Facebook Code

IPLS: Privacy-preserving storage for your WhatsApp contacts

Tue, 22 Oct 2024 14:59:00 +0200

Your contact list is fundamental to the experiences you love and enjoy on WhatsApp. With contacts, you know which of your friends and family are on WhatsApp, you can easily message or call them, and it helps give you context on who is in your groups. But losing your phone could mean losing your contact list as well. Traditionally, WhatsApp has lacked the ability to store your contact list in a way that can be easily and automatically restored in the event you lose it. What’s more, the only place you were able to add contacts was from your mobile device, by either typing in a phone number or scanning a QR code.

As part of WhatsApp’s new feature to privately add and manage your contacts on WhatsApp across linked devices, we’re announcing a novel encrypted storage system we’ve designed called Identity Proof Linked Storage (IPLS). IPLS allows you to save your contacts and automatically restore them directly through WhatsApp. With IPLS in place, you can now create contacts directly within WhatsApp and choose to sync them to your phone or securely save them only to WhatsApp – giving you the ability to create contacts that are specific to your account. If you use linked devices, this also allows you to add and manage contacts seamlessly regardless of which device you’re on.

Additionally, if you have multiple accounts on the same phone, such as a work and personal account, you can now customize your contact list for each account. If you lose your phone, your contact list can be restored on a newly registered device.

Contact names are stored encrypted within WhatsApp, and we’ve built this with additional, robust protections by using IPLS to deter access to contacts to anyone except the user.

IPLS incorporates new privacy technology that protects your contact lists in a privacy-preserving fashion. To further ensure the safety and security of this system, we’ve partnered with Cloudflare to provide independent third-party auditing of its cryptographic properties. The new technology stack was reviewed by external researchers and NCC Group, an independent cybersecurity consultant.

What is Identity Proof Linked Storage?

IPLS is a novel system at WhatsApp that allows users to store their contact names in an encrypted way. IPLS allows the client device to save the contact information using a strong encryption key generated on the client device. Its retrieval is based on the client authenticating its primary device identity.

IPLS is based on two existing pieces of technology that are already used at scale by WhatsApp: key transparency and our hardware security module (HSM).

Certain events associated with your phone’s WhatsApp application (such as installing or reinstalling) trigger the creation of a new cryptographic keypair that is associated with your phone number. WhatsApp’s key transparency system publishes records of these primary device identity key changes to an append-only, cryptographic Auditable Key Directory (AKD) that allows WhatsApp clients to automatically verify a user’s encryption key.

Key transparency allows WhatsApp, and the public at large, to cryptographically verify if a given phone number used for a WhatsApp account is tied to a given identity key.

The HSMs are employed by WhatsApp end-to-end encrypted backups and allow for private, tamper-resistant execution of application logic within WhatsApp data centers in a privacy-preserving way. Data processing within HSM’s security boundary remains opaque even to WhatsApp insiders with the highest privilege and physical access to the hardware.

The components of IPLS

The AKD and Cloudflare integration

As mentioned, the first building block of IPLS is WhatsApp’s AKD, which maps a client phone number to a client identity key. Primary device identity is used to authenticate the client to ensure that only the owner of the contact encryption key is allowed to restore the contacts.

To strengthen the single instance nature of AKD, WhatsApp has engaged Cloudflare to act as an additional witness of the additions to AKD. Cloudflare digitally signs each epoch, and associated root hash, and returns a digital signature validation confirming that the directory was not tampered with. The HSM-based Key Vault validates Cloudflare signature using Cloudflare’s public key.

WhatsApp relies on the availability of the Cloudflare signing service and cannot proceed with the updates to AKD in the absence of the digital signature of each update.

In addition, WhatsApp provides auditable proofs of consistency for the transitions between epochs. The auditable proofs are published to a write-once, read-many enabled Amazon S3 instance, which has a public interface for any entity to retrieve the proofs.

Using AKD and partnering with Cloudflare ensures that there is only a single instance of the directory that is validated by a 3rd party.

HSM-based key storage

To ensure privacy for user contacts registered on WhatsApp, contact names are first encrypted using a symmetric encryption key generated by the user’s device, and then stored in the HSM-based Key Vault. Storage and retrieval of the contact encryption key occurs via an end-to-end encrypted channel between the client and the HSM-based Key Vault, ensuring that the data in transit remains opaque to WhatsApp.

Storing the contact key in the HSM-based Key Vault ensures its availability even when the user loses their phone. If a user loses their client device and wants to restore their contacts, the new client device can retrieve the contact key by establishing a secure session with the HSM-based Key Vault. The Key Vault verifies the client identity key by accessing AKD via a secure cryptographic protocol and verifying that the client has the corresponding private key.

Once the client is verified, the new client is allowed to access the contact key in the HSM-based Key Vault using the secure channel established with the client identity key and the HSM key.

Privacy-preserving contacts storage at WhatsApp scale

IPLS is a new system that deters unauthorized access to sensitive data by effectively coupling any data access to publicly auditable identity key changes published to WhatsApp’s key transparency infrastructure. This approach is similar to how a QR code scanning technology can be used to detect a public key compromise in an end-to-end encrypted messaging system.

WhatsApp’s new approach on contacts will give users more ways to easily manage contacts across devices and accounts and store them securely without losing them if they change phones or reinstall WhatsApp. We’re excited about how IPLS has helped enable this new feature and will help ensure WhatsApp contacts are encrypted and can easily move with users when they get a new phone.

OCP Summit 2024: The open future of networking hardware for AI

Tue, 15 Oct 2024 19:06:00 +0200

At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters.
We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP.
We look forward to continued collaboration with OCP to open designs for racks, servers, storage boxes, and motherboards to benefit companies of all sizes across the industry.

At Meta, we believe that open hardware drives innovation. In today’s world, where more and more data center infrastructure is being devoted to supporting new and emerging AI technologies, open hardware takes on an important role in assisting with disaggregation. By breaking down traditional data center technologies into their core components we can build new systems that are more flexible, scalable, and efficient.

Since helping found OCP in 2011, we’ve shared our data center and component designs, and open-sourced our network orchestration software to spark new ideas both in our own data centers and across the industry. Those ideas have made Meta’s data centers among the most sustainable and efficient in the world. Now, through OCP, we’re bringing new open advanced network technologies to our data centers, and the wider industry, for advanced AI applications.

We’re announcing two new milestones for our data centers: Our next-generation network fabric for AI, and a new portfolio of network hardware that we’ve developed in close partnership with multiple vendors.

Disaggregated network fabrics offer significant advantages in scalability over modular-chassis fabric switches.

DSF: Scheduled fabric that is disaggregated and open

Network performance and availability play an important role in extracting the best performance out of our AI training clusters. It’s for that reason that we’ve continued to push for disaggregation in the backend network fabrics for our AI clusters. Over the past year we have developed a Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters to help us develop open, vendor-agnostic systems with interchangeable building blocks from vendors across the industry. DSF-based fabrics allow us to build large, non-blocking fabrics to support high-bandwidth AI clusters.

DSF extends our disaggregating network systems to our VoQ-based switched systems that are powered by the open OCP-SAI standard and FBOSS, Meta’s own network operating system for controlling network switches. VoQ-based traffic scheduling ensures proactive congestion avoidance in the fabric rather than reactive congestion signaling and reaction.

The DSF fabric supports an open and standard Ethernet-based RoCE interface to endpoints and accelerators across several xPUs and NICs, including Meta’s MTIA as well as from several vendors.

DSF platforms for next-generation AI fabrics

Arista 7700R4 series

The DSF platforms, Arista 7700R4 series, consist of dedicated leaf and spine systems that are combined to create a large, distributed switch. As a distributed system, DSF is designed to support high scale AI clusters.

7700R4C-38PE: DSF Leaf Switch

DSF Distributed Leaf Switch (Broadcom Jericho3-AI based)
18 x 800GE (36 x 400GE) OSFP800 host ports
20 x 800Gbps (40 x 400Gbps) fabric ports
14.4Tbps of wirespeed performance with 16GB of buffers

7720R4-128PE: DSF Spine Switch

DSF Distributed Spine Switch (Broadcom Ramon3 based)
Accelerated compute optimized pipeline
128 x 800Gbps (256 x 400Gbps) fabric ports
102.4Tbps of wirespeed performance

51T switches for next-generation 400G/800G fabrics

Minipack3 (Broadcom Tomahawk5 based, designed by Meta and manufactured by Celestica) 51.2T switch.

Meta will deploy two next-generation 400G fabric switches, the Minipack3 (the latest version of Minipack, Meta’s own fabric network switch) and the Cisco 8501, both of which are also backward compatible with previous 200G and 400G switches and will support upgrades to 400G and 800G.

The Minipack3 utilizes Broadcom’s latest Tomahawk5 ASIC while the Cisco 8501 is based on Cisco’s Silicon One G200 ASIC. These high-performance switches transmit up to 51.2 Tbps with 64x OSFP ports, and the design is optimized without the need of retimers to achieve maximum power efficiency. They also have significantly reduced power per bit compared with predecessor models.

Meta will run both the Minipack3 and Cisco 8501 on FBOSS.

Cisco 8501 (Cisco Silicon One G200 based, designed and manufactured by Cisco) 51.2T switch.

Optics: 2x400G FR4 optics for 400G/800G optical interconnection

Meta’s data center fabrics have evolved from 200 Gbps/400 Gbps to 400 Gbps/800 Gbps and we’ve already deployed 2x400G optics in our data centers.

Evolving FBOSS and SAI for DSF

We continue to embrace OCP-SAI to onboard the new network fabrics, switch hardware platforms, and optical transceivers to FBOSS. We have collaborated with vendors, and the OCP community, to evolve SAI. It now supports new features and concepts like DSF and other enhanced routing schemes.

Developers and engineers from all over the world can work with this open hardware and contribute their own software that they, in turn, can use themselves and share with the wider industry.

FBNIC: A multi-host foundational NIC designed by Meta

We are continuing to design more ASICs, including the ASIC for FBNIC. FBNIC is a true multi-host foundational NIC and contains the first of our Meta-designed network ASICs for our server fleet and MTIA solutions. It can support up to four hosts with complete datapath isolation for each host.The FBNIC driver has been upstreamed (available from v6.11 kernel). The NIC module was designed by Marvell and has been contributed to OCP.

FBNIC’s key features include:

Network interfaces for up to 4×100/4×50/4×25 GE with SerDes support for up to 56G PAM4 per lane.
Up to 4 independent PCIe Gen5 slices
HW offloads including LSO, Checksum
Line rate timestamping (for each host all the way from PHY) for PTP
Header-Data split to assist Zero-Copy
Compliant with OCP NIC 3.0, version 1.2.0, design specification

The future is open

Advancing AI means building data center infrastructure that goes beyond scale. It also has to allow for flexibility and perform efficiently and sustainably. At Meta, we envision a future of AI hardware systems that are not only scalable, but also open and collaborative.

We encourage anyone who wants to help advance the future of networking hardware for AI to engage with OCP and Meta to help share the future of AI infrastructure.

Meta’s open AI hardware vision

Tue, 15 Oct 2024 19:00:00 +0200

At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community.
These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components.
By sharing our designs, we hope to inspire collaboration and foster innovation. If you’re passionate about building the future of AI, we invite you to engage with us and OCP to help shape the next generation of open hardware for AI.

AI has been at the core of the experiences Meta has been delivering to people and businesses for years, including AI modeling innovations to optimize and improve on features like Feed and our ads system. As we develop and release new, advanced AI models, we are also driven to advance our infrastructure to support our new and emerging AI workloads.

For example, Llama 3.1 405B, Meta’s largest model, is a dense transformer with 405B parameters and a context window of up to 128k tokens. To train a large language model (LLM) of this magnitude, with over 15 trillion tokens, we had to make substantial optimizations to our entire training stack. This effort pushed our infrastructure to operate across more than 16,000 NVIDIA H100 GPUs, making Llama 3.1 405B the first model in the Llama series to be trained at such a massive scale.

Prior to Llama, our largest AI jobs ran on 128 NVIDIA A100 GPUs. But things have rapidly accelerated. Over the course of 2023, we rapidly scaled up our training clusters from 1K, 2K, 4K, to eventually 16K GPUs to support our AI workloads. Today, we’re training our models on two 24K-GPU clusters.

We don’t expect this upward trajectory for AI clusters to slow down any time soon. In fact, we expect the amount of compute needed for AI training will grow significantly from where we are today.

Building AI clusters requires more than just GPUs. Networking and bandwidth play an important role in ensuring the clusters’ performance. Our systems consist of a tightly integrated HPC compute system and an isolated high-bandwidth compute network that connects all our GPUs and domain-specific accelerators. This design is necessary to meet our injection needs and address the challenges posed by our need for bisection bandwidth.

In the next few years, we anticipate greater injection bandwidth on the order of a terabyte per second, per accelerator, with equal normalized bisection bandwidth. This represents a growth of more than an order of magnitude compared to today’s networks!

To support this growth, we need a high-performance, multi-tier, non-blocking network fabric that can utilize modern congestion control to behave predictably under heavy load. This will enable us to fully leverage the power of our AI clusters and ensure they continue to perform optimally as we push the boundaries of what is possible with AI.

Scaling AI at this speed requires open hardware solutions. Developing new architectures, network fabrics, and system designs is the most efficient and impactful when we can build it on principles of openness. By investing in open hardware, we unlock AI’s full potential and propel ongoing innovation in the field.

Introducing Catalina: Open Architecture for AI Infra

Catalina front view (left) and rear view (right).

Today, we announced the upcoming release of Catalina, our new high-powered rack designed for AI workloads, to the OCP community. Catalina is based on the NVIDIA Blackwell platform full rack-scale solution, with a focus on modularity and flexibility. It is built to support the latest NVIDIA GB200 Grace Blackwell Superchip, ensuring it meets the growing demands of modern AI infrastructure.

The growing power demands of GPUs means open rack solutions need to support higher power capability. With Catalina we’re introducing the Orv3, a high-power rack (HPR) capable of supporting up to 140kW.

The full solution is liquid cooled and consists of a power shelf that supports a compute tray, switch tray, the Orv3 HPR, the Wedge 400 fabric switch, a management switch, battery backup unit, and a rack management controller.

We aim for Catalina’s modular design to empower others to customize the rack to meet their specific AI workloads while leveraging both existing and emerging industry standards.

The Grand Teton Platform now supports AMD accelerators

In 2022, we announced Grand Teton, our next-generation AI platform (the follow-up to our Zion-EX platform). Grand Teton is designed with compute capacity to support the demands of memory-bandwidth-bound workloads, such as Meta’s deep learning recommendation models (DLRMs), as well as compute-bound workloads like content understanding.

Now, we have expanded the Grand Teton platform to support the AMD Instinct MI300X and will be contributing this new version to OCP. Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads.

In addition to supporting a range of accelerator designs, now including the AMD Instinct MI300x, Grand Teton offers significantly greater compute capacity, allowing faster convergence on a larger set of weights. This is complemented by expanded memory to store and run larger models locally, along with increased network bandwidth to scale up training cluster sizes efficiently.

Open Disaggregated Scheduled Fabric

Developing open, vendor-agnostic networking backend is going to play an important role going forward as we continue to push the performance of our AI training clusters. Disaggregating our network allows us to work with vendors from across the industry to design systems that are innovative as well as scalable, flexible, and efficient.

Our new Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters offers several advantages over our existing switches. By opening up our network fabric we can overcome limitations in scale, component supply options, and power density. DSF is powered by the open OCP-SAI standard and FBOSS, Meta’s own network operating system for controlling network switches. It also supports an open and standard Ethernet-based RoCE interface to endpoints and accelerators across several GPUS and NICS from several different vendors, including our partners at NVIDIA, Broadcom, and AMD.

In addition to DSF, we have also developed and built new 51T fabric switches based on Broadcom and Cisco ASICs. Finally, we are sharing our new FBNIC, a new NIC module that contains our first Meta-design network ASIC. In order to meet the growing needs of our AI

Meta and Microsoft: Driving Open Innovation Together

Meta and Microsoft have a long-standing partnership within OCP, beginning with the development of the Switch Abstraction Interface (SAI) for data centers in 2018. Over the years together, we’ve contributed to key initiatives such as the Open Accelerator Module (OAM) standard and SSD standardization, showcasing our shared commitment to advancing open innovation.

Our current collaboration focuses on Mount Diablo, a new disaggregated power rack. It’s a cutting-edge solution featuring a scalable 400 VDC unit that enhances efficiency and scalability. This innovative design allows more AI accelerators per IT rack, significantly advancing AI infrastructure. We’re excited to continue our collaboration through this contribution.

The open future of AI infra

Meta is committed to open source AI. We believe that open source will put the benefits and opportunities of AI into the hands of people all over the word.

AI won’t realize its full potential without collaboration. We need open software frameworks to drive model innovation, ensure portability, and promote transparency in AI development. We must also prioritize open and standardized models so we can leverage collective expertise, make AI more accessible, and work towards minimizing biases in our systems.

Just as important, we also need open AI hardware systems. These systems are necessary for delivering the kind of high-performance, cost-effective, and adaptable infrastructure necessary for AI advancement.

We encourage anyone who wants to help advance the future of AI hardware systems to engage with the OCP community. By addressing AI’s infrastructure needs together, we can unlock the true promise of open AI for everyone.

How open source AI can improve population estimates, sustainable energy, and the delivery of climate change interventions

Thu, 03 Oct 2024 18:00:00 +0200

Data for Good at Meta is open-sourcing the data used to train our AI-powered population maps.
We’re hoping that researchers and other organizations around the world will be able to leverage these tools to assist with a wide range of projects including those on climate adaptation, public health and disaster response.
The dataset and code are available now on GitHub.

To support the ongoing work of researchers, governments, nonprofits, and humanitarians around the world, the Data for Good at Meta program is open-sourcing the first set of training data and sample code used to construct Meta’s AI-powered population maps.

As the world looks towards the increasing threat of climate change, Meta’s AI-powered population maps, and the data behind them, offer significant opportunities to direct investments in disaster preparedness through improved estimation of global flood exposure and in climate adaptation planning.

By open sourcing these tools, we hope that other researchers can generate new insights for speeding the delivery of sustainable energy and climate resilient infrastructure around the world.

Why we need better population maps

Accurate estimates of population are taken for granted in many countries. Governments in advanced economies can rely on a variety of sources including tax records or census datasets to better estimate their population and make informed decisions on the delivery of services. However, in other parts of the world, accurate population data is hard to come by. In certain low- and middle-income countries, the most recent census may have been conducted decades ago or lack accurate representation of vulnerable populations. Furthermore, estimates between censuses are often fraught with inaccuracies and remote populations may be entirely missing from official sources. As a result, uncounted communities may live outside the reach of critical programs.

To combat this challenge, Meta began the process of mapping the world’s population using artificial intelligence and satellite imagery in 2017. Alongside other leading population mapping institutions like Columbia University’s Center for Earth Science Information Network (CIESIN) and WorldPop at the University of Southampton, we have openly published hundreds of high resolution population maps and datasets. These have been used around the world by governments and nonprofits for social programs ranging from the targeting of COVID-19 interventions to the delivery of clean water. As the world’s natural resource and energy demands scale, accurate population estimates also offer significant opportunities to improve sustainability efforts.

The World Bank leveraged Meta’s AI-powered population maps to identify potential COVID-19 hotspots in Kinshasa, DRC.

Background on Meta’s AI-powered population maps

Data for Good’s AI-powered population maps estimate the number of people living within 30-meter grid tiles in nearly every country around the world. These maps leverage computer vision techniques – similar to those leveraged to identify objects in photos for the visually impaired – to identify human-made structures in satellite imagery. The outputs of Meta’s AI model are then combined with population stock estimates from CIESIN to approximate the number of people living in each tile.

In addition to total population counts, Meta’s population maps also include demographic breakdowns for groups such as the number of children under five, women of reproductive age, youth, and the elderly.

AI-powered population estimates have been scientifically evaluated to be among the most accurate in the world for mapping population distribution for a variety of geographies and use-cases. For example, this 2022 paper by researchers at the University of Southampton and University of Ghana in Nature – Scientific Reports compares various population density estimates for use in mapping flooding risk in west Africa. Other studies have investigated a variety of use-cases such as mapping landslide risk and malaria eradication; and a range of countries including Haiti, Malawi, Madagascar, Nepal, Rwanda, and Thailand.

Open-sourcing training data for our AI population maps

This initial set of training data consists of almost 10 million labels for over 126 gigabytes of satellite imagery and includes human labels on these satellite imagery patches indicating if a building is present. These labels were created on satellite imagery dating from 2011 – 2020; however, even labels made on older imagery are useful to train the next generation of machine vision models (like Meta’s Segment Anything) to more accurately identify buildings in a range of land-cover environments. In addition to this first batch, we plan to release additional data and code for computer vision training in the future.

Open sourcing Meta’s training data and code allows population mapping partners like CIESIN and WorldPop to continue the progress made in the last decade. These tools reduce development costs for research units to generate even more accurate population estimates and also allows researchers working on building detection to improve their methods, especially when combined with more recent satellite imagery. Future data released from CIESIN and data collaborations like GRID3 will continue to push boundaries of spatial resolution and accuracy as the result of their work collaborating with many African countries to generate, validate, and use core spatial datasets in support of sustainable development.

To better visualize village settlement locations and calculate service coverage, World Vision turned to an innovative dataset developed by Meta’s Data for Good (D4G) and Columbia University’s Center for International Earth Science Information Network (CIESIN). The resulting High Resolution Settlement Layer (HRSL) has been a game-changer for visualizing the geography of clean water.
–Allen Hollenbach, Technical Director for World Vision Water and Sanitation

Applications in sustainable electrification, clean water, and climate change adaptation

Nonprofit organizations and governments around the world have already leveraged Meta’s AI-powered population maps for a range of social impact programs, including the World Bank’s rural electrification efforts in Somalia and Benin and similar efforts in Uganda by the World Resources Institute.

World Vision has also used these datasets in accelerating the progress in five-year plans for water and sanitation in places like Rwanda and Zambia and just recently announced having reached one million additional Rwandans with clean water using insights from these maps to track progress towards universal water coverage.

World Vision used Meta’s high resolution population maps to identify the population and associated settlements closest to existing water points and target areas where new water points were needed.

Innovation in global population mapping is only possible through the type of collaboration Meta continues to have with Columbia University and WorldPop and a shared commitment to open source enables researchers and governments around the world to participate in this process.

Please visit the Data for Good website for more information about Meta’s Data for Good program. And please visit this blog for more information about how we protect user privacy in our tools.

React at Meta Connect 2024

Wed, 02 Oct 2024 18:00:00 +0200

At Meta, React and React Native are more than just tools; they are integral to our product development and innovation. With over five thousand people at Meta building products and experiences with React every month, these technologies are fundamental to our engineering culture and our ability to quickly build and ship high quality products. In this post, we will dive into the development experiences of some of the product teams who leveraged React and React Native to deliver exciting projects showcased at Meta Connect 2024.

Instagram and Facebook For Meta Quest

https://engineering.fb.com/wp-content/uploads/2024/10/RNBlogDemo-compressed.mp4

At Connect, Mark Zuckerberg shared that we have re-built Instagram and Facebook for mixed reality (MR) on Meta Quest. Our goal was to bring our flagship social experiences to the Meta Quest headset, letting people catch up with their friends and watch Stories and Reels, all while showcasing new possibilities enabled only through MR.

Building Meta’s social apps from scratch in MR required our teams to thoughtfully leverage the platform capabilities offered by Meta Quest while keeping a tremendously high bar for quality. The teams first had to decide how to build them: reusing the existing Android apps, writing a new native Android app, or using React Native to build from scratch. We wanted to offer a hero experience that looked and felt at home on Meta Quest, taking advantage of the additional input types, gestures, and larger visual surface area. Instead of simply porting our mobile social apps, we chose React Native as it enabled our teams to iterate and build quickly with robust animation capabilities, great performance, and a shared platform that powers most of the 2D Meta Quest system apps.

On Instagram, React Native enabled our teams to build rich animations and novel interactions that embody the brand’s deep focus on quality and delight. For this new app, we introduced seamless transitions of video posts from feed into a full screen view side by side with comments, without dropping a single frame. We enabled the ability to swipe through stacks of photos with the controller joystick or pinching your hands. We also introduced a unique hover animation over interactive elements that smoothly follows your controller movements.

When building Facebook for Meta Quest, our teams took advantage of the mature code and infrastructure that supports our Facebook.com desktop experience. We leveraged code sharing technologies to reuse some of the most complex and robust features from Facebook.com like Newsfeed and commenting. Some of these code sharing technologies include our Meta open source projects like StyleX and React Strict DOM. By sharing code, our teams could spend less time on repetitive business logic and focus more on adding Meta Quest specific interactions and experiences.

Meta Horizon mobile app

This year, we also rolled out the new Meta Horizon mobile app – a new look and a new name. We expanded the app to make it easier to socialize and express yourself both in and out of the headset. We added a dedicated tab to easily customize your avatar and express your mood, right from your phone. People can also visit Horizon Worlds and complete quests from the app to unlock exclusive avatar styles, items, and emotes.

We’ve also continued to improve app performance. At Meta, our teams typically look to Facebook Marketplace as a React Native performance benchmark. However, the Meta Horizon app is a standalone app with React Native in the initialization path of the app’s cold start, compared to the Facebook app which initializes React Native when you visit your first React Native surface and not on app start. The performance results our teams delivered with React Native exceeded our original expectations and are on par with Meta’s mobile social apps.

Our Meta Horizon team worked closely with our React team to profile our application and find opportunities for improvement using Android Systrace, React DevTools, and the new React Native DevTools. The most impactful improvement that our teams made was initiating network queries earlier. Instead of initiating network requests when a component of the product surface was rendered, our teams moved that network fetch to start when the navigation button from the previous surface was clicked.

Meta Horizon Store

We also announced that the Meta Horizon Store is now open for all developers to publish apps, including 2D apps. To support this change, we made major changes to the Horizon Store; changes to our navigation to support significantly more categories, better ranking and categorization of apps, and a new “Early Access” section.

The Meta Horizon Store includes the surfaces that let you discover and acquire applications and games for Meta Quest, as well as explore Worlds you can travel to in Horizon. Since we have a centralized team that maintains the Store across four platforms (Android, iOS, Horizon OS, Web) and we need feature parity across these interfaces, the team has benefited tremendously from being able to use React and React Native even though these are primarily separate implementations today. These technologies have enabled the team to roll out new features and experiments much faster with a smaller team.

Just like the new Instagram and Facebook apps, and everything else using React at Meta, our teams use the bleeding edge of React infra like the React Compiler and the New React Native Architecture. The React team partnered with multiple teams over the last few years to build out infrastructure and capabilities to enable cross platform code sharing, which the Meta Horizon Store team has started to take advantage of. For example, the Meta Horizon Store’s navigation and routing infrastructure was originally quite different between platforms. The team is now reusing Meta’s internal router for React apps that was originally built for Facebook.com which now also works with React Native. We also converted the Meta Horizon Store on the web from using pure CSS to using StyleX, which in combination with React Strict DOM, has enabled them to reuse the Spotlight section of the Meta Horizon Store across web and mixed reality. This enabled us to more quickly support internationalized text rendering and light/dark mode for banners, and accelerated future enhancements for our merchandising team.

Meta Spatial Editor

https://engineering.fb.com/wp-content/uploads/2024/10/spatial-compressed.mp4

We announced the Meta Spatial SDK and Meta Spatial Editor to enable mobile developers to create immersive experiences for Meta Horizon OS using familiar Android languages, libraries, and tools, along with unique Meta Quest capabilities, such as physics, MR, and 3D. Creating great 3D experiences always requires being able to visualize and edit your scenes directly. The Meta Spatial Editor is a new desktop app that lets you import, organize, and transform your assets into visual compositions and export them, using the glTF standard, into Meta Spatial SDK.

Our teams built the app with React Native for Desktop, providing users with native Windows and macOS apps and providing our teams with the incredible developer experience of React. One of the key factors in the teams’ decision to use React Native for Desktop instead of other web-based desktop solutions is that React Native enables the team to utilize native integrations when needed. The main 3D scene in the app is powered by a custom 3D rendering engine, requiring a custom React Native Native Component integration. The React Native panels on the scene let users modify all sorts of properties which then communicate with the 3D renderer via C++, enabling us to update the UI at 60fps.

The Meta Spatial Editor team had many engineers who primarily had a C++ background and were used to building with Qt. These team members were initially skeptical of JavaScript but ended up loving the developer experience provided by React Native, such as Fast Refresh. Web developers take for granted that code changes can be seen on file-save, but it is still extremely uncommon for native engineers. This developer experience enabled our teams to build much more quickly with React Native.

This is how Meta builds React

Over a decade ago, Meta introduced React to the industry through open source. Our React team at Meta is so proud of these experiences that were announced at Meta Connect 2024. These products showcase the power, expressivity, and flexibility of what’s possible with React: delightful interactions, deeply complex integrations, and incredibly responsive interfaces. And of course, they all render natively on their respective platforms to match user expectations.

Over the past decade, the React team has partnered deeply with both teams at Meta as well as members of the open source community to enable these types of product and developer experiences. Engineers at Meta use React on every platform where we ship user interfaces: web, mobile, desktop, and new platforms such as MR. Each time the React team has added support for a new platform, the team has invested in deeply understanding the idioms and expectations for user experiences on that platform, then adapting and optimizing React accordingly. We’ve consistently found that improving React for one platform benefits others as well — an approach the React teams described in their Many Platform Vision.

This pattern has continued as the teams expanded support to the constraints and opportunities of mixed reality devices. Our teams have improved startup and application responsiveness, improved efficiency to reduce battery drain, and taken major steps to enable code sharing across web and native platforms — with platform-specific customizations. These wins have consistently benefited our apps on other platforms, with user experience improvements in products such as Facebook.com and Facebook Marketplace.

Our engineers invest in these improvements knowing that they will benefit not only products created by Meta, but all React products in the world. Meta continues to share these improvements with the open source community whenever we have built our confidence that they are stable enough for broader adoption. We’ve previously shared some of these improvements with the open source community, including React Compiler, React 19, React Native’s New Architecture, StyleX, React Strict DOM, and performance improvements to Hermes. These innovations and more are currently under development, and our teams look forward to sharing them with the open source community in the future!

Stranger Things ™/© Netflix. Used with permission.

Inside Bento: Jupyter Notebooks at Meta

Tue, 17 Sep 2024 19:53:00 +0200

This episode of the Meta Tech Podcast is all about Bento, Meta’s internal distribution of Jupyter Notebooks, an open-source web-based computing platform. Bento allows our engineers to mix code, text, and multimedia in a single document and serves a wide range of use cases at Meta from prototyping to complex machine learning workflows.

Pascal Hartig (@passy) is joined by Steve, whose team has built several features on top of Jupyter, including scheduled notebooks, sharing with colleagues, and running notebooks without a remote server component by leveraging WebAssembly in the browser.

Download or listen to the podcast episode below:

[embedded content]

You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Simulator-based reinforcement learning for data center cooling optimization

Tue, 10 Sep 2024 18:00:00 +0200

We’re sharing more about the role that reinforcement learning plays in helping us optimize our data centers’ environmental controls.
Our reinforcement learning-based approach has helped us reduce energy consumption and water usage across various weather conditions.
Meta is revamping its new data center design to optimize for artificial intelligence and the same methodology will be applicable for future data center optimizations as well.

Efficiency is one of the key components of Meta’s approach to designing, building, and operating sustainable data centers. Besides the IT load, cooling is the primary consumer of energy and water in the data center environment. Improving the cooling efficiency helps reduce our energy use, water use, and greenhouse gas (GHG) emissions and also helps address one of the greatest challenges of all – climate change.

Most of Meta’s existing data centers use outdoor air and evaporative cooling systems to maintain environmental conditions within the envelope of temperature between 65°F and 85°F (18°C and 30°C) and relative humidity between 13 and 80%. As water and energy are consumed in the conditioning of this air, optimizing the amount of supply airflow that has to be conditioned is a high priority in terms of improving operational efficiency.

Since 2021, we have been leveraging AI to optimize the amount of airflow supply into data centers for cooling purposes. Using simulator-based reinforcement learning, we have reduced the supply fan energy consumption at one of the pilot regions by 20% and water usage by 4% on average across various weather conditions.

Previously, we shared how a physics-based thermal simulator helps us optimize our data centers’ environmental controls. Now, we will shed more light on the role of reinforcement learning in the solution. As Meta is revamping its new data center design to optimize for artificial intelligence, the same methodology will be applicable for future data center optimizations as well to improve operational efficiency.

How cooling works at Meta’s data centers

Currently, Meta’s data centers adopt a two-tiered penthouse design that utilizes 100% outside air for cooling. As shown in Figure 1, the air enters the facility through louvers on the second-floor “penthouse,” with modulating dampers regulating the volume of outside air. The air passes through a mixing room, where outdoor air, if too cold, can be mixed with heat from server exhaust when needed to regulate the temperature.

The air then passes through a series of air filters and a misting chamber where the evaporative cooling and humidification (ECH) system is used to further control the temperature and humidity. The air continues through a fan wall that pushes the air through openings in the floor that serve as an air shaft leading into the server area on the first floor. The hot air coming out from the server exhaust will be contained in the hot aisle, through exhaust shafts, and eventually released out of the building with the help of relief fans.

Water is mainly used in two ways: evaporative cooling and humidification. The evaporative cooling system converts water into vapor to lower the temperature when the outside air is too hot, while the humidification process maintains the humidity level if the air is too dry. As a result of this design, we believe Meta’s data centers are among the most advanced, energy and water efficient data centers in the world.

Figure 1: The penthouse cooling system within Meta’s data centers.

In order to supply air within the defined operating envelope, the penthouse relies on the building management system (BMS) to monitor and control different components of the mechanical system. This system performs the task of conditioning the intake air from outside by mixing, humidifying/dehumidifying, evaporative cooling, or a combination of these operations.

There are three major control loops responsible for adjusting setpoints for supply air: temperature, humidity, and airflow. The airflow setpoint is typically calculated based on a small set of input variables like current IT load, cold aisle temperature, and differential pressure between the cold aisle and hot aisle. The logic is often very simple at a linear scale, but becomes very difficult to accurately model as these values at different locations in the data center are coupled to one another and highly dependent on complex local boundary conditions. However, the amount of airflow will largely dictate the energy used by the supply fan arrays and water consumption when cooling or humidification is required. Therefore, optimizing the airflow setpoint would have the greatest impact in regards to further improving the cooling efficiency given the fact that the temperature and humidity boundary of the operating envelope is fixed.

A reinforcement learning approach to data center cooling

Reinforcement learning (RL) is good at modeling control systems as sequential state machines. It functions as a software agent that determines what action to take at each state based on some transition model – which leads to a different state – and constantly gets feedback from the environment in terms of reward. In the end, the agent learns the best policy model (typically parameterized by a deep neural network) to achieve the optimal accumulated reward. The data center cooling control can be naturally modeled under this paradigm.

At any given time, the state of a data center can be represented by a set of environmental variables monitored by many different sensors for outside air, supply air, cold aisle and hot aisle, plus IT load (i.e., power consumption by servers), etc. The action is to control setpoints – for example, the supply airflow setpoint that determines how fast the supply fans run to meet the demand. The policy is the function mapping from the state space to action space (i.e., determining the appropriate airflow setpoint based on current state conditions). Now the task is to leverage historical data we have collected from thousands of sensors in our data centers – augmented with simulated data of potential, but not yet experienced conditions – and train a better policy model that gives us better reward in terms of energy or water usage efficiency.

The idea of using AI for data center cooling optimization is not new. There are also various RL approaches reported such as, transforming cooling optimization via deep reinforcement learning and data center cooling using model-predictive control.

However, applying the control policy determined by an online RL model may result in various risks including breaches of service requirements and even thermal unsafety. To address this challenge, we adopted an offline simulator based RL approach. As illustrated in Figure 2, our RL agent operates in a simulated environment by starting from real-life historical observations, S. It then explores the action space, feeding into the simulator to predict the anticipated new state S’ and reward, R, given each sampled action, A. From there it collects the pairs (S, A) that have the best reward to form a new training data set to update the parameterized policy model.

Figure 2: Our simulated-based offline RL approach.

Our simulator is a physics-based model of building energy use that takes as inputs time series such as weather data, IT load, and setpoint schedules. The model is built with data center building parameters, including geometry, construction materials, HVAC, system configurations, component efficiencies, and control strategies. It uses differential equations to output the dynamic system response, such as the thermal load and resulting energy use, along with related metrics like cold aisle temperature and differential pressure profiles.

The simulator plays a very important role here since our goal is to optimize energy and water usage while keeping the data center condition under specs so hardware performance isn’t affected. More specifically, we want to keep the rise in cold aisle temperature below a certain threshold, or a positive pressurization from cold aisle to hot aisle, to minimize the parasitic heat caused by recirculation.

Additionally, the physics-based simulator enables us to train the RL model with all possible scenarios, not only those present in the historical data. This increases reliability during outlier events and allows for rapid deployment in newly commissioned data centers.

Results of our RL approach

In 2021, we started a pilot at one of Meta’s data center regions – having the RL model directly controlling the supply airflow setpoint. Figure 3 shows a comparison of the new setpoint, in the unit of cubic feet per minute (CFM) as the red line to the original BMS setpoint (as the dotted blue line) over one week’s duration for illustration purposes.

Figure 3: A comparison of the RL model versus the original BMS setpoint.

The fluctuation is mainly determined by the supply air temperature and server load cycles at different times of day. More importantly, as shown in Figure 4, the data center temperature conditions never went out of spec, with reduced airflow supply with respect to both cold aisle average and maximum temperature compared against the supply air temperature.

Figure 4: A data center temperature profile under RL model control.

It is noticeable that the CFM savings vary under different supply air temperatures as the univariate chart in Figure 5 shows. The CFM savings can easily be converted to energy savings used by the supply fans. Under hot and dry conditions, when evaporative cooling or humidification is required, using less air will result in less water usage as well. Over the past couple years of the pilot, on average, we were able to reduce the supply fan energy consumption by 20% and water usage by 4% across various weather conditions.

Figure 5. A breakdown of airflow savings at different supply air temperatures.

Future work for AI in data center optimization

This effort has opened the door to transform how our data centers operate. By introducing automated predictions and continuous optimizations for tuning environment conditions in our data centers we can bend the cost curve and reduce effort on labor intensive tasks.Meta is breaking ground on new types of data centers that are designed to optimize for artificial intelligence. We plan to apply the same methodology presented here to our future data centers at the design phase to help ensure they’re optimized for sustainability from day one of their operations.

We’re also currently rolling out our RL approach to data center cooling to our existing data centers. Over the couple of years we expect to achieve significant energy and water usage savings to contribute to Meta’s long- term sustainability goals.

Acknowledgements

We would like to thank our partners in IDC Facility Operations (Butch Howard, Randy Ridgway, James Monahan, Jose Montes, Larame Cummings, Gerson Arteaga Ramirez, John Fabian, and many others) for their support.

Read Meta’s 2024 Sustainability Report

Wed, 04 Sep 2024 19:04:00 +0200

The Net Zero Supplier Engagement Program enables us to set expectations with key suppliers for committing to emissions reduction targets and to support them in meeting those targets. By the end of 2023, 28% of our suppliers, based on their contribution to Meta’s emissions, have set science-aligned emissions reduction targets

Learn more

We are committed to becoming water positive in 2030. To achieve this goal, Meta will restore 200% of the water we consume in high water stress areas, and 100% of the water we consume in medium water stress areas. In 2023, 18 operational water restoration projects returned more than 1.5 billion gallons of water to high and medium water stress regions.

Learn more

Our portfolio of more than 11,700 megawatts (MW) of contracted renewable energy makes Meta one of the largest corporate buyers of renewable energy globally, and the corporate buyer with the largest operating renewable energy portfolio in the U.S. for the second year running with more than 6,700 MW online.

Learn more

Data centers are part of the global infrastructure that brings our technologies and services to life. Meta designs and operates some of the most sustainable data centers in the industry, but they still account for the highest percentage of our energy and water use.

We approach data center sustainability from the ground up—from design and construction to operations by prioritizing energy efficiency, renewable energy, water stewardship and responsibly managing the end of life of our equipment.

Meta is getting ready for post-quantum cryptography

Wed, 28 Aug 2024 18:19:00 +0200

The Quantum Apocalypse is coming. The advent of quantum computers has raised real questions about the future of data privacy over the internet. Someday, advances in quantum computing will make it possible to decrypt sensitive data that was encrypted using today’s complex cryptography systems.

In the latest episode of the Meta Tech Podcast you’ll meet Sheran and Rafael, two engineers leading Meta’s post-quantum readiness work. They sit down with Pascal Hartig (@passy) to discuss the threat of quantum computing and how Meta is working to keep today’s users safe from the quantum attacks of tomorrow.

And for more on post-quantum readiness at Meta be sure to read their blog, Post-quantum readiness for TLS at Meta.

Download or listen to the podcast episode below:

[embedded content]
You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale

Tue, 27 Aug 2024 18:00:00 +0200

At Meta, we’ve been diligently working to incorporate privacy into different systems of our software stack over the past few years. Today, we’re excited to share some cutting-edge technologies that are part of our Privacy Aware Infrastructure (PAI) initiative. These innovations mark a major milestone in our ongoing commitment to honoring user privacy.
PAI offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address complex privacy issues. For example, we built Policy Zones that apply across our infrastructure to address restrictions on data, such as using it only for allowed purposes, providing strong guarantees for limiting the purposes of its processing.
As we expanded PAI across Meta, increasing its maturity, we gained valuable insights. Our understanding of the technology evolved, revealing the need for a larger investment than initially planned to create a cohesive ecosystem of libraries, tool suites, integrations, and more. These investments have been crucial in enforcing complex purpose limitation scenarios while ensuring scalability, reliability, and a streamlined developer experience.

Purpose limitation, a core data protection principle, is about ensuring data is only processed for explicitly stated purposes. A crucial aspect of purpose limitation is managing data as it flows across systems and services. Commonly, purpose limitation can rely on “point checking” controls at the point of data processing. This approach involves using simple if statements in code (“code assets”) or access control mechanisms for datasets (“data assets”) in data systems. However, this approach can be fragile as it requires frequent and exhaustive code audits to ensure the continuous validity of these controls, especially as the codebase evolves. Additionally, access control mechanisms manage permissions for different datasets to reflect various purposes using mechanisms like access control lists (ACLs), which requires the physical separation of data into distinct assets to ensure each maintains a single purpose. When Meta started to address more and larger-scope purpose limitation requirements that crossed dozens of our systems, these point checking controls did not scale.

At Meta, millions of data assets are crucial for powering our product ecosystem, optimizing machine learning models for personalized experiences, and ensuring our products are high quality and meet user expectations. Identifying which code branches and data assets require protection is challenging due to complex propagation requirements and permissions models that need constant revision. For example, when a data consumer reads from one data asset (“source”) and stores the output in another (“sink”), point checking controls would require complex orchestration to ensure propagation from sources to sinks, which can become operationally unviable.

To address this problem, point checking controls can be enhanced by leveraging data flow signals. Data flows can be tracked from the same origin, where relevant data is collected, using various techniques such as static code analysis, logging, and post-query processing. This creates a graph, known as “data lineage,” that tracks the relationships between source and sink data assets. By utilizing data lineage, permissions can be applied to relevant data assets based on these source-to-sink relationships. The combination of point checking and data lineage, while viable at a small scale, leads to significant operational overhead as point checking still requires auditing many individual assets.

Building on these insights, in our latest iteration, we found that the information flow control (IFC) model offers a more durable and sustainable approach by controlling not only data access but also how data is processed and transferred in real-time, rather than relying on point checking or out-of-band audits. Thus, we developed Policy Zones as our IFC-based technology and integrated it across major Meta systems to enhance our purpose limitation capabilities at scale. This effort was later expanded into the Privacy Aware Infrastructure (PAI) initiative, a transformative investment that integrates first-class privacy support into Meta’s infrastructure systems.

We believe PAI is the right investment to protect people’s privacy at scale and can effectively enforce purpose limitation requirements.

Why invest in Policy Zones?

Through our experience deploying purpose limitation solutions over the years, we identified several key themes:

Needs	Problem	Solution
Programmatic Control: We needed to rely more on programmatic controls instead of point checking human audits to control data flows, and do so in real-time	Traditional point checking controls, combined with data lineage checks, can detect data transfers within a specific time frame but not in real-time. Addressing these risks requires implementing resource-intensive human audits at access points.	In contrast, PAI is designed to check data flows in real-time during code execution, blocking problematic data flows from occurring, facilitated by UX tooling, thus making it more scalable.
Granular Flow Control: We needed to maximize the reuse of existing data and business logic on complex infra	Access control is easy to roll out when data is separated physically, but poses significant costs, complexity, and limitations when dealing with Meta’s complex infrastructure, where data for different purposes is often processed by shared code.	PAI solves this by providing precise decision making at the granular level of individual requests, function calls, or data elements, achieving logical data separation at a relatively low compute cost even on complex infrastructures where it’s needed.
Adaptable and Extensible Control: We needed to handle ever-evolving requirements, even multiple for the same data assets	We are facing a rapidly changing world for privacy. Data use restrictions can vary over time depending on evolving privacy and product requirements. A single data asset or different parts of it might be subject to multiple privacy requirements. While “point checking” can address this to some extent, it struggles to control downstream data flows, even combined with data lineage.	PAI is designed to check multiple requirements involved in data flows and is highly flexible to adapt to changing requirements.

How Policy Zones works

Let’s dive into what Policy Zones is and how we can leverage it to meet purpose limitation requirements. Policy Zones provides a comprehensive mechanism for encapsulating, evaluating, and propagating privacy constraints for data both “in transit” and “at rest,” including transitions between different systems. It conducts runtime evaluation of constraints, context propagation, and is deeply integrated with numerous data and code frameworks (e.g., HHVM, Presto, and Spark), representing a step change in how we approach information flow control.

To make the explanation more relatable and bring some levity to a serious topic, we’ll use a simple example: Let’s say a new requirement comes up, where banana data can only be used for the purposes of making smoothies and fruit baskets, but not for making banana bread. For simplicity, this example and the illustration below only demonstrate the first row of the above table.

How would developers leverage Policy Zones to implement such a requirement?

First, to demarcate relevant data assets, they assign a metadata label (“data annotation,” e.g., BANANA_DATA) to data assets at different granularities. This annotation is associated with the purpose limitation requirement as a set of data flow rules that enable systems to understand the allowed purposes for the data.

When annotated data is processed, Policy Zones kicks in and checks whether the data processing is allowed and data can flow downstream. Policy Zones has been built into different Meta systems, including:

Function-based systems that load, process, and propagate data through stacks of function calls in different programming languages. Examples include web frontend, middle-tier, and backend services.
Batch-processing systems that process data rows in batch (mainly via SQL). Examples include real-time and data warehouse systems that power Meta’s AI and analytics workloads.

Let’s dive deeper into how Policy Zones works for the function-based systems, while the same logic applies to the batch-processing systems as well.

In function-based systems, data is passed through parameters, variables, or return values in a stack of function calls.

Let’s walk through an example:

A web request, “BananaRequest,” loads annotated data from BananaDB, causing a data flow violation because the intent of the caller is unknown.
To remediate the data flow violation, we annotate BananaRequest with the BANANA_DATA label, creating a zone (“Banana Zone”) for the request.
Behind the scenes at runtime, Policy Zones programmatically checks all data flows against the flow rules based on the context, flagging new data flow violations from BananaRequest to logB and logC.
We annotate logB as banana and remove the logging of banana data into logC to cut off the disallowed data flow.
With all data flow violations remediated, the zone can be moved from logging mode to enforcement. If a developer adds a write to a sink outside of the zone, it will be blocked automatically.

In a more complex scenario, a function, “makeBananaSmoothie()” from a web request, “BreakfastRequest” calls another function, “makeBanana().” Besides the previous data flow violations, we need to remediate another data flow violation: makeBanana() returns banana data to makeBananaSmoothie(). This means we can create a “Banana Zone” from the function makeBananaSmoothie() that includes all functions that it calls directly or indirectly.

In batch-processing systems, data is processed in batches for rows from tables that are annotated as containing relevant data. When a job runs a query (usually SQL-based) to process the data, a zone is created and Policy Zones flags any data flow violations. Remediation options are provided, similar to those for function-based systems. Once all violations have been remediated, the zone can be moved from logging mode to enforcement mode to prevent future data flow violations. Data annotation can be done at various levels of granularity, including table, column, row, or potentially even cell.

When data flows across different systems (e.g., from frontend, to data warehouse, then to AI), Policy Zones ensures that relevant data is annotated correctly and thus continues to be protected according to the requirements. For some systems that don’t have Policy Zones integrated yet, the point checking control is still used to protect the data.

How we applied PAI to existing systems at scale

The above gives you a glimpse into how the technology is used to roll out a simple use case. However, adopting Policy Zones is a non-trivial task for complex requirements across tens or hundreds of systems. The requirement owner usually collaborates with other engineers who are code and data asset owners across Meta to implement different aspects of that requirement. In some cases, this may involve hundreds or thousands of engineers to complete the implementation and audits. To address this challenge, PAI offers Policy Zone Manager (PZM), a suite of UX tools that helps requirement owners to efficiently enforce privacy requirements using PAI.

Let’s take a look at how PZM makes it easy for people to satisfy their purpose limitation needs in existing systems, using the above banana requirement as an example. At a high level, the requirement owner carries out the following workflow, facilitated by PZM:

Identify relevant assets: This is to identify which source assets need to be purpose limited for the given requirement.
Discover relevant data flows: This is to discover the downstream data flows from the source assets in order to integrate Policy Zones at scale.
Remediate data flow violations: This is to allow people to choose which option to take to remediate data flow violations.
Continuously enforce and monitor data flows: This is to turn on Policy Zones enforcement and monitor it to prevent new data flow violations.

To hear more about this process, check out our presentation at the PEPR conference in June 2024.

[embedded content]

Step 1 – Identify relevant assets

For a given requirement, we check the relevant product entry points (e.g., mobile apps, web requests, and databases) to pinpoint data assets that are collected. These assets may take the form of request parameters, database entries, or event log entries. We use data structures to represent (“schematize”) these data assets and fields, capturing relevant data at various granularities. In the running example, a table in the banana database might contain entirely banana data, a single banana column, or a mix of banana and other fruit data.

In addition to manual code inspection, we heavily rely on various techniques such as our scalable ML-based classifier to automatically identify data assets.

Step 2 – Discover relevant data flows

From a given annotated source, the requirement owner can identify its downstream data flows and sinks (see diagram below). The owner can then decide how to handle these data flows. However, this process can be time consuming when there are many data flows that are one or multiple hops away from the same origin. This often occurs when implementing a new requirement over existing data flows.

Although data lineage presents significant operational overhead for point checking mechanisms, it can efficiently identify where to integrate Policy Zones into the codebase. Therefore, we have integrated data lineage into PZM, allowing requirement owners to discover multiple downstream assets from a given source simultaneously. Once the requirement has been fully implemented, we can rely solely on Policy Zones to enforce the requirements.

Step 3 – Remediate data flow violations

By default, the data flow from a source asset to a sink must meet all of the requirements of the source. If not, it’s considered a data flow violation and needs remediation, enforced by Policy Zones programmatically at runtime. There are three main cases to remediate data flow violations (using the running example to help concretize the general cases):

Case 1: Safe flow – relevant data is used for allowed purpose(s): Assign the banana annotation to the sink asset.
Case 2: Unsafe flow – relevant data is used for disallowed purpose(s): Block data access and code execution to prevent further processing of banana data.
Case 3: Reclassified flow – relevant data is not used or propagated: Annotate the data flow as reclassified as being permitted. Banana data from the source is not used or propagated to the sink.

Step 4 – Continuously enforce and monitor data flows

PAI is integrated into our major data systems to check data flows and catch violations at runtime. During the initial rollout of a new requirement, Policy Zones can be configured to allow remediations of flow violations in “logging mode.” Once Policy Zones enforcement is enabled, any data flow with unremediated violations is denied. This also prevents new data flow violations, even if code changes or new code is added.

PAI continuously monitors the enforcement of requirements to ensure that it operates correctly. PZM provides a set of verifiers to check the accuracy of asset annotations and control configurations.

Lessons learned from adoption at scale across Meta

As PAI has been adopted by a multitude of purpose limitation requirements across Meta, we’ve learned several key lessons over the past few years:

Focus on solving one specific end-to-end use case first

Initially, we developed Policy Zones for batch-processing systems with some basic use cases. However, we realized that our designs for function-based systems were quite abstract and the adoption for a large-scale use case resulted in significant challenges, consequently, requiring considerable effort to map patterns to customer needs. Furthermore, refining the APIs and building missing operational support made it work effectively end-to-end across multiple systems. Only after addressing these challenges were we able to make it more generic and proceed with integrating Policy Zones across extensive platforms.

Streamline integration complexity

Integrating PAI into major Meta systems coherently was a complex, lengthy, and challenging process. We encountered significant difficulties in integrating PAI with Meta’s diverse systems broadly. It took us years to overcome these challenges. For example, initially, product teams expended considerable effort to schematize data assets across different data systems. Then we developed reliable, computationally efficient, and widely applicable PAI libraries in various programming languages (Hack, C++, Python, etc.) that enabled a smoother integration with a broad range of Meta’s systems.

Invest in computational and developer efficiency early on

We also undertook multiple iterations to simplify PAI and improve its computational efficiency. Our initial annotation APIs were overly complex, resulting in high cognitive overhead for engineers. Furthermore, the computational overhead of data flow checking was prohibitively high in Meta’s high-throughput systems. Through several rounds of refinement, we simplified policy lattice representation and evaluation, built language-level features to natively propagate Policy Zones context, and canonicalized policy annotation structures, achieving 10x improvements in computational efficiency.

Simplified and independent annotations are a must to scale to a wide range of requirements

Initially, we employed a monolithic annotation API to model intricate data flow rules and annotate relevant code and data. However, as data from multiple requirements were combined, propagating these annotations from sources to sinks became increasingly complex, resulting in data annotation conflicts that were difficult to resolve. To address this challenge, we implemented simplified data annotations to decouple data from requirements and separate data flow rules for different requirements. This significantly streamlined the annotation process, ultimately improving developer experiences.

Build tools; they are required

We have made significant efforts to ensure the use of PAI is easy and efficient, ultimately improving the developer experience. Initially, we focused on the correctness of the technology first before investing in tooling. Adopting Policy Zones required a lot of manual effort, and it was challenging for engineers to understand how to properly annotate their assets, which led to additional cleanup work later. To address this issue, we developed the PZM tool family, which includes built-in automated rules and classifiers. These tools guide teams through standard workflows, ensuring safe and efficient rollout of purpose limitation requirements and reducing engineering efforts by orders of magnitude.

Durable privacy protection for everyone

Meta is committed to protecting user privacy. The PAI initiative is a crucial step in safeguarding data and preserving privacy efficiently and reliably. It provides a robust foundation for Meta to sustainably tackle privacy challenges, meet high reliability standards, and address future privacy issues more efficiently than traditional solutions. While we’ve laid a strong groundwork, our journey is just beginning. We aim to build upon this foundation by expanding our capabilities and controls to accommodate a wider range of privacy requirements, enhancing the developer experience, and exploring new frontiers.

We hope our work sparks innovation and fosters collaboration across the industry in the field of privacy.

Acknowledgements

The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in productionizing and adopting PAI over the years. In particular, we would like to extend special thanks to (in alphabetical order) Adrian Zgorzalek, Alex Gorelik, Amritha Raghunath, Anuja Jaiswal, Brian Sniffen, Brian Romanko, Brian Spanton, David Detlefs, David Mortenson, David Taieb, Gabriela Jacques da Silva, Iuliu Rus, Jafar Husain, Jerry Pan, Jiang Wu, Joel Krebs, Jun Fang, Komal Mangtani, Marc Celani, Mark Konetchy, Michael Levin, Perry Stoll, Peter Prelich, Pieter Viljoen, Prashant Dhamdhere, Rajesh Nishtala, Rajkishan Gunasekaran, Rishab Mangla, Sergey Doroshenko, Seth Silverman, Sriguru Chakravarthi, Tarek Sheasha, Thomas Georgiou, Uday Ramesh Savagaonkar, Vitalii Tsybulnyk, Vlad Fedorov, Wolfram Schulte, and Yi Huang. We would also like to express our gratitude to all reviewers of this post, including (in alphabetical order) Aleksandar Ilic, Benjamin Renard, Emil Vazquez, Emile Litvak, Harrison Fisk, Jason Hendrickson, Jessica Retka, Nimish Shah, Sabrina B Ross, and Sam Blatchford. We would like to especially thank Emily DiPietro for championing the idea, leading the editorial effort, and pulling all required support together to make this blog post happen.

RETINAS: Real-Time Infrastructure Accounting for Sustainability

Mon, 26 Aug 2024 18:00:00 +0200

We are introducing a new metric— real-time server fleet utilization effectiveness —as part of the RETINAS initiative to help reduce emissions and achieve net zero emissions across our value chain in 2030
This new metric allows us to measure server resource usage (e.g., compute, storage) and efficiency in our large-scale data center server fleet in near real-time.
We are sharing our learnings in adopting depreciation methods for accumulated carbon assets for internal fleet measurements, and encourage further industry improvement and development on these concepts. This is not intended to replace global emissions accounting standards for purposes of external reporting.

Since 2020, Meta has maintained net zero emissions in our operations and matched 100% of our electricity use with renewable energy. However, we know our work doesn’t stop there, and we recognize our responsibility to decarbonize our footprint beyond our data centers and offices, including emissions from the server components our suppliers manufacture to our employees’ commutes. To align with the Paris Agreement, we have set a goal to reach net zero emissions across our value chain in 2030.

Meta’s Net Zero Program has three foundational pillars: understanding our emissions, reducing our emissions, and removing remaining emissions. To understand our emissions, improving the granularity, accuracy, and near real-time measurement of our greenhouse gas data goes beyond carbon accounting. The right data will help us apply actionable metrics to advance decarbonization across our business operations and with our suppliers.

With this in mind, we have created the Real Time Infrastructure Accounting for Sustainability (RETINAS) initiative, which seeks to study and understand the impact of server reliability, performance, and operational optimization on Meta’s Scope 3 emissions.

This initiative has led to the development of a new internal metric— real-time server fleet utilization effectiveness —that enables us to take action to reduce the emissions associated with the embodied carbon of our data center servers and components. Embodied carbon contributes to Meta’s upstream Scope 3 emissions, and includes the emissions associated with the full lifecycle of the manufacturing, assembly, and transportation of servers and materials in our physical infrastructure.

Optimizing the utilization of our server fleet is important to reducing these emissions. Real-time server fleet utilization effectiveness provides a framework toward effective measurement and integration of embodied carbon into ubiquitous infrastructure metrics to drive informed decisions to manage our server fleet resource usage (e.g., compute and storage) and their impacts on Meta’s Scope 3 emissions.

How we measure greenhouse gas emissions at Meta

Since 2011, Meta has reported our Scope 1 and 2 emissions. In 2017, we began reporting select Scope 3 emissions categories. Since 2019 we have reported annually on all relevant emissions defined by the Greenhouse Gas Protocol. We obtain limited assurance conducted by a third party for select environmental metrics. In our accounting, data center servers and their components are a significant driver of our Scope 3 emissions footprint, and we have taken numerous steps to deepen our understanding of those emissions in order to surface reduction opportunities.

An important reduction strategy we are focused on is the circularity of our servers and components. The more effectively and efficiently servers are utilized, the more sustainable the server fleet. We can extend the lifespans of servers, components, and network infrastructure with improvements to server reliability, efforts to reuse components based on their reliability expectations, and various performance optimizations and operational improvements (e.g., firmware/server upgrades and repairs).

While implementing these circularity strategies, we observed limitations in current carbon accounting practices to understand and weigh Scope 3 emissions trade-offs in our server fleet against traditional power, performance, and total cost of ownership (TCO) metrics, such as performance per dollar, performance per watt, and performance per dollar per watt, in real-time.

Current carbon accounting and reporting practices for Scope 3 emissions are static. For data center servers and components, in particular, this means that the entirety of the embodied emissions from the upstream supply chain, manufacturing, and logistics is attributed in the year of purchase. Benefits from circularity are not realized in our Scope 3 footprint until future purchases of new servers or components are deferred. This does not provide actionable information to our operational teams in real-time on how varying the usage or the expected life of the acquired servers can impact Meta’s Scope 3 emissions.

We see a need to develop internal metrics to monitor and incentivize greater efficiency, utilization, and extension of the expected life of servers, which will influence current and future server fleet management.

Introducing real-time server fleet utilization effectiveness

The RETINAS initiative, launched by Meta’s Infrastructure Engineering team, seeks to study and understand how server reliability, performance, and operational optimization impact Meta’s Scope 3 emissions. To understand this holistically, we introduced a standardized, fleet-wide metric for any given resource (e.g., a server or rack) that measures the utilization of embodied carbon:

Where:

This metric borrows depreciation concepts from finance and accounting practices and applies them to aspects of server reliability, efficiency, and useful life. The concept of depreciation is used to showcase the expected useful life of acquired assets. This concept also allows for tracking of acquisition and disposition of server resources at fleet scale and is reported on an ongoing basis.

Utilization metrics like power usage effectiveness (PUE) and hardware usage effectiveness (HUE) measure the effective IT usage from a power perspective at the data center and server level, respectively. Combining depreciated Scope 3 emissions with these utilization metrics allows us to standardize these measurements along with other fleet health measurements for a defined period of time.

We illustrate the usage of this metric with a set of servers and various circularity strategies.

Example (current static state)

Let’s consider an example set of servers purchased in 2023 which have associated embodied emissions attributed to the buyer with 1000 tons of CO2e. Here is how this would be represented using current, static carbon accounting methods:

There is no representation for the useful life of the example set of servers. If we change the server set’s useful life (UL) from four years to five years, the metric doesn’t move.

Example (with proposed dynamic accounting)

For the same example of servers purchased in 2023 with 1000 units of CO2e Scope 3 emissions, we use the concept of depreciation over a period of useful life of four years (example time horizon):

Depreciation in action:

If the server set’s useful life is modified from four years to five years, this would be visible as part of the depreciation metric and showcase the longevity of resource usage.

Effective change in depreciation with extension (from 4y UL to 5y UL):

Within large-scale infrastructure, there are different layers of availability within the hardware and software stack, such as hardware, firmware, the kernel, the operating system, and the application. At each layer, there are metrics associated with efficiency based on available capacity, resources, and their effective use. To represent the use of a depreciation-based metric, we examine the efficiency of a service at the application stack. The representative graph below showcases an example set of variations in the utilization effectiveness stemming from application improvements over a larger time scale.

Utilization effectiveness is defined as: Total resource available / Resource utilized

Combining the depreciation of embodied emissions resources per unit time to the utilization effectiveness for a given unit of time (say, every year), we can arrive at a more real-time measurement of server fleet utilization effectiveness of embodied carbon. (Note: The measurements for utilization effectiveness in the chart are representative values).

Our goal is to consistently minimize the real-time server fleet utilization effectiveness. Utilization effectiveness ideally is decreasing asymptotically towards 1, when resources available are 100% utilized. Depreciation of Scope 3 emissions over a longer period of time due to a longer useful life will also minimize this metric. Combined, this metric allows for ranking of different efforts one must pursue within the server fleet and compare and contrast efficiency improvements, reliability efforts like extensions or initial component selection, and associated embodied carbon impacts.

Below is the comparison of real-time server fleet utilization effectiveness and the way the metric behaves under server life extension and efficiency improvements with the above considered scenarios.

Characteristics of the metric:

The metric above can enable relative comparison of circularity strategies on the server fleet. It can be sliced horizontally into any given timescale (from seconds to years) for understanding a resource’s (e.g., servers or racks) embodied emissions attribution giving fine-grained real-time insights for the server fleet. The metric can also be vertically sliced to obtain utilization effectiveness at different layers of the stack, from entire servers, to containers, to production workloads, to app residencies for short durations – combining that with the associated resource available for the chosen abstraction.

To illustrate how this metric can be used, using the same set example as above:

Increasing server useful life from five years to seven years lowers the metric by 28% due to slower depreciation.
Enabling reuse of a component, pursuing an application efficiency improvement, or choosing server parts that have lower emissions will contribute towards the metric and enable cross-stack tradeoff.

To close, we can observe in this example that this single metric ties together different fleet operations towards a single goal of reducing embodied emissions, delivering insights for decision making at any given time horizon. By integrating depreciation and utilization effectiveness to embodied carbon, our operational and server fleet management teams can leverage this metric to make data-driven decisions that address an important portion of Meta’s Scope 3 footprint.

How PyTorch powers AI training and inference

Fri, 23 Aug 2024 18:00:00 +0200

Learn about new PyTorch advancements for LLMs and how PyTorch is enhancing every aspect of the LLM lifecycle.

In this talk from AI Infra @ Scale 2024, software engineers Wanchao Liang and Evan Smothers are joined by Meta research scientist Kimish Patel to discuss our newest features and tools that enable large-scale training, memory efficient fine-tuning, and on-device LLM capabilities.

First, they cover the importance of memory-efficient fine-tuning and a few common architectural and algorithmic techniques to enable fine-tuning on consumer-grade hardware. Then they discuss the challenges of deploying large models for on-device deployment and how techniques such as quantization make these deployments possible.

[embedded content]

Inside the hardware and co-design of MTIA

Thu, 22 Aug 2024 18:00:00 +0200

In this talk from AI Infra @ Scale 2024, Joel Colburn, a software engineer at Meta, technical lead Junqiang Lan, and software engineer Jack Montgomery discuss the second generation of MTIA, Meta’s in-house training and inference accelerator.

They cover the co-design process behind building the second generation of Meta’s first-ever custom silicon for AI workloads, including the PyTorch software ecosystem, and the model architectures for Meta’s key applications. They demonstrate how MTIA achieves the performance, efficiency, and developer experience to successfully launch models into production. They also highlight several co-design examples where special silicon features are utilized to accelerate Meta’s models.

[embedded content]

Bringing Llama 3 to life

Wed, 21 Aug 2024 18:00:00 +0200

Llama 3 is Meta’s most capable openly-available LLM to date and the recently-released Llama 3.1 will enable new workflows, such as synthetic data generation and model distillation with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models.

At AI Infra @ Scale 2024, Meta engineers discussed every step of how we built and brought Llama 3 to life, from data and training to inference.

Joe Spisak, Product Director and Head of Generative AI Open Source at Meta, talks about the history of Llama and Meta’s overarching vision for open source AI.

He’s joined by Delia David, a software engineer at Meta, to discuss all things data-related for GenAI. David covers the diversity, volume, and freshness of data needed for GenAI and how different data types should be extracted and prepared.

Kaushik Veeraraghavan, a software engineer at Meta, discusses how Meta trains Llama at scale and delves into the data center, networking, and software investments that have enabled the development of Meta’s Llama 3 models.

Finally, Ye (Charlotte) Qia, a production engineer at Meta, discusses how Meta handles inference for Llama. Optimizing and scaling LLM inference is important for enabling large-scale product applications. Qia introduces key parallelism techniques that help scale model sizes and context windows, which in turn influence inference system designs. She also discusses practical challenges associated with deploying these complex serving paradigms throughout Meta’s internal cloud to our data center of heterogeneous hardware.

[embedded content]

Aparna Ramani discusses the future of AI infrastructure

Tue, 20 Aug 2024 18:00:00 +0200

Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even our data center designs.

For the second year in a row, Meta’s engineering and infrastructure teams returned for the AI Infra @ Scale conference, where they discussed the challenges of scaling up an infrastructure for AI as well as work being done on our large-scale GPU clusters, open hardware designs for next-generation data center hardware, and how Meta is building custom silicon like the Meta Training and Inference Accelerator (MTIA) to handle some of our AI training workloads.

Aparna Ramani, VP of Engineering at Meta, responsible for AI infrastructure, data infrastructure and developer infrastructure, delivered the opening keynote at AI Infra @Scale 2024 and discussed the AI landscape up to today, the technical challenges, and how solutions like open models and hardware can push AI to new frontiers.

Watch the full keynote presentation below:

[embedded content]

How Meta animates AI-generated images at scale

Wed, 14 Aug 2024 23:20:00 +0200

We launched Meta AI with the goal of giving people new ways to be more productive and unlock their creativity with generative AI (GenAI). But GenAI also comes with challenges of scale. As we deploy new GenAI technologies at Meta, we also focus on delivering these services to people as quickly and efficiently as possible.

Meta AI’s animate feature, which lets people generate a short animation of a generated image, carried unique challenges in this regard. To deploy and run at scale, our model to generate image animations had to be able to serve billions of people who use our products and services, do so quickly – with fast generation times and minimal errors, and remain resource efficient.

Here’s how we were able to deploy Meta AI’s animate feature using a combination of latency optimizations, traffic management, and other novel techniques.

Optimizing latency for generating image animations

Before launching the animate feature across our family of apps and on the Meta AI website, making animation models fast was one of our top priorities. We wanted people to see the magic of requesting an animation and seeing it appear in just a few seconds. Not only was this important from a user perspective, but the faster and more efficient we made our model, the more we could do with fewer GPUs, helping us scale in a sustainable way. Our work in creating animated stickers with video diffusion, accelerating image generation with Imagine Flash, and accelerating diffusion models through block caching all helped us develop novel techniques we used to accomplish large latency wins.

Halving floating-point precision

The first of those optimization techniques involved halving floating-point precision. We converted the model from float32 to float16, which speeds up the inference time for two reasons. First, the memory footprint of the model is halved. Second, 16 floating-point operations can be executed faster than 32. For all models, to capture these benefits we use bfloat16, a float16 variant with a smaller mantissa for training and inference.

Improving temporal-attention expansion

The second optimization improved temporal-attention expansion. Temporal-attention layers, which are attending between the time axis and text conditioning, require the context tensors to be replicated to match the time dimension, or the number of frames. Previously, this would be done before passing to cross-attention layers. However, this results in less-than-optimal performance gains. The optimized implementation we went with reduces compute and memory by taking advantage of the fact that the repeated tensors are identical, allowing for expansion to occur after passing through the cross-attention’s linear projection layers.

Leveraging DPM-Solver to reduce sampling steps

The third optimization utilized DPM-Solver. Diffusion probabilistic models (DPMs) are powerful and influential models that can produce extremely high-quality generations—but they can be slow. Other possible solutions, such as denoising diffusion-implicit models or denoising diffusion-probabilistic models, can provide quality generation but at the computational cost of more sampling steps. We leveraged DPM-Solver and a linear-in-log signal-to-noise time to reduce the number of sampling steps to 15.

Combining guidance and step distillation

The fourth optimization we implemented combined guidance and step distillation. We accomplished step distillation by initializing a teacher and student with the same weights, and then trained the student to match multiple teacher steps in a single step. Guidance distillation, in contrast, refers to how diffusion models leverage a classifier-free guidance for conditional image generation. This requires both a conditional and unconditional forward pass for every solver step.

In our case, however, we had three forward passes per step: an unconditional, an image-conditional, and a full-conditional, text-and-image step. Guidance distillation reduced these three forward passes into one, cutting inference by a factor of three. The real magic here, though, was combining these two optimizations. By training a student to imitate the classifier-free guidance and multiple steps at the same time with one forward pass through the U-Net, our final model required only eight solver steps, with just one forward pass through the U-Net per step. In the end, during training we distilled 32 teacher steps into eight student steps.

By combining guidance and step distillation we were able to distill 32 steps, each with multiple passes through the U-Net for each conditional type, down to only eight steps through the U-Net architecture.

PyTorch optimizations

The final optimization relates to deployment and architecture and involves two transformations. The first was leveraging torch scripting and freezing. By converting the model to TorchScript, we achieved many automatic optimizations. These included continuous folding, fusing multiple operations, and reducing the complexity of the computational graph. Those three optimizations helped to increase inference speed, while freezing allowed further optimization by transforming dynamically computed values in the graph to constants, reducing the total number of operations.

While these optimizations were critical for our initial launch, we have continued to push the boundaries. For example, we have since migrated all of our media inference from TorchScript to use a PyTorch 2.0-based solution, and this resulted in multiple wins for us. We were able to optimize model components at a more granular level with pytorch.compile at the component level, as well as enable advanced optimization techniques such as context parallel and sequence parallel in the new architecture. This led to additional wins, from reducing the development time of advanced features to improvements in tracing and to being able to support multi-GPU inference.

Deploying and running image animation at scale

Once we had fully optimized the model, we had a new set of challenges to tackle. How could we run this model at scale to support traffic from around the world, all while maintaining fast generation time with minimal failures and ensuring that GPUs are available for all other important use cases around the company?

We started by looking at the data for previous traffic on our AI-generated media both at their launches and over time. We used this info to calculate rough estimates of the quantity of requests we could expect, and then used our benchmarking of model speed to determine how many GPUs would be needed to accommodate that quantity. Once we’d scaled that up, we began running load tests to see if we could handle a range of traffic levels, addressing the various bottlenecks until we were able to handle the traffic projected for launch.

During this testing, we noticed that the end-to-end latency of an animation request was higher than expected—as well as higher than what we had seen after building in all the optimizations described above. Our investigations yielded that traffic was being routed globally, resulting in significant network and communication overhead and adding seconds to the end-to-end generation time. To address this, we utilized a traffic management system that fetches the service’s traffic or load data and uses that to calculate a routing table. The primary objective of the routing table is to keep as many requests as possible in the same region as their requester to avoid having traffic across regions like we were seeing before. The routing table also leverages our predefined load thresholds and routing rings to prevent overload by offloading traffic to other regions when nearing maximum capacity in a region. With these changes, the preponderance of requests remained in region and latency dropped to roughly what we would expect.

Quite a lot of moving parts make this service work. First, it takes each of the metrics that we define for a tier, fetches the value of each from the tier’s machines, and aggregates it by region. It then collects the number of requests per second that each region sends to every other region and uses that to calculate the request-per-second load cost. This tells the system that, generally speaking, the load will increase by X for every added request per second. Once this is complete, the algorithm begins, first by bringing all the traffic to the source region. We don’t yet check if the region has enough capacity or not.

The next step is to enter a loop where during every iteration we look at which region is running closest to maximum capacity. The service tries to take a chunk of that region’s requests and offload them to a nearby region that can handle them without becoming more overloaded. Various levels of overload determine how far away we consider when looking at nearby regions. For example, if the main region is only just starting to run hot, only the closest regions might be utilized. If the region is running at almost maximum capacity, farther-away regions may be unlocked for offloading. We’ll exit the loop if there are no more requests that can be moved between regions, which occurs either when every region is below the defined “overload” threshold or there are no more nearby regions the service can offload to because all nearby regions are also above the threshold. At this point, the service will calculate the optimal number of requests per second for each region and use that to create the routing table mentioned above so our service can appropriately determine where to send traffic at request time.

Part of our work to help ensure requests for animations are delivered as quickly as possible involved implementing a traffic management system to keep requests in the same region as their requester whenever possible.

With these optimizations in place, latency returned to levels that we were happy with, but we were seeing a dip in success rate. At a high level, each GPU can only actively work on one request at a time, as each request fully saturates the GPU. To maintain fast latency, it’s imperative that we don’t allow requests to queue up—otherwise, they’ll have a long wait time. To enforce this we made sure that the server load—queued requests plus in-flight requests—is at most one, and that the server rejects other new requests. Because of this, however, when we are running near our capacity limit, we will run into a number of failures. The naive solution to this issue would be to use a queue, but due to having to load balance globally, that presents its own sets of complex challenges to being efficient and fast. What we used instead was approximating by abusing retries to create a probing system that checks for free GPUs really fast and prevents failures.

This worked well before we implemented the traffic management system. That system, while effective at reducing latency, introduced more complications by reducing the number of available hosts for a request, since we no longer had the global routing. We noticed that the retry polling was no longer being helpful and actually tended to cascade if there were any spikes. Further investigation led us to discover that our router needed to have more optimized settings for retries. It had neither delay nor backoff. So if we had a region where lots of tasks were trying to run, it was stuck overloading until it started failing requests. To avoid the cascading errors, we modified these retry settings to add a marginal execution delay to a percentage of jobs at scheduling time—making them available to execute gradually instead of all at once—as well as an exponential backoff.

Once all of this was done, we had a deployed model that was highly efficient, functioned at scale, and could handle global traffic with high availability and a minimum failure rate.

By adding a marginal execution delay, optimizing retries, and exponential backoff we were able to reduce the number of errors in our system.

To learn more about our work developing and deploying GenAI to animate images, read:

Animated Stickers: Bringing Stickers to Life with Video Diffusion

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

A RoCE network for distributed AI training at scale

Mon, 05 Aug 2024 18:00:00 +0200

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B.
This week at ACM SIGCOMM 2024 in Sydney, Australia, we are sharing details on the network we have built at Meta over the past few years to support our large-scale distributed AI training workload.
Our paper, “RDMA over Ethernet for Distributed AI Training at Meta Scale,” provides the details on how we design, implement, and operate one of the world’s largest AI networks at scale.

The growing prevalence of AI has introduced a new era of communication demands. Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. For instance, a typical generative AI (GenAI) job may necessitate tight coordination of tens of thousands of GPUs over the course of several weeks. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

When Meta introduced distributed GPU-based training, we decided to construct specialized data center networks tailored for these GPU clusters. We opted for RDMA Over Converged Ethernet version 2 (RoCEv2) as the inter-node communication transport for the majority of our AI capacity.

We have successfully expanded our RoCE networks, evolving from prototypes to the deployment of numerous clusters, each accommodating thousands of GPUs. These RoCE clusters support an extensive range of production distributed GPU training jobs, including ranking, content recommendation, content understanding, natural language processing, and GenAI model training, among other workloads.

Topology

We built a dedicated backend network specifically for distributed training. This allowed us to evolve, operate, and scale independently from the rest of the data center network. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler.

The separation

The training cluster relies on two independent networks: the frontend (FE) network for tasks such as data ingestion, checkpointing, and logging, and the backend (BE) network for training, as depicted below.

A training rack is connected to both the FE and BE of the data center network. The FE has a hierarchy of network layers – rack switches (RSWs), fabric switches (FSWs), and higher – that houses the storage warehouse, which provides GPUs with the necessary input data for training workloads. We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload.

The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location. This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network.

AI Zone

Our BE networks have undergone several transformations. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol. This setup had clear limitations in GPU scale and switch redundancy. Therefore, we swiftly transitioned to a fabric-based architecture for extended scalability and higher availability.

We designed a two-stage Clos topology for AI racks, known as an AI Zone. The rack training switch (RTSW), serving as the leaf switch, offers scale-up connectivity for GPUs within the rack using copper-based DAC cables. The spine tier, composed of modular cluster training switches (CTSW), provides scale-out connectivity among all racks in the cluster. The CTSW has deep buffers statically divided over the ports in the chassis. The RTSWs connect to CTSWs via single-mode fiber and 400G pluggable transceivers.

The AI Zones are designed to support a large number of interconnected GPUs in a non-blocking manner. However, emerging AI advancements, such as LLMs like Llama, demand a GPU scale larger than what a single AI zone provides. To accommodate this, we designed an aggregator training switch (ATSW) layer that connects the CTSWs in a data center building, expanding the RoCE domain beyond a single AI Zone.

Note, the cross-AI Zone connectivity is oversubscribed by design, with network traffic balanced using ECMP. To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a “minimum cut” when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time. The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment.

Routing

The scaling of compute power and network topology discussed above led to the question of how to efficiently balance and route the massive training traffic. Specifically, the AI training workloads had several challenging characteristics:

Low entropy: Compared to traditional data center workloads, the number and the diversity of flows for AI workloads are much smaller and the flow patterns are usually repetitive and predictable.
Burstiness: On the time dimension, the flows usually exhibit the “on and of”’ nature in the time granularity of milliseconds.
Elephant flows: For each burst, the intensity of each flow could reach up to the line rate of NICs.

ECMP and path pinning

We initially considered the widely adopted ECMP, which places flows randomly based on the hashes on the five-tuple: source and destination IPs, source and destination UDP ports, and protocol. However, and as expected, ECMP rendered poor performance for the training workload due to the low flow entropy.

Alternatively, we designed and deployed a path-pinning scheme in the initial years of our deployment. This scheme routed packets to specific paths based on the destination “slice” (the index of the RTSW downlink). This worked well if each rack was fully assigned to the same job and there was no failure in the network. However, this was seldom true. We saw that the rack can be partially allocated to a job, with only one of the two hosts in the rack using the uplink bandwidth. This fragmented job placement caused uneven traffic distribution and congestion on the uplinks of the particular RTSW and degraded the training performance up to more than 30%. Further, network failures on a uplink or a CTSW caused the affected flows to be unevenly reassigned to other CTSWs by ECMP. Those reassigned flows collided with other existing flows and slowed down the whole training job.

We mitigated the immediate impact of these flow collisions by upgrading the bandwidth of the RTSW uplinks bandwidth by 2x. Hence we allowed for the RTSW uplink capacity to be 1:2 under-subscribed compared to the RTSW downlink capacity. While this mitigated the immediate performance impact, this was an expensive solution as it required 2x network capacity. Thus, we recognized this as a short-term mitigation and proceeded to further stages of routing evolution.

Queue pair scaling

We next revisited ECMP with an intent to increase the number of flows for hierarchical collectives through the queue pair (QP) scaling software feature in the collective library.

To account for this, we configured switches to perform Enhanced ECMP (E-ECMP) to additionally hash on the destination QP field of a RoCE packet using the UDF capability of the switch ASIC. This increased entropy and, compared to baseline ECMP without QP scaling, we observed that E-ECMP along with QP scaling showed performance improvement of up to 40% for the AllReduce collective.

We evaluated two QP scaling strategies. The first involved splitting each message meant to be posted over a single QP, instead onto multiple QPs resulting in multiple flows. But it also produced smaller message sizes on fabric as well as multiple ACKs. The second approach involved posting each message to a different queue, in a round-robin fashion. For the NIC message sizes demonstrated in our production with NCCL, we observed the latter to be performing well. This feature has been important for ECMP scalability by increasing the network flows for hierarchical collectives like AllReduce.

While we improved ECMP performance with QP scaling, the underlying probabilistic nature of hashing was a persistent downside of this routing scheme. Also, the need to customize the QP scaling factor and methodology based on the workload type, while workable in the short-term, presented long-term operational complexity.

Congestion control

As we transitioned to 400G deployments, we attempted to tune DCQCN to adapt to new network speeds and topology. However, with default DCQCN settings and doubled ECN thresholds compared to 200G networks, performance was degraded. Further investigation revealed that DCQCN implementation in firmware has changed, introducing bugs and reduced visibility with problems relating to correct CNP counting.

We proceeded without DCQCN for our 400G deployments. At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives.

Receiver-driven traffic admission

To mitigate the congestion for 400G and beyond, we co-designed the collective library and RoCE transport to enforce receiver-driven traffic admission for better performance. The diagram below shows that the GPU-to-GPU communication architecture in our production training clusters predominantly uses two-stage copy and receiver-initiated communication via the NCCL collective library. Each GPU’s high bandwidth memory (HBM) maintains multiple channels for parallel transmission of chunked collective messages. The sender GPU threads first copy data from the compute buffer to an available channel buffer. The sender CPU proxy thread can only post an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which includes the size and memory information. The receiver’s GPU threads then copy the channel buffer contents to the destination compute buffer. Finally, CPU proxy threads on both sides recycle the channel buffer, and the receiver CPU proxy sends another CTS packet once the channel buffer is ready.

We effectively leverage this mechanism as a receiver-driven traffic admission to limit the amount of in-flight traffic on the network, especially when congestion starts to build up. However, configuring the right setting can be challenging as:

The number of channels is limited due to the resource contention on GPU threads with concurrent compute operations;
Setting the channel buffer size requires a more careful balance between congestion spreading and bandwidth under-utilization than Infiniband due to RoCE’s more coarse-grained flow control and possible end-host slowness.

Thus, we took two steps to improve the performance. First, we experimentally determined the right parameter settings for the number of channels and channel buffer size across various training job sizes and collective types. Second, we implemented high priority queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth starvation.

Congestion control has been a focal point of research in RDMA networks. DCQCN has been the gold standard for storage-focused networks. However, our experience with distributed AI training workloads provides a different perspective on tailoring the congestion control algorithms. Despite turning off DCQCN and multiple instances of RTSW sending PFC to a deep-buffer CTSW, we have not encountered a scenario over the last four years where production AI training traffic causes the CTSW to send PFCs to RTSWs persistently.

Our current solution depends on careful coordination between the collective communication library and the network. It may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios. We encourage the research community to put more focus on this topic.

Moving forward

The design and operation of large-scale RoCE networks for distributed AI training workloads have evolved to meet the increasing demands of computational density and scale. By segregating FE and BE networks, employing various routing schemes, and optimizing collective traffic patterns, we have been able to build a performant and reliable network infrastructure. These designs and insights underline the importance of deeply understanding the training workload and translating these implications into network component design, ultimately contributing to the advancement of distributed AI training infrastructure.

With the fast growing trend of GenAI workload, our network infrastructure will evolve rapidly.

Read the paper

RDMA over Ethernet for Distributed AI Training at Meta Scale

Acknowledgements

We would like to thank all contributors to the paper, including Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Adi Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashi Gandham, Omar Baldonado. Many current and former people in the Network Infrastructure team at Meta have contributed to productionizing RoCE networks for AI training over the years. In particular, we would like to acknowledge Srinivas Sridharan, Petr Lapukhov, Jose Leitao, and Brandon Taylor. This work is a close collaboration with our partners in Meta’s AI Production Engineering, AI and Systems Co-design, and AI Hardware Systems teams.

DCPerf: An open source benchmark suite for hyperscale compute applications

Mon, 05 Aug 2024 17:55:00 +0200

We are open-sourcing DCPerf, a collection of benchmarks that represents the diverse categories of workloads that run in data center cloud deployments.
We hope that DCperf can be used more broadly by academia, the hardware industry, and internet companies to design and evaluate future products.
DCPerf is available now on GitHub.

Hyperscale and cloud datacenter deployments constitute the largest market share of server deployments in the world today. Workloads developed by large-scale internet companies running in their datacenters have very different characteristics than those in high performance computing (HPC) or traditional enterprise market segments. Therefore, server design considerations, trade-offs and objectives for datacenter use cases are also significantly different from other market segments and require a different set of benchmarks and evaluation methodology. Existing benchmarks fall short of capturing these characteristics and hence do not provide a reliable avenue to design and optimize modern server and datacenter designs.

Introducing DCPerf

Meta developed DCPerf, a collection of benchmarks to represent the diverse categories of workloads that run in cloud deployments. Each benchmark within DCPerf is designed by referencing a large application within Meta’s production server fleet.

We used several new techniques to ensure benchmark representativeness, ranging from low-level hardware microarchitecture features to application and library usage profiles, to analyze production workloads and capture the important characteristics of these workloads in DCPerf. Designing and optimizing hardware and software on future server platforms using these benchmarks willmore closely translate into improved efficiency of hyperscaler production deployments.

DCPerf’s design process.

Over the past few years, we have continuously enhanced these benchmarks to make them compatible with different instruction set architectures, including x86 and ARM. We also validated that the benchmarks can be used to evaluate emerging industry trends, (e.g., chiplet-based architectures), and added support for multi-tenancy so that benchmarks can scale and make use of rapidly increasing core counts on modern server platforms.

Using DCPerf to improve Meta’s compute server designs

We have been using DCPerf internally, in addition to the SPEC CPU benchmark suite, for product evaluation at Meta to make the right configuration choices for our data center deployments. DCPerf also helps us make early performance projections that are used for capacity planning, identify performance bugs in hardware and system software, and jointly optimize the platform with our hardware industry collaborators.

DCPerf provides a much richer set of application software diversity and helps get better coverage signals on platform performance versus existing benchmarks such as SPEC CPU. Due to these benefits, we have also started using DCPerf to assist with our decision making process on which platforms to deploy in our data centers.

DCPerf captures the core and SOC microarchitecture characteristics of data center applications. Graph compares Instruction-Per-Cycle of production applications, DCPerf and SPEC CPU. Red circles highlight that DCPerf more accurately represents IPC of production applications.

DCPerf more closely captures the power and frequency characteristics of data center applications. This graph compares the average core frequency of production applications, DCPerf and SPEC CPU. Red circles highlight that DCPerf more accurately represents the frequency characteristics of production applications.

Improving state-of-the-art computing platforms with our hardware industry collaborators using DCPerf

Over the last two years we have collaborated with leading CPU vendors to further validate DCPerf on pre silicon and/or early silicon setups to debug performance issues and identify hardware and system software optimizations on their roadmap products. There have been multiple instances where we have been able to identify performance optimizations in areas such as CPU core microarchitecture settings and SOC power management optimizations.

The graphic below shows areas of HW/SW design where we have seen DCPerf being representative of production usage and being beneficial for delivering relevant performance signals and help with optimizations as well as areas of future work.

We are thankful for our collaborators’ support and contributions using DCPerf to drive innovation in such an important and complex area and expect to continue improving the benchmarks with new version releases over time to adapt to emerging technologies.

Enabling innovations through open collaboration

Today, we are open-sourcing DCPerf with the goal to create a collaborative and open source reference benchmark that can be used to design, develop, debug, optimize, and improve state-of-the-art in compute platform designs for hyperscale.

As an open source benchmark suite, DCPerf has the potential to become an industry standard method to capture important workload characteristics of compute workloads that run in hyperscale datacenter deployments.

Get DCPerf on GitHub

DCPerf is available now on GitHub

Meet Caddy – Meta’s next-gen mixed reality CAD software

Thu, 18 Jul 2024 15:00:00 +0200

What happens when a team of mechanical engineers get tired of looking at flat images of 3D models over Zoom?

Meet the team behind Caddy, a new CAD app for mixed reality. They join Pascal Hartig (@passy) on the Meta Tech Podcast to talk about teaching themselves to code, disrupting the CAD software space, and how they integrated Caddy with Llama 3, and so much more!

You can download Caddy today to check it out yourself!

Download or listen to the podcast episode below:

[embedded content]

You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

AI Lab: The secrets to keeping machine learning engineers moving fast

Tue, 16 Jul 2024 18:00:00 +0200

The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers.
AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and automatically preventing regressions on TTFB.
AI Lab prevents TTFB regressions whilst enabling experimentation to develop improvements. For example, during the rollout of the open source Python Cinder runtime, AI Lab was used to yield a 2x increase on original TTFB improvements, reducing TTFB by up to 40%.

Time to first batch (TTFB), the delay from when a workflow is submitted to the training job’s first batch, plays an important role in accelerating our machine learning (ML) engineers’ iteration speeds. Essentially, TTFB is the time elapsed from the moment you hit the “start” button on your ML model training to the point when the first batch of data enters the model for processing. TTFB influences the overhead for all ML training jobs and is essentially the moment when developers first get a signal on their job.

By minimizing TTFB we’re unblocking our ML engineers, increasing the number of iterations they can do per day, and improving the overall speed of innovation at Meta.

Supporting TTFB across Meta requires a scalable offering to not only enable proactive improvements on this valuable metric, but also keep it healthy autonomously. To this end we’ve created AI Lab, a pre-production TTFB signal generation tool which empowers infra owners to ship new changes with high confidence, reducing TTFB by up to 40%. This, coupled with automatic prevention of regressions keeps ML engineers moving fast across Meta.

Optimizing TTFB helps ML engineers move fast

The overhead induced from TTFB is on the critical path for most ML development. It is composed of components like config validation, feature pre-processing, and infra overhead (like queuing for capacity). Optimizations to components of TTFB can even impact the entire training cycle of some models. At Meta’s scale, the metric value of TTFB often subtly changes as developers iterate on their model, launcher, or architecture.

Example TTFB measurement with components.

To get and keep ML engineers moving fast, two things are required:

Offensively improve TTFB: We need an intuitive, easy-to-use experimentation framework that allows users to quantify the impact of their changes, enabling fast iteration, and impact certification of new features, empowering infra owners to ship new changes with high confidence.
Defensively prevent regressions on TTFB: We need continuous regression prevention that tests the latest changes in a low-noise environment, whilst providing a way to monitor, detect, and prevent regressions from affecting ML engineers in the first place.

Introducing AI Lab

AI Lab is a specialized pre-production framework in which we continuously execute common ML workflows as an A/B test to accurately measure the impact of recent changes on metrics like TTFB. Built on top of the same systems as MobileLab, AI Lab automatically defends TTFB by preventing regressions prior to release and enables offensive TTFB improvements opportunistically as an experimentation framework.

Building AI Lab presented unique challenges. Because GPU capacity is such a precious resource, we had to ensure we were a net positive to capacity usage across Meta. We took care to work with partners on shrunk models and simple configurations like some that could run on only CPUs, but still prevent the regressions that would regularly tie up GPUs. To this end, we created an auto-shrinker that aims to ensure tests run the same code / configurations as production; except consume less compute resources. It does things like reduce the number of training iterations and model size, even enabling more deterministic behavior. These tests often run in <10 minutes, which is beneficial for developers iterating on potential TTFB changes. We also needed a holistic strategy to scale to the size of Meta, something we’ll cover in a later section.

AI Lab finding a regression in TTFB.

Let’s jump into a real example for how we can leverage a tool like AI Lab to reduce TTFB.

Reducing TTFB with the Python Cinder runtime and AI Lab

Meta’s open source Python Cinder runtime brought with it up to a 40% improvement in TTFB thanks to the aggressive lazy imports. Here, we see the true utility of a framework like AI Lab and how it was used to facilitate this sweeping change.

Offensively

We can leverage AI Lab instead of experimenting on real ML engineers’ workflows that may require days or weeks of turnaround to validate a performance hypothesis. With AI Lab, in less than an hour, we’re able to accurately test and measure the impact of a proposed Cinder version on TTFB across a comprehensive set of representative ML scenarios.

In practice, developers turned this into an iteration loop to test further optimizations and fine-tune Cinder, yielding a 2x increase on the original TTFB improvements they were seeing. For example, initially in profiles with Cinder enabled engineers found that up to 10% of the execution time was spent in a workflow to just pretty print. Turns out, the method of memoization used caused a repr() to happen on an underlying data structure, which just so happened to be huge in typical ML scenarios. Instead, they made an object wrapper on this underlying data structure and made memoization comparisons using the object identities instead.

AI Lab verified the improvement, enabling them to proceed with rolling out the change.

Defensively

Around when Cinder began rolling out, a regression just so happened to occur that was completely unrelated to the rollout. In this new regression, an engineer added some logging that they believed was being done asynchronously. Unbeknownst to them, the call was actually blocking due to one of the nested clients they required being synchronous. AI Lab leveraged Incident Tracker and automatically attributed the regression down to the specific change. The change author of the regression was notified shortly afterwards, reverting their change before the release went out to production.

Thanks to AI Lab, the engineers working on Cinder never had to worry about a TTFB regression occurring in the same release they rolled out in, avoiding a potential rollback.

AI Lab root causing a specific change that caused a TTFB regression.

How to achieve prevention at Meta’s scale

We want to give accurate TTFB signals as early as possible in the development cycle, but it’s infeasible to benchmark all ML scenarios for every change made by every engineer at Meta. Instead, similar to predictive test selection, we establish a limit on capacity used and set out to find as many regressions/improvements as early in the development cycle as possible. In practice, this means:

AI Lab integrating at various stages pre-production.

O(Code Changes): Running relevant, effective, and computationally efficient (often CPU-only) AI Lab tests on prospective changes before they are even reviewed.
O(Releases): Running a more holistic set of AI Lab tests prior to release and performing a bisect-like attribution process to find the root cause.
1. Attribution in this manner is highly effective and efficient; it serves as a great fallback when we must run more computationally intensive tests to find a certain regression.

AI Lab’s high-level end-to-end flow.

Should we find a statistically significant change per a t-test, we perform additional checks before marking it as a regression/improvement:

Run confirmation runs to confirm we confidently reproduce the expected regression/improvement.
Ensure the size of the regression/improvement is above a dynamic threshold based on the standard deviation of the test and a tuned receiver operating characteristic. For example, a partner may require <1 false positive per week, which sets the threshold for our tests to find as many true positives as possible whilst staying below that.

Inviting industry collaboration

While AI Lab is an internal-only tool at Meta, we would love to hear from members of the community who may be running similar platforms. Synthetic signal production is a boon to both developers and users. When developers can rapidly evaluate a hypothesis, and users can experience fewer regressions, it speeds up AI innovation across the industry. We’d love to collaborate with the industry to explore more ways we can improve on tools like AI Lab and optimize more metrics like TTFB.

Acknowledgements

AI Lab was made possible due to the foundational work of MobileLab. As we aim to scale past TTFB, we look forward to tackling AI efficiency metrics too with ServiceLab. We’d like to thank members of the AI Training Orchestration team for helping us build AI Lab and all of our users for leveraging the product to keep improving TTFB.

Taming the tail utilization of ads inference at Meta scale

Wed, 10 Jul 2024 22:30:00 +0200

Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization.
The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability.
Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for the same amount of resources; and p99 latency was cut in half.

The inference platforms that serve the sophisticated machine learning models used by Meta’s ads delivery system require significant infrastructure capacity across CPUs, GPUs, storage, networking, and databases. Improving tail utilization – the utilization level of the top 5% of the servers when ranked by utilization– within our infrastructure is imperative to operate our fleet efficiently and sustainably.

With the growing complexity and computational intensity of these models, as well as the strict latency and throughput requirements to deliver ads, we’ve implemented system optimizations and best practices to address tail utilization. The solutions we’ve implemented for our ads inference service have positively impacted compute utilization in our ads fleet in several ways, including increasing work output by 35 percent without additional resources, decreasing timeout error rates by two-thirds, and reducing tail latency at p99 by half.

How Meta’s ads model inference service works

When placing an ad, client requests are routed to the inference service to get predictions. A single request from a client typically results in multiple model inferences being requested, depending on experiment setup, page type, and ad attributes. This is shown below in figure 1 as a request from the ads core services to the model inference service. The actual request flow is more complex but for the purpose of this post, the below schematic model should serve well.

The inference service leverages Meta infrastructure capabilities such as ServiceRouter for service discovery, load balancing, and other reliability features. The service is set up as a sharded service where each model is a shard and multiple models are hosted in a single host of a job that spans multiple hosts.

This is supported by Meta’s sharding service, Shard Manager, a universal infrastructure solution that facilitates efficient development and operation of reliable sharded applications. Meta’s advertising team leverages Shard Manager’sload balancing and shard scaling capabilities to effectively handle shards across heterogeneous hardware.

Figure 1: The ads inference architecture.

Challenges of load balancing

There are two approaches to load balancing:

Routing load balancing – load balancing across replicas of a single model. We use ServiceRouter to enable routing based load balancing.
Placement load balancing – balancing load on hosts by moving replicas of a model across hosts.

Fundamental concepts like replica estimation, snapshot transition and multi-service deployments are key aspects of model productionisation that make load balancing in this environment a complex problem.

Replica estimation

When a new version of the model enters the system, the number of replicas needed for the new model version is estimated based on historical data of the replica usage of the model.

Snapshot transition

Ads models are continuously updated to improve their performance. The ads inference system then transitions traffic from the older model to the new version. Updated and refreshed models get a new snapshot ID. Snapshot transition is the mechanism by which the refreshed model replaces the current model serving production traffic.

Multi-service deployment

Models are deployed to multiple service tiers to take advantage of hardware heterogeneity and elastic capacity.

Why is tail utilization a problem?

Tail utilization is a problem because as the number of requests increases, servers that contribute to high tail utilization become overloaded and fail, ultimately affecting our service level agreements (SLAs). Consequently, the extra headroom or buffer needed to handle increased traffic is directly determined by the tail utilization.

This is challenging because it leads to overallocation of capacity for the service. If demand increases, capacity headroom is necessary in constrained servers to maintain service levels when accommodating new demand. Since capacity is uniformly added to all servers in a cluster, generating headroom in constrained servers involves adding significantly more capacity than required for headroom.

In addition, tail utilization for most constrained servers grows faster than lower percentile utilization due to the non linear relationship between traffic increase and utilization. This is the reason why more capacity is needed even while the system is under utilized on average.

Making the utilization distribution tighter across the fleet unlocks capacity within servers running at low utilization, i.e. the fleet can support more requests and model launches while maintaining SLAs.

Figure 2: Divergence in the tail utilization distribution across percentile ranges.

How we optimized tail utilization

The implemented solution comprises a class of technical optimizations that attempt to balance the objectives of improving utilization and reducing error rate and latency.

The improvements made the utilization distribution tighter. This created the ability to move work from crunched servers to low utilization servers and absorb increased demand. As a result, the system has been able to absorb up to 35% load increase with no additional capacity.

Figure 3: Convergence of tail utilization distribution across percentiles.

The reliability also improved, reducing the timeout error rate by two-thirds and cutting latency by half.

Figure 4: System reliability over time.

The solution involved two approaches:

Tuning load balancing mechanisms
Making system level changes in model productionisation.

The first approach is well understood in the industry. The second one required significant trial, testing, and nuanced execution.

Tuning load balancing mechanisms

The power of two choices

The service mesh, ServiceRouter, provides detailed instrumentation that allows a better understanding of the load balancing characteristics. Specifically relevant to tail utilization is suboptimal load balancing because of load staleness. To address this we leveraged the power of two choices in a randomized load balancing mechanism. This algorithm requires load data from the servers. This telemetry is collected either by polling – query server load before request dispatch; or by load-header – piggyback on response.

Polling provides fresh load, while it adds an additional hop, but on the other side, load-header results in reading stale load. Load staleness is a significant issue for large services with substantial clients. Any error here due to staleness would result in random load balancing. For polling, given the inference request is computationally expensive, the overhead was found to be negligible. Using polling improved tail utilization noticeably because heavily loaded hosts were actively avoided. This approach worked very well specifically for inference requests greater than 10s of milliseconds.

ServiceRouter provides various tuning load-balancing capabilities. We tested many of these techniques, including the number of choices for server selection (i.e., power of k instead of 2), backup request configuration, and hardware-specific routing weights.

These changes offered marginal improvements. CPU utilization as load-counter was especially insightful. While it is intuitive to balance based on CPU utilization, it turned out to be not useful because: CPU utilization is aggregated over some period of time versus the need for instant load information in this case; and outstanding active tasks waiting on I/O were not taken into account correctly.

Placement load balancing

Placement load balancing helped a lot. Given the diversity in model resource demand characteristics and machine resource supply, there is significant variance in server utilization. There is an opportunity to make the utilization distribution tighter by tuning the Shard Manager load balancing configurations, such as load bands, thresholds, and balancing frequency. The basic tuning above helped and provided big gains. It also exposed a deeper problem like spiky tail utilization, which was hidden behind the high tail utilization and was fixed once identified .

System level changes

There wasn’t a single significant cause for the utilization variance and several intriguing issues emerged among them that offered valuable insights into the system characteristics.

Memory bandwidth

CPU spikes were observed when new replicas, placed on hosts already hosting other models, began serving traffic. Ideally, this should not happen because Shard Manager should only place a replica when the resource requirements are met. Upon examining the spike pattern, the team discovered that the stall cycles were increasing significantly. Using dynolog perf instrumentations, we determined that memory latency was increasing as well, which aligned with memory latency benchmarks.

Memory latency starts to increase exponentially at around 65-70% utilization. It appears to be an increase in CPU utilization, but the actual issue was that the CPU was stalling. The solution involved considering memory bandwidth as a resource during replica placement in Shard Manager.

ServiceRouter and Shard Manager expectation mismatch

There is a service control plane component called ReplicaEstimator that performs replica count estimation for a model. When ReplicaEstimator performs this estimation, the expectation is that each replica roughly receives the same amount of traffic. Shard Manager also works under this assumption that replicas of the same model will roughly be equal in their resource usage on a host. Shard Manager load balancing also assumes this property. There are also cases where Shard Manager uses load information from other replicas if load fetch fails. So ReplicaEstimator and Shard Manager share the same expectation that each replica will end up doing roughly the same amount of work.

ServiceRouter employs the default load counter, which encompasses both active and queued outstanding requests on a host. In general, this works fine when there is only one replica per host and they are expected to receive the same amount of load. However, this assumption is broken due to multi-tenancy, resulting in each host potentially having different models and outstanding requests on a host cannot be used to compare load as it can vary greatly. For example, two hosts serving the same model could have completely different load metrics leading to significant CPU imbalance issues.

The imbalance of replica load created because of the host level consolidated load counter violates Shard Manager and ReplicaEstimator expectations. A simple and elegant solution to this problem is a per-model load counter. If each model were to expose a load counter based on its own load on the server, ServiceRouter will end up balancing load across model replicas, and Shard Manger will end up more accurately balancing hosts. Replica estimation also ends up being more accurate. All expectations are aligned.

Support for this was added to the prediction client by explicitly setting the load counter per model client and exposing appropriate per model load metric on the server side. The model replica load distribution as expected became much tighter with a per-model load counter and helps with the problems discussed above.

But this also presented some challenges. Enabling per-model load counter changes the load distribution instantaneously, causing spikes until Shard Manager catches up and rebalances. The team built a mechanism to make the transition smooth by gradually rolling out the load counter change to the client. Then there are models with low load that end up having per-model load counter values of ‘0’, making it essentially random. In the default load counter configuration, such models end up using the host level load as a good proxy to decide which server to send the request to.

“Outstanding examples CPU” was the most promising load counter among many that were tested. It is the estimated total CPU time spent on active requests, and better represents the cost of outstanding work. The counter is normalized by the number of cores to account for machine heterogeneity.

Figure 5: Throughput as measured by requests per second across hosts in a tier.

Snapshot transition

Some ads models are retrained more frequently than others. Discounting real-time updated models, the majority of the models involve transitioning traffic from a previous model snapshot to the new model snapshot. Snapshot transition is a major disruption to a balanced system, especially when the transitioning models have a large number of replicas.

During peak traffic, snapshot transition can have a significant impact on utilization. Figure 6 below illustrates the issue. The snapshot transition of large models during a crunched time causes utilization to be very unbalanced until Shard Manager is able to bring it back in balance. This takes a few load balancing runs because the placement of the new model during peak traffic ends up violating CPU soft thresholds. The problem of load counters, as discussed earlier, further complicates Shard Manager’s ability to resolve issues.

Figure 6: A utilization spike due to the snapshot transition.

To mitigate this issue, the team added the snapshot transition budget capability. This allows for snapshot transitions to occur only when resource utilization is below a configured threshold. The trade-off here is between snapshot staleness and failure rate. Fast scale down of old snapshots helped minimize the overhead of snapshot staleness while maintaining lower failure rates.

Cross-service load balancing

After optimizing load balancing within a single service, the next step was to extend this to multiple services. Each regional model inference service is made up of multiple sub-services depending on hardware type and capacity pools – guaranteed and elastic pools. We changed the calculation to the compute capacity of the hosts instead of the host number. This helped with a more balanced load across tiers.

Certain hardware types are more loaded than others. Given that clients maintain separate connections to these tiers, ServiceRouter load balancing, which performs balancing within tiers, did not help. Given the production setup, it was non-trivial to put all these tiers behind a single parent tier. Therefore, the team added a small utilization balancing feedback controller to adjust traffic routing percentages and achieve balance between these tiers. Figure 7 shows an example of this being rolled out.

Figure 7: Request per service.

Replica estimation and predictive scaling

Shard Manager employs a reactive approach to load by scaling up replicas in response to a load increase. This meant increased error rates during the time replicas were scaled up and became ready. This is exacerbated by the fact that replicas with higher utilization are more prone to utilization spikes given the non-linear relationship between queries per second (QPS) and utilization. To add to this, when auto-scaling kicks in, it responds to a much larger CPU requirement and results in over-replication. We designed a simple predictive replica estimation system for the models that predicts future resource usage based on current and past usage patterns up to two hours in advance. This approach yielded significant improvements in failure rate during peak periods.

Next steps

The next step in our journey is to adopt our learnings around tail utilization to new system architectures and platforms. For example, we’re actively working to apply the utilizations discussed here to IPnext, Meta’s next-generation unified platform for managing the entire lifecycle of machine learning model deployments, from publishing to serving. IPnext’s modular design enables us to support various model architectures (e.g., for ranking or GenAI applications) through a single platform spanning multiple data center regions. Optimizing tail utilization within IPnext thereby delivering these benefits to a broader range of expanding machine learning inference use cases at Meta.

Meta’s approach to machine learning prediction robustness

Wed, 10 Jul 2024 15:00:00 +0200

Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta’s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically resilient, we have built a comprehensive set of prediction robustness solutions that ensure stability without compromising performance or availability of our ML systems.

Why is machine learning robustness difficult?

Solving for ML prediction stability has many unique characteristics, making it more complex than addressing stability challenges for traditional online services:

ML models are stochastic by nature. Prediction uncertainty is inherent, which makes it difficult to define, identify, diagnose, reproduce, and debug prediction quality issues.
Constant and frequent refreshing of models and features. ML models and features are continuously updated to learn from and reflect people’s interests, which makes it challenging to locate prediction quality issues, contain their impact, and quickly resolve them
Blurred line between reliability and performance.In traditional online services, reliability issues are easier to detect based on service metrics such as latency and availability. However, ML prediction stability implies a consistent prediction quality shift, which is harder to distinguish. For example, an “available” ML recommender system that reliably produces inaccurate predictions is actually “unreliable.”
Cumulative effect of small distribution shifts over time. Due to the stochastic nature of ML models, small regressions in prediction quality are hard to distinguish from the anticipated organic traffic-pattern changes. However, if undetected, such small prediction regressions could have a significant cumulative negative impact over time.
Long chain of complex interactions. The final ML prediction result is derived from a complex chain of processing and propagation across multiple ML systems. Regression in prediction quality could be traced back to several hops upstream in the chain, making it hard to diagnose and locate stability improvements per specific ML system.
Small fluctuations can amplify to become big impacts. Even small changes in the input data (e.g., features, training data, and model hyperparameters) can have a significant and unpredictable impact on the final predictions. This poses a major challenge in containing prediction quality issues at particular ML artifacts (model, feature, label), and it requires end-to-end global protection.
Rising complexity with rapid modeling innovations. Meta’s ML technologies are evolving rapidly, with increasingly larger and more complex models and new architectures. This requires prediction robustness solutions to evolve at the same fast pace.

Meta’s approach and progress towards prediction robustness

Meta has developed a systematic framework to build prediction robustness. This framework includes a set of prevention guardrails to build control from outside-in, fundamental understanding of the issues to gain ML insights, and a set of technical fortifications to establish intrinsic robustness.

These three approaches are exercised across models, features, training data, calibration, and interpretability to ensure all possible issues are covered throughout the ML ecosystem. With prediction robustness, Meta’s ML systems are robust by design, and any stability issues are actively monitored and resolved to ensure smooth ads delivery for our users and advertisers.

Figure 1: A simplified view of Meta’s ads recommendation system shows the flow of complex interactions for producing the final predictions.

Our prediction robustness solution systematically covers all areas of the recommender system – training data, features, models, calibration, and interpretability.

Model robustness

Model robustness challenges include model snapshot quality, model snapshot freshness, and inferencing availability. We use Snapshot Validator, an internal-only real-time, scalable, and low-latency model evaluation system, as the prevention guardrail on the quality of every single model snapshot, before it ever serves production traffic.

Snapshot Validator runs evaluations with holdout datasets on newly-published model snapshots in real-time, and it determines whether the new snapshot can serve production traffic. Snapshot Validator has reduced model snapshot corruption by 74% in the past two years. It has protected >90% of Meta ads ranking models in production without prolonging Meta’s real-time model refresh.

In addition, Meta engineers built new ML techniques to improve the intrinsic robustness of models, such as pruning less-useful modules inside models, better model generalization against overfitting, more effective quantization algorithms, and ensuring model resilience in performance even with a small amount of input data anomalies. Together these techniques have improved the ads ML model stability, making the models resilient against overfitting, loss divergence, and more.

Feature robustness

Feature robustness focuses on guaranteeing the quality of ML features across coverage, data distribution, freshness, and training-inference consistency. As prevention guardrails, robust feature monitoring systems were in production to continuously detect anomalies on ML features. As the ML-feature-value distributions can change widely with non-deterministics sways on model performance, the anomaly detection systems have turned to accommodate the particular traffic and ML prediction patterns for accuracy.

Upon detection, automated preventive measures will kick in to ensure abnormal features are not used in production. Furthermore, a real-time feature importance evaluation system is built to provide fundamental understanding of the correlation between feature quality and model prediction quality.

All these solutions have effectively contained ML feature issues on coverage drop, data corruption, and inconsistency in Meta.

Training data robustness

The wide spectrum of Meta ads products requires distinct labeling logics for model training, which significantly increases the complexity of labeling. In addition, the data sources for label calculation could be unstable, due to the complicated logging infrastructure and the organic traffic drifts. Dedicated training-data-quality systems were built as the prevention guardrails to detect label drifts over time with high accuracy, and swiftly and automatically mitigate the abnormal data changes and prevent models from learning the affected training data.

Additionally, fundamental understanding of training data label consistency has resulted in optimizations in training data generation for better model learning.

Calibration robustness

Calibration robustness builds real-time monitoring and auto-mitigation toolsets to guarantee that the final prediction is well calibrated, which is vital for advertiser experiences. The calibration mechanism is technically unique because it is unjoined-data real-time model training, and it is more sensitive to traffic distribution shifts than the joined-data mechanism.

To improve the stability and accuracy of calibration Meta has built prevention guardrails that consist of high-precision alert systems to minimize problem-detection time, as well as high-rigor, automatically orchestrated mitigations to minimize problem-mitigation time.

ML interpretability

ML interpretability focuses on identifying the root causes of all ML instability issues. Hawkeye, our internal AI debugging toolkit, allows engineers at Meta to root-cause tricky ML prediction problems. Hawkeye is an end-to-end and streamlined diagnostic experience covering all ML artifacts at Meta, and it has covered >80% of ads ML artifacts. It is now one of the most widely used tools in the Meta ML engineering community.

Beyond debugging, ML interpretability invests heavily in model internal state understanding – one of the most complex and technically challenging areas in the realm of ML stability. There are no standardized solutions to this challenge, but Meta uses model graph tracing, which uses model internal states on model activations and neuron importance, to accurately explain why models get corrupted.

Altogether, advancements in ML Interpretability have reduced the time to root-cause ML prediction issues by 50%, and have significantly boosted the fundamental understanding of model behaviors.

Improving ranking and productivity with prediction robustness

Going forward, we’ll be extending our prediction robustness solutions to improve ML ranking performance, and boost engineering productivity by accelerating ML developments.

Prediction robustness techniques can boost ML performance by making models more robust intrinsically, with more stable training, less normalized entropy explosion or loss divergence, more resilience to data shift, and stronger generalizability. We’ve seen performance gains from applying robustness techniques like gradient clipping and more robust quantization algorithms. And we will continue to identify more systematic improvement opportunities with model understanding techniques.

In addition, model performance will be improved with less staleness and stronger consistency between serving and training environments across labels, features, inference platform, and more. We plan to continue upgrading Meta’s ads ML services with stronger guarantees of training-serving consistency and more aggressive staleness SLAs.

Regarding ML development productivity, prediction robustness techniques can facilitate model development, and improve daily operations by reducing the time needed to address ML prediction stability issues. We’re currently building an intelligent ML diagnostic platform that will leverage the latest ML technologies, in the context of prediction robustness, to help even engineers with little ML knowledge locate the root cause of ML stability issues within minutes.

The platform will also evaluate reliability risk continuously across the development lifecycle, minimizing delays in ML development due to reliability regressions. It will embed reliability into every ML development stage, from idea exploration all the way to online experimentation and final launches.

Acknowledgements

We would like to thank all the team members and the leadership that contributed to make the Prediction Robustness effort successful in Meta. Special thanks to Alex Gong, Ashish Singh, Ashish Srivastava, Ben Dummitt, Booker Gong, David Serfass, David Thompson, Evan Poon, Govind Kabra, Haibo Lin, Haoyan Yuan, Igor Lytvynenko, Jie Zheng, Jin Zhu, Jing Chen, Junye Wang, Kapil Gupta, Kestutis Patiejunas, Konark Gill, Lanlan Liu, Lu Zheng, Maggie Ma, Marios Kokkodis, Namit Gupta, Ngoc Lan Nguyen, Pedro Perez de Tejada, Pratibha Udmalpet, Qiming Guo, Roopa Iyer, Rohit Iyer, Sam Elshamy, Sagar Chordia, Sheng Luo, Shuo Chang, Shupin Mao, Velavan Trichy, Weifeng Cui, Ximing Chen, Xin Zhao, Yalan Xing, Yiye Lin, Yongjun Xie, Yubin He, Yue Wang, Zewei Jiang, Santanu Kolay, Prabhakar Goyal, Neeraj Bhatia, Sandeep Pandey, Uladzimir Pashkevich, and Matt Steiner.

The key to a happy Rust/C++ relationship

Tue, 25 Jun 2024 18:00:00 +0200

The history of Rust at Meta goes all the way back to 2016, when we first started using it for source control. Today, it has been widely embraced at Meta and is one of our primary supported server-side languages (along with C++, Python, and Hack).

But that doesn’t mean there weren’t any growing pains.

Aida G., a member of one of Meta’s first Rust teams, joins Pascal Hartig (@passy) on the latest Meta Tech Podcast to dive into the challenges of getting Rust to interact with Meta’s large amount of existing C++ code.

Fortunately, the release of cxx, safe interop between C++, and even async Rust have made things a lot easier.

Download or listen to the episode below:

[embedded content]

You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Leveraging AI for efficient incident response

Mon, 24 Jun 2024 18:00:00 +0200

We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system.
The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations.
Our testing has shown this new system achieves 42% accuracy in identifying root causes for investigations at their creation time related to our web monorepo.

Investigation is a critical part of ensuring system reliability, and a prerequisite to mitigating issues quickly. This is why Meta is investing in advancing our suite of investigation tooling with tools like Hawkeye, which we use internally for debugging end-to-end machine learning workflows.

Now, we’re leveraging AI to advance our investigation tools even further. We’ve streamlined our investigations through a combination of heuristic-based retrieval and large language model (LLM)-based ranking to provide AI-assisted root cause analysis. During backtesting, this system has achieved promising results: 42% accuracy in identifying root causes for investigations at their creation time related to our web monorepo.

[embedded content]

Investigations at Meta

Every investigation is unique. But identifying the root cause of an issue is necessary to mitigate it properly. Investigating issues in systems dependent on monolithic repositories can present scalability challenges due to the accumulating number of changes involved across many teams. In addition, responders need to build context on the investigation to start working on it, e.g., what is broken, which systems are involved, and who might be impacted.

These challenges can make investigating anomalies a complex and time consuming process. AI offers an opportunity to streamline the process, reducing the time needed and helping responders make better decisions. We focused on building a system capable of identifying potential code changes that might be the root cause for a given investigation.

Figure 1: A responder’s view of an investigation journey.

Our approach to root cause isolation

The system incorporates a novel heuristics-based retriever that is capable of reducing the search space from thousands of changes to a few hundred without significant reduction in accuracy using, for example., code and directory ownership or exploring the runtime code graph of impacted systems. Once we have reduced the search space to a few hundred changes relevant to the ongoing investigation, we rely on a LLM-based ranker system to identify the root cause across these changes.

Figure 2: The system flow for our AI-assisted root cause analysis system.

The ranker system uses a Llama model to further reduce the search space from hundreds of potential code changes to a list of the top five. We explored different ranking algorithms and prompting scenarios and found that ranking through election was most effective to accommodate context window limitations and enable the model to reason across different changes. To rank the changes, we structure prompts to contain a maximum of 20 changes at a time, asking the LLM to identify the top five changes. The output across the LLM requests are aggregated and the process is repeated until we have only five candidates left. Based on exhaustive backtesting, with historical investigations and the information available at their start, 42% of these investigations had the root cause in the top five suggested code changes.

Figure 3: Ranking possible code changes through election.

Training

The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model using historical investigations for which we knew the underlying root cause. We started by running continued pre-training (CPT) using limited and approved internal wikis, Q&As, and code to expose the model to Meta artifacts. Later, we ran a supervised fine-tuning (SFT) phase where we mixed Llama 2’s original SFT data with more internal context and a dedicated investigation root cause analysis (RCA) SFT dataset to teach the model to follow RCA instructions.

Figure 4: The Llama 2 (7B) root cause analysis training process.

Our RCA SFT dataset consists of ~5,000 instruction-tuning examples with details of 2-20 changes from our retriever, including the known root cause, and information known about the investigation at its start, e.g., its title and observed impact. Naturally, the available information density is low at this point, however this allows us to perform better in similar real-world scenarios when we have limited information at the beginning of the investigation.

Using the same fine-tuning data format for each possible culprit then allows us to gather the model’s llog probabilities(logprobs) and rank our search space based on relevancy to a given investigation. We then curated a set of similar fine-tuning examples where we expect the model to yield a list of potential code changes likely responsible for the issue ordered by their logprobs-ranked relevance, with the expected root cause at the start. Appending this new dataset to the original RCA SFT dataset and re-running SFT gives the model the ability to respond appropriately to prompts asking for ranked lists of changes relevant to the investigation.

Figure 5: The process for generating fine-tuning prompts to enable the LLM to produce ranked lists.

The future of AI-assisted Investigations

The application of AI in this context presents both opportunities and risks. For instance, it can reduce effort and time needed to root cause an investigation significantly, but it can potentially suggest wrong root causes and mislead engineers. To mitigate this, we ensure that all employee-facing features prioritize closed feedback loops and explainability of results. This strategy ensures that responders can independently reproduce the results generated by our systems to validate their results. We also rely on confidence measurement methodologies to detect low confidence answers and avoid recommending them to the users – sacrificing reach in favor of precision.

By integrating AI-based systems into our internal tools we’ve successfully leveraged them for tasks like onboarding engineers to investigations and root cause isolation. Looking ahead, we envision expanding the capabilities of these systems to autonomously execute full workflows and validate their results. Additionally, we anticipate that we can further streamline the development process by utilizing AI to detect potential incidents prior to code push, thereby proactively mitigating risks before they arise.

Acknowledgements

We wish to thank contributors to this effort across many teams throughout Meta, particularly Alexandra Antiochou, Beliz Gokkaya, Julian Smida, Keito Uchiyama, Shubham Somani; and our leadership: Alexey Subach, Ahmad Mamdouh Abou, Shahin Sefati, Shah Rahman, Sharon Zeng, and Zach Rait.

PVF: A novel metric for understanding AI systems’ vulnerability against SDCs in model parameters

Wed, 19 Jun 2024 18:00:00 +0200

We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems’ vulnerability against silent data corruptions (SDCs) in model parameters.
PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models.
We’re sharing results of our own case studies using PVF to measure the impact of SDCs in model parameters, as well as potential methods of identifying SDCs in model parameters.

Reliability is an important aspect of any successful AI implementation. But the growing complexity and diversity of AI hardware systems also brings an increased risk of hardware faults such as bit flips. Manufacturing defects, aging components, or environmental factors can lead to data corruptions – errors or alterations in data that can occur during storage, transmission, or processing and result in unintended changes in information.

Silent data corruptions (SDCs), where an undetected hardware fault results in erroneous application behavior, have become increasingly prevalent and difficult to detect. Within AI systems, an SDC can create what is referred to as parameter corruption, where AI model parameters are corrupted and their original values are altered.

When this occurs during AI inference/servicing it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services.

Figure 1 shows an example of this, where a single bit flip can drastically alter the output of a ResNet model.

Figure 1: Flipping a random bit of one parameter in the 1st convolution (conv) layer in ResNet-18 drastically alters the model’s output.

With this escalating thread in mind, there are two important questions: How vulnerable are AI models to parameter corruptions? And how do different parts (such as modules and layers) of the models exhibit different vulnerability levels to parameter corruptions?

Answering these questions is an important part of delivering reliable AI systems and services and offers valuable insights for guiding AI hardware system design, such as when assigning AI model parameters or software variables to hardware blocks with differing fault protection capabilities. Additionally, it can provide important information for formulating strategies to detect and mitigate SDCs in AI systems in an efficient and effective manner.

Parameter vulnerability factor (PVF) is a novel metric we’ve introduced with the aim to standardize the quantification of AI model vulnerability against parameter corruptions. PVF is a versatile metric that can be tailored to different AI models/tasks and is also adaptable to different hardware fault models. Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model’s convergence capability.

What is PVF?

PVF is inspired by the architectural vulnerability factor (AVF) metric used within the computer architecture community. We define a model parameter’s PVF as the probability that a corruption in that specific model parameter will lead to an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments.

PVF has several features:

Parameter-level quantitative assessment

As a quantitative metric, PVF concentrates on parameter-level vulnerability, calculating the likelihood that a corruption in a specific model parameter will lead to an incorrect model output. This “parameter” can be defined at different scales and granularities, such as an individual parameter or a group of parameters.

Scalability across AI models/tasks

PVF is scalable and applicable across a wide range of AI models, tasks, and hardware fault models.

Provides insights for guiding AI system design

PVF can provide valuable insights for AI system designers, guiding them in making informed decisions about balancing fault protection with performance and efficiency. For example, engineers might leverage PVF to help map higher vulnerable parameters to better-protected hardware blocks and explore tradeoffs on latency, power, and reliability by enabling a surgical approach to fault tolerance at selective locations instead of a catch-all/none approach.

Can be used as a standard metric for AI vulnerability/resilience evaluation

PVF has the potential to unify and standardize such practices, making it easier to compare the reliability of different AI systems/parameters and fostering open collaboration and progress in the industry and research community.

How PVF works

Similar to AVF as a statistical concept, PVF needs to be derived through a large number of FI experiments that are statistically meaningful. Figure 2 shows an overall flow to compute PVF through a FI process. We’ve presented a case study on the open-source DLRM inference with more details and example case studies in our paper.

Figure 2: Computing PVF through FI.

Figure 3 illustrates the PVF of three DLRM parameter components, embedding table, bot-MLP, and top-MLP, under 1, 2, 4, 8, 16, 32, 64, and 128 bit flips during each inference. We observe different vulnerability levels across different parts of DLRM. For example, under a single bit flip, the embedding table has relatively low PVF; this is attributed to embedding tables being highly sparse, and parameter corruptions are only activated when the particular corrupted parameter is activated by the corresponding sparse feature. However, top-MLP can have 0.4% under even a single bit flip. This is significant – for every 1000 inferences, four inferences will be incorrect. This highlights the importance of protecting specific vulnerable parameters for a given model based on the PVF measurement.

Figure 3: The PVF of DLRM parameters under random bit flips.

We observe that with 128 bit flips during each inference, for MLP components, PVF has increased to 40% and 10% for top-MLP and bot-MLP components respectively, while observing multiple NaN values. Top-MLP component has higher PVF than bot-MLP. This is attributed to the top-MLP being closer to the final model, and hence has less of a chance to be mitigated by inherent error masking probability of neural layers.

The applicability of PVF

PVF is a versatile metric where the definition of an “incorrect output” (which will vary based on the model/task) can be adapted to suit user requirements. To adapt PVF to various hardware fault models the method to calculate PVF remains consistent as depicted in Figure 2. The only modification required is the manner in which the fault is injected, based on the assumed fault models.

Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model’s convergence capability. During training, the model’s parameters are iteratively updated to minimize a loss function. A corruption in a parameter could potentially disrupt this learning process, preventing the model from converging to an optimal solution. By applying the PVF concept during training, we could quantify the probability that a corruption in each parameter would result in such a convergence failure.

Dr. DNA and further exploration avenues for PVF

The logical progression after understanding AI vulnerability to SDCs is to identify and lessen their impact on AI systems. To initiate this, we’ve introduced Dr. DNA, a method designed to detect and mitigate SDCs that occur during deep learning model inference. Specifically, we formulate and extract a set of unique SDC signatures from the distribution of neuron activations (DNA), based on which we propose early-stage detection and mitigation of SDCs during DNN inference.

We perform an extensive evaluation across 10 representative DNN models used in three common tasks (vision, GenAI, and segmentation) including ResNet, Vision Transformer, EfficientNet, YOLO, etc., under four different error models. Results show that Dr. DNA achieves a 100% SDC detection rate for most cases, a 95% detection rate on average and a >90% detection rate across all cases, representing 20-70% improvement over baselines. Dr. DNA can also mitigate the impact of SDCs by effectively recovering DNN model performance with <1% memory overhead and <2.5% latency overhead.

Read the research papers

PVF (Parameter Vulnerability Factor): A Novel Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations

MLow: Meta’s low bitrate audio codec

Thu, 13 Jun 2024 16:45:00 +0200

At Meta, we support real-time communication (RTC) for billions of people through our apps, including WhatsApp, Instagram, and Messenger.
We are working to make RTC accessible by providing a high-quality experience for everyone – even those who might not have the fastest connections or the latest phones.
As more and more people have relied on our products to make calls over the years, we’ve been working on new ways to ensure all calls have a solid audio quality.
We’ve built the Meta Low Bitrate (MLow) codec: a new tool that improves audio quality especially for those on slow-speed connections.

Figure 1: Increasing complexity or bitrate usually improves quality, but good codecs achieve higher quality while balancing the other two.

RTC products use many building blocks to deliver the full experience, and one of the critical components is audio/video codecs. These codecs help compress the captured audio/video data so it can be sent across the internet efficiently to the recipient, keeping the experience real time. For example, the size of raw audio captured for a typical call is 768 kbps (mono, sampling at 48kHz, bit depth 16), which modern codecs are able to compress down to 25-30 kbps. Often this compression comes at the cost of some quality (loss of information), but good codecs can strike a balance among the trio of quality, bitrate, and complexity by exploiting deep knowledge about the nature of the audio signal as well as by using psychoacoustics.

Building a good codec is quite challenging, and that is why we don’t see new codecs emerging very often. The last widely known, good open-source codec was Opus, released in 2012, which has become the codec of choice for the wide variety of applications on the internet. Meta has used Opus for all its RTC needs, and so far it has served us well – helping to deliver quality calls to billions of users across the globe.

Our motivation for building a new codec

Given the massive scale of RTC usage in Meta products, we get to see how a codec performs in a range of network scenarios and how it impacts the end user’s experience. In particular, we’ve observed that a significant chunk of calls have poor network connections throughout or for part of a call. Typically a bandwidth estimation module (BWE) detects the quality of the network, and as the network quality degrades, we need to lower the codec operating bitrate to avoid congesting the network and keep the audio flowing – impacting the trio balance referenced above. Complicating matters, conducting a video call despite poor network quality leaves little room for audio and pushes the audio bitrate further down. The lowest operating point for Opus is 6 kbps, at which it runs in NarrowBand mode (0 – 4kHz) and does not adequately capture all the sound frequencies produced by human voices—and so doesn’t sound as clear or natural. Here is an example of how Opus sounds at 6kbps and the corresponding reference file for comparison.

Raw reference signal:

Opus @ 6 kbps NarrowBand (NB):

Over the last two years, we have seen development of some new machine learning (ML)-based audio codecs that provide good quality audio at very low bitrates. In October of 2022, Meta released Encodec, which achieves amazingly crisp audio quality at very low bitrates. While these AI/ML-based codecs are able to achieve great quality at low bitrates, it often comes at the expense of heavy computational cost. Consequently, only the very high-end (expensive) mobile handsets are able to run these codecs reliably, while users running on lower-end devices continue to experience audio quality issues in low-bitrate conditions. So the net impact of these newer computationally expensive codecs is actually limited to a small portion of users.

A significant number of our users still use low-end devices. For example, more than 20 percent of our calls are made on ARMv7 devices, and 10’s of millions of daily calls on WhatsApp are on 10-year-old-plus devices. Given the readily available codec choices and our commitment to ensure that all users – regardless of what device they’re on – have a quality calling experience, we clearly need a codec with very low-compute requirements that still delivers high-quality audio at these lowest bitrates.

The MLow codec

We broke ground with our development of a new codec in late 2021. After nearly two years of active development and testing, we are proud to announce Meta Low Bitrate audio codec, aka MLow, which achieves two-times-better quality than Opus (POLQA MOS 1.89 vs 3.9 @ 6kbps WB). Even more importantly, we are able to achieve this great quality while keeping MLow’s computational complexity 10 percent lower than that of Opus.

Figure 2 below shows a MOS (Mean Opinion Score) plot on a 1-5 scale and compares the POLQA scores between Opus and MLow at various bitrates. As the chart makes evident, MLow has a huge advantage over Opus at the lowest bitrates, where it saturates quality faster than Opus.

Figure 2: POLQA score comparing Opus (WB) versus MLow at various bitrates across a large dataset of files.

We have already fully launched MLow to all Instagram and Messenger calls and are actively rolling it out on WhatsApp—and we’ve already seen incredible improvement in user engagement driven by better audio quality.

Here are some audio samples for you to listen to. We suggest that you use your favorite pair of headphones to appreciate the striking audio-quality differences.

Opus 6 kbps NB	MLow 6 kbps WB	Reference

Being able to encode high-quality audio at lower bitrates also unlocks more effective Forward Error Correction (FEC) strategies. Compared with Opus, with MLow we can afford to pack FEC at much lower bitrates, which significantly helps to improve the audio quality in packet loss scenarios.

Here are two audio samples at 14 kbps with heavy 30 percent receiver-side packet loss.

Opus:

MLow:

Note that at these bitrates, Opus is not able to encode any inband FEC. It needs a minimum of 19 kbps to encode any inband FEC at 10 percent packet loss, which hurts the audio recovery.

MLow internals

MLow builds on the concepts of a classic CELP (Code Excited Linear Prediction) codec with advancements around excitation generation, parameter quantization, and coding schemes. Figure 3 is a high-level visual of how the codec works internally. On the left we have an input signal (raw PCM audio) feeding into the encoder, which then splits the signal into two low and high-frequency bands. Then, each band is encoded separately while making use of shared information to achieve better compression. All the output is passed through a range encoder to further compress and generate an encoded payload. The decoder does the exact opposite when given the payload to generate output audio signals.

Figure 3: High level MLow encoder and decoder architecture.

With these split-band optimizations, we are able to encode the high band using very few bits, which lets MLow deliver SuperWideBand (32kHz sampling) using a much lower bitrate.

What’s next?

MLow has greatly enhanced audio quality on low-end devices while still ensuring calls are end-to-end encrypted. We are really excited about what we have accomplished in just the last two years—from developing a new codec to successfully shipping it to billions of users around the globe. We’re continuing to work on improving the audio recovery in heavy packet loss networks by pumping out more redundant audio, which MLow allows us to do efficiently. We’re excited to share more as we continue working to make it easier for all our users to make quality audio calls.

How Meta trains large language models at scale

Thu, 13 Jun 2024 00:45:00 +0200

As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we’ve experienced is the sheer scale of computation required to train large language models (LLMs).

Traditionally, our AI model training has involved a training massive number of models that required a comparatively smaller number of GPUs. This was the case for our recommendation models (e.g., our feed and ranking models) that would ingest vast amounts of information to make accurate recommendations that power most of our products.

With the advent of generative AI (GenAI), we’ve seen a shift towards fewer jobs, but incredibly large ones. Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together.

The challenges of large-scale model training

As we increase the number of GPUs in a job, the likelihood of an interruption due to a hardware failure also increases. Also, all of these GPUs still need to communicate on the same high-speed fabric to perform optimally. This underscores the importance of four factors:

Hardware reliability: Ensuring that our hardware is reliable is important. We need to minimize the chances of a hardware failure interrupting a training job. This involves rigorous testing and quality control measures, and automation to quickly detect and remediate issues.
Fast recovery on failure: Despite our best efforts, hardware failures can and do occur. When they do, we need to be able to recover quickly. This involves reducing re-scheduling overhead and fast training re-initialization.
Efficient preservation of the training state: In the event of a failure, we need to be able to pick up where we left off. This means we need to regularly checkpoint our training state and efficiently store and retrieve training data.
Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. A slow data exchange between a subset of GPUs can compound and slow down the whole job. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms.

Innovating across the infrastructure stack

Perfecting every layer of our infrastructure stack is important due to the demands of GenAI at scale. This has encompassed developments in a wide range of areas.

Training software

We enable researchers to use PyTorch and other new open source developments, facilitating extremely fast research-to-production development. This includes developing new algorithms and techniques for efficient large-scale training and integrating new software tools and frameworks into our infrastructure.

Scheduling

Efficient scheduling helps ensure that our resources are used optimally. This involves sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.

Hardware

We need high-performance hardware to handle the computational demands of large-scale model training. Beyond size and scale, many hardware configurations and attributes need to be best optimized for GenAI. Given that hardware development times are traditionally long, we had to adapt existing hardware, and to this end we explored various dimensions including power, HBM capacity and speed, and I/O.

We also pivoted by modifying the Grand Teton platform that was developed using NVIDIA H100 GPUs, increased the TDP of the GPUs to 700W, and moved to HBM3 on the GPUs. Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical and thermal designs had to change to accommodate this, and that triggered a validation cycle to support a large-scale deployment.

All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.

Data center deployment

Once we’ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) requires revisiting trade-offs made for other types of workloads. Data center power and cooling infrastructure cannot be changed quickly (or easily) and we had to find an optimal layout that allowed maximum compute capability within a data hall. This required relocating supporting services such as readers out of the data hall and packing as many GPU racks as possible to maximize the power and network capability for highest compute density with the largest network cluster.

Reliability

We need to plan for detection and remediation to minimize downtime during hardware failures. The number of failures scales with the size of the cluster, and having a job that spans the cluster makes it necessary to keep adequate spare capacity to restart the job as soon as possible. In addition, we monitor failures and can sometimes take preventive measures to mitigate downtime.

Some of the most frequent failure modes we have observed are:

GPUs falling off: In this case, GPUs are not detected by the host on PCIe. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages.
DRAM & SRAM UCE: Uncorrectable errors are common in memories, and we monitor and identify repeat offenders, track against thresholds, and initiate RMAs when error rates exceed vendor thresholds.
HW network cable: In the general category of unreachable servers, these failures are also seen most often in the early life of the server.

Network

Large-scale model training involves transferring vast amounts of data quickly between GPUs. This requires robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms.

There are two leading choices in the industry that fit these requirements: RoCE and InfiniBand fabrics. Both of these options had tradeoffs. On the one hand, Meta had built RoCE clusters for the past four years, but the largest of those clusters only supported 4K GPUs. We needed significantly larger RoCE clusters. On the other hand, Meta had built research clusters with InfiniBand as large as 16K GPUs. However, those clusters were not tightly integrated into Meta’s production environment, nor were they built for the latest generation of GPUs/networking. This made for a difficult decision of what fabric to build with.

So we decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience. These learnings will inform the future direction of GenAI fabrics. We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. We used both InfiniBand and RoCE clusters to train Llama 3, with the RoCE cluster used for training the largest model. Despite the underlying network technology differences between these clusters, we were able to tune both of them to provide equivalent performance for these large GenAI workloads

We optimized three aspects of the overall stack to make network communication for GenAI models performant on both clusters:

We assigned communication patterns resulting from different model, data and pipeline parallelisms to different layers of the network topology so that the network capabilities were effectively exploited.
We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. We do this by changing the default implementation of collectives with custom algorithms such as recursive doubling or halving instead of conventional algorithms like rings.
Just like ranking jobs, GenAI jobs produce additional fat flows that make it hard to distribute traffic across all possible network paths. This required us to further invest in network load balancing and routing to achieve an optimal distribution of traffic across network resources.

We spoke in depth about our RoCE load-balancing techniques at Networking @Scale 2023.

Storage

We need efficient data-storage solutions to store the vast amounts of data used in model training. This involves investing in high-capacity and high-speed storage technologies and developing new data-storage solutions for specific workloads.

Looking ahead

In the next few years we will be working with hundreds of thousands of GPUs, handling even larger volumes of data, and dealing with longer distances and latencies. We’ll be adopting new hardware technologies—including newer GPU architectures—and evolving our infrastructure.

These challenges will push us to innovate and adapt in ways we can’t fully predict yet. But one thing is certain: We are only at the beginning of this journey. As we continue to navigate the evolving landscape of AI, we remain committed to pushing the boundaries of what’s possible.

Maintaining large-scale AI capacity at Meta

Wed, 12 Jun 2024 20:05:00 +0200

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we recognized that to lead in the generative AI space we’d need to transform our fleet.

Our increased focus on AI was driven both by its rise in driving business outcomes and the huge growth in these types of workloads’ computational needs. In addition to wider use of traditional AI for things like ad targeting, we have also seen increasing numbers of large generative AI models that mimic almost-human intelligence in everything from human verbal interaction to the creation of pictures and other media. And these types of models are huge, with trillions of training parameters, and to train them we need vast resources.

In this process, we’ve built one of the world’s largest AI training infrastructures, and it has been growing exponentially over the last years. Meta’s training infrastructure comprises dozens of AI clusters of varying sizes, with a plan to scale to 600,000 GPUs in the next year. It runs thousands of training jobs every day from hundreds of different Meta teams. Training jobs characteristics vary greatly too. They can be as small as a single GPU running for a couple minutes, while generative AI jobs can have trillions of parameters and often span thousands of hosts that need to work together and are very sensitive to interruptions. In addition to that, training jobs are tied much closer to the hardware, and that hardware varies greatly. Meta runs different types of backend networks, topologies, and training jobs that have tight dependencies between software and hardware components.

This transition has not been without its challenges. We had to reconfigure the fleet without disrupting our hypergrowth, a task akin to rebuilding an airplane mid-flight. This pushed us to innovate and collaborate with vendors and utility companies to create a supportive ecosystem. In this blog we will discuss only one of these transformations. We will describe how Meta is maintaining these training clusters and what sets us apart from the average AI environment. And what do we mean by maintaining? Basically, any kind of operation that updates or verifies software and firmware components in the clusters, including the networking path.

The main characteristics of GPU training

GPU training has some demanding characteristics:

Capacity guarantees: While some training jobs can be paused, a lot of Meta jobs are time-critical and recurring or online. This means we cannot take large amounts of capacity on a default basis.
Bad hosts are very bad: Since many jobs require all hosts to be synchronized, bad hosts that are a bit slower, have some non-fatal hardware, or have networking issues are extremely damaging.
Low interruption rate: Since many hosts work with each other on a shared problem, AI training jobs are sensitive to interruptions.
Rollout safety: The AI software stack is deep, and problems are often hard to pinpoint, so we need to be careful when rolling out new components.
Host consistency: AI training jobs are in general cross-host, and while outside of the CUDA version there are rarely hard incompatibilities, we have learned that cluster consistency is highly important for debugging and SEV avoidance.

What’s special about Meta’s GPU training?

Meta uses bespoke training hardware with the newest chips possible and high-performance backend networks that are highly speed optimized. We also try to stay as current and flexible as possible with the software stack; in the event of firmware upgrades, this allows us to utilize new features or reduce failure rates.

Together this means we have more than:

30 maintenance operations
50 different components that are updated
Three different host-verification tasks to ensure optimal performance and stability
Thousands of disruptive AI host tasks every day

And we need to do them safely, while guaranteeing capacity. After all, our training clusters are also used flexibly to run a wide variety of workloads, from single-host to some of the biggest training jobs in the world, and from offline tasks to jobs that need to be up and running 24/7.

An overview of different maintenance rollouts happening on Meta capacity over time with overlapping durations.

Given the variety of upgrades, we have a large amount of overlapping inflight changes at any given time, including some that are consistently being applied, such as verification tasks. Accepting this gives Meta the flexibility we need in using cutting-edge hardware, scaling our infrastructure, and using both in flexible ways. In smaller environments it is often possible to keep clusters in a consistent state and upgrade the whole cluster and all of its firmware and software components in the same maintenance window. Doing this in a large, diverse environment like Meta, however, would introduce big risks and be operationally infeasible. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion. This approach also allows us to guarantee capacity availability.

Maintenance trains

Meta maintains capacity by using maintenance trains, which involves shutting down small amounts of capacity in a cyclic fashion.

Outside of special cases, Meta maintains its fleet of clusters using a technique called maintenance trains. This is used for all capacity, including compute and storage capacity. A small number of servers are taken out of production and maintained with all applicable upgrades. Trains provide the guarantee that all capacity minus one maintenance domain is up and running 24/7, thus providing capacity predictability. This is mandatory for all capacity that is used for online and recurring training.

Maintenance trains pick up any new upgrade and guarantee a full-visit cycle in a guaranteed timeframe. Longer-running upgrades can have lower rollout guarantees and may be scheduled to be applied in multiple cycles. So you can have many overlapping upgrades, and, if beneficial, upgrades can be aligned.

For AI capacity, we have optimized domains that allow for different kinds of AI capacity, very strict SLOs, and a contract with services that allows them to avoid maintenance-train interruptions, if possible.

Gradual rollouts

An illustration of the distinction between higher-level components of the AI stack—such as the CUDA drivers, and the lower-level components involved in the training job,.

Because of the scale of our infrastructure, we had to ensure that all disruptive rollouts outside of special cases happen in a gradual fashion. This means different servers in a cluster can run a different host stack for a short period of time. This is quite normal in traditional capacity but challenging in AI training, since AI jobs are very closely tied to the hardware.

At Meta, we’ve ensured that jobs have a consistent stack but upgrade lower-level components in a gradual fashion. In contrast to this, the AI job itself, which includes the CUDA library, is always consistent. This distinction is necessary because lower-level components often require hours to install and configure or require rebooting the host, while higher-level components in the job container itself can be restarted fluidly.

This sounds simple, but because of the tight integration of AI with hardware, we have needed to do a lot of development, including careful testing on all lower levels, special monitoring, and tight work with vendors.

By and large, this has been very successful. The AI stack in general has matured a lot over the past three years. We also added tooling for rare compatibility-breaking upgrades.

Selecting the correct maintenance domains

Maintenance domains are selected based on the amount of buffer-reserved capacity (the smaller the better) and the amount of interruptions we cause to training jobs (the bigger the better).

One way to ensure optimal AI performance was to work with AI teams to design the optimal size of maintenance domains. A maintenance domain is the percentage of capacity we take down in one go, and selecting the optimal size is a function of both the cost of interruptions and the capacity that is lost during the maintenance duration. Since interruption costs are high for AI jobs, optimizing this relationship allowed us to significantly reduce the maintenance overhead for AI capacity.

OpsPlanner: Meta disruptive-work orchestrator

Critical to AI capacity are the consistency requirements. For example, if you want to move to a new CUDA version, you may need all of the capacity on a new driver version. This becomes really difficult in an environment with thousands of hosts and lots of planned and unplanned operations that may overlap with each other. To do this safely and guarantee hosts have the correct upgrades applied before entering production, Meta has unified them in the OpsPlanner work orchestrator. Not only can it work on overlapping scopes of operations and correctly serialize them, it also takes them safely out and into production. In addition, it has a built-in handover flow that ensures correct escalation behavior and avoids overlaps and deadlocks. OpsPlanner can also ensure upgrades are applied to hosts before they are returned to production. And OpsPlanner owns planned maintenance and failure buffers and safeguards them. Furthermore, it’s highly effective and efficient: OpsPlanner currently handles a million operations per day.

Example scenarios illustrating the disruptive work scheduler the OpsPlanner needs to handle to ensure host consistency.

Safety and failure scenarios

Meta has a deep stack of safety features that includes:

Autostop of maintenance trains if maintenance or failure buffers are exhausted;
Automatic offboarding of failing upgrades; and
Rollout phases for upgrades, so that only well-tested changes reach global systems.

If something does go wrong, however, we can react quickly, depending on the needed fix, with emergency trains, large-scale maintenance for breaking upgrades, and more.

Rapidly moving to the future of generative AI

At Meta, we believe in moving fast and learning by doing. Rapid innovation is in our ethos. This is what fundamentally shaped our journey as we continually innovated towards building the foundational infrastructure that makes us leaders in generative AI. We will remain dedicated to creating technologies that not only benefit Meta but also have a positive impact on society as a whole.

As we move forward, we invite you to join us on this journey. Together, we can shape a future where AI is not just a tool but a force for good, transforming industries, empowering individuals, and creating a more sustainable world.

The best is yet to come, and we are excited to pioneer tomorrow’s possibilities in generative AI.

Unlocking the power of mixed reality devices with MobileConfig

Tue, 11 Jun 2024 18:00:00 +0200

MobileConfig enables developers to centrally manage a mobile app’s configuration parameters in our data centers. Once a parameter value is changed on our central server, billions of app devices automatically fetch and apply the new value without app updates. These remotely managed configuration parameters serve various purposes such as A/B testing, feature rollout, and app personalization. MobileConfig has been in production since 2015 and serves some of the world’s most widely used apps, including Facebook, Instagram, and Messenger.

In this blog, we describe how MobileConfig enables rapid innovation on the Meta Quest and Ray-Ban Meta smart glasses.

Configuration challenges in mixed reality

Reality Labs devices can have extended development and release cycles; thus, reliable configuration and experimentation are vital for consistency, developer velocity, safety, and overall reliability.

As the mixed reality (MR) ecosystem has grown, we have introduced many more apps. These apps often need to share configuration values. With a lack of common patterns for config usage, there are anti-patterns where engineers reinvent the wheel or build ad hoc ways of fetching configuration that are very specific for each app. We also saw usage patterns when apps and services needed to retrieve values early during boot-up that made these problems more complex. These challenges made configuration and experimentation in MR much more challenging.

Enter MobileConfig

In MobileConfig, a configuration (config for short) is defined as a set of parameters with a specific data type and can control different aspects of an application’s behavior. Developers decided on the data type (Boolean, int, double, string) for the parameter to use when creating it. MobileConfig offers a cross-platform client library and API that allows developers to easily read each config parameter in our many different applications and services. It also provides a complete set of backend tools, allowing developers to precisely control which values a given parameter receives based on client context like region or device type.

MobileConfig has several use cases, including feature flags, A/B testing, and release management. A developer can use MobileConfig to control the release of a new feature separately from the application release by placing it behind a config parameter. The config parameter can be tied to an A/B test and enabled for a small segment of users while monitoring critical application metrics such as performance or engagement.

Alternatively, the parameter can be used as a feature flag, allowing developers to control who has access to the feature while it’s under development. Once the feature is behind MobileConfig, the developer can make all these changes without touching client-side code.

Our blog on “Mobile Configuration at Meta” discusses more on how MobileConfig is implemented and the design decisions behind the technology.

Expanding MobileConfig as a platform in MR

Given our experience with mobile app development, it was clear that Meta’s family of devices would also need a configuration system to move fast, enable experimentation, and remotely control various aspects of the systems. We decided that MobileConfig would be the best solution to fit the needs of these platforms because of its proven reliability and performance, existing suite of tools for troubleshooting and debugging, and its ability to release changes safely.

We centralized all configuration requests through a single Android service (MobileConfigService), resulting in lightweight clients that do not need to fetch configs, report telemetry, or understand backend protocols. We also centralized service authentication, enabling apps on the device to leverage session-based configs with or without user info. Here, with device-level consistency, the service was able to allow for experimentation in multiple apps and services at the same time, streamlining the onboarding process for new apps and allowing them to communicate with the leading service via IPC without needing to provide additional authentication or build-tooling (for configuration).

Additionally, we have built libraries on top of MobileConfig in platforms like Windows to allow fetching configs where the MobileConfig API cannot be built as-is (aka using buck). Overall, we avoided reinventing the wheel at a large scale and had a consistent user experience on MR, as our family of apps provided a vehicle for a net increase in developer velocity with a much lower learning curve.

Expanding MobileConfig to Meta’s family of devices

The novel reuse of the MobileConfig service has enabled us to optimize our development process and improve the functionality of other low-powered devices. We never intended to use MobileConfig libraries on microcontrollers. However, we repurposed them into thin client(s) and removed dependencies to lower the overall memory footprint. Additionally, we optimized the developer authoring flow, allowing developers to target devices without standard buck integration. Work here allows our low-powered devices to communicate and consume configs over Bluetooth and Wi-Fi.

Many of these devices, such as the Ray-Ban Meta smart glasses, have multiple microcontroller architectures controlling the various aspects of the product, including power, cameras, etc. In these cases, we use several technologies and protocols, such as IPC and SPI, to sync configuration values to the various components. In cases where the specific hardware could not run the core MobileConfig library due to language limitations, we could still reuse our highly efficient cache design and data structures to deliver configuration values. We abstracted these details from the developer experience, allowing for a unified configuration workflow.

Now, devices, specifically low-power devices, are not always connected to the internet, and they still need to run experiments and launch features independently of device releases. Our design choices helped us create a seamless companion app experience that uses the same mobile configuration libraries as our other family of apps, enabling these low-power devices for the same suite of features. This newfound flexibility speeds up our development process and time to market significantly, allowing us to deliver cutting-edge technology to our customers more quickly and efficiently.

Enabling future devices with limited connectivity and scarce resources

After our success with MR devices, we consolidated efforts to leverage MobileConfig as a centralized platform. Our goal is to enable the next generation of devices with the same suite of features and capabilities. These new-generation devices possess different challenges and MobileConfig needs to be highly performant and optimized for consuming compute resources on the device.

We optimized our service to run under specific conditions only (e.g., while charging or with Wi-Fi). Even with the configuration syncs to the server, our libs ensure we don’t drain the battery since it’s in minimal supply (much different than our family of apps). We also created customized IPC APIs to reduce memory and CPU usage. Work here enables dynamic configuration sets on newly installed apps to get config values from a centralized MobileConfig service.

While being in such a constrained environment, we successfully enabled experimentation at the OS level with customized AOSP Java and Native APIs while keeping the same user experience as our other family of apps using Meta’s tools. We’ve also developed a customized tool that allows developers to quickly and efficiently use configuration with limited screen real estate, further unlocking developer efficiency. Advancements in future-gen devices have enabled us to provide a best-in-class service that sets us apart from our competitors and offers significant business value.

A leap into the future

Meta’s infrastructure is empowering a new generation of devices, helping developers move swiftly. As we build more devices and applications, we will leverage MobileConfig as a centralized platform, given its capabilities and success with our existing lineup.

Efforts here allow our product groups to scale at rates that would otherwise take months and massive engineering efforts to coordinate. We drive with joint goals across organizations that put our user’s needs at the forefront.

Serverless Jupyter Notebooks at Meta

Mon, 10 Jun 2024 18:30:00 +0200

At Meta, Bento, our internal Jupyter notebooks platform, is a popular tool that allows our engineers to mix code, text, and multimedia in a single document. Use cases run the entire spectrum from what we call “lite” workloads that involve simple prototyping to heavier and more complex machine learning workflows. However, even though the lite workflows require limited compute, users still have to go through the same process of reserving and provisioning remote compute – a process that takes time – before the notebook is ready for any code execution.

To address this problem, we have invested in building infrastructure that allows for code execution directly in the browser, removing the need to provision remote compute for some lite workloads. This infrastructure leverages a library called Pyodide that sits on top of WebAssembly (Wasm)

Here’s how we married Bento with this in-browser, serverless code execution technology to power our notebooks platform for these lite workloads.

The motivation for supporting lite workloads

We define lite workloads as workloads that only consume data from upstream systems, do not have side effects to our underlying systems, and use up to the maximum Chrome tab memory limit. We frequently get internal feedback from the owners of these lite workloads that the time and complexity in getting started is not proportionate to what they want to use Bento for.

The requirements can be summarized as follows:

An intuitive startup process that works right out of the box
A startup process that is very quick and has the notebook immediately ready for execution
A startup process that does not include the complex remote compute reservation process
An execution environment that supports the majority of the lite workloads

How we put the pieces together

How this all works

Pyodide (a Python distribution for the browser that runs on WebAssembly) is an important ingredient for this work. We’ve built a kernel abstraction around this which, when called from Bento, will just work as any of the classic kernels we have (with some limitations) and perform message passing using the Jupyter Protocol.

Kernel bridge

This is just an abstraction that allows Bento to work with both traditional server-based kernels and this new browser-based kernel with no changes whatsoever to the rest of the system. The visible manifestation of this is just a selector in the notebook that toggles between server-based kernels and serverless.

Magics

Cell magics are an important component of the Bento extension platform. In order to allow existing custom cells to work with no changes, we built middleware to capture these cell magics, process them directly in the context of javascript, and then just inject the expected results back into the Python kernel. A good example of this pattern is around %%sql, which we use to power our custom SQL cell.

We’ll showcase a few more examples in the section below on “Meta-specific” integrations.

Why we need a webworker

Since JavaScript is single-threaded, in the absence of a webworker, the entire browser would just lock up when we have “expensive” kernel operations. Having kernel operations run in a webworker with just the results being passed to the main thread helps mitigate this.

[embedded content]

Meta-specific integrations

In order to unlock additional utility and have a coherent story around the extract, transform, and load (ETL) narrative, we built integrations with an initial set of existing extensions. These represent a relatively popular set of extensions that users leverage to perform data operations.

SQL Cell

This leverages the %%sql magic to fetch data from the warehouse and make it available for further processing in the Pyodide kernel.

Google Sheets

Here, we leverage the %%googlesheet magic to fetch data from a Google sheet and make it available for further processing in the notebook.

GraphQL

Here, we leverage %%graphql magic, which powers the GraphQL cell to make data fetches and then inject the result back into the kernel for further processing.

Dataframe uploads

Data uploads are a bit trickier to pull off as compared to the data reads we showcased above. We instead achieve this functionality by:

Leveraging the %%dataframe magic that powers the upload custom cell in order to fetch the arguments in a structured way.
We then kick off an async job using Tupperware (Meta’s async tier compute platform) and show the status of the associated tupperware job in the cell output.

What’s next for serverless notebooks

While we’ve addressed the initial set of challenges to bring this product online, there is still a lot of work to be done to improve the developer experience for users. Firstly, we’re planning on improving the lite workloads heuristic. Once we have this figured out, the next step will involve defaulting all new workloads to start as serverless. Then we can quickly autodetect (based on memory requirements, data volumes, or libraries in use) whether the workload is lite enough. If not, we can automatically switch that notebook to leverage a server-based kernel with minimal interruption to the user flow.

After this, we plan to integrate with more existing cell extensions built on top of the Bento platform and thus expand the scope of what’s possible when running “serverless.”

The biggest limitation with this approach at Meta is that homegrown libraries that have not been ported to WebAssembly will be unavailable. Given this, we’re also planning to explore whether we can farm out the execution of specific “non-lite” cells to our remote execution infrastructure while making this work seamlessly with Pyodide.

Once these have been addressed, “serverless” notebooks will become the de facto landing experience in Bento.

Acknowledgments

Some of the approaches we took were directly inspired by the work done on JupyterLite and directly leverages the Pyodide library without which this project would not have been possible. I’d also like to thank all the engineers at Meta I collaborated with to make this project a reality.

Composable data management at Meta

Wed, 22 May 2024 19:15:00 +0200

In recent years, Meta’s data management systems have evolved into a composable architecture that creates interoperability, promotes reusability, and improves engineering efficiency.
We’re sharing how we’ve achieved this, in part, by leveraging Velox, Meta’s open source execution engine, as well as work ahead as we continue to rethink our data management systems.

Data is at the core of every product and service at Meta. To efficiently process data generated by billions of people, Data Infrastructure teams at Meta have built a variety of data management systems over the last decade, each targeted to a somewhat specific data processing task. Today, our data engines support workloads such as offline processing of large datasets (ETL), interactive dashboard generation, ad hoc data exploration, and stream processing as well as more recent feature engineering and data preprocessing systems that support our rapidly expanding AI/ML infrastructure.

Over time, these divergent systems created a fragmented data environment with little reuse across systems, which eventually slowed our engineering innovation. In many cases, it forced our engineers to reinvent the wheel, duplicating work and reducing our ability to quickly adapt systems as requirements evolved.

More importantly, the byproducts of this fragmentation – incompatible SQL and non-SQL APIs, and inconsistent functionality and semantics – impacted the productivity of internal data users who are commonly required to interact with multiple, distinct data systems, each with their own quirks, to finish a particular task. Fragmentation also eventually translated to high costs of ownership for running these data systems. To economically support our fast-paced environment where products are constantly evolving and generating additional requirements to our data systems, we needed more agility. We needed to change our thinking to be able to move faster.

A few years ago we embarked on a journey to address these shortcomings by rethinking how our data management systems were designed. The rationale was simple: Instead of individually developing systems as monoliths, we would identify common components, factor them out as reusable libraries, and leverage common APIs and standards to increase the interoperability between them. We would create teams that cooperate horizontally by developing shared components, concentrating our specialists in fewer but more focused teams, thus amplifyinging the impact of the teams’ work.

This ambitious program had a three-fold goal: (a) to increase the engineering efficiency of our organization by minimizing the duplication of work; (b) to improve the experience of internal data users through more consistent semantics across these engines, and ultimately, (c) to accelerate the pace of innovation in data management.

Building on similarities

With time, the evolution of these ideas gave birth to the trend now called the “composable data management system.” We have recently published this vision in a research paper in collaboration with other organizations and key leaders in the community facing similar challenges. In the paper, we make the observation that in many cases the reusability challenges are not only technical but commonly also cultural and even economic. Moreover, we discuss that while at first these specialized data systems may seem distinct, at the core they are all composed of a similar set of logical components:

A language frontend, responsible for parsing user input (such as a SQL string or a dataframe program) into an internal format;
An intermediate representation (IR), or a structured representation of computation, usually in the form of a logical and/or a physical query plan;
A query optimizer, responsible for transforming the IR into a more efficient IR ready for execution;
An execution engine library, able to locally execute query fragments (also sometimes referred to as the “eval engine”); and
An execution runtime, responsible for providing the (often distributed) environment in which query fragments can be executed.

We have also highlighted that, beyond having the same logical components, the data structures and algorithms used to implement these layers are largely consistent across systems. For example, there is nothing fundamentally different between the SQL frontend of an operational database system and that of a data warehouse, or between the expression evaluation engines of a traditional columnar DBMS and that of a stream processing engine, or between the string, date, array, or JSON manipulation functions across database systems.

Often, however, data systems do require specialized behavior. For example, stream processing systems have streaming-specific operators, and machine learning (ML) data preprocessing systems may have tensor-specific manipulation logic. The rationale is that reusable components should provide the common functionality (the intersection), while providing extensibility APIs where domain-specific features can be added. In other words, we need a mindset change as we build data systems as well as organize the engineering teams that support them: We should focus on the similarities, which are the norm, rather than on the differences, which are the exceptions.

Decomposition begins

If one were to start building data systems from scratch, there is little disagreement that reusable components are more cost effective and maintainable in the long run. However, most of our existing data systems are stable and battle-tested, and are the result of decades of engineering investment. From a cost perspective, refactoring and unifying their components could be impractical.

Yet, scale drives innovation, and to support the growing needs from our products and services, we are constantly improving the efficiency and scalability of our existing data engines. Since the execution engine is the layer where most computational resources are spent, often we have found ourselves re-implementing execution optimizations already available in a different system, or porting features across engines.

With that in mind, a few years ago we decided to take a bolder step: Instead of individually tweaking these systems, we started writing a brand new execution-layer component containing all the optimizations we needed. The strategy was to write it as a composable, reusable, and extensible library, which could be integrated into multiple data systems, therefore increasing the engineering efficiency of our organization in the long run.

Composable Execution: Velox

This is how Velox started. We created Velox in late 2020 and made it open source in 2022.

By providing a reusable, state-of-the-art execution engine that is engine- and dialect-agnostic (i.e, it can be integrated with any data system and extended to follow any SQL-dialect semantic), Velox quickly received attention from the open-source community. Beyond our initial collaborators from IBM/Ahana, Intel, and Voltron Data, today more than 200 individual collaborators from more than 20 companies around the world participate in Velox’s continued development.

Velox is currently in different stages of integration with more than 10 data systems at Meta. For example, in our Velox integration with Presto (a project cleverly named “Prestissimo”), we have seen 3-10x efficiency improvements in deployments running production workloads. In the Apache Gluten open source project created by Intel, where Velox can be used as the execution engine within Spark, a similar 3x efficiency gain has been observed on benchmarks. We have also seen engineering-efficiency improvements as new systems such as internal time-series databases and low-latency interactive engines were developed in record time by reusing the work done by a small group of focused database execution specialists.

With Velox, we intend to commoditize execution in data management by providing an open, state-of-the-art implementation. Beyond the novel composability aspect, in general lines, Velox extensively leverages the following data processing techniques to provide superior performance and efficiency:

Columnar and vectorized execution: Velox decomposes large computations into concise and tight loops, as these provide more predictable memory access patterns and can be more efficiently executed by modern CPUs.
Compressed execution: In Velox, columnar encodings have dual applicability: data compression and processing efficiency. For example, dictionary encoding can be used not only to more compactly represent the data, but also to represent the output of cardinality-reducing or increasing operations such as filters, joins, and unnests.
Lazy materialization: As many operations can be executed just by wrapping encodings around the data, the actual materialization (decoding) can be delayed and at times completely avoided.
Adaptivity: In many situations, Velox is able to learn when applying computations over successive batches of data, in order to more efficiently process incoming batches. For example, Velox keeps track of the hit rates of filters and conjuncts to optimize their order; it also keeps track of join-key cardinality to more efficiently organize the join execution; it learns column access patterns to improve prefetching logic, among other similar optimizations.

By being composable, Velox enabled us to write and maintain this complex logic once and then benefit from it multiple times. It also allowed us to build a more focused team of data execution specialists who were able to create a far more efficient execution component than what was possible with bespoke systems, due to investment fragmentation. By being open source, Velox allowed us to collaborate with the community while building these features, and to more closely partner with hardware vendors to ensure better integration with evolving hardware platforms.

System-wide integration: Open standards and Apache Arrow

To continue decomposing our monolithic systems into a more modular stack of reusable components, we had to ensure that these components could seamlessly interoperate through common APIs and standards. Engines had to understand common storage (file) formats, network serialization protocols, and table APIs, and have a unified way of expressing computation. Often, these components had to directly share in-memory datasets with each other, such as when transferring data across language boundaries (from C++ to Java or Python) for efficient UDF support. As much as possible, our focus was to use open standards in these APIs.

Yet, while creating Velox, we made the conscious design decision to extend and deviate from the open-source Apache Arrow format (a widely adopted in-memory columnar layout) and created a new columnar layout called Velox Vectors. Our goal was to accelerate data-processing operations that commonly occur in our workloads in ways that had not been possible using Arrow. The new Velox Vectors layout provided the efficiency and agility we needed to move fast, but in return it created a fragmented space with limited component interoperability.

To reduce fragmentation and create a more unified data landscape for our systems and the community, we partnered with Voltron Data and the Arrow community to align and converge the two formats. After a year of work, three new extensions inspired by Velox Vectors were added to new Apache Arrow releases: (a) StringView, (b) ListView, and (c) Run-End-Encoding (REE). Today, new Arrow releases not only enable efficient (i.e., zero-copy) in-memory communication across components using Velox and Arrow, but also increase Arrow’s applicability in modern execution engines, unlocking a variety of use cases across the industry.

This work is described in detail in our blog, Aligning Velox and Apache Arrow: Towards composable data management.

Future directions

To continue our journey towards making systems more sustainable in the long-term through composability, as well as adaptable to current and future trends, we have started investing in two new avenues. First, we have witnessed how the inflexibility of current file formats can limit the performance of large training tables for AI/ML. In addition to their massive size, these tables are often (a) much wider (i.e, containing thousands of column/feature streams), (b) can benefit from novel, more flexible and recursive encoding schemes, and (c) need parallel and more efficient decoding methods to feed data-hungry trainers. To address these needs, we have recently created and open sourced Nimble (formerly known as Alpha). Nimble is a new file format for large datasets aimed at AI/ML workloads, but that also provides compelling features for traditional analytic tables. Nimble is meant to be shared as a portable and easy-to-use library, and we believe it has the potential to supersede current mainstream analytic file formats within Meta and beyond.

Second, AI/ML compute requirements are rapidly driving innovation in data center design, steadily driving heterogeneity. To better leverage new hardware platforms, we believe AI/ML and data management systems should continue to converge through hardware-accelerated data systems, and that while fragmentation has historically hindered the adoption of hardware accelerators in data management, composable data systems will provide just about the right architecture. With Velox, we have seen that the first 3-4x efficiency improvements in data management can come purely from software techniques; moving forward, we believe that the next 10x efficiency wins will come from hardware acceleration. Although for now in this ongoing explorational effort there exist more challenges and open questions than answers, two things are well understood: Composability is paving the way for widespread hardware acceleration and other innovations in data management, and working in collaboration with the open-source community will increase our chances of success in this journey.

We believe that the future of data management is composable and hope more individuals and organizations will join us in this effort.

Post-quantum readiness for TLS at Meta

Wed, 22 May 2024 18:35:00 +0200

Today, the internet (like most digital infrastructure in general) relies heavily on the security offered by public-key cryptosystems such as RSA, Diffie-Hellman (DH), and elliptic curve cryptography (ECC). But the advent of quantum computers has raised real questions about the long-term privacy of data exchanged over the internet. In the future, significant advances in quantum computing will make it possible for adversaries to decrypt stored data that was encrypted using today’s cryptosystems.

Existing algorithms have reliably secured data for a long time. However, Shor’s algorithm can efficiently break these cryptosystems using a sufficiently large quantum computer. Although large quantum computers are not a reality yet, there’s an immediate quantum-related threat that needs to be addressed: the “store now, decrypt later” (SNDL) attack, in which attackers intercept and store encrypted data today with the intention of decrypting it at a later date when a sufficiently powerful quantum computer becomes available. This makes transitioning to quantum-resistant cryptography an endeavor of key priority.

To address this issue, the cryptography community has been working on a new class of cryptosystems known as post-quantum cryptography (PQC), which are expected to withstand quantum attacks but can be less efficient (in particular, communication bandwidth wise) than its classical counterparts. The US National Institute of Standards and Technology (NIST) is close to publishing their new PQC Standards (expected to be released this summer). Meta cryptographers are actively contributing to this and other PQC standardization processes (co-authoring the BIKE and Classic McEliece submissions to NIST, and co-editing the ISO/IEC 14888-4 standard).

How Meta is approaching the migration to PQC

Meta’s applications are used by billions of people every day. Given our focus on maintaining user privacy and security, Meta continuously raises its security bar to deploy the most advanced security and cryptographic protection techniques. As part of this continuous effort, we’ve created a workgroup to migrate to PQC, spanning from our internal infrastructure to user-facing apps. This is a highly complex multi-year effort and identifying where to first place PQC protections wasn’t trivial.

After careful analysis, protecting components that are susceptible to the SNDL attack, and where we control both endpoints, has been identified as our first priority (given their migration urgency and lack of external dependencies). In particular, protecting our internal communication traffic was the most sensitive use case that checked both boxes and thus became our first migration target.

But a direct migration to PQC wouldn’t be the most sensible approach. Migrating systems to different cryptosystems always carries some risks such as interoperability issues and security vulnerabilities. For the PQC migration specifically, the risks are even greater because some of these cryptosystems are comparatively new and/or have not experienced a long period of field testing. To reduce such risks, Meta has started transitioning to using hybrid key exchange for TLS, which combines existing classical cryptographic algorithms with a PQC algorithm. In this way, we ensure that our systems remain protected against existing attacks while also providing protection against future threats.

For our deployment, we have chosen Kyber with X25519 in a hybrid setting. Kyber is the only key encapsulation mechanism selected by NIST for standardization so far. Kyber comes in different parameterizations: Kyber512, Kyber768, and Kyber1024. Larger parameterizations provide stronger security but also require more computational resources and communication bandwidth. We aim to use Kyber768 by default, while using Kyber512 in some cases where larger parameterizations lead to prohibitive performance impact, to accelerate the deployment of PQC hybrid key exchange.

How Meta is enabling PQC

Meta’s TLS protocol library, Fizz, is designed for high security, reliability, and performance. The early work on Fizz previously helped standardize TLS 1.3 (RFC 8446). Fizz now supports a range of features including various handshake modes, PSK resumption, Diffie-Hellman key exchange authenticated with a pre-shared key for forward secrecy, async I/O, zero copy encryption, client authentication, and HelloRetryRequest. The use of our own implementation has allowed us to quickly react to new features in the TLS protocol.

Fizz is mostly built on top of three libraries: Folly, OpenSSL, and Sodium. To support PQC, we make use of liboqs, which is an open source library led by world-renowned PQC experts that has received attention from both academia and industry experts. The liboqs library implements post-quantum cryptography algorithms for key encapsulation and signature mechanisms, including Kyber. Additionally, we extended Fizz with hybrid key exchange functionality, which can make use of the new post-quantum key exchange mechanisms provided by liboqs alongside existing classical mechanisms.

Challenges

Large packet size

One of the main challenges is the size of the Kyber768 public key share, which is 1184 bytes. This is close to the typical TCP/IPv6 maximum segment size (MSS) of 1440 bytes, but is still fine for a full TLS handshake.

However, the key size becomes an issue during TLS resumption. Internally, we do Ephemeral Diffie-Hellman key exchange to achieve forward secrecy, so key exchange still happens on resumption. There will also be a pre-shared key (PSK) for authentication. These PSKs are 200-300 bytes long, and the remaining ClientHello fields can run up to 200 bytes, causing the resumption ClientHello to exceed the MSS for one packet.

Figure 1: ClientHello size, when including ECDHE keyshares and PSK, will exceed MSS.

This poses some challenges given significant usage of TCP Fast Open (TFO) for internal traffic. With TFO, the entire ClientHello could previously ride along with the TCP SYN packet, allowing the server’s TLS implementation to start processing and have its ServerHello ready to send right after its TCP SYN-ACK packet. However, when the ClientHello is too large to fit in the first packet, TFO still happens but the ClientHello is only partially sent. The client then has to wait for the TCP handshake to complete before sending the rest of the ClientHello, and needs to wait again for the ServerHello. This adds an extra round trip time (RTT) to the whole handshake process before any application data can be sent.

Figure 2: Left: TLS handshake with TFO done in same round trip as TCP handshake. Right: ClientHello exceeds MSS of one packet, one round trip added to finish TLS handshake.

After evaluating various alternatives and workarounds, and given the prohibitive key size of Kyber768, we opted to use Kyber512 in internal communications affected by this problem for now, allowing us to accelerate the PQC deployment. Kyber512’s 800-bytes-long public keys help with fitting the ClientHello into a single TCP packet, while still being considered secure by NIST. This choice ensures both security and efficient communication. In the future, an increase in MTU, or utilizing QUIC, which allows for multiple initial packets, may allow for larger ClientHellos without an additional round trip.

Multithreading problem with liboqs

After we rolled out post-quantum hybrid key exchange to our fleet, one of our internal teams started experiencing intermittent but constant segmentation fault crashes, and liboqs code was near the top of the stack trace. Here is an example stack trace:

#0  0x0000000000000000 in ?? ()
#1  
#2  0x0000000000000000 in ?? ()
#3  0x0000556ea1ed5eac in keccak_x4_inc_absorb.constprop ()

We determined the problem to be a race condition that was causing a function call to call the 0 address. The issue was filed to liboqs. To explain briefly, the race condition was in the Keccak_Dispatch function, where Keccak_Initialize_ptr would be set before setting some other function pointers. Crucially, Keccak_Initialize_ptr being set or not is used by the caller of Keccak_Dispatch to determine whether to actually call it. In a multi-threaded environment, some thread could call Keccak_Dispatch, then set Keccak_Initialize_ptr and pause there. Another thread could then take the same code path, see that Keccak_Initialize_ptr is non-zero and opt not to call Keccak_Dispatch, then call some of the other function pointers that are still zero, leading to a segfault. (The same is true of the Keccak_X4_Dispatch function.)

Although liboqs is being used by a growing number of products and companies, it appears that we were the first to encounter and report this issue, possibly due to the scale of our trial deployment. We fixed it by calling Keccak_Dispatch with pthread_once on POSIX platforms. The fix has since been submitted and merged upstream.

Cross-domain resumption handshake thrash

We rolled out post-quantum hybrid key exchange progressively, with the decision driven by the client. For instance, we started with connections between different data centers, then moved on to traffic within the data center.

Internally, we scope TLS sessions by “service” name. This allows a client to perform cross-host resumption to different servers in the same service. This includes the ability to resume from a server with which the client decides to use hybrid key exchange to one where the client does not, and vice versa, which runs into a small problem with Fizz.

As previously mentioned, we do Ephemeral Diffie-Hellman key exchange on resumption. To facilitate efficient use of computation resources, the client will send only the minimally required default keyshares, which in the resumption case means the keyshare for the previously negotiated named group. This means that when a client connects to a particular server and negotiates a classical named group, then subsequently resumes on a server with which the client should use a hybrid named group, the client would advertise the hybrid named group but send only the keyshare for the classical named group. This leads to the server negotiating the hybrid named group and replying with a HelloRetryRequest to ask the client for the hybrid keyshare, resulting in an additional 1-RTT to perform the key exchange.

To address this, we had the client split each service into different TLS session scopes – one using classical key exchange, and one using hybrid key exchange. Each session scope thus uses only one named group each, avoiding the keyshare thrashing behavior described above. The tradeoff is space consumption due to having to store more session tickets, but this has been acceptable given the small size of each session ticket (a few hundred bytes).

The computational cost of Kyber key exchange

Meta currently uses X25519 in Elliptic Curve Diffie-Hellman key exchange. During the initial rollout of hybrid key exchange with the hybrid named group X25519_kyber768, we observed a roughly 40 percent increase in CPU cycles. Although this may seem like an undesirable result, it actually indicates that Kyber768 standalone key exchange is faster than x25519, which lines up with results others have found.

Current status and future plans

Meta has deployed post-quantum hybrid key exchange for most internal service communication to protect against the SNDL threat. Since internal service communication traffic occurs within our internal network and is fully under our control, this was the logical starting point for implementing this advanced security countermeasure, even as we await the PQC standards to be published by NIST.

Implementing post-quantum hybrid key exchange to external public internet traffic poses several additional challenges, such as dependency on browsers’ TLS implementations and crypto libraries’ PQC readiness, increased communication bandwidth due to larger payloads, and more. We are looking forward to industry standardization and major browser based adoption, and we’ll keep working across Meta to harden our systems as well. We look forward to sharing more as we continue our efforts in this space.

Acknowledgements

We thank the current and past members of Meta’s Service Encryption team particularly: Isaac Elbaz, Fred Qui, Keyu Man, Puneet Mehra, Forrest Mertens, Ameya Shedarkar, and Mingtao Yang.

Behind the scenes of Threads for web

Tue, 14 May 2024 18:33:00 +0200

When Threads first launched one of the top feature requests was for a web client.

In this episode of the Meta Tech Podcast, Pascal Hartig (@passy) sits down with Ally C. and Kevin C., two engineers on the Threads Web Team that delivered the basic version of Threads for web in just under three months.

Ally and Kevin share how their team moved swiftly by leveraging Meta’s shared infrastructure and the nimble engineering practices of their colleagues who built Threads for iOS and Android. They also discuss how they balanced the need to ship new features with the desire to craft exciting experiences for people on Threads.

Download or listen to the episode below:

[embedded content]
You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Building new custom silicon for Meta’s AI workloads

Thu, 11 Apr 2024 18:30:00 +0200

“Even as a kid, I wanted to understand the whole picture,” says Nicolaas V., systems engineer, sweeping his hands to show how he thinks. “Teachers would teach one thing at a time and say, ‘Don’t worry about the next part yet,’ even though it’s all related. I could never grasp that. That's why I love building at Meta — from the moment we start, we're thinking about how everything connects.”

Before joining Meta, Nicolaas built network processors for another company. “Normal chips, like CPUs, carry information with big threads — imagine strong people hammering railway stakes into the ground. The team I was on built a chip made of lots of little threads, like tiny ants carrying bits of information around the world.” Nicolaas brought this type of creative thinking to Meta when he joined in 2020 and began helping build infrastructure for the future of AI.

Where ideas and opportunities create impact

Moving from a small company, Nicolaas immediately noticed the scale and resources at Meta — making it a place he could turn ideas into reality. “There’s a culture of autonomy and trust that’s unique to Meta as a global company,” he explains. “There are more opportunities to try something new. For example, when my team discovered I knew about remote direct memory access (RDMA) and network interface cards (NICs), I was asked to join a project that hadn’t been done internally before: building a GPU cluster in infrastructure. I had the time of my life.”

While large-scale GPU clusters are now foundational for Meta, people weren’t always sure how the technology would work. “After four months of building, our small team was able to do in two hours of training in what had previously taken five days,” Nicolaas remembers. “This was an aha moment for the company, and it led us to the path we’re on today within our AI practice.”

By 2021, the GPU cluster project had grown exponentially. Nicolaas joined another project where the team needed NIC expertise. Once it became a plan of record (PoR), a senior team lead approached Nicolaas with another ground-breaking opportunity. “He said, ‘We want to build an AI chip,’” Nicolaas smiles. “The ask brought my knowledge of GPU clusters, RDMA and NICs together, and it’s how I started on the Meta Training and Inference Accelerator (MTIA) program, our in-house, custom-built silicon.”

Making a case to grow MTIA from the ground up

“The MTIA chip was a really interesting challenge because we were building from nothing. We had no background, no IP and no capability yet,” Nicolaas shares. “The existing silicon was not optimized to operate in large clusters. So we had to develop the silicon, software and hardware to connect the devices together so they could run as one giant computer. To extend the metaphor: Rather than one person hammering a railway, we had to train thousands of people to hammer at the exact same moment.”

While Nicolaas dove headfirst into the technical project, a question remained: Did it make sense for Meta to build its own chips? This opened up a discussion around business cases, which inspired Nicolaas and his team to develop a high-level business plan. “We outlined how MTIA helps us make sure we run more efficiently. We just had to build it and make it scale.”

Nicolaas standing next to two members of his team, looking at a whiteboard with a diagram on it.

This was another experience unique to Meta. “While most companies would hire someone from the outside with expertise, leadership trusted us to figure it out,” Nicolaas shares. “They encouraged us to define the project and build up very quickly.” Nicolaas rapidly expanded his skill set — negotiating with commercial vendors and leveraging his network to find the right partners for Meta. Through this approach, he saw firsthand how Meta was leading the industry forward in infrastructure built specifically for AI.

An intersectional approach to AI systems

For the past year, the next-gen MTIA chip has been Nicolaas’ primary focus on the AI systems team. He spends his time ensuring the hardware, software and silicon are integrated. “Building from the ground up has allowed me to ‘stack’ all the layers — which I love as someone who prefers to see the full picture. Our team moves away from traditional approaches, which have many abstraction layers, so we can move data up and down these systems as efficiently as possible.”

Once the team built the chip with enough connectivity and bandwidth, they moved to hardware: designing a system to connect everything together. Next, they built the software to enable communication between chips. “MTIA is relatively small compared to GPUs — that’s what makes it performant and efficient — so how do we get the system to scale larger and run bigger workloads? We did something novel: adding space in the system to start connecting future clusters of MTIA when it becomes necessary.”

As AI grows, Nicolaas believes flexibility will make the biggest difference in innovation. “Our AI models are changing faster than we can adjust the silicon, so we need to be able to flex the system to meet different balances of compute, memory and network. That's our job: to make sure we can flex ourselves as far as possible, up and down and sideways to meet the needs of our models. That’s how we’re enabling the future of AI. That’s how we’re setting up Meta and the industry to grow.”

Building new custom silicon for Meta’s AI workloads

Thu, 11 Apr 2024 18:30:00 +0200

Where ideas and opportunities create impact

Making a case to grow MTIA from the ground up

Nicolaas standing next to two members of his team, looking at a whiteboard with a diagram on it.

An intersectional approach to AI systems

Building new custom silicon for Meta’s AI workloads

Thu, 11 Apr 2024 18:30:00 +0200

Building an infrastructure for AI’s future

Thu, 11 Apr 2024 18:20:00 +0200

Kevin L., a technical program manager (TPM), studied mechanical engineering at the University of Waterloo with the dream of “building planes, cars and fast things.” When his career began, however, he was transported by a different kind of speed. “At Meta, we move fast,” he shares. “We ask each other, ‘what would you do if you weren’t afraid?’ I had this eye-opening moment that I could break out of my way of doing things.”

When Kevin joined Meta as an intern and thermal engineer in 2011, he was building cooling solutions and supporting the first generation of the Open Compute Project. “My manager sent me to a data center to fix overheating network switches. When it didn’t end up being a thermal problem, I had an opportunity to still find a solution. Meta changed my perspective on problem-solving: look at the whole issue and focus on solving integrated challenges. It doesn’t matter what your degree or title is, at Meta we are all problem-solvers.”

With this mindset, Kevin spent the next decade at Meta bringing his skills to exciting new places: the world of AI.

Solving integrated challenges

Kevin joined the release to production team as a full-time validation engineer in 2012. “Our team was scrappy — we just had to figure out issues. My focus was testing hardware in the data center, but I was also learning to debug systems and hack the Linux Kernel — skills I’d later use to solve open-ended questions around AI.”

When Meta kickstarted its AI efforts, the team approached Kevin for his expertise in graphics processing units (GPUs) for training models, which they didn’t have yet. “I bought the materials, plugged GPUs into the servers, changed power supplies and helped build our first iteration of AI compute at Meta,” he shares. As Meta shifted more toward AI, so did Kevin’s roles. In 2016, he became a TPM focused on vision and strategy for AI; in 2018 he became a TPM manager, leading a group for AI acceleration; and in 2020 he moved to the Fundamental AI Research (FAIR) org, where he helped build the AI Research SuperCluster — which Meta used to train its first Llama large language model.

“Nothing is just a hardware problem,” Kevin explains. “It’s an integrated problem from hardware to software to AI research. Moving from an engineer who deep dives into things, to a TPM who looks more broadly, helped me expand my scope.”

Building infrastructure for the future of AI

Today, Kevin sits within the infra foundation team, creating fundamental building blocks for AI infrastructure at Meta. “Our work accelerates what Meta does as a company, impacting our ability to scale and maintain reliability and functionality,” he shares. “Engineers and researchers use this infrastructure to build products and do better research. For instance, Meta recently announced two versions of our 24,576-GPU data center scale cluster, which support our current and next-gen AI models, including Llama 3, the successor to Llama 2, our open-source LLM.”

As AI training models continue getting bigger, Kevin and his team need to scale infrastructure to train them — but building large systems is a fundamental challenge. If one GPU slows or fails, Kevin explains, the whole system is affected. “We’re already collaborating with vendors to design solutions that can minimize the impact of system failures. We’re also improving checkpointing, which is like a ‘save state’ in video games, where we can return to the last known good state to resume training. These are all so that we can continue training even bigger and more complex models.”

Collaborating on a cross-functional mission

Kevin describes the AI clusters as being collaborative in every way — bringing together innovation in hardware, storage, software and network fabrics. “Picture a cluster as one big chip being split across the data center. We have to connect one GPU to another with cables. Can you imagine how many cables and switches are needed? Ensuring we maintain low latency and high bandwidth is complex from a network perspective — cross-functional work is vital.”

Kevin highlights the culture at Meta as “set up for collaborative communication” — an advantage to his team’s success. “ I have people explore different teams and tracks, acquiring diverse skills and knowledge — which we all benefit from. Every team across the company is thinking about AI and how we can build it better. The flexibility and focus we share at Meta are highly unique. ”

Collaboration at Meta extends to the open source community as well. “The commitment to open source at Meta is tried and true. We’ve been doing it, and we’ll continue to do it,” Kevin says.

That’s what innovation means to Kevin — pushing the boundaries of what’s possible. “Every few years we see things that change what we’re doing and set us up for the next big thing, from the internet to smart phones. GenAI will help every person in every industry be more creative and productive. Together, with these tools, we’ll do bigger and better things.”

Building an infrastructure for AI’s future

Thu, 11 Apr 2024 18:20:00 +0200

With this mindset, Kevin spent the next decade at Meta bringing his skills to exciting new places: the world of AI.

Solving integrated challenges

Building infrastructure for the future of AI

Collaborating on a cross-functional mission

Collaboration at Meta extends to the open source community as well. “The commitment to open source at Meta is tried and true. We’ve been doing it, and we’ll continue to do it,” Kevin says.

Introducing the next-gen Meta Training and Inference Accelerator

Wed, 10 Apr 2024 17:11:00 +0200

[...]

Bringing HDR photo support to Instagram and Threads

Tue, 26 Mar 2024 17:00:00 +0100

Meta’s family of apps serves trillions of image download requests every day. And if you’re into high-quality images, you’ve probably noticed that Instagram and Threads have added support for high dynamic range (HDR) photos. Now people on Threads and Instagram can upload and share images that are more true-to-life, with the full color and range their device is capable of capturing.

Zuzanna Mroczek, a software engineer on Meta’s Media Platform Team, joins Pascal Hartig (@passy) on the Meta Tech Podcast to talk about how her team, which owns the entire flow from serving images from the CDN to displaying them on your device, is driving up image quality across apps, platforms, and devices.

Hear how the Media Platform Team brought HDR to Instagram and Threads, and how they partnered with major phone manufacturers (including Google and Samsung) on the rollout!

Download or listen to the episode below:

[embedded content]

You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Threads has entered the fediverse

Thu, 21 Mar 2024 19:05:00 +0100

Threads has entered the fediverse! As part of our beta experience, now available in a few countries, Threads users aged 18+ with public profiles can now choose to share their Threads posts to other ActivityPub-compliant servers.
People on those servers can now follow federated Threads profiles and see, like, reply to, and repost posts from the fediverse.
We’re sharing how we’re continuing to integrate Threads with the fediverse, the technical challenges, the solutions we’ve come up with along the way, and what’s next as we move toward making Threads fully interoperable.

Threads’ initial launch came together in only a few short months. A nimble team of engineers, leveraging Meta’s existing scalable infrastructure, was able to make Threads Meta’s most successful app launch of all time.

Now, we’re integrating Threads with the fediverse. With our beta experience, now available in a few countries, including the US, Threads users aged 18+ with public profiles can now choose to federate their profiles – allowing them to share their Threads posts to other ActivityPub-compliant servers, and enabling people on those servers to follow them, and like, reply to, and repost their posts.

Building a federated platform – Meta’s first app for open social networking – has meant new engineering challenges and opportunities. Designing for the fediverse comes with unique interoperability considerations and hurdles to overcome on the server side.

What is the fediverse?

When we set out to build Threads our goal was always to build a decentralized social networking app within the fediverse, where federated networking gives people greater control over their online identity and the content they see, regardless of their chosen platform.

One way to think about the fediverse is to compare it to email. You can send an email from a Gmail account to a Yahoo account, for example, because those services support the same protocols. Similarly, in the fediverse you can connect with people who use different social networking services that are built on the same protocol, removing the silos that confine people and their followers to any single platform. But unlike email, your fediverse conversations and profile are public and can be shared across servers.

Building Threads on an open social networking protocol gives people more freedom and choice in the online communities they inhabit. Every fediverse server can set its own community standards and content moderation policies, meaning people have the freedom to choose spaces that align with their values.

We believe this decentralized approach, similar to the protocols governing email and the web itself, will play an important role in the future of online platforms. The fediverse promotes innovation and competition by fostering a more diverse and vibrant ecosystem of social media platforms that can easily connect with a wider audience.

What is ActivityPub?

Threads leverages ActivityPub – a decentralized, open social networking protocol built by the World Wide Web Consortium(W3C) – that is premised on a straightforward, fundamental idea: creating a social networking structure based on open protocols that allow people to communicate and network with each other regardless of the server they choose.

ActivityPub acts as a server-to-server protocol where the API allows decentralized servers to communicate with one another to deliver content and activities.

The protocol plays a key role in allowing Threads to be interoperable with other servers that also use it. Eventually, people on Threads will be able to interact with people on platforms like Mastodon and WordPress without having to sign up for accounts on those apps.

The current state of fediverse integration in Threads

With our beta experience, Threads users aged 18+ with public profiles can now choose to enable sharing to the fediverse. If they do, they’ll be able to publish posts on Threads that will be viewable on other ActivityPub-compliant servers. Threads users will also be able to see aggregated like counts on their posts from other fediverse servers directly from the Threads app. If people on other fediverse servers follow federated Threads profiles they’ll be able to see, reply to, and repost Threads posts (if their server allows it).

What types of content are federated?

In this initial phase federated Threads users will not be able to see who liked their posts or any replies from people in the fediverse on Threads. For now, people who want to see replies on their posts on other fediverse servers will have to visit those servers directly.

Certain types of posts and content are also not federated, including:

Posts with restricted replies.
Replies to non-federated posts.
Post with polls (until future updates).
Reposts of non-federated posts.

For posts that contain links, a link attachment will be appended as a link at the end of the post if it is not already included in the post.

Building more federated features for Threads

More federated features for Threads will come once we have addressed other technical hurdles in a way that we feel is safest and offers the best possible user experience. Within all of this, it’s also important to us that, as we build these solutions, we do so alongside the open and decentralized fediverse developer community.

As we federate new features in Threads, we have to look at how to address the disparity in the availability and implementation of these features across servers.

Federating quote posts

Take quote posts as an example. They’re a popular feature across all social media, but ActivityPub does not have a formal specification for how to handle them yet. Thus, fediverse servers have come up with their own methods of integrating and handling quote posts. Some servers allow for creating and viewing quote posts; others don’t support the function at all.

There are a handful of unofficial methods for handling quote posts in ActivityPub. One fediverse enhancement proposal (FEP), FEP-e232, proposes a way to represent inline quotes and other text-based links to ActivityPub in a manner similar to mentions on other social media platforms. Another method would be to use the quoteURL property within ActivityPub, which would assign posts an ID that could then be pulled into other posts that want to quote them. Misskey created its own solution with its _misskey_quote property, which builds on FEP-e232.

Many fediverse servers also append extra syntax (RE:) to post content to make it compatible with servers that haven’t implemented any of the structured methods for handling quote posts.

After exploring different options pursued by the fediverse community, we chose to implement both FEP-e232 and _misskey_quote to federate quote posts on Threads. As of now, none of these methods are official keys in the ActivityPub namespace. We chose _misskey_quote because its naming makes it clear that it’s not an official ActivityPub method, and because we know that it’s supported by Misskey, Firefish, and potentially other servers that use quote posts.

In our current implementation, if a Threads user creates a quote post from a federated post, the quote post will contain a permalink URL (e.g. “RE: “) to the post along with a structured representation of the post. Platforms outside of Threads can display the quote post similar to how it’s displayed on Threads by using the structured representation to fetch the post and display it within the quote post.

If the post being quoted is not federated, the quote post’s content will only contain the permalink URL and not the structured representation.

Federated and non-federated interactions

If a federated Threads user is replying to, quoting, or reposting a post from another federated Threads user it makes perfect sense to federate that reply, quote, or repost (which we do).

However, we had to take a careful look at the complexities that arise since not every Threads user will opt in to turn on sharing to the fediverse. Prioritizing the user experience for both those who federate and those who choose not to is important to us. Which also means federated and non-federated users on Threads should still be able to interact with one another seamlessly.

Unlike other federated platforms, Threads doesn’t simply federate every post. Given that features like replies may or may not be federated, we had to build UI/UX treatments and notices to help people understand what is happening and what to expect when posting.

Our phased approach to the fediverse

We’re taking a phased approach to Threads’ fediverse integration to ensure we can continue to build responsibly and get valuable feedback from our users and the fediverse community.

In the future, we expect content to flow from the fediverse into Threads. Federated Threads users will be able to see and engage with replies to their posts coming from other servers, or follow people on other fediverse servers and engage with their content directly in Threads. Our plan is for fediverse-enabled Threads profiles to ultimately have one consolidated number of followers that combines users that followed them from Threads and users from other servers.

Building a federated social networking app is a complex and delicate process if it is to be done safely. While we don’t have exact dates or details on our milestones just yet, we’re committed to a fully interoperable experience, and we’ll take the time to get this right and grow the fediverse responsibly.

This is another step in our journey to make Threads fully interoperable. We will continue to collaborate with developers and policy makers so that people across services have the opportunity to experience the benefits the fediverse offers via a fully interoperable experience, including reaching new audiences and fostering their community.

Optimizing RTC bandwidth estimation with machine learning

Wed, 20 Mar 2024 22:50:00 +0100

Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta’s family of apps. We’ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We’re sharing our experiment results from this approach, some of [...]

Better video for mobile RTC with AV1 and HD

Wed, 20 Mar 2024 20:40:00 +0100

At Meta, we support real-time communication (RTC) for billions of people through our apps, including Messenger, Instagram, and WhatsApp. We’ve seen significant benefits by adopting the AV1 codec for RTC. Here’s how we are improving the RTC video quality for our apps with tools like the AV1 codec, the challenges we face, and how we [...]

The post Better video for mobile RTC with AV1 and HD appeared first on Engineering at Meta.

Logarithm: A logging engine for AI training workflows and services

Mon, 18 Mar 2024 17:00:00 +0100

Systems and application logs play a key role in operations, observability, and debugging workflows at Meta.
Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs.
In this post, we present the design behind Logarithm, and show how it powers AI training debugging use cases.

Logarithm indexes 100+GB/s of logs in real time, and thousands of queries a second. We designed the system to support service-level guarantees on log freshness, completeness, durability, query latency, and query result completeness. Users can emit logs using their choice of logging library (the common library at Meta is the Google Logging Library [glog]). Users can query using regular expressions on log lines, arbitrary metadata fields attached to logs, and across log files of hosts and services.

Logarithm is written in C++20 and the codebase follows modern C++ patterns, including coroutines and async execution. This has supported both performance and maintainability, and helped the team move fast – developing Logarithm in just three years.

Logarithm’s data model

Logarithm represents logs as a named log stream of (host-local) time-ordered sequences of immutable unstructured text, corresponding to a single log file. A process can emit multiple log streams (stdout, stderr, and custom log files). Each log line can have zero or more metadata key-value pairs attached to it. A common example of metadata is rank ID in machine learning (ML) training, when multiple sequences of log lines are multiplexed into a single log stream (e.g., in PyTorch).

Logarithm supports typed structures in two ways – via typed APIs (ints, floats, and strings), and extraction from a log line using regex-based parse-and-extract rules – a common example is metrics of tensors in ML model logging. The extracted key-value pairs are added to the log line’s metadata.

Figure 1: Logarithm data model. The boxes on text represent typed structures.

AI training debugging with Logarithm

Before looking at Logarithm’s internals, we present support for training systems and model issue debugging, one of the prominent use cases of Logarithm at Meta. ML model training workflows tend to have a wide range of failure modes, spanning data inputs, model code and hyperparameters, and systems components (e.g., PyTorch, data readers, checkpointing, framework code, and hardware). Further, failure root causes evolve over time faster than traditional service architectures due to rapidly-evolving workloads, from scale to architectures to sharding and optimizations. In order to triage such a dynamic nature of failures, it is necessary to collect detailed telemetry on the systems and model telemetry.

Since training jobs run for extended periods of time, training systems and model telemetry and state need to be continuously captured in order to be able to debug a failure without reproducing the failure with additional logging (which may not be deterministic and wastes GPU resources).

Given the scale of training jobs, systems and model telemetry tend to be detailed and very high-throughput – logs are relatively cheap to write (e.g., compared to metrics, relational tables, and traces) and have the information content to power debugging use cases.

We stream, index and query high-throughput logs from systems and model layers using Logarithm.

Logarithm ingests both systems logs from the training stack, and model telemetry from training jobs that the stack executes. In our setup, each host runs multiple PyTorch ranks (processes), one per GPU, and the processes write their output streams to a single log file. Debugging distributed job failures leads to ambiguity due to lack of rank information in log lines, and adding it means that we modify all logging sites (including third-party code). With the Logarithm metadata API, process context such as rank ID is attached to every log line – the API adds it into thread-local context and attaches a glog handler.

We added UI tools to enable common log-based interactive debugging primitives. The following figures show screenshots of two such features (on top of Logarithm’s filtering operations).

Filter–by-callsite enables hiding known log lines or verbose/noisy logging sites when walking through a log stream. Walking through multiple log streams side-by-side enables finding rank state that is different from other ranks (i.e., additional lines or missing lines), which typically is a symptom or root cause. This is directly a result of the single program, multiple data nature of production training jobs, where every rank iterates on data batches with the same code (with batch-level barriers).

Figure 2: Logarithm UI features for training systems debugging (Logs shown are for demonstration purposes).

Logarithm ingests continuous model telemetry and summary statistics that span model input and output tensors, model properties (e.g., learning rate), model internal state tensors (e.g., neuron activations) and gradients during training. This powers live training model monitoring dashboards such as an internal deployment of TensorBoard, and is used by ML engineers to debug model convergence issues and training failures (due to gradient/loss explosions) using notebooks on raw telemetry.

Model telemetry tends to be iteration-based tensor timeseries with dimensions (e.g., model architecture, neuron, or module names), and tends to be high-volume and high-throughput (which makes low-cost ingestion in Logarithm a natural choice). Collocating systems and model telemetry enables debugging issues that cascade from one layer to the other. The model telemetry APIs internally write timeseries and dimensions as typed key-value pairs using the Logarithm metadata API. Multimodal data (e.g., images) are captured as references to files written to an external blob store.

Model telemetry dashboards typically tend to be a large number of timeseries visualizations arranged in a grid – this enables ML engineers to eyeball spatial and temporal dynamics of the model external and internal state over time and find anomalies and correlation structure. A single dashboard hence needs to get a significantly large number of timeseries and their tensors. In order to render at interactive latencies, dashboards batch and fan out queries to Logarithm using the streaming API. The streaming API returns results with random ordering, which enables dashboards to incrementally render all plots in parallel – within 100s of milliseconds to the first set of samples and within seconds to the full set of points.

Figure 3: TensorBoard model telemetry dashboard powered by Logarithm. Renders 722 metric time series at once (total of 450k samples).

Logarithm’s system architecture

Our goal behind Logarithm is to build a highly scalable and fault tolerant system that supports high-throughput ingestion and interactive query latencies; and provides strong guarantees on availability, durability, freshness, completeness, and query latency.

Figure 4: Logarithm’s system architecture.

At a high level, Logarithm comprises the following components:

Application processes emit logs using logging APIs. The APIs support emitting unstructured log lines along with typed metadata key-value pairs (per-line).
A host-side agent discovers the format of lines and parses lines for common fields, such as timestamp, severity, process ID, and callsite.
The resulting object is buffered and written to a distributed queue (for that log stream) that provides durability guarantees with days of object lifetime.
Ingestion clusters read objects from queues, and support additional parsing based on any user-defined regex extraction rules – the extracted key-value pairs are written to the line’s metadata.
Query clusters support interactive and bulk queries on one or more log streams with predicate filters on log text and metadata.

Logarithm stores locality of data blocks in a central locality service. We implement this on a hosted, highly partitioned and replicated collection of MySQL instances. Every block that is generated at ingestion clusters is written as a set of locality rows (one for each log stream in the block) to a deterministic shard, and reads are distributed across replicas for a shard. For scalability, we do not use distributed transactions since the workload is append-only. Note that since the ingestion processing across log streams is not coordinated by design (for scalability), federated queries across log streams may not return the same last-logged timestamps between log streams.

Our design choices center around layering storage, query, and log analytics and simplicity in state distribution. We design for two common properties of logs: they are written more than queried, and recent logs tend to be queried more than older ones.

Design decisions

Logarithm stores logs as blocks of text and metadata and maintains secondary indices to support low latency lookups on text and/or metadata. Since logs rapidly lose query likelihood with time, Logarithm tiers the storage of logs and secondary indices across physical memory, local SSD, and a remote durable and highly available blob storage service (at Meta we use Manifold). In addition to secondary indices, tiering also ensures the lowest latencies for the most accessed (recent) logs.

Lightweight disaggregated secondary indices. Maintaining secondary indices on disaggregated blob storage magnifies data lookup costs at query time. Logarithm’s secondary indices are designed to be lightweight, using Bloom filters. The Bloom filters are prefetched (or loaded on-query) into a distributed cache on the query clusters when blocks are published on disaggregated storage, to hide network latencies on index lookups. We later added support for data blocks in the query cache when executing a query. The system tries to collocate data from the same log stream in order to reduce fan outs and stragglers during query processing. The logs and metadata are implemented as ORC files. The Bloom filters currently index log stream locality and metadata key-value information (i.e., min-max values and Bloom filters for each column of ORC stripes).

Logarithm separates compute (ingestion and query) and storage to rapidly scale out the volume of log blocks and secondary indices. The exception to this is the in-memory memtable on ingestion clusters that buffer time-ordered lists of log streams, which is a staging area for both writes and reads. The memtable is a bounded per-log stream buffer of the most recent and long enough time window of logs that are expected to be queried. The ingestion implementation is designed to be I/O-bound and not compute or host memory bandwidth-heavy to handle close to GB/s of per-host ingestion streaming. To minimize memtable contention, we implement multiple memtables, for staging, and an immutable prior version for serializing to disk. Ingestion implementation follows zero-copy semantics.

Similarly, Logarithm separates ingestion and query resources to ensure bulk processing (ingestion) and interactive workloads do not impact each other. Note that Logarithm’s design uses schema-on-write, but the data model and parsing computation is distributed between the logging hosts (which scales ingestion compute), and optionally, the ingestion clusters (for user-defined parsing). Customers can add additional anticipated capacity for storage (e.g., increased retention limits), ingestion and query workloads.

Logarithm pushes down distributed state maintenance to disaggregated storage layers (instead of replicating compute at ingestion layer). The disaggregated storage in Manifold uses read-write quorums to provide strong consistency, durability and availability guarantees. The distributed queues in Scribe use LogDevice for maintaining objects as a durable replicated log. This simplifies ingestion and query tier fault tolerance. Ingestion nodes stream serialized objects on local SSDs to Manifold in 20-min. epochs, and checkpoint Scribe offsets on Manifold. When a failed ingestion node is replaced, the new node downloads the last epoch of data from Manifold, and restarts ingesting raw logs from the last Scribe checkpoint.

Ingestion elasticity. The Logarithm control plane (based on Shard Manager) tracks ingestion node health and log stream shard-level hotspots, and relocates shards to other nodes when it finds issues or load. When there is an increase in logs written in a log stream, the control plane scales out the shard count and allocates new shards on ingestion nodes with available resources. The system is designed to provide resource isolation at ingestion-time between log streams. If there is a significant surge in very short timescales, the distributed queues in Scribe absorb the spikes, but when the queues are full, the log stream can lose logs (until elasticity mechanisms increase shard counts). Such spikes typically tend to result from logging bugs (e.g., verbosity) in application code.

Query processing. Queries are routed randomly across the query clusters. When a query node receives a request, it assumes the role of an aggregator and partitions the request across a bounded subset of query cluster nodes (balancing between cluster load and query latency). The aggregator pushes down filter and sort operators to query nodes and returns sorted results (an end-to-end blocking operation). The query nodes read their partitions of logs by looking up locality, followed by secondary indices and data blocks – the read can span the query cache, ingestion nodes (for most recent logs) and disaggregated storage. We added 2x replication of the query cache to support query cluster load distribution and fast failover (without waiting for cache shard movement). Logarithm also provides a streaming query API with randomized and incremental sampling that returns filtered logs (an end-to-end non-blocking operation) for lower-latency reads and time-to-first-log. Logarithm paginates result sets.

Logarithm can tradeoff query result completeness or ordering to maintain query latency (and flag to the client when it does so). For example, this can be the case when a partition of a query is slow or when the number of blocks to be read is too high. In the former, it times out and skips the straggler. In the latter scenario, it starts from skipped blocks (or offsets) when processing the next result page. In practice, we provide guarantees for both result completeness and query latency. This is primarily feasible since the system has mechanisms to reduce the likelihood of root causes that lead to stragglers. Logarithm also does query admission control at client or user-level.

The following figures characterize Logarithm’s aggregate production performance and scalability across all log streams. They highlight scalability as a result of design choices that make the system simpler (spanning disaggregation, ingestion-query separation, indexes, and fault tolerance design). We present our production service-level objectives (SLOs) over a month, which are defined as the fraction of time they violate thresholds on availability, durability (including completeness), freshness, and query latency.

Figure 5: Logarithm’s ingestion-query scalability for the month of January 2024 (one point per day).

Figure 6: Logarithm SLOs for the month of January 2024 (one point per day).

Logarithm supports strong security and privacy guarantees. Access control can be enforced on a per-log line granularity at ingestion and query-time. Log streams can have configurable retention windows with line-level deletion operations.

Next steps

Over the last few years, several use cases have been built over the foundational log primitives that Logarithm implements. Systems such as relational algebra on structured data and log analytics are being layered on top with Logarithm’s query latency guarantees – using pushdowns of search-filter-sort and federated retrieval operations. Logarithm supports a native UI for interactive log exploration, search, and filtering to aid debugging use cases. This UI is embedded as a widget in service consoles across Meta services. Logarithm also supports a CLI for bulk download of service logs for scripting analyses.

The Logarithm design has centered around simplicity for scalability guarantees. We are continuously building domain-specific and agnostic log analytics capabilities within or layered on Logarithm with appropriate pushdowns for performance optimizations. We continue to invest in storage and query-time improvements, such as lightweight disaggregated inverted indices for text search, storage layouts optimized for queries and distributed debugging UI primitives for AI systems.

Acknowledgements

We thank Logarithm team’s current and past members, particularly our leads: Amir Alon, Stavros Harizopoulos, Rukmani Ravisundaram, and Laurynas Sukys, and our leadership: Vinay Perneti, Shah Rahman, Nikhilesh Reddy, Gautam Shanbhag, Girish Vaitheeswaran, and Yogesh Upadhay. Thank you to our partners and customers: Sergey Anpilov, Jenya (Eugene) Lee, Aravind Ram, Vikram Srivastava, and Mik Vyatskov.

Building Meta’s GenAI Infrastructure

Tue, 12 Mar 2024 16:00:00 +0100

Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We are strongly committed to open [...]

The post Building Meta’s GenAI Infrastructure appeared first on Engineering at Meta.

Making messaging interoperability with third parties safe for users in Europe

Wed, 06 Mar 2024 10:00:00 +0100

To comply with a new EU law, the Digital Markets Act (DMA), which comes into force on March 7th, we’ve made major changes to WhatsApp and Messenger to enable interoperability with third-party messaging services.
We’re sharing how we enabled third-party interoperability (interop) while maintaining end-to-end encryption (E2EE) and other privacy guarantees in our services as far as possible.

On March 7th, a new EU law, the Digital Markets Act (DMA), comes into force. One of its requirements is that designated messaging services must let third-party messaging services become interoperable, provided the third-party meets a series of eligibility, including technical and security requirements.

This allows users of third-party providers who choose to enable interoperability (interop) to send and receive messages with opted-in users of either Messenger or WhatsApp – both designated by the European Commission (EC) as being required to independently provide interoperability to third-party messaging services.

For nearly two years our team has been working with the EC to implement interop in a way that meets the requirements of the law and maximizes the security, privacy and safety of users. Interoperability is a technical challenge – even when focused on the basic functionalities as required by the DMA. In year one, the requirement is for 1:1 text messaging between individual users and the sharing of images, voice messages, videos, and other attached files between individual end users. In the future, requirements expand to group functionality and calling.

To interoperate, third-party providers will sign an agreement with Messenger and/or WhatsApp and we’ll work together to enable interoperability. Today we’ll publish the WhatsApp Reference Offer for third-party providers which will outline what will be required to interoperate with the service. The Reference Offer for Messenger will follow in due course.

While Meta must be ready to enable interoperability with other services within three months of receiving a request, it may take longer before the functionality is ready for public use. We wanted to take this opportunity to set out the technical infrastructure and thinking that sits behind our interop solution.

A privacy-centric approach to building interoperable messaging services

Our approach to compliance with the DMA is centered around preserving privacy and security for users as far as is possible. The DMA quite rightly makes it a legal requirement that we should not weaken security provided to Meta’s own users.

The approach we have taken in terms of implementing interoperability is the best way of meeting DMA requirements, whilst also creating a viable approach for the third-party providers interested in becoming interoperable with Meta and maximizing user security and privacy.

Implementing an end-to-end encrypted protocol

First, we need to protect the underlying security that keeps communication on Meta E2EE messaging apps secure: the encryption protocol. WhatsApp and Messenger both use the tried and tested Signal protocol as a foundational piece for their encryption.

Messenger is still rolling out E2EE by default for personal communication, but on WhatsApp, this default has been the case since 2016. In both cases, we are using the Signal protocol as the foundation for these E2EE communications, as it represents the current gold standard for E2EE chats.

In order to maximize user security, we would prefer third-party providers to use the Signal protocol. Since this has to work for everyone however, we will allow third-party providers to use a compatible protocol if they are able to demonstrate it offers the same security guarantees as Signal.

To send messages, the third-party providers have to construct message protobuf structures which are then encrypted using the Signal Protocol and then packaged into message stanzas in eXtensible Markup Language (XML).

Meta servers push messages to connected clients over a persistent connection. Third-party servers are responsible for hosting any media files their client applications send to Meta clients (such as image or video files). After receiving a media message, Meta clients will subsequently download the encrypted media from the third-party messaging servers using a Meta proxy service.

It’s important to note that the E2EE promise Meta provides to users of our messaging services requires us to control both the sending and receiving clients. This allows us to ensure that only the sender and the intended recipient(s) can see what has been sent, and that no one can listen to your conversation without both parties knowing.

While we have built a secure solution for interop that uses the Signal protocol encryption to protect messages in transit, without ownership of both clients (endpoints) we cannot guarantee what a third-party provider does with sent or received messages, and we therefore cannot make the same promise.

Our technical solution builds on Meta’s existing client / server architecture

We think the best way to deliver interoperability is through a solution which builds on Meta’s existing client / server architecture [Figure 1]. In particular, the requirement that clients connect to Meta infrastructure has the following benefits, it:

Enables Meta to maximize the level of security and safety for all users by carrying out many of the same integrity checks as it does for existing Meta users
Constitutes a “plug-and-play” model for third-party providers, lowering the barriers for potential new entrants and costs for third-party providers
Helps maximize protection of user privacy by limiting the exposure of their personal data to Meta servers only
Improves overall reliability of the interoperable service as it benefits from Meta’s infrastructure, which is already globally scaled to handle over 100 billion messages each day

Figure 1: A simplified illustration of WhatsApp’s technical architecture.

Taking the example of WhatsApp, third-party clients will connect to WhatsApp servers using our protocol (based on the Extensible Messaging and Presence Protocol – XMPP). The WhatsApp server will interface with a third-party server over HTTP in order to facilitate a variety of things including authenticating third-party users and push notifications.

WhatsApp exposes an Enlistment API that third-party clients must execute when opting in to the WhatsApp network. When a third-party user registers on WhatsApp or Messenger, they keep their existing user-visible identifier, and are also assigned a unique, WhatsApp-internal identifier that is used at the infrastructure level (for protocols, data storage, etc.)

WhatsApp requires third-party clients to provide “proof” of their ownership of the third-party user-visible identifier when connecting or enlisting. The proof is constructed by the third-party service cryptographically signing an authentication token. WhatsApp uses the standard OpenID protocol (with some minor modifications) alongside a JSON Web Token (JWT Token) to verify the user-visible identifier through public keys periodically fetched from the third-party server.

WhatsApp uses the Noise Protocol Framework to encrypt all data traveling between the client and the WhatsApp server. As part of the Noise Protocol, the third-party client must perform a “Noise Handshake” every time the client connects to the WhatsApp server. Part of this Handshake is providing a payload to the server which also contains the JWT Token.

Once the client has successfully connected to the WhatsApp server, the client must use WhatsApp’s chat protocol to communicate with the WhatsApp server. WhatsApp’s chat protocol uses optimized XML stanzas to communicate with our servers.

As we continue to discuss this architecture with third-party providers, we think there is also an approach to implementing interop where we could give third-party providers the option to add a proxy or an “intermediary” between their client and the WhatsApp server. A proxy could potentially give third-party providers more flexibility and control over what their client can receive from the WhatsApp server and also removes the requirement that third-party clients must implement WhatsApp’s client-to-server protocol, i.e. maintain their existing “chat channel” on their clients.

The challenge here is that WhatsApp would no longer have direct connection to both clients and, as a result, would lose connection level signals that are important for keeping users safe from spam and scams such as TCP fingerprints. We would therefore anticipate implementing additional requirements for third-party providers who take up this option under our Reference Offer. This approach also exposes all the chat metadata to the proxy server, which increases the likelihood that this data could be accidentally or intentionally leaked.

Clearly explaining how interop works to users

We believe it is essential that we give users transparent information about how interop works and how it differs from their chats with other WhatsApp or Messenger users. This will be the first time that users have been part of an interoperable network on our services, so giving them clear and straightforward information about what to expect will be paramount. For example, users need to know that our security and privacy promise, as well as the feature set, won’t exactly match what we offer in WhatsApp chats.

Privacy and security is a shared responsibility

As is hopefully clear from this post, preserving privacy and security in an interoperable system is a shared responsibility, and not something that Meta is able to do on its own. We will therefore need to continue collaborating with third-party providers in order to provide the safest and best experience for our users.

How DotSlash makes executable deployment simpler

Mon, 26 Feb 2024 20:46:00 +0100

Andres Suarez and Michael Bolin, two software engineers at Meta, join Pascal Hartig (@passy) on the Meta Tech Podcast to discuss the ins and outs of DotSlash, a new open source tool from Meta. DotSlash takes the pain out of distributing binaries and toolchains to developers. Instead of committing large, platform-specific executables to a repository, [...]

The post How DotSlash makes executable deployment simpler appeared first on Engineering at Meta.

Aligning Velox and Apache Arrow: Towards composable data management

Tue, 20 Feb 2024 18:00:00 +0100

We’ve partnered with Voltron Data and the Arrow community to align and converge Apache Arrow with Velox, Meta’s open source execution engine.
Apache Arrow 15 includes three new format layouts developed through this partnership: StringView, ListView, and Run-End-Encoding (REE).
This new convergence helps Meta and the larger community build data management systems that are unified, more efficient, and composable.

Meta’s Data Infrastructure teams have been rethinking how data management systems are designed. We want to make our data management systems more composable – meaning that instead of individually developing systems as monoliths we identify common components, factor them out as reusable libraries, and leverage common APIs and standards to increase the interoperability between them.

As we decompose our large, monolithic systems into a more modular stack of reusable components, open standards, such as Apache Arrow, play an important role for interoperability of these components. To further our efforts in creating a more unified data landscape for our systems as well as those in the larger community, we’ve partnered with Voltron Data and the Arrow community to converge Apache Arrow’s open source columnar layouts with Velox, Meta’s open source execution engine.

The result combines the efficiency and agility offered by Velox with the widely-used Apache standard.

Why we need a composable data management system

Meta’s data engines support large-scale workloads that include processing large datasets offline (ETL), interactive dashboard generation, ad hoc data exploration, and stream processing. More recently, a variety of feature engineering, data preprocessing, and training systems were built to support our rapidly expanding AI/ML infrastructure. To ensure our engineering teams can efficiently maintain and enhance these engines as our products evolve, Meta has started a series of projects aimed at increasing our engineering efficiency by minimizing the duplication of work, improving the experience of internal data users through more consistent semantics across these engines, and, ultimately, accelerating the pace of innovation in data management.

An introduction to Velox

Velox is the first project in our composable data management system program. It’s a unified execution engine, implemented as a C++ library, aimed at replacing the very processing core of many of these data management systems – their execution engine.

Velox improves the efficiency of these systems by providing a unified, state-of-the-art implementation of features and optimizations that were previously only available in individual engines. It also improves the engineering efficiency of our organization since these features can now be written once, in a single library, and be (re-)used everywhere.

Velox is currently in different stages of integration in more than 10 of Meta’s data systems. We have observed 3-10x efficiency improvements in integrations with well-known systems in the industry like Apache Spark and Presto.

We open-sourced Velox in 2022. Today, it is developed in collaboration with more than 200 individual contributors around the world from more than 20 companies.

Open standards and Apache Arrow

In order to enable interoperability with other components, a composable data management system has to understand common storage (file) formats, network serialization protocols, table APIs, and have a unified way of expressing computation. Oftentimes these components have to directly share in-memory datasets with each other, for example, when transferring data across language boundaries (C++ to Java or Python) for efficient UDF support.

Our focus is to use open standards in these APIs as often as possible. Apache Arrow is an open source in-memory layout standard for columnar data that has been widely adopted in the industry. In a way, Arrow can be seen as the layer underneath Velox: Arrow describes how columnar data is represented in memory; Velox provides a series of execution and resource management primitives to process this data.

Although the Arrow format predates Velox, we made a conscious design decision while creating Velox to extend and deviate from the Arrow format, creating a layout we call Velox Vectors. The purpose was to accelerate the data processing operations commonly found in our workloads in ways that were not possible using Arrow. Velox Vectors provided the efficiency and agility we need to move fast, but in return created a fragmented space with limited component interoperability.

To bridge this gap and create a more unified data landscape for our systems and the community, we partnered with Voltron Data and the Arrow community to align and converge these two formats. After a year of work, the new Apache Arrow release, Apache Arrow 15.0.0, includes three new format layouts inspired by Velox Vectors: StringView, ListView, and Run-End-Encoding (REE).

Arrow 15 not only enables efficient (zero-copy) in-memory communication across components using Velox and Arrow, but also increases Arrow’s applicability in modern execution engines, unlocking a variety of use cases across the industry.

Details of the Arrow and Velox layout

Both Arrow and Velox Vectors are columnar layouts whose purpose is to represent batches of data in memory. A column is usually composed of a sequential buffer where row values are stored contiguously and an optional bitmask to represent the nullability/validity of each value:

(a) Logical and (b) physical representation of an example dataset.

The Arrow and Velox Vectors formats already had compatible layout representations for scalar fixed-size data types (such as integers, floats, and booleans) and dictionary-encoded data. However, there were incompatibilities in string representation and container types such as arrays and maps, and a lack of support for constant and run-length-encoded (RLE) data.

StringView – strings

Arrow’s typical string representation uses the variable-sized element layout, which consists of one contiguous buffer containing the string contents (the data), and one buffer marking where each string starts (the offsets). The size of a string i can be obtained by subtracting offsets[i+1] by offsets[i]. This is equivalent to representing strings as an array of characters:

Arrow original string representation.

While Arrow’s representation stands out in simplicity, we found through a series of experiments that the following alternate string representation (which is now referred to as StringView) provides compelling properties that are important for efficient string processing:

New StringView representation in Arrow 15.

In the new representation, the first four bytes of the view object always contain the string size. If the string is short (up to 12 characters), the contents are stored inline in the view structure. Otherwise, a prefix of the string is stored in the next four bytes, followed by the buffer ID (StringViews can contain multiple data buffers) and the offset in that data buffer.

The benefits of this layout are:

Small strings of up to 12 bytes are fully inlined within the views buffer and can be read without dereferencing the data buffer. This increases memory locality as the typical cache miss of accessing the data buffer is avoided, increasing performance.
Since StringViews store a small (four bytes) prefix with the view object, string comparisons can fail-fast and, in many cases, avoid accessing the data buffer. This property speeds up common operations such as highly selective filters and sorting.
StringView gives developers more flexibility on how string data is laid out in memory. For example, it allows for certain common string operations, such as ????() and ??????(), to be executed zero-copy by only updating the view object.
Since StringView’s view object has a fixed size (16 bytes), StringViews can be written out of order (e.g., first writing StringView at position 2, then 0 and 1).

Besides these properties, we have found that other modern processing engines and libraries like Umbra and DuckDB follow a similar string representation approach, and, consequently, also used to deviate from Arrow. In Arrow 15, StringView has been added as a supported layout and can now be used to efficiently transfer string batches across these systems.

ListView – variable-sized containers

Variable-size containers like arrays and maps are represented in Arrow using one buffer containing the flattened elements from all rows, and one offsets buffer marking where the container on each row starts, similar to the original string representation. The number of elements a container on row i stores can be obtained by subtracting offsets[i+1] by offsets[i]:

Arrow original list representation.

To efficiently support execution of vectorized conditionals (e.g., IF and SWITCH operations), the Velox Vectors layout has to allow developers to write columns out of order. This means that developers can, for example, first write all even row records then all odd row records without having to reorganize elements that have already been written.

Primitive types can always be written out of order since the element size is constant and known beforehand. Likewise, strings can also be written out of order using StringView because the string metadata objects have a constant size (16 bytes), and string contents do not need to be written contiguously. To increase flexibility and support out-of-order writes for the remaining variable-sized types in Velox, we decided to keep both lengths and offsets buffers:

New ListView representation in Arrow 15.

To bridge the gap, a new format called ListView has been added to Arrow 15. It allows the representation of variable-sized elements that have both lengths and offsets buffers.

Beyond allowing for efficient execution of conditionals, ListView gives developers more flexibility to slice and rearrange containers (e.g., operations like slice() and trim_array() can be implemented zero-copy), other than allowing for containers with overlapping ranges of elements.

REE – more encodings

We have also added two additional encoding formats commonly found in data warehouse workloads into Velox: constant encoding, to represent that all values in a column are the same, typically used to represent literals and partition keys; and RLE, to compactly represent consecutive runs of the same element.

Upon discussion with the community, it was decided to add the REE format to Arrow. The REE format is a slight variation of RLE that, instead of storing the lengths of each run, stores the offset in which each run ends, providing better random-access support. With REEs it is also possible to represent constant encoded values by encoding them as a single run whose size is the entire batch.

Composability is the future of data management

Converging Arrow and Velox’s memory layout is an important step towards making data management systems more composable. It enables systems to combine the power of Velox’s state-of-the-art execution with the widespread industry adoption of Arrow’s standard, resulting in a more efficient and seamless cooperation. The new extensions are already seeing adoption in libraries like PyArrow and Polars and within Meta. In the future, it will allow more efficient interplay between projects like Apache Gluten (which uses Velox internally) and PySpark (which consumes Arrow), for example.

We envision that fragmentation and duplication of work can be reduced by decomposing data systems into reusable components which are open source and built based on open standards and APIs. Ultimately, we hope this work will help provide the foundation required to accelerate the pace of innovation in data management.

Acknowledgments

This format alignment was only possible due to a broad collaboration across different groups. A special thank you to Masha Basmanova, Orri Erling, Xiaoxuan Meng, Krishna Pai, Jimmy Lu, Kevin Wilfong, Laith Sakka, Wei He, Bikramjeet Vig, and Sridhar Anumandla from the Velox team at Meta; Felipe Carvalho, Ben Kietzman, Jacob Wujciak-Jens, Srikanth Nadukudy, Wes McKinney, and Keith Kraus from Voltron Data; and the entire Apache Arrow community for the insightful discussions, feedback, and receptivity to new ideas.

Meta loves Python

Mon, 12 Feb 2024 15:00:00 +0100

By now you’re already aware that Python 3.12 has been released. But did you know that several of its new features were developed by Meta?

Meta engineer Pascal Hartig (@passy) is joined on the Meta Tech Podcast by Itamar Oren and Carl Meyer, two software engineers at Meta, to discuss their teams’ contributions to the latest Python release, including new hooks that allow for custom JITs like Cinder, Immortal Objects, improvements to the type system, faster comprehensions, and more.

Learn how and why they built these new features for Python and how they worked with and engaged with the Python community.

Download or listen to the episode below:

[embedded content]

You can also find the episode wherever you get your podcasts, including:

Spotify
Apple Podcasts
PocketCasts
Castro
Overcast

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Simple Precision Time Protocol at Meta

Wed, 07 Feb 2024 18:00:00 +0100

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.
In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in CPU, memory, and network utilization.
We’ve made the source code for the SPTP client and server available on GitHub.

We’ve previously spoken in great detail about how Precision Time Protocol is being deployed at Meta, including the protocol itself and Meta’s precision time architecture.

As we deployed PTP into one of our data centers, we were also evaluating and testing alternative PTP clients. In doing so, we soon realized that we could eliminate a lot of complexity in the PTP protocol itself that we experienced during data center deployments while still maintaining complete hardware compatibility with our existing equipment.

This is how the idea of Simple Precision Time Protocol (SPTP) was born.

But before we dive under the hood of SPTP we should explore why the IEEE 1588 G8265.1 and G8275.2 unicast profiles (here, we just call them PTP) weren’t a perfect fit for our data center deployment.

PTP and its limitations

Excessive network communication

A typical IEEE 1588-2019 two-step PTPv2 unicast UDP flow consists of the following exchange:

Figure 1: Typical two-step PTPv2 exchange.

This sequence repeats either in full or in part depending on the negotiation result. The exchange shown is one of many possible combinations. It may involve additional steps such as grant cancellation, grand cancellation acknowledgements, and so on.

The frequency of these messages may vary depending on the implementation and configuration. After completing negotiation, the frequency of some messages can change dynamically.

This design allows for a lot of flexibility, especially for less powerful equipment where resources are limited. In combination with multicast, it allows us to support a relatively large number of clients using either very old or embedded devices. For example, a PTP server can reject the request or confirm a less frequent exchange if the resources are exhausted.

This design, however, leads to excessive network communication, which is particularly visible on a time appliance serving a large number of clients.

State machine

Due to the “subscription” model, both the PTP client and the server have to keep the state in memory. This approach comes with the tradeoffs such as:

Excessive usage of resources such as memory and CPU.
Strict capacity limits that mean multicast support is required for large numbers of clients.
Code complexity.
Fragile state transitions.

These issues can manifest, for example, in so-called abandoned syncs – situations where the work of a PTP client is interrupted (either forcefully stopped or crashed). Because the PTP server didn’t receive a cancellation signaling message it will keep sending sync and followup packets until the subscription expires (which may take hours). This leads to additional complexity and fragility in the system.

There are additional protocol design side effects such as:

An almost infinite Denial of Service Attack (DoS) amplification factor.
Server-driven communication with little control by the client.
Complete trust in the validity of server timestamps.
Asynchronous path delay calculations.

In data centers, where communication is typically driven by hundreds of thousands of clients and multicast is not supported, these tradeoffs are very limiting.

SPTP

True to its name, SPTP significantly reduces the number of exchanges between a server and client, allowing for much more efficient network communication.

Exchange

Figure 2: Typical SPTP exchange.

In a typical SPTP exchange:

The client sends a delay request.
The server responds with a sync.
The server sends a followup/announce.

The number of network exchanges is drastically reduced. Instead of 11 different network exchanges as shown on Figure 1 and the requirement for client and server state machines for the duration of the subscription, there are only three packets exchanged and no state needs to be preserved on either side. In the simplified exchange, every packet has an important role:

Delay request

A delay request initiates the SPTP exchange. It’s interpreted by a server not only as a standard delay request containing the correction field (CF1) of the transparent clock, but also as a signal to respond with sync and followup packets. Just like in a two-step PTPv2 exchange, it generates T3 upon departure from the client side and T4 upon arrival on the server side.

To distinguish between a PTPv2 delay request and a SPTP delay request, the PTP profile Specific 1 flag must be set by the client.

Sync

In response to a delay request, a sync packet would be sent containing the T4 generated at an earlier stage. Just like in a regular two-step PTPv2 exchange, a sync packet will generate a T1 upon departure from the server side. While in transit, the correction field of the packet (CF2) is populated by the network equipment.

Followup/announce

Following the sync packet, an announce packet is immediately sent containing T1 generated at a previous stage. In addition, the correction filed from the Delay Request field is populated by the CF1 value collected at an earlier stage.

The announce packet also contains typical PTPv2 information such as clock class, clock accuracy, and so on. On the client side, the arrival of the packet generates the T2 timestamp.

After a successful SPTP exchange, default two-step PTPv2 formulas for mean path delay and clock offset must be applied:

mean_path_delay = ((T4 – T3) + (T2-T1) – CF1 -CF2)/2

clock_offset = T2 – T1 – mean_path_delay

After every exchange the client has access to the announce message attributes such as time source, clock quality, etc., as well as the path delay and a calculated clock offset after every exchange with every server. And, because the exchange is client-driven, the offsets could be calculated at the exact same time. This avoids a situation where a client is following a faulty server and has no chance of detecting it.

Figure 3: Client following faulty Time Server 2 based on announce.

Reliability

We can also provide stronger reliability guarantees by using multi-clock reliance.

In our implementation for precision time synchronization, we provide time as well as a window of uncertainty (WOU) to the consumer application via the fbclock API. As we described in a previous blog post on how PTP is being deployed at Meta the WOU is based on the observation of time sync errors for the minimum duration to have stationarity of the state of the system.

In addition, we’ve established a method based on a collection of clocks that each client can access for timing information that we call a clock ensemble. The clock ensemble operates in two modes, steady state and transient; where steady state is during normal operation and transient is in the case of holdover.

However, with a pool of N clocks, C, forming the clock ensemble, the question becomes which clocks to select for determining robustness and accurate timing information. Clocks that are not accurate are rejected (C_reject) and, thus, our ensemble size falls to N = C_total – C_reject. We employ two stages, one that is based on each individual clock, and the second that acts on the collection of valid clocks in the ensemble.

The first stage observes the previous measurements of each individual clock, where the main criteria is to reject outliers in the previous states of the clock. Once this criterion threshold is exceeded, the entire clock is rejected from the valid clock ensemble pool. This is based off Chauvenet’s criterion, where the criterion is a probability band that is centered on the mean of the clock outputs (assuming a normal distribution during steady state). Based on the stationarity tests, we use a sample size of 400 previous clock outputs and calculate a maximum allowable deviation.

For example:

, where is the current clock output, is the clock sample mean, and is the clock set standard deviation.

We find the probability that the current clock output is in disagreement with the previous 400 samples:

Based on a window size of 400 previous samples, the maximum allowed deviation is:

Now, the clock outputs are tested against this value. If they exceed the they are rejected, an alert is raised, and a threshold counter is incremented. Once the rejection threshold is reached for an individual clock, this clock is entirely rejected.

Now, we enter the second stage of verifying the clock ensemble composed of the valid clocks. The second stage forms a weighted average of the non-rejected clocks in the valid clock ensemble, where each clock in the ensemble is reported as its sample size, mean, and variance. The average of the clocks’ means is the weighted average, where the weights are inversely proportional to the mean absolute deviations reported by each clock after applying Chauvenet’s criterion.

Now we can report the mean and variance of the clock ensemble, ensuring the clocks contained therewith are valid and not providing erroneous values. The confidence interval is scaled with the number of good clocks in the ensemble, where the higher the number of valid clocks out of the total clocks provides greater reliability.

For a number of hosts, we show that the distribution of clocks falls within the following heatmap:

Figure 4: Offset distribution overlay of multiple clocks.

We calculate the variance, , of each individual clock’s observations, then we calculate a weighted mean, , taking into consideration the reciprocal of each clock’s variance as the weight.

Due to independence of clocks, the variance of the weighted sum, , is:

In summary, we collect samples from a number of clock sources that form our clock ensemble. The overall precision and reliability of the provided data by SPTP is a function of the number of reliable and in distribution clocks forming the clock ensemble.

A future post will focus on this specifically.

SPTP’s performance

Let’s explore performance of the SPTP versus PTP.

Initial deployments to a single client confirmed no regression in the precision of the synchronization:

Figure 5: Clock offset after switching from ptp4l and SPTP.

Repeating the same measurement after migration to SPTP produces a very similar result, only marginally different due to a statistical error:

Figure 6: P99.99 offset collected from over 100000 SPTP clients.

With large-scale deployment of our implementations, we can confirm resource utilization improvements.

We noticed that due to the difference in multi-server support, the performance gains vary significantly depending on the number of tracked time servers.

For example, with just a single time appliance serving the entire network there are significant improvements across the board. Most notably over 40 percent CPU, 70 percent memory, and 50 percent network utilization improvements:

Figure 7: Packets per second with ptp4l (green) vs SPTP (blue).

The next steps for SPTP at Meta

Since SPTP can offer the exact same level of synchronization with a lot fewer resources consumed, we think it’s a reasonable alternative to the existing unicast PTP profiles.

In a large-scale data center deployment, it can help to combat frequently changing network paths and create savings in terms of network traffic, memory usage, and number of CPU cycles.

It will also eliminate a lot of complexity inherited from multicast PTP profiles, which is not necessarily useful in the trusted networks of the modern data centers.

It should be noted that SPTP may not be suitable for systems that still require subscription and authentication. But this could be solved by using PTP TLVs (type-length-value).

Additionally, by removing the need for subscriptions, it’s possible to observe multiple clocks – which allows us to provide higher reliability by comparing the time sync from multiple sources at the end node.

SPTP can offer significantly simpler, faster, and more reliable synchronization. Similar to G.8265.1 and G.8275.2 it provides excellent synchronization quality using a different set of parameters. Simplification comes with certain tradeoffs, such as missing signaling messages, that users need to be aware of and decide which profile is the best for them.

Having it standardized and assigned a unicast profile identifier will encourage wider support, adoption, and popularization of PTP as a default precise time synchronization protocol.

The source code for the SPTP client and the server can be accessed on our GitHub page.

Acknowledgements

We would like to thank Alexander Bulimov, Vadim Fedorenko, and Mike Lambeta for their help implementing the code and the math for this article.

DotSlash: Simplified executable deployment

Tue, 06 Feb 2024 15:00:00 +0100

We’ve open sourced DotSlash, a tool that makes large executables available in source control with a negligible impact on repository size, thus avoiding I/O-heavy clone operations.
With DotSlash, a set of platform-specific executables is replaced with a single script containing descriptors for the supported platforms. DotSlash handles transparently fetching, decompressing, and verifying the appropriate remote artifact for the current operating system and CPU.
At Meta, the overwhelming majority of DotSlash files are generated and committed to source control via automation, so we are also releasing a complementary GitHub Action to assemble a comparable setup outside of Meta.
DotSlash is written in Rust for performance and is cross-platform.

At Meta, we have a vast array of first-party and third-party command line tools that need to be available across a diverse range of developer environments. Reliably getting the appropriate version of each tool to the right place can be a challenging task.

For example, the source code for many of our first-party tools lives alongside the projects that leverage them inside our massive monorepo. For such tools, the standard practice is to use buck2 run to build and run executables from source, as necessary. This has the advantage that tools and the projects that use them can be updated atomically in a single commit.

While we use extensive caching and remote execution to provide our developers with fast builds, there will always be cases where buck2 run is going to be considerably slower than running the prebuilt binary directly. While we leverage a virtual filesystem that reduces the drawbacks of checking large binaries into source control compared to a traditional physical filesystem, there are still pathological cases that are best avoided by keeping such files out of the repository in the first place. (This practice also eliminates a large class of code provenance issues.)

Further, not everything we use is built from source, nor do all of our tools live in source control. For example, there is the case of buck2 itself, which needs to be pre-built for developers and readily available on the $PATH for convenience. For core developer tools like Buck2 and Sapling, we use a Chef recipe to deploy new versions, installing them in /usr/local/bin (or somewhere within the appropriate %PATH$% on Windows) across a variety of developer environments.

While this approach is reasonable for commonly-used executables, it is not a great fit for the long tail of tools. That is, while it might be convenient to install everything a developer might need in /usr/local/bin by default, this could easily add up to tens or hundreds of gigabytes of disk, very little of which will end up being executed, in practice. In turn, this makes Chef runs more expensive and prone to failure.

Introducing DotSlash

DotSlash attempts to solve many of the problems described in the previous section. While we do not claim it is a silver bullet, we have found it to be the right solution for many of our internal use cases. At Meta, DotSlash is executed hundreds of millions of times per day to deliver a mix of first-party and third-party tools to end-user developers as well as hermetic build environments.

The idea is fairly simple: we replace the contents of a set of platform-specific, heavyweight executables with a single lightweight text file that can be read by the dotslash command line tool (which must be installed on the user’s $PATH). We call such a file a DotSlash file. It contains the information DotSlash needs to fetch and run the executable it replaces for the host platform. By convention, a DotSlash file maintains the name of the original file rather than calling attention to itself via a custom file extension. Instead, it aspires to be a transparent wrapper for the original executable. To that end, a DotSlash file is required to start with #!/usr/bin/env dotslash (even on Windows) to help maintain this illusion.

The following is a hypothetical DotSlash file named node that is designed to run v18.19.0 of Node.js. Note that users across x86 Linux, x86 macOS, and ARM macOS can all run the same DotSlash file, as DotSlash will take care of doing the work to select the appropriate executable for the host on which it is being run. In this way, DotSlash simplifies the work of cross-platform releases:

In this example, the workflow DotSlash runs through when executing node looks like:

See the How DotSlash Works documentation for details.

Because of how #! works on Mac and Linux, when a user runs ./node --version, the invocation effectively becomes dotslash ./node --version. DotSlash requires that its first argument is a file that starts with #!/usr/bin/env dotslash, as mentioned above. Once it verifies the header, it uses a lenient JSON parser to read the rest of the file. DotSlash finds the entry in the "platforms" section that corresponds to the host it is running on.

DotSlash uses the information in this entry and hashes it to compute a corresponding file path (that doubles as a key) in the user’s local DotSlash cache. DotSlash attempts to exec the corresponding file, replacing argv0 with the path to the DotSlash file and forwarding the remaining command line arguments (--version, in this example) to the exec invocation.

If the target executable is in the cache, the user immediately runs Node.js as originally intended. In the event of a cache miss (indicated by exec failing with ENOENT), DotSlash uses the information from the DotSlash file to determine the URL it should use to fetch the artifact containing the executable as well as the size and digest information it should use to verify the contents. If this succeeds, the verified artifact is atomically mv‘d into the appropriate location in the DotSlash cache and the exec invocation is performed again. Note that DotSlash uses advisory file locking to avoid making duplicate requests even if DotSlash files requiring the same artifact are run concurrently.

Note that it is common to have multiple DotSlash files refer to the same artifact, such as a .tar.zst file, while each DotSlash file maps to a distinct entry within the archive. For example, suppose node-v18.19.0-darwin-arm64.tar.gz is a compressed tar file that contains many entries, including node , npm , and npx. The DotSlash file for node would be as follows:

#!/usr/bin/env dotslash
{
  "name": "node-v18.19.0",
  "platforms": {
    "macos-aarch64": {
      "size": 40660307,
      "hash": "blake3",
      "digest": "6e2ca33951e586e7670016dd9e503d028454bf9249d5ff556347c3d98c347c34",
      // Note the difference from the previous example where "format": "zst" has been
      // replaced with "format": "tar.gz", which specifies what type of decompression
      // logic to use as well as the path within the decompressed archive to run when
      // this DotSlash file is executed.
      "format": "tar.gz",
      // Assuming node-v18.19.0-darwin-arm64.tar.gz contains node, npm, and npx in the
      // node-v18.19.0-darwin-arm64/bin/ folder within the the archive, the following
      // is the only line that has to change in the DotSlash file that represents
      // those other executables.
      "path": "node-v18.19.0-darwin-arm64/bin/node",
      "providers": [
        {
          "url": "https://nodejs.org/dist/v18.19.0/node-v18.19.0-darwin-arm64.tar.gz"
        }
      ]
    },
    /* other platforms omitted for brevity */
  }
}

As noted in the comments, the only change in the DotSlash files for npm and npx would be the "path" entry. Because the artifact for all three DotSlash files would be the same, whichever DotSlash file was run first would fetch the artifact and put it in the cache whereas all subsequent runs of any of the three DotSlash files would leverage the cached entry.

This technique is often used to ensure that a set of complementary executables is released together. Further, because the archive will be decompressed in its own directory, it may also contain resource files (or library files, such as .dll files that need to live alongside .exe files on Windows) that will be unpacked using the directory structure specified by the archive. This also makes DotSlash a good fit for distributing executables that are not binaries, but trees of script files, which is common for Node.js or Python.

Generating DotSlash files

At Meta, most DotSlash files are produced as part of an automated build pipeline. Our continuous integration (CI) system supports special configuration for DotSlash jobs where a user must specify:

A set of builds to run (these can span multiple platforms).
The resulting generated artifacts to publish to an internal blobstore.
The DotSlash files in source control to update with entries for the new artifacts.
The conditions under which the job should be triggered (this is analogous to workflow triggers on GitHub).

The result of such a job is a proposed change to the codebase containing the updated DotSlash files. At Meta, we call such a change a “diff,” though on GitHub, this is known as a pull request. Just like an ordinary human-authored diff at Meta, putting it up for review triggers a number of jobs that include linters, automated tests, and other tools that provide signal on the proposed change. For a DotSlash diff, if all of the signals come back clean, the diff is automatically committed to the codebase without further human intervention.

See the Generating DotSlash Files at Meta documentation for details.

The script we use to generate DotSlash files injects metadata about the build job that makes it straightforward to trace the provenance of the underlying artifacts. The following is a hypothetical example of a generated DotSlash file for the CodeCompose LSP built from source at a specific commit in clang-opt mode. Note the "metadata" entries in the DotSlash file will be ignored by the dotslash CLI, but we include them as structured data so they can be parsed by other tools to facilitate programmatic audits:

#!/usr/bin/env dotslash
// @generated SignedSource<>
// https://yarnpkg.com/package?name=signedsource can be used to
// generate and verify the above signature to flag tampering
// in generated code.
{
  "name": "code-compose-lsp",
  // Added by automation.
  "metadata": {
    "build-info": {
      "job-repo": "fbsource",
      "job-src": "dotslash/code-compose-lsp.star",
      // It is considered best practice to build the artifacts for
      // all platforms from the same commit within a DotSlash file.
      "commit": {
        "repo": "fbsource",
        "scm": "sapling",
        "hash": "0f9e3d9e189bf393f7f9d0b6879361cd76fcdcd0",
        "date": "2024-01-03 20:07:54 PST",
        "timestamp": 1704341274
      }
    }
  },
  "platforms": {
    "linux-x86_64": {
      "size": 2740736,
      "hash": "blake3",
      "digest": "fc8a3ade56a97a6e73469ade1575e8f8e33fda99fbf6df429d555e480d6453d0",
      "format": "zst",
      "providers": [
        {
          "type": "meta-cas",
          "key": "fc8a3ade56a97a6e73469ade1575e8f8e33fda99fbf6df429d555e480d6453d0:2740736"
        }
      ]
      // Added by automation.
      "metadata": {
        "build-command": [
          "buck2",
          "build",
          "--config-file",
          "//buildconfig/clang-opt",
          "//codecompose/lsp/cli:code-compose-lsp"
        ]
      }
    },
    // additional platforms...
  }
}

Without DotSlash, a developer would have to run buck2 build --config-file //buildconfig/clang-opt //codecompose/lsp/cli:code-compose-lsp to build and run the LSP from source, which could be a slow operation depending on the size of the build, the state of the build cache, etc. With DotSlash, the developer can run the optimized LSP as quickly as they can fetch and decompress it from the specified URL, which is likely much faster than doing a build.

Another thing you may have noticed about this example is that the "key" is not an ordinary URL, but an identifier that happens to be the concatenation of the BLAKE3 hash and the size of the specified artifact. This is because "type": "meta-cas" indicates that this artifact must be fetched via a custom provider in DotSlash, which is specialized fetching logic built into DotSlash that has its own identifier scheme. In this case, the artifact would be fetched from Meta’s in-house content-addressable storage (CAS) system, which uses the artifact hash+size as a key.

While we do not provide the code for the meta-cas provider in the open source version of DotSlash, we do include one custom provider out-of-the-box beyond the default http provider.

Using DotSlash with GitHub releases

While DotSlash is generally useful for fetching an executable from an arbitrary URL and running it, we have found the combination of DotSlash and CI to be particularly powerful. To that end, we include custom tooling to facilitate generating DotSlash files for GitHub releases. To ensure DotSlash can fetch artifacts from private GitHub repositories as well as GitHub Enterprise instances, DotSlash includes a custom provider for GitHub releases that includes an appropriate authentication token when fetching artifacts.

For example, suppose you have existing workflows for building your release artifacts and publish them via gh release upload. For simplicity, let’s assume these are named linux-release, macos-release, and windows-release. To create a single DotSlash file that includes the artifacts from all three platforms you would introduce a new GitHub Action that leverages the workflow_run trigger so it fires whenever one of these release workflows succeeds. (Note that GitHub’s documentation states: “You can’t use workflow_run to chain together more than three levels of workflows,” so check the depth of your workflow graph if your workflow is not firing.)

The .yml file to define the new workflow would look like this:

name: Generate DotSlash File
on:
  workflow_run:
    # These must match the names of the workflows that publish
    # artifacts to your GitHub Release.
    workflows: [linux-release, macos-release, windows-release]
    types:
      - completed
jobs:
  create-dotslash-file:
    name: Generating DotSlash File
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - uses: facebook/dotslash-publish-release@v1
        env:
          # This is necessary because the action uses
          # `gh release upload` to publish the generated DotSlash file(s)
          # as part of the release.
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          # Additional file that lives in your repo that defines
          # how your DotSlash file(s) should be generated.
          config: .github/workflows/dotslash-config.json
          # Tag for the release to to target.
          tag: ${{ github.event.workflow_run.head_branch }}

Because inputs to GitHub Actions are limited to string values, facebook/dotslash-publish-release takes config, which is a path to a JSON file in the repo that supports a rich set of configuration options for generating the DotSlash files. The other required input is the ID of the release, which in GitHub, is defined by a Git tag. When the action is run, it will check to see whether all of the artifacts specified in the config are present in the release, and if so, will generate the appropriate DotSlash files and add them to the release.

For example, consider an open source project like Hermes where a release includes a number of platform-specific .tar.gz files, each containing a handful of executables (hermes, hdb, etc.). To create a separate an individual DotSlash file for each executable, the JSON configuration for the action would be:

{
  "outputs": {
    "hermes": {
      "platforms": {
        "macos-x86_64": {
          "regex": "^hermes-cli-darwin-",
          "path": "hermes"
        },
        "macos-aarch64": {
          "regex": "^hermes-cli-darwin-",
          "path": "hermes"
        },
        "linux-x86_64": {
          "regex": "^hermes-cli-linux-",
          "path": "hermes"
        },
        "windows-x86_64": {
          "regex": "^hermes-cli-windows-",
          "path": "hermes.exe"
        }
      }
    },
    "hdb": {
      "platforms": {
        "macos-x86_64": {
          "regex": "^hermes-cli-darwin-",
          "path": "hdb"
        },
        "macos-aarch64": {
          "regex": "^hermes-cli-darwin-",
          "path": "hdb"
        },
        "linux-x86_64": {
          "regex": "^hermes-cli-linux-",
          "path": "hdb"
        },
        "windows-x86_64": {
          "regex": "^hermes-cli-windows-",
          "path": "hdb.exe"
        }
      }
    },
    // Additional entries for hvm, hbcdump, and hermesc...
  }
}'

Each entry in "outputs" corresponds to the name of a DotSlash file that will be added to the release. The "platforms" for each entry defines the "platforms" that should be present in the generated DotSlash file. The action uses the "regex" to identify the file in the GitHub release that should be used as the backing artifact for the entry. Assuming the artifact is an “archive” of some sort (.tar.gz, .tar.zst, etc.), the "path" indicates the path within the archive that the DotSlash file should run.

In this particular case, Hermes does not provide an ARM-specific binary for macOS, so the "macos-aarch64" entry is the same as the "macos-x86_64"one. Though if that changes in the future, a simple update to "regex" to distinguish the two binaries is all that is needed.

Note that the action will take responsibility for computing the digest for each binary. In this example, the resulting DotSlash file for hermes would be:

#!/usr/bin/env dotslash
{
  "name": "hermes",
  "platforms": {
    "linux-x86_64": {
      "size": 47099598,
      "hash": "blake3",
      "digest": "8d2c1bcefc2ce6e278167495810c2437e8050780ebb4da567811f1d754ad198c",
      "format": "tar.gz",
      "path": "hermes",
      "providers": [
        {
          "url": "https://github.com/facebook/hermes/releases/download/v0.12.0/hermes-cli-linux-v0.12.0.tar.gz"
        },
        {
          "type": "github-release",
          "repo": "facebook/hermes",
          "tag": "v0.12.0",
          "name": "hermes-cli-linux-v0.12.0.tar.gz"
        }
      ],
    },
    // additional platforms...
  }
}

Note that there are two entries in the "providers" section for the Linux artifact. When DotSlash fetches an artifact, it will try the providers in order until one succeeds. Regardless of which provider is used, the downloaded binary will be verified against the specified "hash", "digest", and "size" values.

In this case, the first provider is an ordinary, public URL that can be fetched using curl --location, but the second is an example of a custom provider discussed earlier. The "type": "github-release" line indicates that the GitHub provider for DotSlash should be used, which shells out to the GitHub CLI (gh, which must be installed separately from DotSlash) to fetch the artifact instead of curl. Because facebook/hermes is a public GitHub repository, the first provider should be sufficient here. However, if the repository were private and the fetch required authentication, we would expect the first provider to fail and DotSlash would fallback to the GitHub provider. Assuming the user had run gh auth login in advance to configure credentials for the specified repo, DotSlash would be able to fetch the artifact using gh release download.

By publishing DotSlash files as part of GitHub releases, users can copy them to their own repositories to “vendor in” a specific version of your tool with minimal effect on their repository size, regardless of how large your releases might be.

Try DotSlash Today

Visit the DotSlash site for more in-depth documentation and technical details. The site includes instructions on Installing DotSlash so you can start playing with it firsthand.

We also encourage you to check out the DotSlash source code and provide feedback via GitHub issues. We look forward to hearing from you!

Improving machine learning iteration speed with faster application build and packaging

Mon, 29 Jan 2024 18:00:00 +0100

Slow build times and inefficiencies in packaging and distributing execution files were costing our ML/AI engineers a significant amount of time while working on our training stack.
By addressing these issues head-on, we were able to reduce this overhead by double-digit percentages.

In the fast-paced world of AI/ML development, it’s crucial to ensure that our infrastructure can keep up with the increasing demands and needs of our ML engineers, whose workflows include checking out code, writing code, building, packaging, and verification.

In our efforts to maintain efficiency and productivity while empowering our ML/AI engineers to deliver cutting-edge solutions, we found two major challenges that needed to be addressed head-on: slow builds and inefficiencies in packaging and distributing executable files.

The frustrating problem of slow builds often arises when ML engineers work on older (“cold”) revisions for which our build infrastructure doesn’t maintain a high cache hit rate, requiring us to repeatedly rebuild and relink many components. Moreover, build non-determinism further contributes to rebuilding by introducing inefficiencies and producing different outputs for the same source code, making previously cached results unusable.

Executable packaging and distribution was another significant challenge because, historically, most ML Python executables were represented as XAR files (self-contained executables) and it is not always possible to leverage OSS layer-based solutions efficiently (see more details below). Unfortunately, creating such executables can be computationally costly, especially when dealing with a large number of files or substantial file sizes. Even if a developer modifies only a few Python files, a full XAR file reassembly and distribution is often required, causing delays for the executable to be executed on remote machines.

Our goal in improving build speed was to minimize the need for extensive rebuilding. To accomplish this, we streamlined the build graph by reducing dependency counts, mitigated the challenges posed by build non-determinism, and maximized the utilization of built artifacts.

Simultaneously, our efforts in packaging and distribution aimed to introduce incrementality support, thereby eliminating the time-consuming overhead associated with XAR creation and distribution.

How we improved build speeds

To make builds faster we wanted to ensure that we built as little as possible by addressing non-determinism and eliminating unused code and dependencies.

We identified two sources of build non-determinism:

Non-determinism in tooling. Some compilers, such as Clang, Rustc, and NVCC, can produce different binary files for the same input, leading to non-deterministic results. Tackling these tooling non-determinism issues proved challenging, as they often required extensive root cause analysis and time-consuming fixes.
Non-determinism in source code and build rules. Developers, whether intentionally or unintentionally, introduced non-determinism by incorporating things like temporary directories, random values, or timestamps into build rules code. Addressing these issues posed a similar challenge, demanding a substantial investment of time to identify and fix.

Thanks to Buck2, which sends nearly all of the build actions to the Remote Execution (RE) service, we have been able to successfully implement non-determinism mitigation within RE. Now we provide consistent outputs for identical actions, paving the way for the adoption of a warm and stable revision for ML development. In practice, this approach will eliminate build times in many cases.

Though removing the build process from the critical path of ML engineers might not be possible in all cases, we understand how important it is to handle dependencies for controlling build times. As dependencies naturally increased, we made enhancements to our tools for managing them better. These improvements helped us find and remove many unnecessary dependencies, making build graph analysis and overall build times much better. For example, we removed GPU code from the final binary when it wasn’t needed and figured out ways to identify which Python modules are actually used and cut native code using linker maps.

Adding incrementality for executable distribution

A typical self-executable Python binary, when unarchived, is represented by thousands of Python files (.py and/or .pyc), substantial native libraries, and the Python interpreter. The cumulative result is a multitude of files, often numbering in the hundreds of thousands, with a total size reaching tens of gigabytes.

Engineers spend a significant amount of time dealing with incremental builds where packaging and fetching overhead of such a large executable surpasses the build time. In response to this challenge, we implemented a new solution for the packaging and distribution of Python executables – the Content Addressable Filesystem (CAF).

The primary strength of CAF lies in its ability to operate incrementally during content addressable file packaging and fetching stages:

Packaging: By adopting a content-aware approach, CAF can intelligently skip redundant uploads of files already present in Content Addressable Storage (CAS), whether as part of a different executable or the same executable with a different version.
Fetching: CAF maintains a cache on the destination host, ensuring that only content not already present in the cache needs to be downloaded.

To optimize the efficiency of this system, we deploy a CAS daemon on the majority of Meta’s data center hosts. The CAS daemon assumes multiple responsibilities, including maintaining the local cache on the host (materialization into the cache and cache GC) and organizing a P2P network with other CAS daemon instances using Owl, our high-fanout distribution system for large data objects. This P2P network enables direct content fetching from other CAS daemon instances, significantly reducing latency and storage bandwidth capacity.

In the case of CAF, an executable is defined by a flat manifest file detailing all symlinks, directories, hard links, and files, along with their digest and attributes. This manifest implementation allows us to deduplicate all unique files across executables and implement a smart affinity/routing mechanism for scheduling, thereby minimizing the amount of content that needs to be downloaded by maximizing local cache utilization.

While the concept may bear some resemblance to what Docker achieves with OverlayFS, our approach differs significantly. Organizing proper layers is not always feasible in our case due to the number of executables with diverse dependencies. In this context, layering becomes less efficient and its organization becomes more complex to achieve. Additionally direct access to files is essential for P2P support.

We opted for Btrfs as our filesystem because of its compression and ability to write compressed storage data directly to extents, which bypasses redundant decompression and compression and Copy-on-write (COW) capabilities. These attributes allow us to maintain executables on block devices with a total size similar to those represented as XAR files, share the same files from cache across executables, and implement a highly efficient COW mechanism that, when needed, only copies affected file extents.

LazyCAF and enforcing uniform revisions: Areas for further ML iteration improvements

The improvements we implemented have proven highly effective, drastically reducing the overhead and significantly elevating the efficiency of our ML engineers. Faster build times and more efficient packaging and distribution of executables have reduced overhead by double-digit percentages.

Yet, our journey to slash build overhead doesn’t end here. We’ve identified several promising improvements that we aim to deliver soon. In our investigation into our ML workflows, we discovered that only a fraction of the entire executable content is utilized in certain scenarios. Recognizing that, we intend to start working on optimizations to fetch executable parts on demand, thereby significantly reducing materialization time and minimizing the overall disk footprint.

We can also further accelerate the development process by enforcing uniform revisions. We plan to enable all our ML engineers to operate on the same revision, which will improve the cache hit ratios of our build. This move will further increase the percentage of incremental builds since most of the artifacts will be cached.

Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta

Thu, 18 Jan 2024 18:00:00 +0100

At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime.
The outcome? Up to 40 percent time to first batch (TTFB) improvements, along with a 20 percent reduction in Jupyter kernel startup times.
This advancement facilitates swifter experimentation capabilities and elevates the ML developer experience (DevX).

Time is of the essence in the realm of machine learning (ML) development. The milliseconds it takes for an ML model to transition from conceptualization to processing the initial training data can dramatically impact productivity and experimentation.

At Meta, we’ve been able to significantly improve our model training times, as well as our overall developer experience (DevX) by adopting Lazy Imports and the Python Cinder runtime.

The time to first batch challenge

Batch processing has been a game changer in ML development. It handles large volumes of data in groups (or batches) and allows us to train models, optimize parameters, and perform inference more effectively and swiftly.

But ML training workloads are notorious for their sluggish starts. When we look to improve our batch processing speeds, time to first batch (TTFB) comes into focus. TTFB is the time elapsed from the moment you hit the “start” button on your ML model training to the point when the first batch of data enters the model for processing. It is a critical metric that determines the speed at which an ML model goes from idle to learning. TTFB can vary widely due to factors like infrastructure overhead and scheduling delays. But reducing TTFB means reducing the development waiting times that can often feel like an eternity to engineers – waiting periods that can quickly amass as expensive resource wastage.

In the pursuit of faster TTFB, Meta set its sights on reducing this overhead, and Lazy Imports with Cinder emerged as a promising solution.

The magic of Lazy Imports

Previously, ML developers explored alternatives like the standard LazyLoader in importlib or lazy-import`, to defer explicit imports until necessary. While promising, these approaches are limited by their much narrower scope, and the need to manually select which dependencies will be lazily imported (often with suboptimal results). Using these approaches demands meticulous codebase curation and a fair amount of code refactoring.

In contrast, Cinder’s Lazy Imports approach is a comprehensive and aggressive strategy that goes beyond the limitations of other libraries and delivers significant enhancements to the developer experience. Instead of painstakingly handpicking imports to become lazy, Cinder simplifies and accelerates the startup process by transparently deferring all imports as a default action, resulting in a much broader and more powerful deferral of imports until the exact moment they’re needed. Once in place, this method ensures that developers no longer have to navigate the maze of selective import choices. With it, developers can bid farewell to the need of typing-only imports and the use of TYPE_CHECKING. It allows a simple from __future__ import annotations declaration at the beginning of a file to delay type evaluation, while Lazy Imports defer the actual import statements until required. The combined effect of these optimizations reduced costly runtime imports and further streamlined the development workflow.

The Lazy Imports solution delivers. Meta’s initiative to enhance ML development has involved rolling out Cinder with Lazy Imports to several workloads, including our ML frameworks and Jupyter kernels, producing lightning-fast startup times, improved experimentation capabilities, reduced infrastructure overhead, and code that is a breeze to maintain. We’re pleased to share that Meta’s key AI workloads have experienced noteworthy improvements, with TTFB wins reaching up to 40 percent. Resulting time savings can vary from seconds to minutes per run.

These impressive results translate to a substantial boost in the efficiency of ML workflows, since they mean ML developers can get to the model training phase more swiftly.

The challenges of adopting Lazy Imports

While Lazy Imports’ approach significantly improved ML development, it was not all a bed of roses. We encountered several hurdles that tested our resolve and creativity.

Compatibility

One of the primary challenges we grappled with was the compatibility of existing libraries with Lazy Imports. Libraries such as PyTorch, Numba, NumPy, and SciPy, among others, did not seamlessly align with the deferred module loading approach. These libraries often rely on import side effects and other patterns that do not play well with Lazy Imports. The order in which Python imports could change or be postponed, often led to side effects failing to register classes, functions, and operations correctly. This required painstaking troubleshooting to identify and address import cycles and discrepancies.

Balancing performance versus dependability

We also had to strike the right balance between performance optimization and code dependability. While Lazy Imports significantly reduced TTFB and enhanced resource utilization, it also introduced a considerable semantic change in the way Python imports work that could make the codebase less intuitive. Achieving the perfect equilibrium was a constant consideration, and was ensured by limiting the impact of semantic changes to only the relevant parts that could be thoroughly tested.

Ensuring seamless interaction with the existing codebase required meticulous testing and adjustments. The task was particularly intricate when dealing with complex, multifaceted ML models, where the implications of deferred imports needed to be thoroughly considered. We ultimately opted for enabling Lazy Imports only during the startup and preparation phases and disabling it before the first batch started.

Learning curve

Adopting new paradigms like Lazy Imports can introduce a learning curve for the development team. Training ML engineers, infra engineers, and system engineers to adapt to the new approach, understand its nuances, and implement it effectively is a process in itself.

What is next for Lazy Imports at Meta?

The adoption of Lazy Imports and Cinder represented a meaningful enhancement in Meta’s AI key workloads. It came with its share of ups and downs, but ultimately demonstrated that Lazy Imports can be a game changer in expediting ML development. The TTFB wins, DevX improvements, and reduced kernel startup times are all tangible results of this initiative. With Lazy Imports, Meta’s ML developers are now equipped to work more efficiently, experiment more rapidly, and achieve results faster.

While we’ve achieved remarkable success with the adoption of Lazy Imports, our journey is far from over. So, what’s next for us? Here’s a glimpse into our future endeavors:

Streamlining developer onboarding

The learning curve associated with Lazy Imports can be a challenge for newcomers. We’re investing in educational resources and onboarding materials to make it easier for developers to embrace this game-changing approach.

Enhancing tooling

Debugging code with deferred imports can be intricate. We’re working on developing tools and techniques that simplify the debugging and troubleshooting process, ensuring that developers can quickly identify and resolve issues.

Community collaboration

The power of Lazy Imports lies in its adaptability and versatility. We’re eager to collaborate with the Python community – sharing insights, best practices, and addressing challenges together. Building a robust community that helps supporting paradigms and patterns that play well with Lazy Imports is one of our future priorities.

How Meta is advancing GenAI

Thu, 11 Jan 2024 18:00:00 +0100

What’s going on with generative AI (GenAI) at Meta? And what does the future have in store?

In this episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) speaks with Devi Parikh, an AI research director at Meta. They cover a wide range of topics, including the history and future of GenAI and the most interesting research papers that have come out recently.

And, of course, they discuss some of Meta’s latest GenAI innovations, including:

Audiobox, a foundational model for generating sound and soundscapes using natural language prompts.
Emu, Meta’s first foundational model for image generation.
Purple Llama, a suite of tools to help developers safely and responsibly deploy GenAI models.

Download or listen to the episode below:

[embedded content]

You can also find the episode on various podcast platforms:

Spotify
PocketCasts
Apple Podcasts
Google Podcasts

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in AI career opportunities at Meta visit the Meta Careers page.

How Meta built the infrastructure for Threads

Tue, 19 Dec 2023 18:01:00 +0100

On July 5, 2023, Meta launched Threads, the newest product in our family of apps, to an unprecedented success that saw it garner over 100 million sign ups in its first five days. A small, nimble team of engineers built Threads over the course of only five months of technical work. While the app’s production [...]

The post How Meta built the infrastructure for Threads appeared first on Engineering at Meta.

AI debugging at Meta with HawkEye

Tue, 19 Dec 2023 18:00:00 +0100

HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products.
HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the time spent debugging production issues.
In this post, we will provide an overview of the end-to-end debugging workflows supported by HawkEye, components of the system, and the product surface for Meta product and monetization teams to debug AI model and feature issues.

Many of Meta’s products and services leverage ML for various tasks such as recommendations, understanding content, and generating content. Workflows to productionize ML models include data pipelines to get the information needed to train the models, training workflows to build and improve the models over time, evaluation systems to test the models, and inference workflows to actually use the models in Meta’s products. At any point in time multiple versions (snapshots) of a model could be hosted as A/B experiments to test their performance and accuracy.

Ensuring the robustness of predictions made by models is crucial for providing engaging user experiences and effective monetization. However, several factors can affect the accuracy of these predictions, such as the distribution of data used for training, inference-time inputs, the (hyper)parameters of the model, and the systems configuration. Identifying the root cause of any issue is a complex problem, especially given the scale of Meta’s models and data.

At Meta, we created the Prediction Robustness program to innovate on tools and services to ensure the quality of Meta products relying on ML model predictions. HawkEye is the powerful toolkit we built as a part of this effort for the monitoring, observability, and debuggability of ML-based products. HawkEye includes infrastructure for continuously collecting data on serving and training models, data generation, and analysis components for mining root causes. It supports UX workflows for guided exploration, investigation, and initiating mitigation actions.

HawkEye’s components.

Our approach to AI debugging

Prior to HawkEye’s development, identifying and resolving issues in production workflows for features and models required specialized knowledge and familiarity with the processes and telemetry involved. Additionally, it required substantial coordination across different organizations. To address these challenges, model and feature on-call engineers, who debug models and features, would utilize shared notebooks and code to do root cause analyses for small parts of the overall debugging process.

HawkEye implements a decision tree that streamlines this process while building the necessary components for continuous instrumentation and analysis layers to build the tree. HawkEye enables users to efficiently navigate the decision tree and quickly identify the root cause of complex issues. As a result, HawkEye has significantly reduced the time spent on debugging complex production issues, simplified operational workflows, and enabled non-experts to triage complex issues with minimal coordination and assistance.

HawkEye’s debugging workflows.

Operational debugging workflows typically begin with an alert triggered by a problem with a key metric for a product, or an anomaly in a model or feature (at serving or training time). Examples of detection mechanisms include model validation failures, model explosions in gradient/loss, prediction anomalies, and shifts in feature distribution. It’s worth noting that a top line anomaly debugging workflow may encompass all of the other workflows. HawkEye supports this workflow by providing guided exploration surfaces layered on top of the necessary components, allowing users to efficiently investigate and resolve issues.

Isolating top-line product issues to model snapshots

The initial step in investigating an anomaly in a product’s top-line metrics is to pinpoint the specific serving model, infrastructure, or traffic-related factors responsible for the degradation. This can be difficult because multiple models might be used for different product and user segments, A/B experiments, or composed predictions. Additionally, model traffic distribution can vary as experiments are scaled up or down.

HawkEye performs analysis and detection to identify models with prediction degradation correlated with the anomaly in the top-line metrics across all experiments. This enables on-call personnel to assess the quality of predictions for all models powering the affected product and/or user segment.

Product anomaly to model isolation.

Next, HawkEye enables on-calls to correlate top-line degradation with a timeline of snapshot rollouts for each model. The outcome from this step is typically a small set of suspect model snapshots in serving.

Model anomaly to snapshot isolation.

Once a model has been isolated, the next step is to determine if the root cause is a bad snapshot. Rolling back to an older snapshot can provide immediate mitigation, but outdated models can also lead to prediction robustness issues that deteriorate over time. It is essential to identify the underlying cause of a bad snapshot, such as training data or model problems.

Isolating prediction anomalies to features

Serving models can consume thousands of features (each with their own data pipelines) at very high request rates. At such scales, identifying the list of features responsible for prediction anomalies requires real-time analyses of model inputs and outputs. HawkEye uses model explainability and feature importance algorithms to localize prediction changes to subsets of features.

HawkEye samples model inputs and predictions for each serving model and computes correlations between time-aggregated feature distributions and prediction distributions to identify significant correlation structure during periods of degradation. This allows for real-time feature isolation.

In addition, HawkEye computes feature importance changes using feature ablation algorithms on serving model snapshots, which provides a stronger signal than correlational analysis but requires longer processing times due to the need to cover the entire feature distribution hyperspace. When the change in feature importance is significant for a particular feature, it indicates that the feature has a different impact on model predictions in the current snapshot compared to previous snapshots and is a candidate for investigation.

HawkEye presents the on-call with a ranked list of features (or the absence thereof) responsible for prediction anomalies, derived from its model explainability and feature importance analyses. This approach has reduced the time from triage to serving features by several orders of magnitude.

Model to feature isolation.

Isolating upstream causes of feature issues

A large-scale data processing setup consists of complex data transformation pipelines (which may include real-time joins), which create features, which are then stored in systems belonging to multiple infrastructure teams. HawkEye tracks the lineage of upstream data and pipelines, keeping track of statistical information about each feature’s upstream data dependencies and transformations that have occurred, including changes to transform code or configurations.

HawkEye facilitates root-cause analysis of feature problems through a visual workflow that moves upstream along the lineage graph and examines the statistics of related lineage nodes. This helps pinpoint the source of a problematic feature by correlating it with upstream data statistics along with a measure of confidence. Without relying on a model anomaly, HawkEye enables investigations into feature upstream issues from feature health alerts, enabling engineers to detect and correct faults before they impact the live models.

Feature upstream debugging.

Diagnosing model snapshots

Sometimes prediction errors are not due to the features, but result from issues with the model itself, which are a result of model training (e.g., hyperparameter-, architectural-, or training data-related). To determine whether there is an issue with the model, HawkEye compares the model’s current snapshot (specifically, its weights and biases) with previous snapshots that have been operationally stable. In normal scenarios, model parameters across snapshots show some degree of stationarity. However, significant differences indicate problems with either the training data or loss divergence (e.g., loss or gradient explosion) in the bad snapshot.

When a snapshot is published, HawkEye also runs inference on the snapshot using recent feature data, and captures neuron output and activation tensors in the forward pass. If it finds outputs that are NaNs or extreme values, it uses the activations to walk upstream and downstream on the model graph to associate the problem with feature(s) at the input layer, or the effect on predictions at the output layer. Walking the model graph also enables finding opportunities to improve model architecture (e.g., adding layer normalization, clipping, or other operators).

Analyzing and walking through the large model graph helps identify the cause of bad snapshots quickly and proactively. HawkEye provides model graph visualization with tensor level stats, along with graph walking visualizations. Lowering and post-quantization analyses, as well as analyses while training, are beyond the scope of this discussion.

Model snapshot debugging.

Diagnosing training data issues

HawkEye enables on-calls to easily navigate from a suspect snapshot to the training pipeline that produced it and inspect statistical issues with training data (at a partition granularity) and training-time metrics (e.g., learning curve, loss function evolution). This feature enables the on-call to identify data drift between training and serving label distributions, and anomalies with labels (e.g., label imbalance). Such issues can happen for several hard-to-diagnose reasons, ranging from the complex data pipelines behind training data, to data corruptions. HawkEye provides observability into the upstream data pipelines and their health and helps locate the root cause of bad training data. It also provides quick mitigation capabilities, such as pausing affected pipelines if upstream data is bad, and prevents further production impact due to training snapshots.

Training data debugging.

Next steps for HawkEye

Moving forward, we will continue to keep track of emerging root causes in production issues, adding detailed analyses to HawkEye workflows and the product surface. We are also piloting extensibility features in the product and backend components, so that product teams can add generic and specialized debugging workflows to HawkEye, while the community benefits from some of these workflows.

Acknowledgements

We would like to thank all current and past members of the HawkEye team and its partners that helped make this effort a success. A special thank you to Girish Vaitheeswaran, Atul Goyal, YJ Liu, Shiblee Sadik, Peng Sun, Adwait Tumbde, Karl Gyllstrom, Sean Lee, Dajian Li, Yu Quan, Robin Tafel, Ankit Asthana, Gautam Shanbhag, and Prabhakar Goyal.

Building end-to-end security for Messenger

Thu, 07 Dec 2023 03:00:00 +0100

We are beginning to upgrade people’s personal conversations on Messenger to use end-to-end encryption (E2EE) by default
Meta is publishing two technical white papers on end-to-end encryption:
- Our Messenger end-to-end encryption whitepaper describes the core cryptographic protocol for transmitting messages between clients.
- The Labyrinth encrypted storage protocol whitepaper explains our protocol for end-to-end encrypting stored messaging history between devices on a user’s account.

Today, we’re announcing that we’ve begun to upgrade people’s personal conversations on Messenger to use E2EE by default. Our aim is to ensure that everyone’s personal messages on Messenger can only be accessed by the sender and the intended recipients, and that everyone can be sure the messages they receive are from an authentic sender.

This is the most significant milestone yet for this project, which began in earnest after Mark Zuckerberg outlined his vision for it in 2019. Bringing E2EE to Messenger has been a complex process, with every feature and product goal revealing further challenges that required careful consideration.

Enabling E2EE on Messenger meant fundamentally rebuilding many aspects of the application its protocols to improve privacy, security, and safety while simultaneously maintaining the features that have made Messenger so popular.

Why we’re bringing E2EE to Messenger

Messenger first built end-to-end encrypted chats in 2016 as a feature called Secret Conversations. Since then, we’ve learned a great deal in regards to rolling out E2EE for a wider user base. For example, we recently published an updated white paper, “Meta’s Approach to Safer Private Messaging on Messenger and Instagram Direct Messaging,” that sets out the industry-leading safety systems and tools available on Messenger.

End-to-end encryption isn’t about the technology at its core. It’s about protecting people’s communications, so they can feel safe expressing themselves with their friends and loved ones. To achieve this, we typically focus on two aims:

Only the sender and recipients of an E2EE message can see its contents.
Nobody (not even Meta) should be able to forge messages to appear to have been sent from someone they weren’t.

In other words, the aim is that only you and the people you’re corresponding with can read your messages – not even the app’s provider (in this case, Meta) could interfere with their contents – and you can be confident in who sent the messages.

Understanding these goals

These two aims are broad. However, when we reflect on our approach to addressing them, they end up breaking down into eight overlapping concepts that we believe achieve a cohesive approach to meaningful E2EE:

1. Confidentiality in transit

Message contents are authentically and securely transmitted between your devices and those of the people you’re talking to. This is, perhaps, the primary goal of E2EE, and is where much E2EE research and design work is targeted, such as the Signal protocol we use in our products (such as WhatsApp, Messenger, and Instagram Direct), or the IETF’s Messaging Layer Security protocol, which we helped to design and was recently standardized.

2. Confidentiality in storage

Typically, E2EE messaging services rely on local storage and encryption keys to secure encrypted messages. Messenger, however, has a long history of storing people’s messages for them so that they can access them whenever they need without having to store them locally. That’s why we’ve designed a server-based solution where encrypted messages can be stored on Meta’s servers while only being readable using encryption keys under the user’s control.

3. Control over endpoints

For something to be “end-to-end encrypted,” it is necessary to have a notion of what the “ends” are. For an E2EE messaging app this means that users should have the ability to verify and manage their set of endpoint devices that are receiving their messages, as well as visibility into when this set of devices changes.

4. Private feature designs

Product features in an E2EE setting typically need to be designed to function in a device-to-device manner, without ever relying on a third party having access to message content. This was a significant effort for Messenger, as much of its functionality has historically relied on server-side processing, with certain features difficult or impossible to exactly match with message content being limited to the devices.

5. Logging limitations

Maintaining the confidentiality of message content extends to avoiding accidentally leaking it back to us in telemetry. In a product of Messenger’s scale, complexity, and iteration speed, this creates particular challenges as telemetry is vital in ensuring that the product is working well for people, and in debugging when things go wrong.

6. Application security

It’s a common saying that, “You can’t have privacy without security,” and this is absolutely true in the end-to-end encrypted domain. Security is important for any consumer product, but E2EE exacerbates the challenges in two important ways: it reduces the provider’s ability to protect the user from attacks, and, in fact, it expands the threat model to include the service provider itself. Our security team is keenly aware of these challenges and works closely with product teams to secure design and implementation of E2EE functionality. For example, we’ve been working to improve the memory safety of our apps; and our E2EE surfaces are covered by our bug bounty program.

7. Being deliberate about what’s being protected

E2EE protects message content. However, this is a complex term to define, and, while certain things are relatively clear – such as the strings contained in a text message, or a photograph sent from your camera roll – in a sufficiently complex messaging application, it turns out there’s a surprisingly large grey area. Our focus is on determining the appropriate boundaries, ensuring that we remain true to our commitments, setting the correct user expectations, and avoiding creating meaningful privacy risks, while still ensuring that the product retains its usefulness to our users.

8. Third-party scrutiny

E2EE implies confidentiality even if the provider wants to access the contents of a communication. We aim for this to be verifiable externally, and, to this end, have published two white papers to provide transparency into our operations. We describe the properties of some features in our Help Center, and we encourage submissions to our bug bounty program. Throughout the project, we have consulted with a diverse range of external parties to ensure that we’re making the right set of tradeoffs. To improve people’s ability to scrutinize us, we also support the Code Verify browser extension for our web-based end-to-end encrypted messaging, to give security researchers greater confidence that the code version that they are assessing is being used globally.

High-level approach

With all of this in mind, our high-level approach was to build off of Meta’s prior learnings in E2EE, from both WhatsApp and Messenger’s Secret Conversations, and then to iterate on our most challenging problems.

Working from the baseline of these two approaches, we then had to address a series of significant technical challenges, including:

Multi-device capability: Messenger’s model of multi-device reflects the Facebook network, which allows people to authenticate on new devices with a username and password, in order to send and receive messages. Since WhatsApp’s multi-device capability relies on a single primary device that must cryptographically authenticate companion devices, we adopted the Secret Conversations model of multi-device, while ensuring that it functions well for all of our users.
Feature support: Messenger has a number of messaging features that either don’t exist in WhatsApp, or function differently. Some of these just had to be rebuilt from scratch, while others required deploying new applied privacy technology. For example, we used OHAI and Anonymous Credentials to support searches of Facebook’s first-party sticker library, without revealing to us who is sending them.
Message history: Messenger has always allowed clients to operate off of a small stored local cache, relying on a server-side database for their message history. Neither WhatsApp nor Secret Conversations operated in this manner, and we didn’t want all users to have to rely on a device-side storage system. Instead, we designed an entirely new encrypted storage system called Labyrinth, with ciphertexts uploaded to our servers and loaded on-demand by clients, while operating in a multi-device manner and supporting key rotation when clients are removed.
Web support: We needed to support E2EE within our existing web surfaces, including the main Facebook site. The Web platform carries significantly different constraints from native apps, meaning that we needed to take custom approaches to many different aspects of the product. Further, Web users often add and remove devices in very different patterns from mobile-only users, increasing the complexity of our multi-device challenge.

Learn more about E2EE on Messenger

Today, we are sharing two white papers:

Our Messenger end-to-end encryption whitepaper, which describes the core cryptographic protocol for transmitting messages between clients.
The Labyrinth encrypted storage protocol whitepaper, describing our protocol for end-to-end encrypting stored messages history between devices on a user’s account.

These add to a number of publications that we have shared which cover Messenger’s E2EE, including:

Our recently updated Safety whitepaper
The independent E2EE Human Rights Impact Assessment
Our Security Principles whitepaper
The Code Verify browser extension

Beyond E2EE for Messenger

The journey to bring E2EE to Messenger has been a long one, but it’s not yet finished. While we are globally launching default E2EE for personal one-to-one messages on Messenger, we are still in the testing phase for group messaging and some other products, like Instagram Direct Messages. On Instagram, we are currently testing “disappearing messages” for one-to-one Instagram Direct conversations in select countries. Disappearing messages are ephemeral and, as with those in Messenger, expire 24 hours after being sent. They are built leveraging our E2EE infrastructure and provide an increased level of privacy. We plan to expand this work as well as conduct additional testing around E2EE on Instagram over the next year.

Writing and linting Python at scale

Tue, 21 Nov 2023 18:00:00 +0100

Python plays a big part at Meta. It powers Instagram’s backend and plays an important role in our configuration systems, as well as much of our AI work.

Meta even made contributions to Python 3.12, the latest version of Python.

On this episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) is joined by Amethyst Reese, a production engineer at Meta, to discuss all things Python at Meta.

They discuss: how Meta’s Python Foundation Team works to improve the developer experience of everyone working with Python at Meta; Fixit 2, Meta’s recently open-sourced linter framework; and what exactly the role of production engineer at Meta entails.

For more from Amethyst, be sure to read her blog post: Fixit 2: Meta’s next-generation auto-fixing linter

Download or listen to the episode below:

[embedded content]
You can also listen to the episode wherever you get your podcasts:

Apple Podcasts
Google Podcasts
Spotify
PocketCasts
Overcast
Castro

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Watch: Meta’s engineers on building network infrastructure for AI

Wed, 15 Nov 2023 18:00:00 +0100

Meta is building for the future of AI at every level — from hardware like MTIA v1, Meta’s first-generation AI inference accelerator to publicly released models like Llama 2, Meta’s next-generation large language model, as well as new generative AI tools like Code Llama.

Delivering next-generation AI products and services at Meta’s scale also requires a next-generation infrastructure.

The 2023 edition of Networking at Scale focused on how Meta’s engineers and researchers have been designing and operating the network infrastructure over the last several years for Meta’s AI workloads, including our numerous ranking and recommendation workloads and the immense Generative AI models. They cover a wide range of topics, including physical and logical network design, custom routing and load balancing solutions, performance tuning/debugging/benchmarking, and workload simulation and planning. We also look ahead to the requirements of Generative AI models coming in the next several years.

Networking for GenAI Training and Inference Clusters

Jongsoo Park, Research Scientist, InfrastructurePetr Lapukhov, Network Engineer

Developing new GenAI technologies and incorporating them into product features is a top priority at Meta. But the sheer scale and complexity of GenAi models means new challenges for Meta’s network infrastructure.

Jongsoo Park and Petr Lapukhov discuss the unique requirements of new large language models, and how Meta’s infrastructure is changing for the new GenAI landscape.

Meta’s Network Journey to Enable AI

Hany Morsy, Network EngineerSusana Contrera, Network Engineer

Over the years, Meta’s AI infrastructure has transitioned from CPU-based to GPU-based training due to growing AI workloads. As a result, we have deployed large-scale, distributed, network-interconnected systems to support these systems and workloads. .

Today, our training models use a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster.

Hany Morsy and Susana Contrera delve into how Meta’s network builds have evolved to support the needs of AI services. Along the way, they share challenges encountered, new solutions that were implemented, and the strategic considerations that have gone into building Meta’s high-performance, efficient network fabric for AI workloads.

Scaling RoCE Networks for AI Training

Adi Gangidi, Production Network Engineer

Adi Gangidi provides an overview of Meta’s RDMA deployment based on RoCEV2 transport for supporting our production AI training infrastructure. He sheds light on how Meta’s infrastructure is designed to both maximize the raw performance and consistency that is fundamental for AI-related workloads.

The talk also covers challenges in the routing, transport, and hardware layers that were solved along the way to scale Meta’s infrastructure, as well as opportunities for further progress over the next few years.

Traffic Engineering for AI Training Networks

Shuqiang Zhang, Software EngineerJingyi Yang, Software Engineer

Meta has been operating RoCE-based distributed training clusters to serve its internal AI training workloads since 2020. But those early days saw challenges around maintaining job performance consistency.

Shuqiang Zhang and Jingyi Yang discuss centralized traffic engineering, one of Meta’s solutions to this challenge, which dynamically places traffic over all available paths in a load-balanced manner. They go over the centralized traffic engineering solution’s design, development, evaluation, and operational experience.

Network Observability for AI/HPC Training Workflows

Shengbao Zheng, Research Scientist

Having high-performance and reliable collective communication over Meta’s AI-Zone RDMA network is foundational for enabling and scaling Meta’s AI training and inference workloads. To facilitate this, it’s necessary to capture top-down observability from workload to network for collective communication – this allows us to attribute performance regression and training failures to the backend network when appropriate.

Meta has introduced two important tools for this: The ROCET and PARAM benchmarks and Chakra ecosystems. We build ROCET to associate the job to RDMA network metrics and provide analysis on top. In addition, we built the PARAM benchmark to allow for analyzing and tuning collective communication operations through workload trace. We recently shared these systems with the community via Chakra, which allows for co-designing efficient distributed ML systems. In this talk, Shengbao Zheng discusses the design and use cases for these tools.

Arcadia: End-to-end AI System Performance Simulator: Fostering data-driven decision-making processes and promoting the future evolution of AI systems

Zhaodong Wang, Research ScientistSatyajeet Singh Ahuja, Networking Modeling and Optimization Engineer

Arcadia is a unified system designed to simulate the compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware.

With Arcadia, researchers and practitioners can gain valuable insights into the performance of future AI models and workloads on specific infrastructures and make data-driven decisions around how models and hardware will evolve in the future.

Arcadia allows Meta’s engineers to simulate the performance impact of scheduled operational tasks on AI-models that are running in production and helps them make job-aware decisions during day-to-day operational activity.

Zhaodong Wang and Satyajeet Singh Ahuja discuss Arcadia’s capabilities and its potential impact in advancing the field of AI systems and infrastructure.

Enhancing the security of WhatsApp calls

Wed, 08 Nov 2023 15:00:00 +0100

New optional features in WhatsApp have helped make calling on WhatsApp more secure.
“Silence Unknown Callers” is a new setting on WhatsApp that not only quiets annoying calls but also blocks sophisticated cyber attacks.
“Protect IP Address in Calls” is a new setting on WhatsApp that helps hide your location from other parties on the call.

Privacy and security are at the core of WhatsApp. In addition to protecting personal messages with end-to-end encryption, WhatsApp empowers users to control their own privacy settings: from what you share, how you show up online, or who can reach out to you or add you to groups.

In June 2023, WhatsApp announced an additional privacy feature: Silence Unknown Callers. We launched this feature for the benefits it has for not only privacy but also security. The experience is simple: with the setting turned on, calls from unknown numbers do not ring your phone. Having carefully built this feature to minimize attack surface and external data processing, we are able to help protect users from not only unwanted contact, but also cyber attacks and spyware.

Then in October 2023, WhatsApp began rolling out “Protect IP Address in Calls” which hides your IP from the other party by relaying calls through WhatsApp Servers.

Stop cyber attacks and hackers with “Silence Unknown Callers”

Across the software industry, calling products are an attractive vector for cyber attacks. Popular software projects in this space, such as WebRTC and PJSIP, have documented numerous vulnerabilities. Because of the complexity and large number of protocols involved, attackers have many opportunities to find a bug to exploit. Furthermore, calling software often automatically processes incoming packets from callers to optimize call setup and improve performance. This means calling vulnerabilities can often lead to “zero-click” attacks; the victim may not need to even accept the call for the attack to succeed.

In most calling products, devices exchange information and setup state without user interaction.

Many calling products offer ways to silence calls. However, traditional methods of silencing retain the same network protocols and message flow of a normal call which merely silences the call on the recipient’s device. This leaves many risks for call recipients unmitigated.

The recipient’s device may still perform complicated processing of attacker-controlled data
This gives an attacker ways to load data into the recipient’s memory
The recipient may leak device information back to the attacker to increase exploit delivery reliability

One could attempt to mitigate these risks by adding state machines, firewalls, and sandboxes on the recipient. However, there are many examples in the industry of these techniques failing to protect users.

Instead, WhatsApp built a specialized protocol for delivering stripped-down, silenced call notifications to recipients. The server enforces this protocol, protecting the recipient device from the complexity of normal call setup and from processing attacker-controlled data.

Our implementation of silenced calls, with WhatsApp servers enforcing separation.

This approach took more effort than a client-only method. How can the server know if the call should be silenced without asking the recipient? In end-to-end encrypted messengers like WhatsApp, clients are the source of truth. We don’t keep logs of who everyone is messaging or calling: While traditionally mobile carriers and operators store this information, we believe that keeping these records for two billion users would be both a privacy and security risk and we don’t do it.

WhatsApp developed a new technology, named privacy tokens, to solve this problem. Each client locally decides which other user it trusts and distributes tokens to them. When a call is placed, the caller includes the privacy token of the recipient in the protocol message. Next, the server checks the token’s validity along with a few other factors to determine if the intended recipient allows this sender to ring them. Crucially, for our user’s privacy, the server does not learn anything about the exact relationship between the caller and the recipient from the token.

With our design of this feature, calling becomes a much less attractive vector for attackers.

Protect your IP address metadata in calls

Two common methods of connecting call participants: peer-to-peer and via a relay.

Most calling products people use today have peer-to-peer connections between participants. This direct connection allows for faster data transfers and better call quality, but it also means that participants need to know each other’s IP addresses so that call data packets can be delivered to the correct device – meaning that the IP addresses are visible to both callers on a 1:1 call. IP addresses may contain information that some of our most privacy-conscious users are mindful of, such as broad geographical location or internet provider.

To address this concern, we introduced a new feature on WhatsApp that allows you to protect your IP address during calls. With this feature enabled, all your calls will be relayed through WhatsApp’s servers, ensuring that other parties in the call cannot see your IP address and subsequently deduce your general geographical location. This new feature provides an additional layer of privacy and security particularly geared towards our most privacy-conscious users. As always, your calls are end-to-end encrypted, so even if a call is relayed through WhatsApp servers, WhatsApp cannot listen to your calls.

Visit the WhatsApp Help Center learn more about this feature – which is being rolled out currently to iOS and Android users – and how to activate it.

Conclusion

WhatsApp built and launched “Silence Unknown Callers” and “Protect IP Address in Calls” this year as part of our ongoing comprehensive work to keep users safe. These features respect and improve user privacy while also reducing the effectiveness of real-world attacks.

Protecting user privacy and security is absolutely necessary for WhatsApp to accomplish its mission to enable private communication for the world. These new security features combine with many other protections to keep people safe on WhatsApp.

How Meta built Threads in 5 months

Mon, 06 Nov 2023 18:00:00 +0100

In about five short months, a small team of engineers at Meta took Threads, the new text-based conversations app, from from an idea to the most successful app launch of all time, pulling in over 100M users in its first five days.

But this achievement wouldn’t have been possible without Meta’s existing systems and infrastructure.

On the latest episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) is joined by Joy Qiu, Cameron Roth, and Richard Zadorozny, three engineers from the Threads team, who worked on backend, iOS, and Android respectively.

Learn more of the inside story behind Threads and the challenges their team faced along the way.

Download or listen to to the episode below:

[embedded content]

You can also listen to the episode wherever you get your podcasts:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Automating data removal

Tue, 31 Oct 2023 17:00:00 +0100

Meta’s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing unused data types.
SCARF scans production data systems to identify tables or assets that are unused and safely removes them.
SCARF avoids tedious manual work and ensures that product data is correctly removed when a product is shut down.

This is the third and final post in our series on Meta’s Systematic Code and Asset Removal Framework (SCARF). SCARF contains a combination of subsystems that analyze code and data usage throughout Meta to aid in the safe and efficient removal of deprecated products. In our first post on automating product deprecation, we discussed the complexities of product deprecations and introduced SCARF’s workflow management tools that guide engineers through a coordinated process to safely deprecate products. In the second post on automating dead code cleanup, we discussed SCARF’s dead code subsystem and its ability to analyze both static and dynamic usage of code to automatically generate change requests to remove unused code. Throughout this series, we have referred to the example of the deprecation of a photo sharing app called Moments, which Meta launched in 2015 and eventually shut down in 2019.

In this post, we introduce the subsystem responsible for automating the identification and safe removal of unused data types at Meta. This process can be unexpectedly difficult, because large software systems are inevitably interconnected. Moments relied on several pieces of shared Facebook functionality and infrastructure, and deleting it was more complicated than simply turning off servers and removing data tables.

Unused data cleanup

SCARF implements an unused data type cleanup system for Meta engineers to leverage when they want to ensure consistent removal of unused data. SCARF scans data systems to identify each type of data stored (for example, identifying all the tables in a relational database) and, for each of these, determines whether the data is being used. If any of the assets are not in use, SCARF initiates a process to safely remove them.

The types of data tracked by SCARF vary and include things like database tables, partitioned “use cases” in shared storage systems, or object classes. Each represents a class of data — not individual records. Meta has a separate system, DELF, for deleting individual records, rows, and objects.

SCARF coordinates several kinds of tasks for each data system: metadata collection (e.g., data quantity, field types), usage collection, analysis, and actions. These tasks share some common components and adhere to a standardized format; however, the implementation is inherently specific to each supported data system.

Measuring usage

SCARF tracks two primary metrics to determine if deletion is safe: It measures static usage by identifying code that appears to query a type of data; and it measures runtime usage by observing access patterns in production.

As mentioned in our blog post, Automating dead code cleanup, SCARF statically analyzes Meta’s codebases using Glean. Glean presents static analysis facts extracted from the compiler in an indexed, standardized format. SCARF queries Glean to locate code that appears to reference each type of data. For example, each type of data stored in Meta’s object graph data system TAO is referenced by an enum value: We can locate usages of each enum value in code across multiple languages.

viewed_photo = TAO.fetch(
    id=objectId,
    type=TAOType.USER_VIEWED_PHOTO,
)

SCARF also measures the usage of each type of data at runtime. We instrument Meta’s data systems to publish counters indicating how many reads each type of data has received from relevant sources, like production traffic from our webserver fleet, while ignoring traffic created by backup infrastructure.

Some of our data systems, like TAO, receive billions of requests per second. Instrumenting a data system at that scale, while ensuring we only measure certain types of usage, presents complex engineering challenges to avoid costly performance degradations.

Orchestrating removal

Once SCARF detects that a type of data is completely unused by combining the signals from our metadata, traffic analysis, and code references, it notifies the engineering team responsible for that data type, via an internal ticket, that cleanup will proceed .

After a configured time, SCARF blocks all reads and writes via a data system specific mechanism. This period of time is important as it acts as a dry run of what would happen when the data is truly deleted. Once this period elapses, the data will be deleted. The system keeps an internal log of actions it has performed for our records.

For example, if a type of data in TAO has no code references or production traffic, after a notification and waiting to see if there are objections, SCARF will instruct TAO to raise an error any time a service attempts to read from or write to that type of data. After a further monitoring phase, it will instruct TAO to delete the data.

Note that SCARF does not wait for the engineer’s acknowledgement when the internal ticket is filed – the system biases towards the automatic removal of unused data types and relies on its thorough analyses to ensure that only unused data is removed. If something is unused for long enough, it is less and less likely that problems will arise when automation cleans it up. Biassing towards automation improves efficiency and allows the system to scale beyond a process that requires manual approvals for every action.

Should a mistake be made, the access restriction period acts as a buffer where any mistakes can be caught before any data deletion occurs and our analysis can be updated to account for any missing signals. In the worst case, if data is deleted but should not have been, many systems at Meta provide backups to protect against data loss; and while the backups are available, they can act as a final safeguard to protect against erroneous deletions.

Engineers can interact with SCARF’s deletion process in various ways. As mentioned earlier, they can override the usage signals that SCARF detects in order to proceed with the deletion, if they determine those signals to be false-positives. They can also accelerate the process by shortening the waiting periods. Finally, engineers can highlight problems they have noticed back to our team who build SCARF itself: Often engineers will notice false-positives (cases when SCARF detects something as used, but it isn’t), and rarely false-negatives (when SCARF detects no usage, but there actually is usage), in the usage signals collected by SCARF.

Coping with cross-system dependencies

Meta has many different systems for storing data, many of which are specialized for a certain use-case. A high-quality product will often require the features of multiple different data systems. For example, TAO is a graph database that excels at serving many small, fast queries, but it wouldn’t be used for tasks like machine learning, ranking, or aggregation. As such, Meta frequently leverages multiple data storage systems for a single product, including data pipelines that move data between systems. SCARF hence has to understand the interconnections between these systems to ensure data is removed from each place it is stored and to prevent deletions from occurring out of order.

SCARF models these through a curated set of generated asset relationships. For a given asset and its corresponding inbound and outbound dependencies, SCARF determines the nature of the relationship, which dictates which asset must be deleted first and whether the deletion of one asset necessitates the deletion of the other. For example, some assets exist solely as the result of moving data between systems and must be removed together in a multi-step process. This modeling of system relationships within SCARF enables more thorough orchestration of data cleanup and prevents the system from attempting to delete assets out of order.

Coping with code usage

Thinking of Meta’s code and data definitions as a single dependency graph, SCARF can be seen as a system that prunes leaves and isolated nodes in this graph. This dependency graph changes over time, as new nodes and edges come and go with every piece of engineering work.

SCARF is unable to automatically remove data if it identifies code that could use this data, even if that code is not being run: SCARF, by design, will not break edges in this graph. For example, if an engineer commits a script referencing a type of data in TAO for debugging purposes, SCARF would correctly identify that as a reference to the use case and prevent deletion — even if the script is no longer used.

As mentioned in part two of this series, SCARF’s dead code subsystem works to help solve this problem through the automated removal of known dead code. If the dead code system is able to remove the unused script, the unused data can then be removed.

Removing data at scale

Removing unused data types not only simplifies our internal infrastructure, but also saves material capacity costs. In the last year, it has removed petabytes of unused data, across 12.8M different data types stored in 21 different data systems. While, in many products, an individual piece of data may consist of an identifier (primary key) and a small amount of data, at Meta’s scale there are billions of such rows. Data logged during routine usage of our services to provide analytics, operational logging, or analysis will also consist of billions of rows, multiplied by the retention of historical data in our warehouse, and our backups.

SCARF concurrently operates on millions of assets each day and drastically reduces the overhead on our teams from having to manually intervene and clean up unused data. The team that maintains SCARF has developed strong partnerships with our colleagues that build and maintain these various data systems to leverage their expertise and to work together to provide the APIs that SCARF invokes to safely restrict access to and eventually clean up data.

SCARF runs on a daily cadence: The lifecycle of products and features means that there are new types of data being created every day as well as types of data that become unused every day. Running the system regularly ensures that as the final references to assets are deleted, SCARF picks up these changes quickly and can trigger the relevant automation.

A summary of SCARF

SCARF provides a powerful suite of tools to the company’s engineering teams. Its workflow management tooling powers thousands of human-led deprecation projects alongside the millions of code and data assets it has cleaned automatically.

SCARF also serves useful purposes for Meta’s privacy teams: We can use the tool to monitor the progress of ongoing product deprecations and ensure that they are completed in a timely manner. When there’s work that our automation is unable to do, SCARF’s internal tooling educates engineers about what they need to do and how to do it safely. The information it provides is not generic: It is tailored to the specific code and data that an engineer is deleting, empowering them to make the right decisions in the most efficient manner.

By discussing privacy innovations like SCARF, we hope to create transparency about our continuous investment in infrastructure that delivers privacy protections for the people that use our products and services. Our dedication to automating and orchestrating unused code and data deletion in a comprehensive manner is just one example of the substantive privacy by design measures we focus on at Meta.

Product deprecation can be safe, efficient, and thorough, even with infrastructure as complex and vast as Meta’s. Combining automation with engineering tooling is a tried and tested strategy that we have found to be very effective. We are investing in this space for the long-term since product deprecation is a continuous part of the data lifecycle, which contributes to the sustained success of any large tech company like Meta.

Automating dead code cleanup

Tue, 24 Oct 2023 18:00:00 +0200

Meta’s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing dead code.
SCARF combines static and dynamic analysis of programs to detect dead code from both a business and programming language perspective.
SCARF automatically creates change requests that delete the dead code identified from the program analysis, minimizing developer costs.

In our last blog post on automatic product deprecation, we talked about the complexities of product deprecations, and a solution Meta has built called the Systematic Code and Asset Removal Framework (SCARF). As an example, we looked at Moments, the photo sharing app Meta launched in 2015 and eventually shut down in 2019, and how SCARF can help with the deprecation process through its workflow management capabilities. We discussed how SCARF saves engineering time by identifying the correct order of tasks for cleaning up a product and how it can be blocked from automating the cleanup when there are intersystem dependencies. This naturally leads to the question: How do we automatically unblock SCARF when there is code that references an asset?

Dead code removal in SCARF

SCARF contains a subsystem that automatically identifies dead code through a combination of static, runtime, and application analysis. It leverages this analysis to submit change requests to remove this code from our systems. This automated dead code removal improves the quality of our systems and also unblocks unused data removal in SCARF when the dead code includes references to data assets that prevent automated data cleanup.

Code analysis

SCARF’s code analysis subsystem gathers information from a variety of sources. First, a code dependency graph for each language is extracted from our compilers via Glean. This is then augmented with further information, like the usage of API endpoints from operational logs that determine whether an endpoint is used at runtime. Additional examples of domain-specific usage encoded include:

Script invocations for internal developer tools and system management commands.
Template hooks for dynamically rendering pages in the Instagram Django backend and URI handler and routing.
Async’s dynamically referenced dispatch methods (Meta’s deferred job execution service).

SCARF must be capable of introspecting any and all types of dynamic usage in addition to the static dependency graph to make accurate determinations of whether a piece of code is truly safe to remove. These are combined and form an augmented dependency graph.

SCARF supports multiple programming languages. This is very important, as products at Meta may have client code written in Java, Objective-C, and JavaScript, with server code written in Hack, and some backend infrastructure written in Python. All of these pieces of code should be deleted as they all combine to form the same dependency graph since they are associated via APIs and other known forms of dynamic and language-spanning references.

SCARF operates at a symbol level as opposed to a file level, which allows for more granular analysis and cleanup. For example, an individual variable that is unused in a function will have its own fully qualified symbol, which allows for more granular cleanup than is possible at the file level.

Garbage collection

SCARF analyzes the augmented dependency graph to identify unreachable nodes and subgraphs that can be deleted and will automatically generate code change requests to delete the corresponding code on a daily basis. A key benefit of analyzing the complete graph is that we can detect and delete cycles, where different parts of the codebase depend on each other. Deleting entire subgraphs accelerates the deletion of dead code and provides a better experience for the engineers leveraging this automation in their deprecations.

It’s important that the graph contains the augmented information, as static analysis alone may not reveal links between components created through dynamic references or runtime language features. There is a trade-off, though, in that augmenting the graph with dynamic usage information requires the full processing of the indexed code and the subsequent data analysis pipelines that provide the metrics. This increases the end to end duration of the entire process which can make prototyping new features or capabilities more difficult.

Earlier versions of SCARF avoided this upfront cost by taking a different approach. It analyzed each discoverable symbol individually and at runtime would run classifiers that queried for static and dynamic references in order to find dead root nodes — pieces of code with no inbound dependencies. This did not require the upfront construction of the complete dependency graph and simplified the process of running the system over small subsets of the codebase. As a result, it was trivial to prototype new classifiers that identified potential dynamic references without requiring time-consuming indexing or data analysis.

However, this longer end-to-end development cycle led to a dramatic improvement in coverage. The transition from analyzing individual symbols to the entire graph led to a nearly 50% increase in dead code removed from one of Meta’s largest codebases. The new approach improves visibility into the state of our codebases: how much is alive, how much is dead, and how much of that we are removing in any given pass of SCARF.

Fine-tuning the dependency graph

Many of the dependencies that we index using Glean are for patterns of code invocation which do not necessarily block the deletion of that code. For example, let’s say we had a class PhotoRenderer, and the only dependency on it was in code like this:

if isinstance(renderer, PhotoRenderer):
    return renderer.render_photo()
else:
    return renderer.render_generic()

In this case, the references to PhotoRenderer and render_photo() can be removed, and the code changed to this:

return renderer.render_generic()

In this example, the class, PhotoRenderer, was inlined based on a rule derived from the semantics of Python: if there are no places where the PhotoRenderer class is instantiated, we can be confident that this code cannot take the first branch and it is therefore dead.

In some cases, we derive these rules based on our application semantics as opposed to language semantics. Imagine this code:

uri_dispatch = {
  '/home/': HomeController,
  '/photos/': PhotosController,
  ...
}

If we only analyzed a language-level dependency graph, it would be impossible to determine whether or not PhotosController is ever referenced as it can be invoked via this URI dispatch mechanism. However, if we know from our application analysis that the ‘/photos/’ endpoint never receives any requests in production, then we could remove the corresponding entry from this dictionary.

There’s no inherent way to infer this given Python’s language semantics, but our domain-specific logging and graph augmentation allow us to inform SCARF that this operation is safe.

Automating code changes

At Meta, we heavily automate changes to code. We built an internal service, called CodemodService, which empowers engineers to deploy configurations to automate code changes at scale. SCARF was the first instance of company-wide, fully automated code changes at Meta, and was built hand-in-hand alongside CodemodService. Today, CodemodService also powers hundreds of other types of automated code changes at Meta, from automating the formatting of code, automatically removing completed experiments, empowering large-scale API migrations, to improving coverage of strong types in partially-typed languages like Python and Hack.

Dead code removal at scale

SCARF uses CodemodService to create code change requests for engineers to review. These change requests incorporate human-readable descriptions informing engineers about the analysis that determined the targeted code is provably dead.

SCARF has grown to analyze hundreds of millions of lines of code; and five years on, it has automatically deleted more than 100 million lines of code in over 370,000 change requests. False-positives caught by engineers during code review are triaged and used to improve the analysis that SCARF performs and typically reflect new sources of dynamic usage that our augmented graphs must account for. Sometimes these misunderstood dynamic references can lead to incorrect deletion of code, and these deletions can make it to production. Meta has other mechanisms in place to catch these problems and we take such incidents very seriously.

In some languages, we have such high confidence in our analysis that we can automatically accept and merge the change requests without human intervention to make better use of engineers’ valuable time.

Is dead code removal sufficient?

SCARF’s automated dead code removal accelerates the process of shutting down and removing the code and data for deprecated products, but it does not solve it fully. Beyond the problems caused by interconnectivity, we are constantly improving our ability to integrate across all languages, systems, and frameworks at Meta. It is difficult to accurately cover every type of usage of code and data that enables our systems to determine what is truly dead.

Our systems also err on the side of caution, by searching for textual references to code and data through our BigGrep system and not solely relying on the curated graphs produced through Glean and our dynamic usage augmentations. This is a fallback safety mechanism that helps avoid accidentally deleting MySQL tables that are referenced by name in other languages and preventing deletions of dynamically invoked code in languages like Hack, Python, and JavaScript that can call code through string references or use eval. This approach can cause false negatives, but avoids false positives. When automating the removal of dead code, those are a more serious problem.

As mentioned in our first post of this series, SCARF provides workflow management features that work together with the dead code subsystem to provide a cohesive experience for fully deprecating products and features. Crucially, our engineers can iterate on code changes faster than our automation! If an engineer understands that a change has rendered a branch of code (and therefore an entire subgraph) unreachable, they can easily incorporate that deletion into their changes without waiting for our infrastructure to index the new code, analyze it, and eventually get around to submitting its automated changes. Engineers sometimes find it more productive to manually delete things rather than waiting to see if the automated systems will clean it up for them later.

In the next and final blog post in this series, we will look at SCARF’s unused data type subsystem that Meta has built that, in conjunction with the dead code subsystem, amplifies Meta’s data minimization capabilities by automating the removal of dead and unused assets.

5 Things you didn’t know about Buck2

Mon, 23 Oct 2023 18:00:00 +0200

Meta has a very large monorepo, with many different programming languages. To optimize build and performance, we developed our own build system called Buck, which was first open-sourced in 2013.

Buck2 is the recently open-sourced successor. In our internal tests at Meta, we observed that Buck2 completed builds approximately 2x as fast as Buck1. Below are five interesting facts you might not have known about Buck2.

Fact1: Buck2 is written in Rust

The core of Buck2 is written in Rust, with the rules written in Starlark (a Python-like language) and interpreted by our open-source Starlark library. Using Starlark was a natural choice since Buck1 and Bazel both already use it. We debated between using Java (like Buck1), Haskell (like the Shake build system) or Go for the core programming language. In the end we chose Rust because of its high performance, control over memory usage (garbage collection isn’t a great fit for big graphs), and its focus on correct concurrency (important for a highly parallel build system).

Fact2: Buck2 can avoid downloading intermediate outputs

When configured using remote execution, Buck2 can run actions remotely. If the output of those actions is then used as the input for a further remote action, we avoid ever downloading the intermediate output and just remember the resulting hash. In some cases a build will proceed entirely remotely, with only minimal local activity, followed by downloading the final results. This technique can save both bandwidth and disk space.

Fact3: Buck2 has 11 different types of file path internally

In addition to the built-in Rust Path type, and the relative-path crate’s type RelativePath, we have further defined nine additional path types. We use these distinct types to identify which paths are absolute and, for relative paths, what they are relative to (e.g., relative to project root, enclosing cell, package, a given label, etc). While it was fairly tedious to write out so many types of path, the bugs it prevented made it worth it. We take our type safety very seriously!

Fact4: Buck2 has no knowledge of any programming language built in

Everything language-specific in Buck2 is described in Starlark. Furthermore, while Buck2 comes with a Starlark “prelude” that supports a number of programming languages (e.g., C++, Rust, Python, OCaml, etc.) you can write your own prelude from scratch (several open source users have done so).

Because we kept the core of Buck2 free from language concepts, we had to make the Starlark APIs powerful enough to write things like dep files and conditional relinking– important performance optimisations in C++. Happily, the APIs required for those advanced features tend to be useful in other languages for purposes we hadn’t even imagined (e.g. Erlang dependency trimming, Android package grouping).

Fact5: Buck2 provides BXL (Buck2 Extension Language) for inspecting and interacting with the Buck graph

This feature, which is unique in the build system space as far as we are aware, gives access to the graph with Starlark API, and also lets you define new build actions native to BXL. The build graph often serves as the source of truth for a project, and BXL lets that information be easily reused. We have used BXL to generate IDE projects but think this could have lots of future opportunities.

Further resources:

Build faster with Buck2: Our open source build system (Buck2 announcement blog)
Buck2 website
Buck2 GitHub

We hope you found this article insightful to show you what Buck2 is capable of and where you can use it for your own workflow. Stay tuned for our upcoming blog posts, where we will explore other projects from Meta’s DevInfra teams, and interesting facts on how you can use them in your own projects.

About this series

This blog is a part of our 5 Things you didn’t know series, where we go over the top five interesting facts about a DevInfra project from Meta that can help you simplify your projects or workflows. Look out for more of these blogs where we discuss other DevInfra projects and interesting facts to help you get started with using them in your own projects.

To learn more about Meta Open Source, visit our website, subscribe to our YouTube channel, or follow us on X and Facebook.

How Meta is creating custom silicon for AI

Wed, 18 Oct 2023 18:00:00 +0200

With the recent launches of MTIA v1, Meta’s first-generation AI inference accelerator, and Llama 2, the next generation of Meta’s publicly available large language model, it’s clear that Meta is focused on advancing AI for a more connected world. Fueling the success of these products are world-class infrastructure teams, including Meta’s custom AI silicon team, led by Olivia Wu, a leader in the silicon industry for 30 years.

In the conversation below, Olivia explains how she led the silicon design team to deliver Meta’s AI silicon, allowing the company to improve the compute efficiency of the infrastructure, and enable software developers to create AI models that will provide more relevant content and better user experiences.

Tell us about your role at Meta.

Olivia Wu: I lead design development of the next generation of Meta’s AI silicon. My team is responsible for the design and development of Meta’s in-house machine learning (ML) accelerator, and I partner closely with our co-design, architecture, verification, implementation, emulation, validation, system, firmware, and software teams to successfully build and deploy the silicon in our data centers.

What led you to this role?

OW: I’ve been working in the silicon industry for 30 years and have experience working at a variety of large companies leading both architecture and design for multiple ASICs and IPs, and for startups focused on training AI. In 2018, I saw a social media post from Yann LeCun, our Chief AI Scientist, that Meta was looking for someone to help build AI silicon in-house. I knew of just a few other companies designing their own custom AI silicon, but they were mainly focused only on silicon and not the software ecosystem and products.

The opportunity for Meta (known as Facebook back then) was to bring in silicon developers to work directly with the software teams to reimagine end-to-end systems allowing for greater efficiency and larger degrees of freedom in optimizing across hardware and software boundaries.

This was very enticing to me. I knew this was a rare opportunity and I had to jump on it to have the chance to build a design team from the ground up.

How was the transition from working at two different startups to working at Meta?

OW: My transition from startup to Meta was super easy. We had a very small team, so it almost feels like a startup within a large company. I was able to get involved in many parts of the project. It gave me the opportunity to be very hands-on in all aspects of ASIC development.

Meta also has a very open culture. The freedom to innovate and experiment with new ideas is ingrained into Meta’s DNA. I was able to have whiteboard sessions with members of co-design, software, hardware, and other cross-functional teams to brainstorm features that would go into the silicon. These discussions gave me a lot of insights into Meta’s critical AI workloads, the challenges that our software teams had encountered with the current solutions, and their future directions. Coming from a startup, where we had very limited visibility into customer workloads and the roadmap outside of what is open sourced, this was very enlightening and refreshing,

What are some of the challenges you face in your current role?

OW: The silicon development cycle typically is fairly long. It usually spans anywhere from one and a half to two years, though it can take as long as four years in some cases. With AI advancing at a much faster clip, we are really designing hardware for software that doesn’t yet exist. So the silicon has to be able to handle not just the demands of AI today, but future AI as well. To do this, we have to understand what our software team needs – AI workload trends they see, features they will need – and incorporate that into our design.

This is where we at Meta have an advantage. Because our silicon and software teams are both in-house, we have a front row seat into what’s happening in software, and we are able to incorporate it into our silicon from the beginning.

MTIA v1 was the very first silicon that we built at Meta, so one of the really challenging things was having to build out the entire design and verification flow from scratch, as well as the silicon development infrastructure itself. This was a lot of work in the beginning, but it’s really paid off in the long run for the team.

Meta announced MTIA v1 earlier this year. What is the significance of this milestone to you and the company?

OW: MTIA v1 is Meta’s first generation ML accelerator. It’s customized for our deep learning recommendation model, which is an important component for Meta technologies – including Facebook, Instagram, WhatsApp, Meta Quest, Horizon Worlds, and Ray-Ban Stories. While we will continue to purchase silicon chips from our partners, designing our own silicon allows us to optimize specifically for our critical workloads and gain complete control over the entire stack – from silicon, to the system, to software and the application.

This was such a fun and unique experience, especially when I first started and the team was really, really small. We were able to fit into a conference room along with the software team and whiteboard all the different ideas and features we wanted to implement. I don’t think I’ve ever had that kind of experience anywhere else. Even though the team has grown quite a bit since then, we still try to maintain that scrappy culture.

What did you and the team learn from this process?

OW: I learned how important it is to have a hands-on team capable of jumping into other roles to get the job done. We operate in many ways like a startup in that we have to wear many hats and take on other challenges beyond our usual work. So even though I’m the design lead, in addition to leading the project development, I also roll up my sleeves to code and help out wherever is needed.

What are you looking forward to next? What’s next for the AI silicon design team?

OW: AI is central to our work at Meta. The recommendation system is obviously a big part of our AI models, but beyond that, we also have GenAI and video processing use cases that have different requirements. This brings us a lot of opportunities to create products tailored for each need.

With MTIA in-house, it gives us a tremendous amount of learnings we can incorporate in our products. In addition, we maintained the user experience and developer efficiency offered by PyTorch eager-mode development. Developer efficiency is a journey as we continue to support PyTorch 2.0, which supercharges how PyTorch operates at the compiler level — under the hood. We’re continuing to gather feedback and input from our AI software teams to shape the features of our future AI silicon.

As we work on the next generations of MTIA chips, we’re constantly looking at bottlenecks in the system, such as memory and communication across different chips so that we can put together a well-balanced solution to scale and future-proof our silicon.

What advice might you give to women or other historically underrepresented groups interested in pursuing a career as engineers?

OW: I would encourage them to actively participate and not shy away from speaking up in meetings or discussions so people can know what they can accomplish. The other thing is to look for mentors within the team. They don’t have to be the same as you. Having a mentor is always good, particularly early in your career, to help guide you and prioritize what will help you advance.

Meta’s Infra team, as well as Meta more widely, has a mentor program for women engineers and underrepresented people. We offer both a group coaching program as well as one-on-one coaching. I’ve done both of these and really enjoy having the opportunity to mentor. I’ve found that it’s very helpful for junior engineers to have the opportunity to get coaching and mentoring from senior people in the company.

What about Meta’s culture and technical advancements make it such a prime time for engineers, researchers, and developers to be at the company?

OW: Meta is an amazingly open company with a truly collaborative culture and a great place to learn and grow. We provide resources to help people quickly become familiar with the entire stack, even if they have no prior exposure to certain parts. This includes everything from the silicon to the firmware, the compiler, the application, as well as large scale system design that we are putting into the data center. The sheer scale to which Meta has been deploying the application also creates a dimension of challenges that makes it interesting and rewarding to work here.

Automating product deprecation

Tue, 17 Oct 2023 18:00:00 +0200

Systematic Code and Asset Removal Framework (SCARF) is Meta’s unused code and data deletion framework.
SCARF guides engineers through deprecating a product safely and efficiently via an internal tool.
SCARF combines this tooling with automation to reduce load on engineers.

At Meta, we are constantly innovating and experimenting by building and shipping many different products, and those products comprise thousands of individual features. As part of this healthy technology lifecycle, it is inevitable that certain products or features will be deprecated. For example, in 2015 we launched a photo-sharing app called Moments, which was later deprecated in 2019. So, how did we efficiently and safely remove all of the code and data related to Moments without adversely affecting Meta’s other products and services?

In this three-part blog series, we will discuss the complexities involved in removing a product from a complex portfolio of products and the framework Meta has built to drive the automation of this process, our Systematic Code and Asset Removal Framework (SCARF). SCARF has had an important impact at Meta. In the last year, it has removed petabytes of unused data across 12.8M different data types stored in 21 different data systems. Over the last five years it has deleted over 100M lines of code.

The first post will introduce the complexities faced when systematically deprecating products safely in a large organization and the internal workflow tools we have developed. The second post will explain how SCARF automates the removal of dead code and the infrastructure that powers it. The third post will discuss SCARF’s orchestration for safely identifying and deleting unused data types across various data systems.

Failure modes

Without established guidance detailing the process for determining when and how to remove a product or feature, a few failure modes might emerge. Consider the example of launching a time-specific feature for a large event that happens once every few years. Does it make sense to keep all the code and data related to it until the next event? Most of the time, maintaining code for a number of years that is unused is less desirable than building a new experience for the next event.

Engineers who do attempt the cleanup might find that doing so is a very time-consuming job. Correctly identifying all the pieces of code and data associated with the product and only taking action on those specific pieces of code and data is a laborious process. It’s possible that a table that is still being used, or code that is still required for a shared use case, could be included in the scope of a deletion effort. For example, some tables might be shared between products. Hypothetically, Moments may have started life as an extension to Facebook photos before it became a separate app.

Most importantly, it’s crucial that any cleanup efforts only remove things that are actively being deprecated or are already entirely unused. Deleting something that is actively being used in production could cause bad experiences for users. The interconnected nature of features within a large product like Facebook makes this a very real possibility.

How did we solve this?

Meta has developed playbooks describing how to safely deprecate a product. These playbooks describe how to notify people and give them time to download their data, how to disable the product safely, and when to eventually delete the underlying code and data. They describe how and when to perform a product or feature deprecation, but actually removing the code and data for a product or feature is an engineering problem with an engineering solution. The engineering solution we have built does not replace these guides, but enables engineers to more safely and efficiently complete the product deprecation process.

In this post, we’ll describe Meta’s SCARF and how it guides engineers through this cleanup process. In subsequent posts, we will discuss its limitations and how they are mitigated by our automated dead code and data removal platforms. To start, we introduce Meta’s suite of internal tools developed to orchestrate both of these systems in guiding engineers to remove large products with complex dependencies.

Introducing workflow management

To simplify the task of removing a product, Meta has built a product deprecation workflow management tool into SCARF to help engineers delete a product’s dead code and unused data safely and efficiently. This tool lets engineers understand and break down the steps they’ll go through during the deprecation and coordinates the actions of SCARF to bring automation to bear with an engineer’s guidance.

Engineers can import their product or feature into SCARF, which then determines the constituent pieces of code and data, and identifies internal and external dependencies on these assets. SCARF automatically processes this information to guide engineers on the correct order of operations to delete assets safely and shows their progress relative to the desired end-state.

Any assets that are safe to be deleted immediately will be handled by SCARF’s automated code and data cleanup systems (which will be covered in more depth in our subsequent blog posts). Engineers are able to track this automation and accelerate it by using their domain knowledge combined with SCARF’s scoping analysis, which determines the assets that are safe to remove.

Scoping the deprecation

Understanding when a piece of code or data is used by other components of a product is important, and detecting when an asset is referenced in the codebase of other products is crucial for a safe deprecation (the internal and external dependencies, respectively). For example, we would not want to leave a dead web link between two different products because that could lead to a bad user experience. An engineer deprecating a product must think carefully about each such dependency in the external boundary of their product.

To avoid these complications, engineers begin their deprecations by scoping the project, recursively adding any assets that should be deleted, and flagging dependencies that should be severed. For example, if Moments had an integration with the Facebook app’s Sharing feature, this dependency must be broken because the Facebook Sharing feature itself may not be in scope for removal.

However, if the Sharing feature was unique to Moments, that would change the scope of removal as we would want to delete that component as well. Adding new assets into the scope of the project as it progresses requires further analysis to discover the new boundaries of the internal and external dependencies. These dependencies are expected to change over time as an engineer discovers extra components to delete. Alternatively, we may identify code that is actually a shared component and should not be deleted. Attempting to finalize this boundary from the start is very difficult, so we allow developers to redefine the boundary over time as the deprecation progresses to reflect this growing understanding.

The graph of components, their internal and external dependencies, and related data assets, can grow extremely complex in large products. SCARF simplifies this problem by only requiring engineers to make “Flag Dependency” or “Add to Project” decisions at the boundaries. It then internally computes the correct deletion order for everything inside the project.

Creating a deletion roadmap

Once the set of unused assets to be deleted has been determined, SCARF will analyze the internal and external dependencies and create a deletion roadmap that outlines the correct sequence of steps for deleting everything safely. The roadmap is refreshed each day as changes are made, either from targeted assets being deleted or from modifications to the component graph through the Flag and Add actions. This roadmap is one of the workflow management tool’s most crucial features.

Without guidance, engineers may attempt to delete everything in one fell swoop by removing an entire code directory without accounting for external systems’ dependencies, where changes cannot be committed atomically. Another example includes deleting data before the code that reads and writes it, which may lead to new data being created that must be cleaned again. Deprecations must be staggered to account for these various requirements and this entails that every deprecation be performed in a coordinated, multi-step process.

The key implementation detail is the encoding of business logic that allows products and apps to function in Meta’s systems. By encoding how users are able to engage with products and how those products communicate with other Meta services, we can ensure that upstream assets are always deleted before their dependencies. For example, typically a product will comprise code in many different languages across multiple repositories. An engineer needs to delete their mobile code (Java, Objective-C) in order to free up and delete their server-side GraphQL definitions. Deleting those GraphQL definitions makes it possible to delete business logic; deleting business logic makes it possible to delete data schema definitions, which in turn allows unused data to be deleted.

The following diagram is a hypothetical example of the type of information that is presented to an engineer in their deletion roadmap.

This sequencing isn’t obvious at the get-go! The links between these system boundaries are often weaker than the link between, for example, two classes in the same language; and the link may often only be discovered by an engineer during continuous testing on their code change requests. SCARF’s encoding of these inter-system boundaries enables the deletion roadmap which in turn enables engineers to stagger their removal actions safely.

Powering the analysis

SCARF’s workflow management tool is powered by detailed metrics from a combination of both static and dynamic analysis. Information about code is gathered from SCARF’s unified code dependency graph and information about data is gathered from SCARF’s asset usage analysis. These mechanisms will be discussed in more detail in our subsequent blog posts.

These metrics give the workflow management tool a very important property: If metrics can be found for an asset, it is in use and should be blocked from automated deletion. Correspondingly, any asset for which no metrics are available is ready for automatic deletion. With these metrics in hand, SCARF can show engineers a precise explanation of exactly what is blocking the automation and what steps they must manually perform in order for the automation to proceed.

This completeness property allows SCARF to automatically begin removing code and data that is unused, while simultaneously showing engineers which things must be handled manually. At every step of the process, both engineers and automation work together to complete the deprecation.

Once an engineer has acted upon this information and removed more dead code and unused data in accordance with the deletion roadmap, the continuous indexing and analysis of SCARF will detect these changes and automatically trigger any code/data cleanup it can. Finally, the workflow management tool will update its deletion roadmap to identify the next set of items requiring manual intervention, and the process is repeated.

Is automation sufficient?

Some of the usage signals SCARF highlights (such as API endpoints that still receive traffic) do not necessarily cause compilation errors if they are ignored and are more subjective as to whether they should stop automated deletion. If the endpoint for a product receives a single request each day, should we delete it right away? We have already identified that the endpoint belongs to a product we are removing so we know it should be removed eventually, but the answer to when it is correct to delete it is ultimately a business decision. Further, we understand that mistakes can happen and that the automation is not expected to be 100 percent perfect. Therefore, SCARF errs on the side of caution as a mistaken deletion could lead to unrecoverable errors or data loss.

Correspondingly, SCARF offers the ability to override signals that prevent automation from taking action. In the example above, an engineer might decide we wish to proceed with deleting the API endpoint despite small amounts of traffic. This is extremely valuable when cleaning up internal products, tools, and services. Internal tools are built and deprecated with a much faster cadence than an external product often is and SCARF can help determine each individual internal user of a tool who needs to be asked if they’re agreeable with it being removed.

This feature creates a feedback loop between engineers and automation. Our automation highlights usage signals; engineers then triage those usage signals (either by modifying code to remove them or, if applicable, marking the signal as not sufficient to block deletion); then our automation can proceed to progress the SCARF project.

To further accelerate manual changes performed by engineers, SCARF also provides an instant code change feature. SCARF suggests a means for breaking a dependency through a targeted code change and an engineer is able to select one of a predetermined set of possible ways to break it. The change is immediately previewable and a change request can be immediately generated for review by other engineers.

After an engineer has progressed through a majority of the deletion roadmap and removed the entry points, client code, server code, and any other in-scope assets, they will be left with unused tables of data that can now be automatically cleaned up by SCARF. This automation will be explained in more detail in our future blog posts.

Integration with other tools at Meta

SCARF is useful for more than just product deprecation. We surface deprecation as an option when notifying engineers about routine maintenance or upgrade work on individual assets. For example, when asking developers to migrate away from a legacy internal API, we can include an option for them to deprecate the callsite or affected product and import their code into SCARF. Deprecating something can, in many cases, be more work than performing a simple upgrade, but deprecation means that no future maintenance work is necessary. Once something has been removed, future maintenance costs go to zero.

We hope you’ll look forward to our subsequent posts in the series, which will cover SCARF’s automated code and data deletion, respectively.

Meta contributes new features to Python 3.12

Thu, 05 Oct 2023 18:00:00 +0200

Python 3.12 is out! It includes new features and performance improvements – some contributed by Meta – that we believe will benefit all Python users.
We’re sharing details about these new features that we worked closely with the Python community to develop.

This week’s release of Python 3.12 marks a milestone in our efforts to make our work developing and scaling Python for Meta’s use cases more accessible to the broader Python community. Open source at Meta is an important part of how we work and share our learnings with the community.

For several years, we have been sharing our work on Python and CPython through our open source Python runtime, Cinder. We have also been working closely with the Python community to introduce new features and optimizations to improve Python’s performance and to allow third parties to experiment with Python runtime optimization more easily.

For the Python 3.12 release, we collaborated with the Python community on several categories of features:

Immortal Objects
Type system improvements
Performance optimizations
New benchmarks
Cinder hooks

Immortal Objects

Immortal Objects – PEP 683 makes it possible to create Python objects that don’t participate in reference counting, and will live until Python interpreter shutdown. The original motivation for this feature was to reduce memory use in the forking Instagram web-server workload by reducing copy-on-writes triggered by reference-count updates.

Immortal Objects are also an important step towards truly immutable Python objects that can be shared between Python interpreters with no need for locking, for example, via the global interpreter lock (GIL) This can enable improved Python single-process parallelism, whether via multiple sub-interpreters or GIL-free multi-threading.

Type system improvements

The engineering team behind Pyre, an open source Python type-checker, authored and implemented PEP 698 to add a @typing.override decorator, which helps avoid bugs when refactoring class inheritance hierarchies that use method overriding.

Python developers can apply this new decorator to a subclass method that overrides a method from a base class. As a result, static type checkers will be able to warn developers if the base class is modified such that the overridden method no longer exists. Developers can avoid accidentally turning a method override into dead code. This improves confidence in refactoring and helps keep the code more maintainable.

Performance optimizations

Faster comprehensions

In previous Python versions, all comprehensions were compiled as nested functions, and every execution of a comprehension allocated and destroyed a single-use Python function object.

In Python 3.12, PEP 709 inlines all list, dict, and set comprehensions for better performance (up to two times better in the best case).

The implementation and debugging of PEP 709 also uncovered a pre-existing bytecode compiler bug that could result in silently wrong code execution in Python 3.11, which we fixed.

Eager asyncio tasks

While Python’s asynchronous programming support enables single-process concurrency, it also has noticeable runtime overhead. Every call to an async function creates an extra coroutine object, and the standard asyncio library will often bring additional overhead in the form of Task objects and event loop scheduling.

We observed that, in practice, in a fully async codebase, many async functions are often able to return a result immediately, with no need to suspend. (This may be due to memoization, for example.) In these cases, if the result of the function is immediately awaited (e.g., by await some_async_func(), the most common way to call an async function), the coroutine/Task objects and event loop scheduling can be unnecessary overhead.

Cinder eliminates this overhead via eager async execution. If an async function call is awaited immediately, it is called with a flag set that allows it to return a result directly, if possible, without creating a coroutine object. If an asyncio.gather() is immediately awaited, and all the async functions it gathers are able to return immediately, there’s no need to ever create a Task or schedule it to the event loop.

Fully eager async execution would be an invasive (and breaking) change to Python, and doesn’t work as well with the new Python 3.11+ TaskGroup API for managing concurrent tasks. So in Python 3.12 we added a simpler version of the feature: eager asyncio tasks. With eager tasks, coroutine and Task objects are still created when a result is available immediately, but we can sometimes avoid scheduling the task to the event loop and instead resolve it right away.

This is more efficient, but it is a semantic change, so this feature is opt-in via a custom task factory.

Other asyncio improvements

We also landed a faster C implementation of asyncio.current_task and an optimization to async task creation that shows a win of up to 5 percent on asyncio benchmarks.

Faster super() calls

The new LOAD_SUPER_ATTR opcode optimizes code of the form super().attr and super().method(…). Such code previously had to allocate, and then throw away, a single-use “super” object each time it ran. Now it has little more overhead than an ordinary method call or attribute access.

Other performance optimizations

We also landed two hasattr optimizations and a 3.8x performance improvement to unittest.mock.Mock.

New benchmarks

When we optimize Python for internal use at Meta, we are usually able to test and validate our optimizations directly against our real-world workloads. Optimization work on open-source Python doesn’t have such a production workload to test against and needs to be effective (and avoid regression) on a variety of different workloads.

The Python Performance Benchmark suite is the standard set of benchmarks used in open-source Python optimization work. During the 3.12 development cycle, we contributed several new benchmarks to it so that it more accurately represents workload characteristics we see at Meta.

We added:

A set of async_tree benchmarks that better model an asyncio-heavy workload.
A pair of benchmarks that exercise comprehensions and super() more thoroughly, which were blind spots of the existing benchmark suite.

Cinder hooks

Some parts of Cinder (our JIT compiler and Static Python) wouldn’t make sense as part of upstream CPython (because of limited platform support, C versus C++, semantic changes, and just the size of the code), so our goal is to package these as an independent extension module, CinderX.

This requires a number of new hooks in the core runtime. We landed many of these hooks in Python 3.12:

An API to set the vectorcall entrypoint for a Python function. This gives the JIT an entry point to take over execution for a given function.
We added dictionary watchers, type watchers, function watchers, and code object watchers. All of these allow the Cinder JIT to be notified of dynamic changes that might invalidate its assumptions, so its fast path can remain as fast as possible.
We landed extensibility in the code generator for CPython’s core interpreter that will allow Static Python to easily re-generate an interpreter with added Static Python opcodes, and a C API to visit all GC-tracked objects, which will allow the Cinder JIT to discover functions that were created before it was enabled.
We also added a thread-safe API for writing to perf-map files. Perf-map files allow the Linux perf profiler to give a human-readable name to dynamically-generated sections of machine code, e.g. from a JIT compiler. This API will allow the Cinder JIT to safely write to perf map files without colliding with other JITs or with the new Python 3.12 perf trampoline feature.

These improvements will be useful to anyone building a third party JIT compiler or runtime optimizer for CPython. There are also plans to use the watchers internally in core CPython.

Beyond Python 3.12

Python plays a significant role at Meta. It’s an important part of our infrastructure, including the Instagram server stack. And it’s the lingua franca for our AI/ML work, highlighted by our development of PyTorch, a machine learning framework for a wide range of use cases including computer vision, natural language processing, and more.

Our work with the Python community doesn’t end with the 3.12 release. We are currently discussing a new proposal, PEP 703, with the Python Steering Council to remove the GIL and allow Python to run in multiple threads in parallel. This update could greatly help anyone using Python in a multi-threaded environment.

Meta’s involvement with the Python community also goes beyond code. In 2023, we continued supporting the Developer in Residence program for Python and sponsored events like PyCon US. We also shared our learnings in talks like “Breaking Boundaries: Advancements in High-Performance AI/ML through PyTorch’s Python Compiler” and posts on the Meta Engineering blog.

We are grateful to be a part of this open source community and look forward to working together to move the Python programming language forward.

Acknowledgements

The author would like to acknowledge the following people for their work in contributing to all of these new features: Eddie Elizondo, Vladimir Matveev, Itamar Oren, Steven Troxler, Joshua Xu, Shannon Zhu, Jacob Bower, Pranav Thulasiram Bhat, Ariel Lin, Andrew Frost, and Sam Gross.

Meta Quest 2: Defense through offense

Tue, 12 Sep 2023 18:00:00 +0200

Meta’s Native Assurance team regularly performs manual code reviews as part of our ongoing commitment to improve the security posture of Meta’s products.
In 2021, we discovered a vulnerability in the Meta Quest 2’s Android-based OS that never made it to production but helped us find new ways to improve the security of Meta Quest products.
We’re sharing our journey to get arbitrary native code execution in the privileged VR Runtime service on the Meta Quest 2 by exploiting a memory corruption vulnerability from an unprivileged application over Runtime IPC.

In 2021, the Native Assurance team at Meta (part of the Product Security organization) performed a code review on a privileged service called VR Runtime which provides VR services to client applications on VROS, the Android Open Source Project (AOSP)-based OS for the Meta Quest product line. In the process they found multiple memory corruption vulnerabilities that could be triggered by any installed application.

This vulnerability never made it into production. But to get a better understanding of how exploitation could happen on VROS we decided to use this opportunity to write an elevation-of-privilege exploit that could execute arbitrary native code in VR Runtime. Doing so gave us an even better understanding of what exploitation could look like on VROS and gave us actionable items we’re using to improve the security posture of Meta Quest products.

An introduction to VROS

VROS is an in-house AOSP build that runs on the Meta Quest product line up. It contains customizations on top of AOSP to provide the VR experience on Quest hardware, including firmware, kernel modifications, device drivers, system services, SELinux policies, and applications.

As an Android variant, VROS has many of the same security features as other modern Android systems. For example, it uses SELinux policies to reduce the attack surfaces exposed to unprivileged code running on the device. Because of these protections, modern Android exploits typically require chains of exploits against numerous vulnerabilities to gain control over a device. Attackers attempting to compromise VROS must overcome similar challenges.

Image source: https://source.android.com/docs/core/architecture

On VROS, VR applications are essentially regular Android applications. However, these applications communicate with a variety of system services and hardware to provide the VR experience to users.

VR Runtime

VR Runtime is a service that provides VR features such as time warp and composition to client VR applications. The service is contained within the com.oculus.vrruntimeservice process as part of the com.oculus.systemdriver (VrDriver.apk) package. The VrDriver package is installed to /system/priv-app/ in VROS making com.oculus.vrruntimeservice a privileged service with SELinux domain priv_app. This gives it permissions beyond what are given to normal Android applications.

The VR Runtime service is built on a custom IPC called Runtime IPC that is developed by Meta. Runtime IPC uses UNIX pipes and ashmem shared memory regions to facilitate communication between clients and servers. A native broker process called runtimeipcbroker sits in the middle between clients and servers and manages the initial connection, after which clients and servers communicate directly with one another.

VR application / VR Runtime connections

All VR applications use Runtime IPC to connect to the VR Runtime server running in the com.oculus.vrruntimeservice process using either the VrApi or OpenXR API. The VrApi and OpenXR interfaces load a library dynamically from VrDriver.apk containing the client side of the VR Runtime implementation and use this under the hood to perform various VR operations supported by VR Runtime such as time warp.

This process can be summarized in a sequence of steps:

A loader is linked to all VR applications at build time. This makes it so VR apps can run on multiple products/versions.
When a VR app starts, the loader uses dlopen to load the vrapiimpl.so library installed as part of VrDriver.apk. The loader will obtain the addresses of functions within vrapiimpl.so associated with the public VrApi or OpenXR interface.
After the loader’s execution:
1. The VR application will create a Runtime IPC connection to the VR Runtime server running inside of com.oculus.vrruntimeservice.
2. This process is mediated by the native runtimeipcbroker process, which performs permissions checks and other hand-off responsibilities so that the client and server can communicate directly.
3. From this point forward the connection uses UNIX pipes and shared memory regions for client/server communication.

The VR Runtime attack surface

The default SELinux domain for most applications on VROS is untrusted_app. These applications include those that are installed from the Meta Quest Store as well as those that are sideloaded onto the device. The untrusted_app domain is restrictive and meant to contain the minimum SELinux permissions that an application should need.

Since untrusted applications can communicate with the more privileged VR Runtime server this introduces an elevation of privilege risk. If an untrusted application is able to exploit a vulnerability in the VR Runtime code it will be able to perform operations on the device reserved for privileged applications. Because of this, all inputs from untrusted applications to VR Runtime should be scrutinized heavily.

The most important inputs that VR Runtime processes from untrusted applications are those that originate from RPC requests and from read/write shared memory. The code that processes these inputs consists of the attack surface of VR Runtime, as shown below:

Exploiting VR Runtime

Before diving into the vulnerability and its exploitation, let us explain the exploitation scenario that we considered.

Anyone who owns a Meta Quest headset is able to turn on developer mode, which allows users to sideload applications and have adb / shell access. This doesn’t mean users are able to get root on their devices, but it does give them a large amount of flexibility for interacting with the headset that they would not have otherwise.

We chose to pursue exploitation from the perspective of an application that escalates its privileges on the headset. Such an application could be intentionally malicious or be sideloaded by a user for jailbreaking purposes.

The vulnerability

The vulnerability that we chose for exploitation never made it into a production release, but it was introduced in a code commit in 2021. The commit added processing code for a new type of message that the VR Runtime could receive over Runtime IPC. Here is a redacted code snippet of what the vulnerability looked like:

 REGISTER_RPC_HANDLER(
    SetPerformanceIdealFeatureState,
    [=](const uint32_t clientId,
      const SetPerformanceIdealFeatureStateRequest request,
      bool& response) {
// ...  
PerformanceManagerState->IdealFeaturesState.features_[static_cast(request.Feature)]
          .status_ = request.Status;     
PerformanceManagerState->IdealFeaturesState.features_[static_cast(request.Feature)]
          .fidelity_ = request.Fidelity;
// ...
      response = true;
      return reflect::RPCResult_Complete;
    })

The request parameter is an object that is built based on what is received over Runtime IPC. This means both request.Feature and request.Status are attacker controlled. The PerformanceManagerState->IdealFeaturesState.features_ variable is a statically-sized array and lives in the .bss section of the libvrruntimeservice.so module. PerformanceManagerState->IdealFeaturesState.features_ is structured as follows:

enum class FeatureFidelity : uint32_t { ... };
enum class FeatureStatus : uint32_t { ... };
struct FeatureState {
  FeatureFidelity fidelity_;
  FeatureStatus status_;
};
struct FeaturesState {
  std::array features_;
};

Since request.Feature and request.Status are attacker controlled and PerformanceManagerState->IdealFeaturesState.features_ is a statically-sized array, the vulnerability gives an attacker the ability to perform arbitrary 8-byte-long corruptions at arbitrary offsets (32-bit limit). Any VR application can trigger this vulnerability by sending a specially crafted SetPerformanceIdealFeatureState Runtime IPC message. Moreover, the vulnerability is stable and can be repeated.

Hijacking control-flow

The end goal for our exploit was arbitrary native code execution. We needed to turn this 8-byte write vulnerability into something useful for an attacker. The first step was to find a corruption target to take control of the program counter.

Thankfully for us, VR Runtime is a complex stateful piece of software and there are a lot of interesting potential targets inside its .bss section. The ideal corruption target for us was a function pointer that:

Is stored at an arbitrary offset right after the global array. This is important because it means we can use the 8-byte write primitive to corrupt and control its value.
Has an attacker-reachable call site that invokes it. This is important because without a call site invoking the function pointer, we can’t take over the control flow.

To enumerate the corruption targets that were reachable from the write primitive, we used Ghidra to manually analyze the layout of the .bss section of the libvrruntimeservice.so binary. First, we located where the array is stored in the section. This location corresponds to the beginning of the PerformanceManagerState->IdeaFeatureState.features_ array that you can see below.

We then searched for forward reachable corruption targets that were contained within the libvrruntimservice.so binary. Lucky for us, we found an array of function pointers that are dynamically resolved at runtime and stored within a global instance of an ovrVulkanLoader object. The function pointers contained within ovrVulkanLoader point into the libvulkan.so module providing the Vulkan interface. The Vulkan interface function pointer calls are invokable indirectly from attacker-controlled inputs over RPC. These two properties satisfy the two exploitation criteria we mentioned earlier.

With that in mind, we looked for a function pointer that we knew could be invoked indirectly from an RPC command. We chose to overwrite the vkGetPhysicalDeviceImageFormatProperties function pointer, which can be called from a control flow originating from the CreateSwapChain Runtime IPC RPC command.

Below is a decompilation output of the CreateTextureSwapChainVulkan function that invokes the vkGetPhysicalDeviceImageFormatProperties function pointer:

To hijack control flow, we first used the write primitive to corrupt the vkGetPhysicalDeviceImageFormatProperties function pointer and then crafted an RPC command that triggered the CreateTextureSwapChainVulkan function. This eventually allowed us to control the program counter:

Bypassing Address Space Layout Randomization (ASLR)

We turned this corruption primitive into something that allowed us to control the program counter of the target. Address Space Layout Randomization (ASLR) is an exploit mitigation that makes it difficult for exploits to predict the address space of the target. Because of ASLR, we had no knowledge of the target address space: We didn’t know where libraries were loaded and didn’t know where the heap or stack was. Knowing these locations is extremely useful for an attacker because they can redirect the execution flow to loaded libraries and reuse some of their code. This is a technique known as jump-oriented programming (JOP) or return-oriented programming (a specific case of JOP).

Bypassing ASLR is a common problem in modern exploitation and the answer is usually to:

Find or manufacture a way to leak hints about the address-space (function addresses, saved-return addresses, heap pointers, etc.).
Find another way.

We explored both of those options and eventually stumbled upon something rather interesting:

$ adb shell ps -A
USER           PID  PPID     VSZ    RSS WCHAN            ADDR S NAME                       
root           694     1 5367252 128760 poll_schedule_timeout 0 S zygote64
u0_a5         1898   694 5801656 112280 ptrace_stop         0 t com.oculus.vrruntimeservice
u0_a80        7519   694 5383760 104720 do_epoll_wait       0 S com.oculus.vrexploit

In the above, you can see that our application and our target have been forked off the zygote64 process. The result is that our process inherits the same address space from the zygote64 process as the VR Runtime process. This means that the loaded libraries in the zygote64 process at fork time will be loaded at the same addresses in both of those processes.

This is extremely useful because it means that we don’t need to break ASLR anymore since we have detailed knowledge of where numerous libraries reside in memory. Below shows an example where the libc.so module is loaded at 0x7dae043000 in both processes:

$ adb shell cat /proc/1898/maps | grep libc.so
7dae043000-7dae084000 r--p 00000000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
7dae084000-7dae11e000 --xp 00040000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
7dae11e000-7dae126000 r--p 000d9000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
7dae126000-7dae129000 rw-p 000e0000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
$ adb shell cat /proc/7519/maps | grep libc.so
7dae043000-7dae084000 r--p 00000000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
7dae084000-7dae11e000 --xp 00040000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
7dae11e000-7dae126000 r--p 000d9000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so
7dae126000-7dae129000 rw-p 000e0000 fd:00 286     /apex/com.android.runtime/lib64/bionic/libc.so

Using this knowledge, we enumerated all shared libraries in both address spaces and looked for code reuse gadgets in them. At this point there were literally millions of code reuse gadgets in a file that we needed to sift through to assemble a JOP chain and accomplish our goal.

...
0x240b4: ldr x8, [x0]; ldr x8, [x8, #0x40]; blr x8; 
0x23ad0: ldr x8, [x0]; ldr x8, [x8, #0x48]; blr x8; 
0x23ab0: ldr x8, [x0]; ldr x8, [x8, #0x50]; blr x8; 
0x24040: ldr x8, [x0]; ldr x8, [x8, #0x70]; blr x8; 
0x23100: ldr x8, [x0]; ldr x8, [x8, #8]; blr x8; 
0x23ae0: ldr x8, [x0]; ldr x8, [x8]; blr x8; 
0x22ba8: ldr x8, [x0]; ldr x9, [x8, #0x30]; add x8, sp, #8; blr x9; 
0x231e0: ldr x8, [x0]; mov x19, x0; ldr x8, [x8, #0x58]; blr x8; 
0x208fc: ldr x8, [x0]; rev x0, x8; ret; 
0x231f0: ldr x8, [x19]; mov w20, w0; mov x0, x19; ldr x8, [x8, #0x60]; blr x8; 
0x22de4: ldr x8, [x1]; mov x0, x1; ldr x8, [x8, #0x70]; blr x8; 
0x179e4: ldr x8, [x20], #0x10; sub x19, x19, #1; ldr x8, [x8]; blr x8; 
0x17ea4: ldr x8, [x21]; mov x0, x21; ldr x8, [x8, #0x10]; blr x8; 
0x23b0c: ldr x8, [x21]; mov x0, x21; mov x1, x20; ldr x8, [x8, #0x48]; blr x8; 
0x17b38: ldr x8, [x22], #0x10; mov x0, x21; ldr x8, [x8]; blr x8; 
0x17ad8: ldr x8, [x22], #0xfffffffffffffff0; mov x0, x21; ldr x8, [x8]; blr x8; 
0x23be0: ldr x8, [x22]; mov w23, w0; mov x0, x22; ldr x8, [x8, #0x60]; blr x8;

We now had control over the execution flow, knew where a large subset of libraries loaded in the VR Runtime are placed in memory, and had a list of code reuse gadgets. The next step was to actually write the exploit to execute a payload of our choosing in the VR Runtime process.

Exploitation

As a reminder, our exploitation scenario was from the perspective of an already installed untrusted application. Our approach for exploitation was to get the VR Runtime process to load a shared library using dlopen from our application APK. When VR Runtime loaded the library, our payload would be executed automatically as part of the loaded library’s initialization function.

Accomplishing this meant we needed a JOP chain that performed the following sequence of operations:

Assign a pointer to $x0 (the first function argument in the ARM64 ABI) pointing to a path of a shared module we placed in our exploit APK.
Redirect the program counter to dlopen.

To build our JOP chain we filtered the list of gadgets based on the registers and memory we controlled at the time of hijack. The state at the time of the hijack is illustrated below:

Recall that the $x0 register at the time of the control flow transfer to dlopen corresponds to the path argument. The problem we now had to solve was how do we load $x0 with a pointer to a string we control? This is tricky because the only place we were able to insert controlled data is the .bss section of the target. But we didn’t know its location in memory, so we couldn’t hardcode its address.

One thing that was very helpful for us is that there happened to be a pointer to the .bss section (ovrVulkanLoader) in the $x21 register at the time of control flow hijack. This meant that in theory we could simply move $x21 or a value offset from $x21 into $x0. This would give us our controlled path argument to dlopen, solving our problem.

After hours of sifting through gadgets, we eventually found one that did exactly what we needed and also allowed us to keep control flow:

ldr        x2,[x21 , #0x80 ]
mov        w1,#0x1000
mov        x0,x21
blr        x2

We could then use another gadget to set $x1 (the second function argument in the ARM64 ABI) to a sane value and invoke dlopen:

mov        w1,#0x2
bl         ::dlopen undefined dlopen()

Luckily, the write vulnerability we used in the exploit was also repeatable. This meant that we could overwrite multiple locations in memory offset from $x21 (ovrVulkanLoader). We ended up using multiple RPC commands to overwrite memory in the way we needed for setting up our gadget state and only afterwards triggering the control flow hijack.

Using this approach, we set up the gadget state to combine the two gadgets above and were able to load our shared module giving us arbitrary native code execution:

  // Corrupt the `vulkanLoader.vkGetPhysicalDeviceImageFormatProperties` pointer which is
  // at +0x68. We hijack control flow by triggering a function call in
  // ovrSwapChain::CreateTextureSwapChainVulkan.
  // First gadget in eglSubDriverAndroid.so
  //  0010b3ac a2  42  40  f9    ldr        x2,[x21 , #0x80 ]
  //  0010b3b0 e1  03  14  32    mov        w1,#0x1000
  //  0010b3b4 e0  03  15  aa    mov        x0,x21
  //  0010b3b8 40  00  3f  d6    blr        x2
  const uint64_t vkGetPhysicalDeviceImageFormatPropertiesOffset = VulkanLoaderOffset + 0x68;
  const uint64_t FirstGadget = ModuleMap.at("eglSubDriverAndroid.so") + 0xb3'ac;
  Corruptions.emplace_back(vkGetPhysicalDeviceImageFormatPropertiesOffset, FirstGadget);
  // Second gadget in libcutils.so:
  //  0010bc78 41  00  80  52    mov        w1,#0x2
  //  0010bc7c ad  0d  00  94    bl         ::dlopen undefined dlopen()
  const uint64_t SecondGadget = ModuleMap.at("/system/lib64/libcutils.so") + 0xbc'78;
  Corruptions.emplace_back(VulkanLoaderOffset + 0x80, SecondGadget);

And below is what it looked like from GDB (GNU Debugger):

(gdb) break *0x7c98012c78
Breakpoint 1 at 0x7c98012c78
(gdb) c
Continuing.
Thread 41 "Thread-15" hit Breakpoint 1, 0x0000007c98012c78 in ?? ()
(gdb) x/s $x0
0x7bb11633e8:   "/data/app/com.oculus.vrexploit-OjL813hdSAtlc3fEkJKdrg==/lib/arm64/libinject-arm64.so"
(gdb) c
Continuing.
warning: Could not load shared library symbols for /data/app/com.oculus.vrexploit-OjL813hdSAtlc3fEkJKdrg==/lib/arm64/libinject-arm64.so.

At that point, we accomplished our goal and were able to execute arbitrary native code in the VR Runtime process.

What we learned

We tried to derive as much value out of the exercise as possible with a focus on actionable items we could use to improve the security posture of Meta products. We won’t list all the outcomes in this post but here are some of the most notable.

RELRO for function pointers in RW global memory

One of the patterns we noticed early in the exercise was that the VR Runtime service contained many function pointers in global memory. The VR Runtime process loads these function pointers early in its initialization by first calling dlopen on certain system installed libraries and then using dlsym to assign a given function pointer with its associated address.

This approach provides flexibility to developers to use vendor libraries providing a common API across products (e.g., libvulkan.so). The downside is that the function pointers are stored in readable and writable memory, making them prime targets for memory corruption-based overwrites. In VR Runtime’s case, they were stored in global readable writable memory that happened to be reachable from our out-of-bounds write exploitation primitive. Additionally, these function pointers are not protected by compiler mitigations such as control flow integrity.

As an outcome of our exploitation exercise, we explored different strategies to protect these function pointers after their initial assignment. One strategy was to try and mirror the well-known full relocation read-only (RELRO) mitigation that is used to protect pointers to functions in other libraries computed by the dynamic linker at load time. In full RELRO, the mappings containing these pointers are made read-only after they are initialized, which prevents malicious writes from overwriting their contents.

We made multiple changes to the VR Runtime code to mark function pointers in global memory to be read only after we initialized them. Had this protection been in place it would have made our exploitation much more difficult. We are now working on generalizing this approach by building an LLVM compiler pass that implements the technique.

Thoughts on SELinux

One of the most frustrating things for us during exploit development was the constraints imposed on us by SELinux. With that said, we were pleasantly surprised that we could load a .so library out of an untrusted application’s data directory as a privileged application. This is because Android’s default SELinux policy enables privileged applications (typically installed to platform_app, system_app, or priv_app) to execute code under /data/app, which is where untrusted applications are commonly installed.

Android supports this behavior because it allows for updates to privileged applications outside of OTA updates. This allows privileged applications signed with the same certificate as the original to be updated in a more lightweight manner. An updated privileged application is installed to /data/app, but retains its privileged SELinux context.

While we did not develop a solution to this issue, we feel it’s worth calling out as a potential area for improvement on Android. In general, we don’t believe that privileged applications should be able to execute code owned by lesser privileged applications.

About Meta’s Native Assurance team

The Meta Native Assurance team that performed this exploit exercise is part of a larger product security group that performs proactive security work on Meta’s products. Some examples of this work include fuzzing, static analysis, architecture/implementation reviews, attack surface reduction, exploit mitigations, and more. In addition, Meta also offers a bug bounty program to incentivize security research across its entire external attack surface, including the VR and AR products.

Using Chakra execution traces for benchmarking and network performance optimization

Thu, 07 Sep 2023 21:35:00 +0200

Meta presents Chakra execution traces, an open graph-based representation of AI/ML workload execution, laying the foundation for benchmarking and network performance optimization.
Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints.
In collaboration with MLCommons, we are seeking industry-wide adoption for benchmarking.
Meta open sourced a set of tools to enable the collection, analysis, generation, and adoption of Chakra execution traces by a broad range of simulators, emulators, and replay tools.

At Meta, our endeavors are not only geared towards pushing the boundaries of AI/ML but also towards optimizing the vast networks that enable these computations. Our agile, reproducible, and standardized benchmarking system plays an important role in this. Through our collaboration with MLCommons, and our deep insights into traditional benchmarking constraints, we have initiated the Chakra execution traces—a graph-based representation of AI/ML workloads. This approach aims to unify diverse execution trace schemas, seeking industry-wide adoption for enhanced AI efficiency analysis tools and holistic performance benchmarking.

The limitations of traditional AI benchmarking methodology

Traditionally, benchmarking AI systems has largely relied on running full ML workloads. Established benchmarking approaches, such as MLPerf, have provided invaluable insights into the behavior and performance of AI workloads and systems. However, traditional full workload benchmarking presents several challenges:

Difficulty in forecasting future system performance: When designing an AI system, engineers frequently face the challenge of predicting the performance of future systems. Such predictions become even more complex when the compute engines aren’t ready or when changes in network topology and bandwidth become necessary. Relying on full workloads to evaluate the performance of these not-yet-realized systems is not feasible.
High compute cost: Executing full workload benchmarks comes at a substantial compute cost. Given that training contemporary ML models often requires thousands of graphics processing units (GPUs), these benchmarks should ideally be executed on a similarly vast number of GPUs. Additionally, gauging the performance of a system using this method can be time-consuming.
Inability to adapt to evolving workloads: The landscape of ML workloads and their requirements is rapidly evolving. Traditional full workload benchmarks fall short when it comes to addressing these changing needs, primarily because they necessitate significant efforts to standardize workloads as benchmarks.

An overview of Chakra

Building upon our insights into the constraints of traditional benchmarking, we present the Chakra execution traces. This new approach provides an open, interoperable graph-based depiction of AI/ML workload execution. The Chakra execution trace captures core operations—including compute, memory, and communication—along with their dependencies, timing, and metadata.

Though execution traces are a valuable representation of an ML task, the structure and metadata of the resulting traces can differ based on the ML framework utilized. Recognizing this, Chakra introduces a standardized schema for performance modeling, termed the Chakra execution trace. The below figure outlines the Chakra ecosystem, with execution traces as its central component. As depicted in the figure, Chakra also offers a range of tools to convert, visualize, generate, and simulate these execution traces.

How Meta leverages Chakra execution traces

At Meta, we collect execution traces from our production servers every day. These execution traces serve multiple purposes: Benchmarking, visualization, and performance optimization.

Benchmarking

Benchmarking is essential for improving current AI systems and planning future networks. We specifically utilize Chakra execution traces for this task. We have developed several benchmarking tools, including Mystique and PARAM. Mystique allows us to replicate the performance of an ML workload by replaying both compute and communication operators found in execution traces. It leverages the Chakra execution trace to record runtime details of a model at the operator level and then replays them to reproduce the original performance. In line with our vision, the MLCommons Chakra working group is curating the ‘Chakra trace benchmark suite’ by gathering execution traces from various industry players.

Visualization and performance optimization

One example of visualization and performance optimization is the analysis of collective message sizes. We analyze production execution traces using an automated system. The visual data generated aids us in identifying any balance or imbalance in collective message sizes across different ranks. Our visualization tool can precisely highlight these imbalances, as shown by the below figure.

With this information at hand, Meta engineers are equipped to craft appropriate solutions, ensuring a balanced message size, as demonstrated in the below figure.

Future plans

Enhancing the benchmarking capability of Chakra execution traces

While the execution trace replayer enables replay of execution traces, it brings forth challenges. A primary challenge is the intrinsic linkage of collected execution traces to specific systems. Because traces are gathered from actual machine runs, the kernels executed are optimized for the specific system at play. As a result, traces sourced from one system might not accurately simulate on another with a different GPU, network topology, and bandwidth.

We’re addressing this constraint in collaboration with the MLCommons Chakra working group. We aspire to gather execution traces prior to the operator optimization phase for any target system, as shown in the figure. These are termed pre-execution traces. In parallel, to enable benchmarking next-gen AI systems, we’re streamlining the process from trace collection to simulation on a simulator.

Using AI to generate representative execution traces

Chakra execution traces are capable of identifying network bottlenecks in ML workload execution. However, optimizing SW/HW stacks with production execution traces presents a practical challenge. The main challenge arises when trying to globally optimize our production systems. Given the sheer volume of production traces, exhaustively running them for system optimization is neither feasible nor efficient. Doing so would be both time-consuming and computationally expensive. Thus, selecting a representative subset of production execution traces becomes imperative.

However, there’s a risk: The chosen traces might not holistically represent the global characteristics, potentially skewing optimization efforts towards only specific ML workloads. We envision a generative AI model that can identify and generate execution traces that are representative of the primary characteristics observed. We also plan to incorporate an obfuscation mechanism within the AI model. This will facilitate trace sharing without jeopardizing intellectual property, fostering SW/HW co-design between different companies.

Taking the leap with industry collaboration

For such an ecosystem to flourish, industry consensus is paramount. Our collaboration with the MLCommons consortium, an open engineering assembly of over 50 leading companies, is a testament to our commitment. This collaboration aims to establish Chakra within its fold, providing a framework for broad adoption.

Chakra’s working group under MLCommons will spearhead efforts to create and develop:

A standardized schema that can capture and convert execution traces from diverse frameworks.
ML models for creating representative Chakra execution traces – protecting proprietary information while also projecting future AI workloads.
An open ecosystem of tools for benchmarks, simulations, and emulations.
Comprehensive benchmarks with Chakra execution traces based on MLCommons/MLPerf guidelines.

Join us on this journey

Our vision is to forge an agile, reproducible benchmarking and co-design system for AI. Collaboration with peers, academic institutions, and consortiums will be pivotal. We invite interested individuals and companies to become a part of the Chakra working group, to help contribute to the paradigm shift in benchmarking and network performance optimization.

Read the research paper

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Acknowledgements

We would like to thank all contributors to the Chakra project within Meta: Taekyung Heo, Srinivas Sridharan, Brian Coutinho, Hiwot Kassa, Matt Bergeron, Parth Malani, Shashi Gandham, Omar Baldonado, our external partners in Georgia Tech and MLCommons, as well as external collaborators in AMD, CMU, Cornell, Enfabrica, Google, Harvard, HP Labs, Intel, Keysight Technologies, Microsoft, NVIDIA, OCP, and Stanford.

Arcadia: An end-to-end AI system performance simulator

Thu, 07 Sep 2023 21:10:00 +0200

We’re introducing Arcadia, Meta’s unified system that simulates the compute, memory, and network performance of AI training clusters.
Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively.
Arcadia gives Meta’s researchers and engineers valuable insights into the performance of AI models and workloads in an AI cluster – enabling data-driven decision making in the design of AI clusters.

AI plays an important role in the work we do at Meta. We leverage AI to deliver more personalized experiences and recommendations for people across our family of apps. We’re also committed to advancing the state-of-the-art in generative AI, computer vision, new augmented reality (AR) tools, natural language processing (NLP), and other core areas of AI for a wide range of applications.

Delivering on these commitments means maximizing the performance of every GPU within our AI clusters across three performance pillars: Compute, memory, and network.

Within these pillars, AI cluster performance can be influenced by multiple factors, including

model parameters, workload distribution, job scheduler logic, topology, and hardware specs. But focusing on these pillars in isolation leads to local performance optimization efforts that are unable to tap into the full extent of cluster performance. From an organizational perspective, this further leads to decreased efficiencies because multiple efforts with the same goal of increasing cluster performance aren’t being holistically prioritized. These challenges will only grow as large language models (LLMs) become more prevalent.

We need a systemized source of truth that can simulate various performance factors across compute, storage, and network collectively. That’s where Arcadia, Meta’s end-to-end AI system performance simulator, comes in. Arcadia is designed to create a unified simulation framework that accurately models the performance of compute, memory, and network components within large-scale AI training clusters. Using insights from Arcadia, our engineers and developers can make data-driven design decisions for AI clusters and infrastructure that supports it while they are being developed.

Challenges of optimizing AI clusters

When we think about optimizing our AI clusters, there are several factors to take into consideration:

Our large-scale distributed system: Advancement in any area of AI, whether it is computer vision, speech, or NLP requires training large and complex models. At Meta, this is facilitated by multiple high performing computing infrastructure clusters. For instance, the AI Research SuperCluster for AI research.
Our multi-layered system: At Meta, we control the stack from physical network to applications. This translates into multiple tunable parameters across network, compute, memory, application, and scheduling to achieve the desired model performance. Finding the right set of parameters and inputs for achieving good model performance is an iterative task that can increase training time significantly.
Our operational workflows: The availability of the underlying infrastructure is a major factor that can influence model training time. For instance, a component failure can trigger a job to be rolled-back to a previous checkpoint and progress made would be lost. At our scale, operating such clusters without operational awareness data can lead to performance losses.
AI workload characteristics: Our training clusters cater to workloads across multiple use cases that may exhibit different sets of characteristics ranging from memory and compute-intensive, to latency-sensitive, and parallelizable. Keeping track of these characteristics across multiple workloads is already challenging. But the problem’s complexity increases by an order of magnitude due to uncertainty around future workloads and predicting various workloads for optimum performance.
The need for a common source of truth: Interdisciplinary research efforts, such as AI cluster-performance optimization, span multiple teams across network, compute, and storage. These teams may be working on outdated assumptions about other pillars as they drive their own local optimization efforts. Lack of a holistic approach in such cases often leads to organizational inefficiencies such as decision-making challenges and duplicative efforts.

The Arcadia system

Our primary objective with Arcadia is to develop a multi-disciplinary performance analysis system that enables design and joint optimization across various system levels, including application, network, and hardware.

Arcadia empowers stakeholders to examine and enhance different aspects such as machine learning (ML) model architectures, collective algorithms, job scheduling, hardware, and network architecture design. By providing insights into the impact of these factors on system performance, Arcadia facilitates data-driven decision-making processes and fosters the evolution of models and hardware.

Inputs

As shown in the architecture design above, the input to the Arcadia system encompasses a range of essential parameters, including the long-range plans of AI systems and models, network topology and routing protocols, data center floor plans, AI workload distributions, and hardware specifications. Additionally, Arcadia considers failure domains to provide a holistic view of the system’s performance and reliability.

Core

At the core of Arcadia is an orchestrator that coordinates the simulation of various components, including job scheduling, compute and memory, and network behavior at different levels. The system employs an AI workload synthesizer that learns from production distributions and generates representative workloads as inputs, ensuring the simulation reflects real-world conditions.

Outputs

Arcadia offers a wide range of outputs, including AI training and inference performance metrics, resource utilizations, and reliability and availability metrics. This comprehensive set of metrics empowers stakeholders to analyze the impact of different factors and make informed decisions to optimize system performance.

Feedback Loop

Unlike analytical roofline estimates, Arcadia takes into account the network and compute feedback loop, providing accurate estimations of performance that align with real-world production measurements. This capability allows for more precise predictions and a better understanding of the expected performance of AI models and workloads on a given infrastructure.

Arcadia’s benefits

Arcadia provides operational insights and a level of flexibility in simulation that allows us to address several challenges around optimizing our clusters.

Operational workflows benefit significantly from Arcadia’s simulation capabilities, providing enhanced visibility and a deeper understanding of risk and mitigation plans. Simulation-based audits for AI/HPC network maintenance can be conducted to identify potential issues and devise appropriate solutions. Maintenance scheduling can be optimized by leveraging Arcadia’s insights, ensuring minimal disruption to AI/HPC jobs. Furthermore, Arcadia aids in debugging and root-causing production events, enabling efficient troubleshooting and preventing recurrence of issues.

Arcadia offers flexibility in terms of simulation detail levels, catering to different user requirements and purposes. Users who focus solely on the application level can disregard lower-level details, enabling faster simulation runs. On the other hand, for example, users requiring in-depth analysis of low-level network hardware behaviors can leverage Arcadia’s packet-level network simulation to extract detailed insights.

Furthermore, Arcadia serves as a single source of truth that is agreed upon by all stakeholders. This unified approach helps ensure consistent and reliable performance analysis across teams and disciplines, establishing a common framework for hardware, network, job-scheduling, and AI systems co-design.

Use cases for Arcadia

There are several use cases for Arcadia system in pursuit of large-scale, efficient high-performance clusters:

Cluster utilization and fragmentation insights
Measuring the impact of network and hardware on AI/HPC job performance
AI/HPC job profile analysis in the training clusters
Assessing the reliability, availability, and efficiency of training clusters
Optimization of training cluster maintenances
Optimization of AI/HPC job scheduling and configurations

Next steps for Arcadia

As we build out more use cases for Arcadia we’re also developing additional frameworks to expand on its capabilities. This will include a framework to support operational cases in production networks, such as optimizing training cluster maintenance and AI/HPC job scheduling and configurations.

We’re also investigating a framework to provide design insights for different topology/routing designs given a set of known models. This would be used to surface key bottlenecks in compute, memory, or network and provide insights on how different model parameters can be optimized for a given cluster.

Finally, we’re aiming for Arcadia to support inputs from Chakra, an open, graph-based representation of AI/ML workloads being developed in a working group in MLCommons.

Acknowledgments

Many people contributed to this project but we’d particularly like to thank Naader Hasani, Petr Lapukhov, Mikel Jimenez Fernandez, Thomas Fuller, Xin Liu, Greg Steinbrecher, Yuhui Zhang, Max Noormohammadpour, Mustafa Ozdal, Kapil Bisht, Phil Buonadonna, Josh Gilliland, Abishek Gopalan, Biao Lu, Gaya Nagarajan, Steve Politis, Kevin Quirk, Jimmy Williams, Yi Zhang, and Ying Zhang.

Threads: The inside story of Meta’s newest social app

Thu, 07 Sep 2023 20:00:00 +0200

Earlier this year, a small team of engineers at Meta started working on an idea for a new app. It would have all the features people expect from a text-based conversations app, but with one very key, distinctive goal – being an app that would allow people to share their content across multiple platforms. We wanted to build a decentralized (or federated) app that would enable people to post content that is viewable by anyone on other social apps, and vice versa.

On July 5, people were greeted with a surprise when they logged into Instagram – an invitation to try a brand new app for sharing text and joining public conversations – Threads.

Five days later, over 100 million people had joined Threads, making it the most successful app launch of all time.

It took a small, nimble engineering team working alongside Meta’s infrastructure teams to take Threads from zero to 100 million people in record time with no major downtime. The story of its development and initial launch is the story of Meta’s ability to scale with speed and impact. But Threads is still evolving. We are working toward making Threads compatible with the open, interoperable social networks that we believe can shape the future of the internet – where peoples’ content exists in the fediverse and is platform-agnostic.

Planning and developing Threads

Threads was developed in an environment more akin to a startup. Creating a new app with such a small team meant assembling a group with a high level of trust – where everyone was aligned toward a singular goal and there was close alignment with our leadership, like the Head of Instagram, Adam Mosseri. People had to move fast and work independently, even as the team grew to roughly 60 engineers over the course of a few months. Daily burndown syncs, where we met to prioritize tasks, and setting monthly milestones were crucial.

Instagram uses Python (Django) for its backend. By using the same backend for Threads, we could leverage a lot of the existing tech stack for Threads and reuse most of our existing data models, business logic, security features, and server infrastructure. This also meant users could sign in to the app with their existing Instagram account, making it super simple to onboard and set up your Threads app.

The Threads mobile apps themselves were built primarily with Swift on iOS, and Jetpack Compose on Android. But building an app that needed to be ready to launch at any time, and moving on such a strict timeline, meant we had to be very intentional in determining the core set of minimum viable features the app could realistically roll out with. Features like keyword search, which rolled out to more countries today, and private messaging were put on hold for later updates, as were the decentralized features.

You also can’t go from zero to 100 million users without having an infrastructure in place that can handle that level of growth with efficiency and reliability. Threads scaled successfully to 100 million users without any major downtime thanks to Meta’s underlying infrastructure and engineering foundations, which were critical to the successful launch.

Over the years, Meta’s larger infrastructure, foundation, production, and engineering teams have already done a lot of the heavy lifting to build the infrastructure that has allowed us to scale Threads. Without this level of close collaboration we wouldn’t have been able to support scaling the app to so many people so quickly.

The Threads launch

With everything moving ahead at full speed toward a July launch, our team saw and took an opportunity to release the app early. On the upside, this meant Meta would be able to offer a new social app at a time when people were eager for a new experience. But the earlier rollout also meant challenges around reduced time to dark test and ensure distribution channels were properly configured.

Once the decision was made to launch, we set up launch rooms with teams across San Francisco, Menlo Park, New York, and London. Engineers from all parts of the Infrastructure team, the product teams, as well as on-call engineers were pulled in. At Meta, we have internal monitoring tools like ODS and Scuba that help us track important metrics that were displayed on dashboards throughout launch to monitor app health. Having tools like these in place was an important part of addressing scaling challenges that arose with the launch.

Some of the engineers on the Threads team had worked on a launch of this capacity previously, while others had not. But for the infra engineers this was just another day at the office. On the day of the launch, we had engineers, especially production engineers, that were incredible throughout the process – herding us all together and coordinating what we needed to do.

The future is decentralized

Shortly after launch, we signaled that we are taking open standards seriously with Threads by allowing people to use their Threads profile to verify their identity on supported platforms like Mastodon.

Some people aren’t familiar with the idea of the fediverse (federated universe), but, for me, the easiest analogy is to think of it like email. Let’s say someone uses Gmail and another person uses Yahoo Mail. We don’t think twice that they can send and receive emails to each other, even though one is on a Google platform and the other is on Yahoo. That’s because email is interoperable and these two companies both conform to the SMTP protocol for delivering emails. But you can’t do that today on social media. People on Instagram can’t follow other creators on Tumblr, for example.

Our goal with Threads is to make social content as interoperable as email. We are working on the ability for Threads to integrate with ActivityPub, the open, decentralized social networking protocol. Once that happens people will be able to enjoy the best features of Threads across platforms. More importantly, they’ll be able to have more control over their online social presence, regardless of any app or platform. They’ll have the ability to distribute their posts to other social media apps, and consume content from creators on other apps on Threads.

It’s all about connecting with more people and helping the world be more open.

Code wins arguments

Now that Threads is out in the world, we’ve shifted our focus to improving the product, continuing to build new features and closing the feature gaps that people expect from an app like Threads. We recently launched Threads on Web and other new feature updates have already been rolled out, including Following feed, the ability to edit alt-text, and the ability to share a Thread to Instagram DMs.

Last week, Meta announced that we began testing keyword search, which allows people to search specific keywords they’re interested in. Today, we announced we’ll start rolling out the feature in English and Spanish, in countries where most people post in these languages – such as Argentina, India, Mexico, the United Kingdom and the United States – on both mobile and web.

We could have built and designed the Threads app a million different ways. But we were able to complete the technical work for a new app in five months because we were given the freedom to operate in a completely greenfield space, where we could prototype things, test our ideas, and get internal feedback very quickly.

Code wins arguments. If we tested a feature and didn’t like it, we rebuilt it until we landed on the best version. For some surfaces, like the activity feed, we rewrote it three times before we finally landed on an implementation that felt good enough.

My ultimate hope for Threads is that it becomes the zeitgeist of the internet. It’ll be the place you go to where you want to have positive conversations about the latest cultural events. It’ll be where you want to go to see the latest conversations between creators, or between other people that you’re interested in. I really hope to see that happen over the next few months as we continue to improve the product for you all.

What’s it like to write code at Meta?

Tue, 05 Sep 2023 18:00:00 +0200

Ever wonder what it’s like to write code at Meta’s scale?

On the latest episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) sits down with Dustin Shahidehpour and Katherine Zak, two software engineers at Meta, about their careers and what it’s really like to ship code at Meta.

Why does Meta have a monorepo? What’s it like doing pre-commit code reviews? And what does Meta’s CI infrastructure look like? And, “How is stuff not constantly on fire?” We cover these questions and many more!

Dustin also recently authored a blog on The evolution of Facebook’s iOS app architecture.

Download or listen to to the episode below:

[embedded content]

You can also listen to the episode wherever you get your podcasts:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Scheduling Jupyter Notebooks at Meta

Tue, 29 Aug 2023 18:00:00 +0200

At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals. Users with such notebooks would have to remember to manually run their notebooks at the required cadence – a process people might forget because it does not scale with the number of notebooks used.

To address this problem, we invested in building a scheduled notebooks infrastructure that fits in seamlessly with the rest of the internal tooling available at Meta. Investing in infrastructure helps ensure that privacy is inherent in everything we build. It enables us to continue building innovative, valuable solutions in a privacy-safe way.

The ability to transparently answer questions about data flow through Meta systems for purposes of data privacy and complying with regulations differentiates our scheduled notebooks implementation from the rest of the industry.

In this post, we’ll explain how we married Bento with our batch ETL pipeline framework called Dataswarm (think Apache Airflow) in a privacy and lineage-aware manner.

The challenge around doing scheduled notebooks at Meta

At Meta, we’re committed to improving confidence in production by performing static analysis on scheduled artifacts and maintaining coherent narratives around dataflows by leveraging transparent Dataswarm Operators and data annotations. Notebooks pose a special challenge because:

Due to dynamic code content (think table names created via f-strings, for instance), static analysis won’t work, making it harder to understand data lineage.
Since notebooks can have any arbitrary code, their execution in production is considered “opaque” as data lineage cannot be determined, validated, or recorded.
Scheduled notebooks are considered to be on the production side of the production-development barrier. Before anything runs in production, it needs to be reviewed, and reviewing notebook code is non-trivial.

These three considerations shaped and influenced our design decisions. In particular, we limited notebooks that can be scheduled to those primarily performing ETL and those performing data transformations and displaying visualizations. Notebooks with any other side effects are currently out of scope and are not eligible to be scheduled.

How scheduled notebooks work at Meta

There are three main components for supporting scheduled notebooks:

The UI for setting up a schedule and creating a diff (Meta’s pull request equivalent) that needs to be reviewed before the notebook and associated dataswarm pipeline gets checked into source control.
The debugging interface once a notebook has been scheduled.
The integration point (a custom Operator) with Meta’s internal scheduler to actually run the notebook. We’re calling this: BentoOperator.

How BentoOperator works

In order to address the majority of the concerns highlighted above, we perform the notebook execution state in a container without access to the network. We also leverage input & output data annotations to show the flow of data.

The overall design for BentoOperator.

For ETL, we fetch data and write it out in a novel way:

Supported notebooks perform data fetches in a structured manner via custom cells that we’ve built. An example of this is the SQL cell. When BentoOperator runs, the first step involves parsing metadata associated with these cells and fetching the data using transparent Dataswarm Operators and persisting this in local csv files on the ephemeral remote hosts.
Instances of these custom cells are then replaced with a call to pandas.read_csv() to load that data in the notebook, unlocking the ability to execute the notebook without any access to the network.
Data writes also leverage a custom cell, which we replace with a call to pandas.DataFrame.to_csv() to persist to a local csv file, which we then process after the actual notebook execution is complete and upload the data to the warehouse using transparent Dataswarm Operators.
After this step, the temporary csv files are garbage-collected; the resulting notebook version with outputs uploaded and the ephemeral execution host deallocated.

Custom SQL cell supported for scheduled notebooks.

Structured custom cell for data uploads.

Our approach to privacy with BentoOperator

We have integrated BentoOperator within Meta’s data purpose framework to ensure that data is used only for the purpose it was intended. This framework ensures that the data usage purpose is respected as data flows and transmutes across Meta’s stack. As part of scheduling a notebook, a “purpose policy zone” is supplied by the user and this serves as the integration point with the data purpose framework.

Overall user workflow

Let’s now explore the workflow for scheduling a notebook:

We’ve exposed the scheduling entry point directly from the notebook header, so all users have to do is hit a button to get started.

The first step in the workflow is setting up some parameters that will be used for automatically generating the pipeline for the schedule.

The next step involves previewing the generated pipeline before a Phabricator (Meta’s diff review tool) diff is created.

In addition to the pipeline code for running the notebook, the notebook itself is also checked into source control so it can be reviewed. The results of attempting to run the notebook in a scheduled setup are also included in the test plan.

Once the diff has been reviewed and landed, the schedule starts running the next day. In the event that the notebook execution fails for whatever reason, the schedule owner is automatically notified. We’ve also built a context pane extension directly in Bento to help with debugging notebook runs.

What’s next for scheduled notebooks

While we’ve addressed the challenge of supporting scheduled notebooks in a privacy-aware manner, the notebooks that are in scope for scheduling are limited to those performing ETL or those performing data analysis with no other side effects. This is only a fraction of the notebooks that users want to eventually schedule. In order to increase the number of use cases, we’ll be investing in supporting other transparent data sources in addition to the SQL cell.

We have also begun work on supporting parameterized notebooks in a scheduled setup. The idea is to support instances where instead of checking in many notebooks into source control that only differ by a few variables, we instead just check in one notebook and inject the differentiating parameters during runtime.

Lastly, we’ll be working on event-based scheduling (in addition to the time-based approach we have here) so that a scheduled notebook can also wait for predefined events before running. This would include, for example, the ability to wait until all data sources the notebook depends on land before notebook execution can begin.

Acknowledgments

Some of the approaches we took were directly inspired by the work done on Papermill.

Code Llama: Meta’s state-of-the-art LLM for coding

Thu, 24 Aug 2023 09:00:00 +0200

How Code Llama works

Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. Essentially, Code Llama features enhanced coding capabilities, built on top of Llama 2. It can generate code, and natural language about code, from both code and natural language prompts (e.g., “Write me a function that outputs the fibonacci sequence.”) It can also be used for code completion and debugging. It supports many of the most popular languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash (see our research paper for a full list).

We are releasing three sizes of Code Llama with 7B, 13B, and 34B parameters respectively. Each of these models is trained with 500B tokens of code and code-related data. The 7B and 13B base and instruct models have also been trained with fill-in-the-middle (FIM) capability, allowing them to insert code into existing code, meaning they can support tasks like code completion right out of the box.

The three models address different serving and latency requirements. The 7B model, for example, can be served on a single GPU. The 34B model returns the best results and allows for better coding assistance, but the smaller 7B and 13B models are faster and more suitable for tasks that require low latency, like real-time code completion.

The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.

Aside from being a prerequisite for generating longer programs, having longer input sequences unlocks exciting new use cases for a code LLM. For example, users can provide the model with more context from their codebase to make the generations more relevant. It also helps in debugging scenarios in larger codebases, where staying on top of all code related to a concrete issue can be challenging for developers. When developers are faced with debugging a large chunk of code they can pass the entire length of the code into the model.

Additionally, we have further fine-tuned two additional variations of Code Llama: Code Llama - Python and Code Llama - Instruct.

Code Llama - Python is a language-specialized variation of Code Llama, further fine-tuned on 100B tokens of Python code. Because Python is the most benchmarked language for code generation – and because Python and PyTorch play an important role in the AI community – we believe a specialized model provides additional utility.

Code Llama - Instruct is an instruction fine-tuned and aligned variation of Code Llama. Instruction tuning continues the training process, but with a different objective. The model is fed a “natural language instruction” input and the expected output. This makes it better at understanding what humans expect out of their prompts. We recommend using Code Llama - Instruct variants whenever using Code Llama for code generation since Code Llama - Instruct has been fine-tuned to generate helpful and safe answers in natural language.

We do not recommend using Code Llama or Code Llama - Python to perform general natural language tasks since neither of these models are designed to follow natural language instructions. Code Llama is specialized for code-specific tasks and isn’t appropriate as a foundation model for other tasks.

When using the Code Llama models, users must abide by our license and acceptable use policy.

Evaluating Code Llama’s performance

To test Code Llama’s performance against existing solutions, we used two popular coding benchmarks: HumanEval and Mostly Basic Python Programming (MBPP). HumanEval tests the model’s ability to complete code based on docstrings and MBPP tests the model’s ability to write code based on a description.

Our benchmark testing showed that Code Llama performed better than open-source, code-specific LLMs and outperformed Llama 2. Code Llama 34B, for example, scored 53.7% on HumanEval and 56.2% on MBPP, the highest compared with other state-of-the-art open solutions, and on par with ChatGPT.

As with all cutting edge technology, Code Llama comes with risks. Building AI models responsibly is crucial, and we undertook numerous safety measures before releasing Code Llama. As part of our red teaming efforts, we ran a quantitative evaluation of Code Llama’s risk of generating malicious code. We created prompts that attempted to solicit malicious code with clear intent and scored Code Llama’s responses to those prompts against ChatGPT’s (GPT3.5 Turbo). Our results found that Code Llama answered with safer responses.

Details about our red teaming efforts from domain experts in responsible AI, offensive security engineering, malware development, and software engineering are available in our research paper.

Releasing Code Llama

Programmers are already using LLMs to assist in a variety of tasks, ranging from writing new software to debugging existing code. The goal is to make developer workflows more efficient, so they can focus on the most human centric aspects of their job, rather than repetitive tasks.

At Meta, we believe that AI models, but LLMs for coding in particular, benefit most from an open approach, both in terms of innovation and safety. Publicly available, code-specific models can facilitate the development of new technologies that improve peoples' lives. By releasing code models like Code Llama, the entire community can evaluate their capabilities, identify issues, and fix vulnerabilities.

Code Llama’s training recipes are available on our Github repository.

Model weights are also available.

Responsible use

Our research paper discloses details of Code Llama’s development as well as how we conducted our benchmarking tests. It also provides more information into the model’s limitations, known challenges we encountered, mitigations we’ve taken, and future challenges we intend to investigate.

We’ve also updated our Responsible Use Guide and it includes guidance on developing downstream models responsibly, including:

Defining content policies and mitigations.
Preparing data.
Fine-tuning the model.
Evaluating and improving performance.
Addressing input- and output-level risks.
Building transparency and reporting mechanisms in user interactions.

Developers should evaluate their models using code-specific evaluation benchmarks and perform safety studies on code-specific use cases such as generating malware, computer viruses, or malicious code. We also recommend leveraging safety datasets for automatic and human evaluations, and red teaming on adversarial prompts.

The future of generative AI for coding

Code Llama is designed to support software engineers in all sectors – including research, industry, open source projects, NGOs, and businesses. But there are still many more use cases to support than what our base and instruct models can serve.

We hope that Code Llama will inspire others to leverage Llama 2 to create new innovative tools for research and commercial products.

Try Code Llama today

Code Llama GitHub repository

Download the Code Llama Model

Read the research paper

Code Llama: Open foundation models for code

Introducing Immortal Objects for Python

Tue, 15 Aug 2023 18:00:00 +0200

Instagram has introduced Immortal Objects – PEP-683 – to Python. Now, objects can bypass reference count checks and live throughout the entire execution of the runtime, unlocking exciting avenues for true parallelism.

At Meta, we use Python (Django) for our frontend server within Instagram. To handle parallelism, we rely on a multi-process architecture along with asyncio for per-process concurrency. However, our scale – both in terms of business logic and the volume of handled requests – can cause an increase in memory pressure, leading to efficiency bottlenecks.

To mitigate this effect, we rely on a pre-fork web server architecture to cache as many objects as possible and have each separate process use them as read-only structured through shared memory. While this greatly helps, upon closer inspection we saw that our processes’ private memory usage grew over time while our shared memory decreased.

By analyzing the Python heap, we found that while most of our Python Objects were practically immutable and lived throughout the entire execution of the runtime, it ended up still modifying these objects through reference counts and garbage collection (GC) operations that mutate the objects’ metadata on every read and GC cycle – thus, triggering a copy on write on the server process.

The effect of copy on writes is increasing private memory and a reduction of shared memory from the main process.

Immortal Objects for Python

This problem of state mutation of shared objects is at the heart of how the Python runtime works. Given that it relies on reference counting and cycle detection, the runtime requires modifying the core memory structure of the object, which is one of the reasons the language requires a global interpreter lock (GIL).

To get around this issue, we introduced Immortal Objects – PEP-683. This creates an immortal object (an object for which the core object state will never change) by marking a special value in the object’s reference count field. It allows the runtime to know when it can and can’t mutate both the reference count fields and GC header.

A comparison of standard objects versus immortal objects. With standard objects, a user can guarantee that it will not mutate its type and/or its data. Immortality adds an extra guarantee that the runtime will not modify the reference count or the GC Header if present, enabling full object immutability.

While implementing and releasing this within Instagram was a relatively straightforward process due to our relatively isolated environment, sharing this to the community was a long and arduous process. Most of this was due to the solution’s implementation, which had to deal with a combination of problems such as backwards compatibility, platform compatibility, and performance degradation.

First, the implementation had to guarantee that, even after changing the reference count implementation, applications wouldn’t crash if some objects suddenly had different refcount values.

Second, it changes the core memory representation of a Python object and how it increases its reference counts. It needed to work across all the different platforms (Unix, Windows, Mac), compilers (GCC, Clang, and MSVC), architectures (32-bit and 64-bit), and hardware types (little- and big-endian).

Finally, the core implementation relies on adding explicit checks in the reference count increment and decrement routines, which are two of the hottest code paths in the entire execution of the runtime. This inevitably meant a performance degradation in the service. Fortunately, with the smart usage of register allocations, we managed to get this down to just a ~2 percent regression across every system, making it a reasonable regression for the benefits that it brings.

How Immortal Objects have impacted Instagram

For Instagram, our initial focus was to achieve improvements in both memory and CPU efficiency of handling our requests by reducing copy on writes. Through immortal objects, we managed to greatly reduce private memory by increasing shared memory usage.

Increasing shared memory usage through immortal Objects allows us to significantly reduce private memory. Reducing the number of copy on writes.

However, the implications of these changes go far beyond Instagram and into the evolution of Python as a language. Until now, one of Python’s limitations has been that it couldn’t guarantee true immutability of objects on the heap. Both the GC and the reference count mechanism had unrestricted access to both of these fields.

Contributing immortal objects into Python introduces true immutability guarantees for the first time ever. It helps objects bypass both reference counts and garbage collection checks. This means that we can now share immortal objects across threads without requiring the GIL to provide thread safety.

This is an important building block towards a multi-core Python runtime. There are two proposals that leverage immortal objects to achieve this in different ways:

PEP-684: A Per-Interpreter GIL
PEP-703: Making the Global Interpreter Lock Optional in CPython

Try Immortal Objects today

We invite the community to think of ways they can leverage immortalization in their applications as well as review the existing proposals to anticipate how to improve their applications for a multi-core environment. At Meta, we are excited about the direction in the language’s development and we are ready to keep contributing externally while we keep experimenting and evolving Instagram.

Meta Connect 2023: September 27 – 28

Mon, 14 Aug 2023 18:00:00 +0200

[...]

The post Meta Connect 2023: September 27 – 28 appeared first on Engineering at Meta.

Scaling the Instagram Explore recommendations system

Wed, 09 Aug 2023 18:00:00 +0200

Explore is one of the largest recommendation systems on Instagram.
We leverage machine learning to make sure people are always seeing content that is the most interesting and relevant to them.
Using more advanced machine learning models, like Two Towers neural networks, we’ve been able to make the Explore recommendation system even more scalable and flexible.

AI plays an important role in what people see on Meta’s platforms. Every day, hundreds of millions of people visit Explore on Instagram to discover something new, making it one of the largest recommendation surfaces on Instagram.

To build a large-scale system capable of recommending the most relevant content to people in real time out of billions of available options, we’ve leveraged machine learning (ML) to introduce task specific domain-specific language (DSL) and a multi-stage approach to ranking.

As the system has continued to evolve, we’ve expanded our multi-stage ranking approach with several well-defined stages, each focusing on different objectives and algorithms.

Retrieval
First-stage ranking
Second-stage ranking
Final reranking

By leveraging caching and pre-computation with highly-customizable modeling techniques, like a Two Towers neural network (NN), we’ve built a ranking system for Explore that is even more flexible and scalable than ever before.

The stages funnel for Explore on Instagram.

Readers might notice that the leitmotif of this post will be clever use of caching and pre-computation in different ranking stages. This allows us to use heavier models in every stage of ranking, learn behavior from data, and rely less on heuristics.

Retrieval

The basic idea behind retrieval is to get an approximation of what content (candidates) will be ranked high at later stages in the process if all of the content is drawn from a general media distribution.

In a world with infinite computational power and no latency requirements we could rank all possible content. But, given real-world requirements and constraints, most large-scale recommender systems employ a multi-stage funnel approach – starting with thousands of candidates and narrowing down the number of candidates to hundreds as we go down the funnel.

In most large-scale recommender systems, the retrieval stage consists of multiple candidates’ retrieval sources (“sources” for short). The main purpose of a source is to select hundreds of relevant items from a media pool of billions of items. Once we fetch candidates from different sources, we combine them together and pass them to ranking models.

Candidates’ sources can be based on heuristics (e.g., trending posts) as well as more sophisticated ML approaches. Additionally, retrieval sources can be real-time (capturing most recent interactions) and pre-generated (capturing long-term interests).

The four types of retrieval sources.

To model media retrieval for different user groups with various interests, we utilize all these mentioned source types together and mix them with tunable weights.

Candidates from pre-generated sources could be generated offline during off-peak hours (e.g., locally popular media), which further contributes to system scalability.

Let’s take a closer look at a couple of techniques that can be used in retrieval.

Two Tower NN

Two Tower NNs deserve special attention in the context of retrieval.

Our ML-based approach to retrieval used the Word2Vec algorithm to generate user and media/author embeddings based on their IDs.

The Two Towers model extends the Word2Vec algorithm, allowing us to use arbitrary user or media/author features and learn from multiple tasks at the same time for multi-objective retrieval. This new model retains the maintainability and real-time nature of Word2Vec, which makes it a great choice for a candidate sourcing algorithm.

Here’s how the Two Tower retrieval works in general with schema:

The Two Tower model consists of two separate neural networks – one for the user and one for the item.
Each neural network only consumes features related to their entity and outputs an embedding.
The learning objective is to predict engagement events (e.g., someone liking a post) as a similarity measure between user and item embeddings.
After training, user embeddings should be close to the embeddings of relevant items for a given user. Therefore, item embeddings close to the user’s embedding can be used as candidates for ranking.

How we train our Two Tower neural network for Explore.

Given that user and item networks (towers) are independent after training, we can use an item tower to generate embeddings for items that can be used as candidates during retrieval. And we can do this on a daily basis using an offline pipeline.

We can also put generated item embeddings into a service that supports online approximate nearest neighbors (ANN) search (e.g., FAISS, HNSW, etc), to make sure that we don’t have to scan through an entire set of items to find similar items for a given user.

During online retrieval we use the user tower to generate user embedding on the fly by fetching the freshest user-side features, and use it to find the most similar items in the ANN service.

It’s important to keep in mind that the model can’t consume user-item interaction features (which are usually the most powerful) because by consuming them it will lose the ability to provide cacheable user/item embeddings.

The main advantage of the Two Tower approach is that user and item embeddings can be cached, making inference for the Two Tower model extremely efficient.

How the Two Towers model handles retrieval.

User interactions history

We can also use item embeddings directly to retrieve similar items to those from a user’s interactions history.

Let’s say that a user liked/saved/shared some items. Given that we have embeddings of those items, we can find a list of similar items to each of them and combine them into a single list.

This list will contain items reflective of the user’s previous and current interests.

User interaction history for Explore.

Compared with retrieving candidates using user embedding, directly using a user’s interactions history allows us to have a better control over online tradeoff between different engagement types.

In order for this approach to produce high-quality candidates, it’s important to select good items from the user’s interactions history. (i.e., If we try to find similar items to some randomly clicked item we might risk flooding someone’s recommendations with irrelevant content).

To select good candidates, we apply a rule-based approach to filter-out poor-quality items (i.e., sexual/objectionable images, posts with high number of “reports”, etc.) from the interactions history. This allows us to retrieve much better candidates for further ranking stages.

Ranking

After candidates are retrieved, the system needs to rank them by value to the user.

Ranking in a high load system is usually divided into multiple stages that gradually reduce the number of candidates from a few thousand to few hundred that are finally presented to the user.

In Explore, because it’s infeasible to rank all candidates using heavy models, we use two stages:

A first-stage ranker (i.e., lightweight model), which is less precise and less computationally intensive and can recall thousands of candidates.
A second-stage ranker (i.e., heavy model), which is more precise and compute intensive and operates on the 100 best candidates from the first stage.

Using a two-stage approach allows us to rank more candidates while maintaining a high quality of final recommendations.

For both stages we choose to use neural networks because, in our use case, it’s important to be able to adapt to changing trends in users’ behavior very quickly. Neural networks allow us to do this by utilizing continual online training, meaning we can re-train (fine-tune) our models every hour as soon as we have new data. Also, a lot of important features are categorical in nature, and neural networks provide a natural way of handling categorical data by learning embeddings

First-stage ranking

In the first-stage ranking our old friend the Two Tower NN comes into play again because of its cacheability property.

Even though the model architecture could be similar to retrieval, the learning objective differs quite a bit: We train the first stage ranker to predict the output of the second stage with the label:

PSelect = { media in top K results ranked by the second stage}

We can view this approach as a way of distilling knowledge from a bigger second-stage model to a smaller (more light-weight) first-stage model.

Two Tower inference with caching on the both the user and item side.

Second-stage ranking

After the first stage we apply the second-stage ranker, which predicts the probability of different engagement events (click, like, etc.) using the multi-task multi label (MTML) neural network model.

The MTML model is much heavier than the Two Towers model. But it can also consume the most powerful user-item interaction features.

Applying a much heavier MTML model during peak hours could be tricky. That’s why we precompute recommendations for some users during off-peak hours. This helps ensure the availability of our recommendations for every Explore user.

In order to produce a final score that we can use for ordering of ranked items, predicted probabilities for P(click), P(like), P(see less), etc. could be combined with weights W_click, W_like, and W_see_less using a formula that we call value model (VM).

VM is our approximation of the value that each media brings to a user.

Expected Value = W_click * P(click) + W_like * P(like) – W_see_less * P(see less) + etc.

Tuning the weights of the VM allows us to explore different tradeoffs between online engagement metrics.

For example, by using higher W_like weight, final ranking will pay more attention to the probability of a user liking a post. Because different people might have different interests in regards to how they interact with recommendations it’s very important that different signals are taken into account. The end goal of tuning weights is to find a good tradeoff that maximizes our goals without hurting other important metrics.

Final reranking

Simply returning results sorted with reference to the final VM score might not be always a good idea. For example, we might want to filter-out/downrank some items based on integrity-related scores (e.g., removing potentially harmful content).

Also, in case we would like to increase the diversity of results, we might shuffle items based on some business rules (e.g., “Do not show items from the same authors in a sequence”).

Applying these sorts of rules allows us to have a much better control over the final recommendations, which helps to achieve better online engagement.

Parameters tuning

As you can imagine, there are literally hundreds of tunable parameters that control the behavior of the system (e.g., weights of VM, number of items to fetch from a particular source, number of items to rank, etc.).

To achieve good online results, it’s important to identify the most important parameters and to figure out how to tune them.

There are two popular approaches to parameters tuning: Bayesian optimization and offline tuning.

Bayesian optimization

Bayesian optimization (BO) allows us to run parameters tuning online.

The main advantage of this approach is that it only requires us to specify a set of parameters to tune, the goal optimization objective (i.e., goal metric), and the regressions thresholds for some other metrics, leaving the rest to the BO.

The main disadvantage is that it usually requires a lot of time for the optimization process to converge (sometimes more than a month) especially when dealing with a lot of parameters and with low-sensitivity online metrics.

We can make things faster by following the next approach.

Offline tuning

If we have access to enough historical data in the form of offline and online metrics, we can learn functions that map changes in offline metrics into changes in online metrics.

Once we have such learned functions, we can try different values offline for parameters and see how offline metrics translate into potential changes in online metrics.

To make this offline process more efficient, we can use BO techniques.

The main advantage of offline tuning compared with online BO is that it requires a lot less time to set up an experiment (hours instead of weeks). However, it requires a strong correlation between offline and online metrics.

The growing complexity of ranking for Explore

The work we’ve described here is far from done. Our systems’ growing complexity will pose new challenges in terms of maintainability and feedback loops. To address these challenges, we plan to continue improving our current models and adopting new ranking models and retrieval sources. We’re also investigating how to consolidate our retrieval strategies into a smaller number of highly customizable ML algorithms.

How Meta is improving password security and preserving privacy

Tue, 08 Aug 2023 18:00:00 +0200

Meta is developing new privacy-enhancing technologies (PETs) to innovate and solve problems with less data. These technologies enable teams to build and launch privacy-enhanced products in a way that’s verifiable and safeguards user data. Using state-of-the-art cryptographic techniques, we have developed Private Data Lookup (“PDL”) that allows users to privately query a server-side data set. PDL is based on a secure multiparty computation mechanism called Private Set Intersection, where two parties holding sets can compute the intersection of the two sets without revealing their sets to the counterpart. With PDL, we further ensure that only one party (i.e., Meta users) can see the result, disabling Meta from learning the result of the intersection and thus enhancing the privacy of users’ data.

We use PDL for data minimization and we began with supporting first party passwords in Enterprise Center, Meta’s new platform to enable collaboration between external partners and Meta. With PDL, we encourage the use of stronger passwords while minimizing the information revealed to the server in the password precheck process.

Creating a password is the first step in the authentication cycle for most users. Hence, identifying weak passwords in this step offers a stronger security stance than checking weak passwords while they are already in use. While traditional password guidance includes a list of best practices, good passwords satisfying these requirements can still be leaked through breaches. Thus, proactive checking for compromised passwords complements password strength guidelines and helps users choose strong, secure passwords.

Specifically, PDL supports the breached password check feature in Enterprise Center’s password creation flows, including account creation and password reset. Enterprise Center users now receive an alert if they attempt to use a password that was previously exposed in a data breach collected by third parties (e.g., FlashPoint.io, HoldSecurity.com). Compared with the traditional server-side password hash check that reveals all of the users’ password creation attempts to the server, PDL helps to deliver the alert in a way that preserves privacy, or in other words without revealing to Meta Enterprise Center what passwords were attempted by the user, and whether the password was previously exposed. The goal is to minimize the final information collected by the Enterprise Center to be just the strong password picked by the user.

How PDL supports private password precheck

The challenge of privately checking password entered by a user against a set of passwords known to have been exposed in third party data breaches falls into an area of applied cryptography known as Private Set Intersection. It allows two parties, each holding a set of sensitive data (passwords in this case), to compute the items common to each party’s set without either party revealing the contents of their set to the other party. PDL provides the functionality of Private Set Intersection and its design is inspired by the research paper authored by Thomas et al. One distinction with previous work is we check if the password appears anywhere in the breach, whereas previous solutions alerts the user only when the specific (username, password) pair appears in the breach. We designed our solution this way since it is more relevant for targeted attack scenarios for highly sensitive accounts: for such attacks, the malicious actors are likely to use all passwords in breaches in conjunction with the target’s username. For example, if a strong password associated with a specific username appears in a breach, then all users should also avoid using this password.

Initial implementation

In a simplified version of our password precheck workflow over PDL, when making a request, a client calculates the hash H(p) of its password p and then blinds the hash output with a secret key a that is randomly generated for each request. After that, the client sends this blinded hash value, denoted by H(p)^a, to our service.

Upon receiving the request, the password precheck service (“the service”) in the Meta Enterprise Center will first blind the client’s request with a long term secret key b. The resulting value is a double-blinded hash of the original password from the client, denoted by H(p)^ab. Then the server will apply the same hash algorithm and blinding operation with secret key b to all the passwords from the leaked password dataset. This will result in a list of blinded hash values denoted by H(p1)^b, H(p2)^b, …, H(pn)^b. The server sends back the double blinded query and the list of single-blinded hash values.

After receiving the response, the client applies her secret key a to unblind the double blinded hash, resulting in a hash value that is only blinded by the service’s secret key b, i.e., q^b. Now the client is able to match q^b with the list of blinded hash values. If the client’s password p matches a leaked password pi, then there will be a matched blinded hash value because H(q)^b will be equal to H(pi)^b.

In this implementation, the privacy of the user’s data is well protected because the user’s password is one-way hashed and encrypted by the user’s one-time secret key, revealing no information to the service. In addition, the service learns nothing about the matching result because the matching happens entirely locally at the client.

As one may already have noticed, there are several issues in this initial version. First, hashing and blinding each password in the leaked password dataset at runtime cause a lot of latency at the server side. Second, it is impractical with regards to latency and bandwidth usage for the client to download all the blinded hash values of leaked passwords because there can be millions of them.

Performance optimization

It was determined that the default implementation would adversely impact user experience, due to the increase in processing time and amount of data that would need to be transferred between the client and server. To address this challenge the following optimization was adopted:

Pre-processing of compromised password data into blinded hash values. To avoid having to perform expensive cryptographic operations at run time and to increase performance, the compromised password dataset is pre-processed into a format that can be directly replied to the client.
Sharding the leaked password dataset. Instead of returning blinded hash values for the entire leaked password dataset, we let the client generate a small sharding index from the first couple of bytes of the password hash. The increased leakage and privacy risk is negligible as millions of passwords potentially share the same index and we choose the index size carefully to balance privacy and performance. The index now enables the server to return a smaller subset of the dataset in response to the blinded hash values.
Compression of the blinded hash values replied by the service. To reduce the bandwidth overhead of the service’s response, we truncate each blinded hash value into a smaller size while preserving its uniqueness for matching.

The user experience

Foundational to Private Password Precheck’s success is the ability to perform the check in a manner that is transparent to users, avoiding any disruption to user experience.

The entire workflow for Private Password Precheck consists of the following steps:

User enters a new password during account creation or password reset.
If the password checks through local requirements (e.g. minimum length requirement), it is sent to a client library to go through Private Password Precheck.
The client library generates a PDL request, sends it to the server and gets the PDL response.
The client library will perform the local match; if a match is found, the user gets an alert on the page suggesting to use a stronger password.

The following sequence diagram demonstrates the workflow:

Offering more privacy value with PDL

Looking ahead, PDL has several interesting extensions and potential applications to further minimize data collection efforts. Some of these are briefly mentioned below.

In addition to passwords, PDL can be used to lookup other pieces of information from clients such as user contacts on the service leading to private contact discovery.
PDL can be applied to systems looking to detect malicious content and downloads within apps without revealing the content to servers.
PDL can be extended to support key-value lookups.

PDL can also be combined with other Private Enhancing Technologies to optimize the trade-off between privacy and efficiency. For example, PDL can also be used together with Anonymous Credential Service (ACS) to additionally hide the identity of the client which improves privacy and enables more flexibility in designing our shards.

Fixit 2: Meta’s next-generation auto-fixing linter

Mon, 07 Aug 2023 18:00:00 +0200

Fixit is dead! Long live Fixit 2 – the latest version of our open-source auto-fixing linter.
Fixit 2 allows developers to efficiently build custom lint rules and perform auto-fixes for their codebases.
Fixit 2 is available today on PyPI.

Python is one of the most popular languages in use at Meta. Meta’s production engineers (PEs) are specialized software engineers (SWEs) who focus on reliability, efficiency, and scalability. They work on various projects, including debugging production services, rewriting inefficient libraries, orchestrating project deployments at scale, or capacity planning and scheduling. And Python is often one of the first tools that PEs reach for, as it offers rapid development, easy to read syntax, and a massive array of open source libraries.

Meta’s Python Language Foundation team — a hybrid team of both PEs and traditional SWEs — helps own and maintain the infrastructure and tooling for Python at Meta. The team supports engineers, data scientists, researchers, and anyone else at Meta using Python to get their work done.

One of the ways we accomplish this is building tools that enable Python developers to write better, and more reliable code more efficiently. This includes tools like automatic formatting and import sorting that eliminate tedium, or linters that guide engineers toward maintainable code with fewer bugs.

This year, we have been building a new linter, Fixit 2, designed from the ground up to make developers more efficient and capable, both in open source projects and the diverse landscape of our internal monorepo. At Meta, we are using Fixit 2 with a few early adopters, and plan to roll it out to the rest of our monorepo soon. But any developer can use it to perform auto-fixing more efficiently and make faster improvements to their own codebases.

Why a new linter? (why not X?)

There are a variety of excellent linters in the Python ecosystem, many of which have a large community of third-party plugins providing a diverse array of lint rules. We have used Flake8 internally at Meta since 2016, and it has been very successful in helping developers reduce bugs and keep a clean codebase. The popular flake8-bugbear plugin was even created by Łukasz Langa (author of Black, PSF developer-in-residence, and release manager for Python 3.8 and 3.9) while working at Meta (then Facebook), as a home for more opinionated lint rules that we could both use internally and share with the rest of the Python developer community.

We also have a large number of internal plugins built by various teams, and Flake8 allows them to write and enable custom lint rules directly in the codebase without getting sign-off from a central gatekeeper, and without waiting for a new deployment of Flake8 to roll out.

But while Flake8 has long been a cornerstone of our linting solution, it also has some rough edges. Writing new lint rules requires building entire plugins (each claiming a portion of the “namespace” for error codes) and encourages developers to build complicated plugins covering multiple classes of errors. When those lint errors are found, Flake8 can only point to a line and column number where it occurred, but has no way of suggesting changes to the developer looking at a list of lint results, leaving them in a state of trial and error to find changes that make the linter happy. Also, Flake8 uses the stdlib ast module, making it unable to parse future syntax features and forcing developers to wait for tools to upgrade before they can use the shiny new hotness.

There are alternatives to Flake8 of course, but many of them suffer from one or more drawbacks:

A lack of support for “local” in-repo plugins or custom lint rules.
Limited or no support for hierarchical configuration for different projects within a monorepo.
No option for auto-fixes when errors are found.
Slow performance on large codebases.

While some of those features aren’t critical, the most important for developer efficiency is offering auto-fixes – automatic suggested changes that would satisfy the lint rule. This takes the guesswork out of using a linter, and allows users to quickly review and accept those changes when possible, eliminating the need to re-run the linter until the code is finally clean. Combining these auto-fixes with in-repo, custom lint rules provides a level of tailored code quality improvements that is hard to beat.

Unfortunately, even Fixit, the auto-fixing linter that we built for Instagram and open sourced, did not support local lint rules or hierarchical configuration – core requirements for our monorepo that is home to thousands of projects, many of which are themselves open source projects with their own distinct needs for linting and CI. We received many requests from developers to support Fixit in our monorepo, but there were enough hurdles that we were only able to support a small set of security lint rules, reducing the direct benefits to our Python codebase.

Meet Fixit 2

After discussions with other teams, especially in the rapidly growing AI/ML space, we considered our options and decided upon a partial rewrite of Fixit. We intentionally designed the new version with an open source-first mindset, while incorporating the needs and requirements of our own monorepos and open source projects from day one.

The framework and linting engine would be rebuilt from the ground up while leaving the core design of lint rules largely untouched. The new system provides a hierarchical configuration based on the TOML format; support for local, in-repo lint rules similar to Flake8; and a much improved CLI and API for integration with other tools and automation.

Fixit itself builds on top of another Instagram open source project, LibCST, a concrete syntax tree for Python with a tree and node API following the patterns of the ast module in the standard library. The “concrete” part of CST means that LibCST includes every part of the source file in the resulting tree after parsing, including whitespace, comments, and formatting elements that are ignored by the ast module. This is what allows Fixit (and other tools we built, like µsort) to safely modify source files, without using regular expressions or the risk of producing broken syntax, and provides the foundation for Fixit to offer auto-fixes suggested by the lint rules themselves.

Writing a new lint rule can be done with less than a dozen lines of code, and test cases are defined inline. You can even place it right next to the code that it will be linting:

# teambread/rules/hollywood.py
import fixit
import libcst
class HollywoodName(fixit.LintRule):
    VALID = [...] # no lint errors here
    INVALID = [...] # bad code samples here
    def visit_SimpleString(self, node: libcst.SimpleString):
        if node.value in "'Paul'" or '"Paul"':
            self.report(node, "It's underbaked!")

Suggesting auto-fixes for the user is as easy as including a new CST node when reporting an error:

def visit_SimpleString(self, node: libcst.SimpleString):
    if node.value in "'Paul'" or '"Paul"':
        new_node = libcst.SimpleString('"Mary"')
        self.report(node, replacement=new_node)

Enabling this new rule within the project’s codebase can be done with a simple config change:

# teambread/sourdough/fixit.toml
[tool.fixit]
enable = [".rules.hollywood"]

Now we can run our linter against our project:

# teambread/sourdough/baker.py
name = "Paul"
print(f"hello {name}!")

$ fixit lint --diff sourdough/baker.py
sourdough/baker.py@7:11 HollywoodName: It's underbaked! (has autofix)
--- a/baker.py
+++ b/baker.py
@@ -6,3 +6,3 @@
def main():
-    name = "Paul"
+    name = "Mary"
    print(f"hello {name}")
?️  1 file checked, 1 file with errors, 1 auto-fix available ?️
[1]

The `lint` command only shows errors and suggested changes. If we use the `fix` command, we can apply those suggested changes back to the codebase:

$ fixit fix --automatic sourdough/baker.py
sourdough/baker.py@7:11 HollywoodName: It's underbaked! (has autofix)
?️  1 file checked, 1 file with errors, 1 auto-fix available, 1 fix applied ?️

Now that our auto-fixes have been applied, we can confirm that the project is now clean and lint-free:

$ fixit lint sourdough/baker.py
? 1 file clean ?

When running Fixit 2 with auto-fixing lint rules, any code that triggers the lint rule is an opportunity to get an automatic replacement, improving the codebase with less effort from the developer. Applied more broadly, Fixit 2 can even be used as a tool to enact sweeping codemods against a large codebase, while leaving a lint rule in place to handle any matching code in the future.

Try Fixit 2

Fixit 2 is available today on PyPI. You can install and test Fixit 2 with pip install fixit.

We have a roadmap with plans for future improvements and features, and a rich set of documentation and user guides to help you get started with Fixit 2 in your own projects or repositories. We hope it proves useful in your projects, and we look forward to hearing your feedback!

Using short-lived certificates to protect TLS secrets

Mon, 07 Aug 2023 15:00:00 +0200

Short-lived certificates (SLCs) are part of our latest efforts to further secure our Transport Layer Security (TLS) private keys on our edge networks.
SLCs have a very short exposure compared to traditional certificates and lower the chances of a compromised private key being abused.
Implementing SLCs has required us to address tradeoffs between operability and reliability, while satisfying the strict security requirements of our edge environment.

To create an optimal experience for the people who use our products and services all over the world, Meta runs a widely distributed network with many points of presence spread across different geographies. We call this the edge of our network, and it is also the entry point into Meta’s infrastructure for our end user requests.

On the edge, our Transport Layer Security (TLS) deployment helps ensure end-to-end security for our applications over the internet.

Now, we’ve introduced short-lived certificates (SLCs) as part of our latest efforts for securing our TLS private keys. As the name indicates, SLCs have a very short exposure compared to traditional certificates (i.e., in order of days, as opposed to validity that lasts months or years). With a shorter validity period the potential for abuse, if the private key is compromised, is lower. This approach also preserves all the benefits of doing all sign operations at the edge, such as capacity and latency, and improves the reliability of the setup.

The evolution of Meta’s public-facing TLS infrastructure

Let’s go over the evolution of our TLS infrastructure over the years. When a client (an app or browser), wants to send a request to a dynamic application, it first needs to complete a TLS handshake with our L7 load balancer Proxygen in the PoP to obtain a secure connection. We also run an OffloadService on the edge in a locked-down separate environment that is not accessible from the external internet. The OffloadService loads the private key in memory. To facilitate TLS termination, Proxygen would offload the TLS sign operations to the OffloadService. Previously, Proxygen used to load both the certification and private key in memory to complete TLS handshakes. However, loading the private keys directly in Proxygen made us potentially vulnerable to Heartbleed types of attacks.

The certificate rotation still required a Proxygen service push, and this was starting to slow us down. We decided to utilize the OffloadService to enable certificate rotation during runtime. This architecture improved security by decoupling load balancer from TLS infrastructure and also improved reliability and operational agility.

The challenges of securing TLS private keys

We examined several methods for better securing our TLS private keys or reacting quickly if a key is compromised.

Protocol layer revocation

Classic revocation protocols such as certificate revocation lists (CRLs) and online certificate status protocols (OCSPs) offer ways to revoke certificates, but they have some drawbacks. Each setup of a secure channel may require access to CRL distribution point (DP) or the OCSP responder. This adds latency and increases failure scenarios for the system. Overall, a system that includes revocation checking is more complex and less reliable than a system that does not include it.

Remote-offload

Remote offload refers to the TLS certificate infrastructure for PoPs where private keys for high-value certificates would be kept in our datacenter regions. During a TLS handshake, the sign operation is completed by an RPC from the edge to the OffloadService running in our datacenter. As expected, this approach increases the latency of the TLS handshake. We deployed this to a few clusters for six months and it ran smoothly. However, these clusters are relatively well connected to their nearest datacenter. Remote offload is vulnerable to datacenter outages and backbone issues. As a result, we decided not to roll it out to the rest of the edge infrastructure.

Delegated credentials

To improve reliability and security tradeoffs, we also experimented with delegated credentials. But this needs further adoption from browsers to have the desired effect across our entire user base.

How we built SLCs

We have been charting a path towards reducing the lifetime of certificates. Years ago, we started with certificates with a years-long lifespan, then improved it to a few months. Currently, our certificates are only valid for a few days. Given the step function increase in security, and the days-long lifespan of certificates, we have built robust automation to have a reliable certificate pipeline

We built a ConfigBuilder with reliability as our top priority, as certificates have only days’ worth of lifespan. ConfigBuilder picks the recently issued certificates, runs proper validation and testing, and updates the OffloadService configuration that gets picked at runtime. OffloadService loads the new certificate-key pairs and starts serving it for user traffic right away. To eliminate any transient sign errors during certificate rotation, OffloadService will successfully do sign operations with both private keys for a short duration.

In our implementation of SLC, we decided to limit the private key exposure to 10 days and rotate the certificates every day. Previously, with longer, 90-day certifications, we would request renewal 30-45 days before expiration. It wasn’t a big deal if certificate rotation was blocked for a few days. On the other hand, SLC requires certification issuance and rotation of all our public certificates every single day, which means our pipeline has to operate almost flawlessly.

Here are main requirements of this system:

External dependencies: We use trusted and well-known third-party certificate authorities (CAs) to issue our certificates. Given that we need to rotate certs every day, we wanted to keep certificate issuance dependency out of the critical path. We decided to issue the certificates with a lifespan of a few months, but only distribute them 10 days before expiration. With this pipeline in place, in case of some unforeseen CA operational issues, the certificates are available well in advance before they need to be used.
Locking down private keys: During certificate issuance the private keys are stored in a locked down secret store, which cannot be accessed by the edge services. Ten days before expiration, the necessary secrets are moved to another store that is only accessible by the OffloadService service on the edge.
Backup certificate: If somehow our private keys are exposed, we need to do a fast rotation. As mentioned above, the future credentials that are issued well in advance can also be used as a backup.
Backfill: We wanted to keep certificate validity details transparent to non-security teams. This required a major update in our certificate management service. (More details in the section on “Certificate issuance overhaul”).
Reliability: Given the high volume of certificates and time sensitive rotation, we need high availability and reliability for the end-to-end flow. At this scale, we need to minimize manual intervention.
Clock skew: Since the certificates are issued well in advance, clock skew should not be a problem for ‘valid_from’ date. We assumed that 10 days of remaining validity should give us enough buffer in case of clock skew. As part of careful deployment, there was no noticeable increase in TLS handshake errors. After the deployment was complete, we analyzed the client data further to understand the effects of shorter cert validity.

Scaling our infrastructure for SLCs

Certificate issuance overhaul

Our internal system for managing all our public certificates includes many users (systems and engineers) internally beyond our load balancers. It provides an easy interface for requesting certificates and supports automated renewals. We wanted to keep management of SLCs transparent to all our users. To set up a new site that needs a certificate, our management system bootstraps the corresponding SLC series without additional effort from the requestor. It also ensures that we always have a certificate available for any upcoming expiration date and backfill any missing certificate.

The issued certificates are then grouped together by their expiration date into “cert bundles”. For example, all certs expiring on April 11th, will be added to a bundle named, “expires_2023_04_11.”

Certificate rotation

The volume of certificates was a challenge on the distribution side as well. For longer validity certs, we renew usually fewer than 10 certs at a time. With SLC, we renew each cert every single day. This increases our load 10x. During daily SLC rotation, the OffloadService fetches the private key from the secret store to load it in memory. The simultaneous fetch requests (~100x) started to overload the secret store due to a sudden spike. We improved the caching and staggered secret fetch requests in the OffloadService to prevent request spikes to the secret store. We use the prev-current-next model to handle any synchronization delays during certificate rotation.

Staged rollout

Every day, ConfigBuilder picks a candidate cert bundle with an exposure of 10 days (for example on April 1, the bundle “expires_2023_04_11” will be selected). Before distribution, all candidate certs need to be validated. ConfigBuilder canaries the new configuration to a limited set of OffloadService, and hence, Proxygen instances.

Data from these canaries is used to establish the status of TLS handshakes. This test, however, is not exhaustive due to the breadth of our use cases. Some cases may not be detected, for example, certificates missing from the bundle or incompatible with the Proxygen version running on a different machine. To gather more confidence on the validity of our certificate pipeline, the cert bundle is pushed to a wider (order of 100 machines) deployment. After baking for a day, the cert bundle is then rolled out to the rest of the edge fleet.

Future work for protecting TLS private keys

We have come a long way since the early days of our TLS infra and have dramatically improved our security posture by reducing the lifetime of certificates from the order of years to days. We continue making improvements to the reliability of this pipeline, improving security by further hardening secret management using hardware based techniques and also are evaluating further reduction in lifetime of certificates

Acknowledgements

This effort could not have been possible without our amazing partners and especially Xiangyu Bu, Puneet Mehra, Anirudh Ramachandran, and Kai Yuan Thng.

Bringing HDR video to Reels

Mon, 17 Jul 2023 18:00:00 +0200

Meta has made it possible for people to upload high dynamic range (HDR) videos from their phone’s camera roll to Reels on Facebook and Instagram.
To show standard dynamic range (SDR) UI elements and overlays legibly on top of HDR video, we render them at a brightness level comparable to the video itself.
We solved various technical challenges to ensure a smooth transition to HDR video across the diverse range of old and new devices that people use to interact with our services every day.

Over the past year, the Video Infrastructure teams at Facebook and Instagram have seen a significant increase in the amount of HDR content being uploaded to our apps, with millions of HDR videos uploaded every day. As a result, we have been working to bring true HDR video support to our family of apps, starting with Reels. Today, people now have the ability to upload HDR videos from their phone’s camera roll to Reels and to have that video playback in full HDR . To make this possible, we needed to overcome a few technical challenges to ensure a smooth transition to HDR video across the diverse range of old and new devices that people use to interact with our services every day.

The journey to HDR can be better understood if we look at how the introduction of color television, for instance, was a game changer, allowing viewers to watch programs in full color, a vastly different experience from the black-and-white broadcasting of the past. This marked a step change in bringing video closer to reality, setting the stage for future advancements, such as high-definition (HD) content, that continue raising the bar for the viewer experience.

Now we are at the dawn of the next generation of video with the transition to HDR video. Unlike SDR video, HDR video has a wider range of luminosity and color, which results in brighter whites, darker blacks, and a larger potential number of visible colors for more vivid and true-to-life images. The growing adoption of HDR cameras and displays, especially on mobile devices, allows a wider audience to experience its benefits.

Challenge: Differing device support for HDR

The rollout of HDR cameras and displays has resulted in an extensive range of devices with disparate capabilities, which makes implementing end-to-end HDR video creation and delivery challenging. It’s more complicated than simply enabling HDR video creation and upload for supported devices. From a product perspective, we need to ensure that HDR is correctly preserved through our entire media pipeline to provide a quality user experience all the way from the creator’s device through our server-side video processing/storage and, ultimately, view-side delivery and playback.

One of the main challenges with having a wide range of devices with various capabilities is ensuring compatibility across different devices. HDR video standards have specific hardware requirements for both creation and consumption, and not all devices support these new standards. This means that some consumers may not be able to view HDR content. Even worse, they may see a degraded version of the video with incorrect colors, resulting in a less satisfying viewing experience.

HDR video also comes with the challenge of different HDR formats, such as HLG and PQ, either of which may contain HDR10+ or Dolby Vision dynamic metadata. Different device manufacturers may choose to implement their own standards, with each having unique capabilities in terms of the range of luminosity and color gamut they support. Different sensors also result in different colors. These distinctions all result in additional characteristics for us to account for to ensure that a given video appears consistently and correctly across all devices. For example, if a video is encoded in HDR10 and played on a device with an 8-bit display, in some cases the video may look worse than if it were an SDR video, with washed-out or unnatural-looking colors. If we didn’t address this, it would be difficult for creators to accurately showcase their work and guarantee the viewing experience of their audiences.

Challenge: Accurate tone mapping

Since not all devices support HDR displays, we need to provide backward compatibility by offering an SDR representation of the HDR video. Tone mapping is the process of scaling down the dynamic range and color space of an image or video while aiming to preserve the original appearance of the image. Improper tone mapping may cause color shifts that look unnatural, such as trees that appear blue. Tone-mapped videos may be too bright or too dark. In a professional studio, manual tone mapping or color grading involves adjusting the parameters until the results look pleasing. However, we need a solution that can run automatically on billions of videos without needing human review with the tone-mapped SDR encoding closely resembling the original HDR video.

Sophisticated tone mapping algorithms can be used to preserve quality and minimize any artifacts that were introduced. Objective visual quality metrics for tone mapping remain an open research problem in the industry, but based on our internal testing, we were able to tune the popular Hable tone mapping operator to produce results that reasonably represent the creator’s original intent.

Tone mapping on the client

When creators first began uploading HDR video content, our media pipeline was not prepared to handle 10-bit colors. So HDR videos had overexposed colors, resulting in a dissatisfying experience. iOS devices were the first to enable HDR by default and, as such, were where we began to see HDR uploads coming from our users. To mitigate this, we converted all compositions containing HDR content to SDR on the device prior to upload with client-side tone mapping.

We used Apple’s native tone mapping APIs to quickly release a fix, mapping HDR videos to SDR color space prior to uploading to our servers. This way all uploads were guaranteed to look good across all devices, even if they were coming from newer phones that captured HDR video.

On Android the story was a bit different. With a more diverse device landscape that lacked standardization for HDR in its early phases, we weren’t able to rely on OS-level tone mapping to maintain a consistent appearance. Some devices were creating HDR videos with the PQ transfer function, while others used HLG. These different standards resulted in different representations of color and, as a result, there was no one-size-fits-all solution for tone mapping all Android HDR uploads into accurate SDR representations.

Bhumi Sabarwal, a software engineer at Meta, discusses the challenges around enabling tone mapping on Android. (from: Video @Scale 2022)

For Android, we needed to implement a deeper solution. The early experience for Android HDR Reels resulted in washed out colors for users with newer devices. So we built custom tone mapping shaders to accurately convert both PQ and HLG HDR into accurate SDR. First, we extracted the video metadata while decoding the frames to determine which transfer function was used (e.g., PQ or HLG). Next, once we had each frame in YUV colorspace, we could apply appropriate transformation matrices to convert into the target SDR colorspace (BT.709). Once we had the SDR rendition, the rest of the creation process, including creative effects, AR filters, and complex media composition, were able to function appropriately.

With client-side tone mapping in place, we had alleviated the issue with washed out colors, but we were still not delivering a true HDR experience for those creators who had HDR content.

Tone mapping on the server

With client-side tone mapping, we were able to mitigate the visual quality degradation associated with HDR video processed in a media pipeline that only supported SDR. This fixed the issues we were observing, but our ultimate goal was still to unlock the power of newer devices with HDR displays and deliver a full HDR experience to them. This meant building robust end-to-end HDR support while also supporting a satisfying user experience for older devices that may not support HDR.

As part of the creation process, we already transcode all uploaded videos into different resolutions and bitrates to provide a smooth playback experience for all devices and network conditions with adaptive bitrate streaming (ABR). With HDR videos, however, this can get a bit more complicated since they require 10 bits per color component per pixel, and are often encoded with newer codecs, such as HEVC, VP9, or AV1. These characteristics increase the decode complexity, and thus require higher performance devices to support smooth decoding and playback. If we delivered HDR content to all devices, including those without adequate support, we could introduce degraded performance as the higher requisite bitrates result in wasted bandwidth, which leads to increased buffering, more frequent in-play stalls, and lower battery life.

Therefore, to build an experience optimized for all users, we need to have a way to deliver SDR encodings to devices that can’t take advantage of the benefits of HDR. To address the challenge of delivering HDR video across a diverse device ecosystem, we built tone mapping into our server-side processing.

With server-side tone mapping, we upload content to our servers with the original HDR color information intact, and generate both HDR and SDR representations for delivery. Doing both HDR and SDR encodes doubles our compute for the HDR videos. Leveraging our Meta Scalable Video Processor for HDR processing, we are able to handle this load without increased energy requirements.

If a device does not support HDR, only the SDR representation will be delivered for playback. Additionally, the tone-mapped SDR variants of HDR videos are useful for mixed scenarios, like the Instagram Explore page or the Facebook Feed, where the user experience is best with a uniform brightness across all thumbnails and previews.

Challenge: Managing brightness

Within our apps, we also faced challenges around maintaining the consistency of the app experience when introducing HDR brightness and colors in the context of the user interfaces designed around SDR video. This often resulted in inconsistent or unsatisfying experiences in our testing.

As mentioned above, HDR does not only expand the range of colors, but also contrast by enabling higher levels of brightness. This begs the question of how an already bright display can become even brighter to sufficiently accommodate the dynamic range needed for HDR, while also ensuring that SDR content remains correctly and relatively represented. On iOS devices, the default system behavior is quite interesting. The brightest ranges of the display become reserved for HDR content, and the SDR content actually becomes dimmer.

Inconsistent brightness when displaying HDR and SDR in a mixed scenario. The leftmost video in the third row is an HDR video and is much brighter than the other videos.

What is happening may not be noticeable in an app where the entirety of the screen is occupied by a single video. However, in an app like Instagram, where a video is displayed alongside others, this effect can be quite dissatisfying and challenges us to define a new standard for an optimal user experience.

One case is the Explore tab, which presents a mix of photos and videos displayed in a grid. If we simply enabled HDR in this setting the HDR videos would draw extra attention, leading to an unbalanced experience. In this scenario we’d use the same client-side tone mapping that we had used during video creation to tone map videos on the fly.

Another case is Reels, where we display a single full-screen video at a time, overlaid by UI indicating the author, caption, comments, and other metadata. These interface elements are SDR and, as such, they are subject to the same dimming behavior when they’re displayed alongside HDR content. This led to issues in our early experiments where white text appeared gray when rendered alongside HDR video. We wanted to ensure that our overlay text would always be rendered in true white because the gray – being dimmer and completely illegible in some cases – posed a usability concern.

Because our objective is to actually show HDR videos, simply tone mapping back to SDR was not the ideal option. Instead, we worked the problem from the other side, rendering the overlays in HDR when accompanying an HDR video. To do this, we apply a brightness multiplier to the overlay colors, extending into the HDR parts of the spectrum and thus rendering at the same brightness levels.

Chris Ellsworth, a software engineer at Meta, discusses our work supporting HDR on iOS. (from: Video @Scale 2022)

Acknowledgements

This work is the result of a collaborative effort between the entire Video Infrastructure and Instagram Media Platform teams at Meta. The authors would like to extend special thanks to the following people: Anthony Dito, Rex Jin, Ioannis Katsavounidis, Ryan Lei, Wen Li, Richard Liu, Denise Noyes, Ryan Peterman, David Ronca, Bhumi Sabarwal, Moisés Ferrer Serra, Ravi Shah, Zafar Shahid, Haixia Shi, Nidhi Singh, and Kyle Yoon.

Meta’s Evenstar is transitioning to OCP to accelerate open RAN adoption

Thu, 29 Jun 2023 18:00:00 +0200

Meta is transferring its IP for Evenstar, a program to accelerate the adoption of open RAN technologies, to the Open Compute Project (OCP).
Meta will contribute Evenstar’s radio unit design to OCP, giving the telecom industry its first open, white box radio unit solution.
The TIP Open RAN community will leverage the Evenstar radio unit reference designs to drive productization, validation, and commercialization of new Open RAN hardware variants.
The OCP community will evolve Evenstar with the requirements of the broader ecosystem in mind and develop a multi-vendor supply chain.

Over the past three years, Meta engineers have been working alongside our ecosystem partners on Evenstar, a collaborative effort to accelerate the adoption of open radio access networks (RANs) through simplified, flexible, and efficient RAN technologies. By driving open source standards, we’ve been helping to deliver innovative radio products and lower the total cost of ownership for operators.

Now that the project has matured to a point where the wider telecom community can fully benefit, we’re officially opening up Evenstar’s design to the wider open RAN community. We will, however, be continuing our collaboration with OCP to help our ecosystem of partners develop and deploy new open RAN solutions for the industry. Earlier this year we contributed Open M-Plane, the software component of Evenstar for configuring and managing RANs, to the Linux Foundation as part of our collaboration on the LF Connectivity project.

“OCP is very pleased to be the home of the hardware specification along with firmware for [Evenstar], a completely open RAN radio unit,” said Bijan Nowroozi, CTO of the Open Compute Project Foundation. “It is an absolute milestone towards a truly open RAN…OCP strives to foster a community where anyone can contribute, anyone can improve, and anyone can innovate.”

Contributing Evenstar’s design to the open RAN community

Evenstar began as a way to bring together a set of industry players to work toward the common goal of building a stable, open, and interoperable infrastructure supply chain for RAN. We’ve since made significant progress in building a modular reference design for a disaggregated radio unit platform, and now we’re sharing this with OCP as a hardware base specification to enable the ecosystem to build modularized and efficient radio unit platforms.

“Key elements of the specification feature an advanced 4T4R MIMO transceiver capable of 40W [per] stream of RF power and a highly-modular design that is useful in real world use cases,” said Nowroozi. “The hardware specification addresses deployment challenges by reducing the total number of SKUs required for worldwide availability.”

Evenstar’s design is based on several novel solutions that address electrical and mechanical design challenges associated with the high-power, dual-band radio units that are emerging in the open RAN ecosystem. Evenstar’s firmware, which has also been contributed to OCP and open sourced, handles complex signal processing algorithms such as digital pre-distortion (DPD) and crest factor reduction (CFR).

Evenstar’s radio unit architecture was designed by Meta with modularity in mind. The modular assembly is a critical disaggregation tenet that enables the mixing and matching of radio unit components (power amplifiers, power supply units, etc.) from different suppliers and provides flexibility in rebanding for a range of radio units.

The design relies on off-the-shelf discrete components, which is expected to result in a lower bill of materials (BOM) cost, with the additional benefit of being able to independently control each of the power amplifier modules (PAMs) for optimized power savings and increased radio unit efficiency.

All modules have been designed, manufactured, and tested in the lab on an individual basis for viability. This hardware design can support the commercial manufacturing of a range of radio units, with the platform anchoring a modular concept to assemble power amplifiers (PAs), radio silicon, and other components from a best-of-breed supply chain. We expect the modular design based on open interfaces will lead to faster innovation and thereby reduce adoption barriers.

In addition to the hardware base specification and firmware, a comprehensive set of design files and test scripts will be accessible via the OCP website, including:

Power amplifier modules (PAMs)
Power supply unit (PSU)
Digital control module (board)
Resonant cavity duplexer filter module
Mechanical design
DPD firmware
CFR firmware
Test automation scripts (including Massive MIMO)

Accelerating open RAN commercialization

Evenstar will be the industry’s first open, white box radio unit solution as the industry continues to pave the way for further adoption of open RAN technologies.

Jabil, a manufacturing solutions provider, has been a key partner in the Evenstar program. It took the lead on developing and producing modular and standards-based radio units based on operator-specified requirements. We’ve been working closely with Jabil to scale our contribution, helping to commercialize the technologies that will drive global adoption. Jabil has been a trusted manufacturing partner for 5G radios and has been providing the telecommunications industry with best-in-class execution for more than two decades.

Vodafone has also been a strong supporter and key contributor to Evenstar and open RAN technologies, having previously announced plans to deploy at least 2,500 open RAN sites in rural areas across the UK by 2027. After contributing requirements and supporting the Evenstar radio unit development program with Jabil, Vodafone is now finalizing plans to take the first units into the lab for testing and validation before entering the field trial phase. With its first open RAN sites already live in the South West of England, we are excited to see Vodafone pave the way for disaggregated RAN solutions in the network.

“This major contribution from Meta will provide the open RAN ecosystem with the right level of stimulus and encouragement to scale the radio unit portfolio in the industry,” said Yago Tenorio, Vodafone fellow and director of network architecture, and TIP Chairman. “We look forward to how this contribution will enable some of the best radio solutions out there for upcoming open RAN deployments.”

Continuing our collaboration with OCP

Open sourcing Evenstar’s radio unit reference design is an exciting next step towards a healthy, robust, and sustainable ecosystem for open RAN infrastructure offerings. We expect field trials and scaled deployments for open RAN to accelerate significantly in 2023. We believe this contribution to the ecosystem will provide a significant time to market advantage to radio unit supply partners as they develop products that will further increase efficiencies around cost, size, and power for RAN components.

We look forward to seeing this reference design moving beyond specifications into a deployable radio platform that will accelerate the adoption of the technology in networks as operators roll out their open RAN solutions.

Meta developer tools: Working at scale

Tue, 27 Jun 2023 18:00:00 +0200

Every day, thousands of developers at Meta are working in repositories with millions of files. Those developers need tools that help them at every stage of the workflow while working at extreme scale. In this article we’ll go through a few of the tools in the development process. And, as an added bonus, those we talk about below are open source so you can try them yourself.

Sapling: Scaling version control

Sapling is a version control system that can scale to huge sizes, but also emphasizes usability. There are three main components to Sapling – a server, a client, and a virtual file system.

The server stores all the data and is a careful mix of clever storage formats, wire protocols, and algorithms, mostly implemented in Rust and architected to scale.

The client then talks to that server, providing all the familiar operations (check out, rebase, commit, amend, etc). In addition, the client is also capable of talking to a git server, meaning that open source GitHub repositories can be worked on using our open-source Sapling release.

The final component is the virtual file system. When checking out a repository of our scale, just the disk I/O of writing all the files can take a significant amount of time. One solution is sparse checkouts, where a developer declares what subset of the repo they wish to see in advance. A more ergonomic alternative is EdenFS, which checks out everything in a few seconds but then only actually downloads the files from the server when they are accessed.

Buck2: Build system

After making changes, many developers at Meta use Buck2 to compile the results and test out their changes. Buck2 is designed to work at large scale, supporting remote caching and execution, so that developers can share each other’s compilations and a single developer can have access to thousands of machines to run compilations in parallel. Buck2 is also designed to support multiple programming languages simultaneously – so if you want your OCaml program to depend on a Rust library that uses a C++ library whose source code was generated by Erlang, that will work just fine.

Buck2 works well without Sapling, but has specific design considerations to enable Sapling and EdenFS. Buck2 uses Watchman to find out which files have changed, and Watchman supports EdenFS so that it can integrate smoothly with files that aren’t on the disk. Buck2 can also use specific EdenFS operations to access the file without going via the disk, optimizing performance on systems where virtual file systems can be slower.

Infer, RacerD, and Jest: Testing and static analysis

Handwritten tests and static analysis play an important role in making sure all of our code functions as intended. Working with the quantity of code we do at Meta means that we need tools that provide a high-quality signal, and do so very quickly.

For general static analysis, we use a platform called Infer, which is interprocedural and supports multiple languages, including Java and C++. We also have more tailor-made analysis tools such as RacerD, which detects Java concurrency bugs. RacerD played a big role in our project to move Facebook’s News Feed on Android from single-threaded to multi-threaded.

We also have language-specific testing frameworks. For example, Jest is our Javascript testing framework. In 2022 we officially transferred Jest to the OpenJS Foundation to further support its growth within the wider industry.

Finally, there are tools that sit between static analysis and manual test cases. Our Sapienz tool, for example, automatically tests mobile apps by allowing developers to simulate the user experience to seek out crashes and other potential issues.

Learn more about Meta’s developer workflow

In addition to our open-source tools, our developers also use a number of proprietary tools in their day-to-day workflows as well. For example, Phabricator (Phab for short), our CI and reviewing tool, helps our developers review and submit stacks of diffs. You can find more about these tools (along with the ones covered above) in the article on Meta developer’s workflow.

Bombyx is being licensed for product development

Mon, 22 May 2023 18:00:00 +0200

When we first conceived of our aerial fiber deployment solution, Bombyx (the Latin name for a silk moth), we imagined a robot weaving strands of fiber-optic cables over powerlines, helping human workers quickly connect communities even in very rural or remote locations. Now, after years of successful research, Bombyx is taking the next steps in its development. Meta will license the technology to Japanese robotics company, Hibot.

Hibot will leverage their decades of robotic development experience to make this technology ready for deployment and will be developing it further with their partners. “We see this as a great opportunity for our business expansion. It is also an honor to have been selected by Meta to carry on this exciting endeavor,” said Michele Guarnieri, CEO of Hibot. “We will be moving with our current global partners, while seeking new and valuable alliances to help us leverage this great innovation.”

In addition to the fiber deployment use case, Hibot will utilize the core innovations we have developed for fiber installation to accelerate their current work in the infrastructure inspection space and also open up new opportunities, such as upgrading electrical grid infrastructure to support grid modernization and the wider use of renewable energy.

What is Bombyx?

Fiber optic cables underpin the global internet. With a capacity orders of magnitude greater than any other technology, fiber deployments enable abundant low-cost connectivity. But while the cost of fiber cables is low, the high costs and complexities of installing fiber have hampered widespread deployment.

Bombyx was developed as a means to address the cost and complexity of fiber installation. It is an evolution of an older, less common fiber installation technique called helical wrapping, where a fiber optic cable is wrapped around an existing powerline conductor. However, the need to shut the power off for extended periods of time to allow for fiber installation, the cumbersome fiber wrapping process, and the limited contiguous spans of fiber that can be installed are all key challenges for adopting the helical wrapping technique.

Bombyx addresses these challenges through a number of key innovations. By combining a novel, miniature fiber cable and pairing that with a unique, spool-free cable geometry, Bombyx allows long spans of fiber to be packed into a small volume and for it to be spun around a powerline conductor without the need for counterweights. Bombyx also introduces the first robot capable of traversing energized conductors and automatically crossing obstacles, including pin and post insulators that must be passed from above. It does this through a combination of machine vision and custom sensors, combined with thruster fans, drive/lift/rotation subsystems, and advanced stabilization control that allows it to balance itself and pass over obstacles like a tightrope walker.

The next frontier

While rural areas are making gains in terms of connectivity, they’re still lagging behind more dense urban areas. We always envisioned Bombyx as a frontier technology to help bring underserved people online all over the world. Achieving this will require close collaboration at all levels, from the engineers developing the technology to the linemen in the field deploying the cables with Bombyx. We believe Bombyx will help facilitate new large-scale fiber deployments that will ultimately benefit every consumer by providing higher capacity and lower costs. We look forward to seeing Hibot build on our efforts with Bombyx and what exciting things they’ll do with the technology next.

MSVP is Meta’s first video processing ASIC

Thu, 18 May 2023 20:39:00 +0200

Here are the primary stages of the transcoding process:

Support for a variety of elementary input video stream formats, including H.264, HEVC, VP9, and AV1
All profiles, 8/10-bit pixel sample depths, and YUV420 color format

Format conversion, including video overlays
Frame resampling operation from arbitrary input resolutions to multiple resolutions (up to 4x) with high precision and wide filters
Shot detection

Support for H.264 (AVC) and VP9 coding standards
8-bit pixel sample depth, YUV420 color format

Full reference: SSIM, MS-SSIM, VIF, PSNR (Luma and Chroma)
No-reference blurriness
QM at multiple viewport resolutions

These stages are implemented as memory-to-memory operations, meaning intermediate buffers are stored back to DRAM and refetched as needed by the downstream operation.

Power and performance

Each MSVP ASIC can offer a peak transcoding performance of 4K at 15fps at the highest quality configuration with 1-in, 5-out streams and can scale up to 4K at 60fps at the standard quality configuration. Performance scales linearly with resolution. This performance is achieved at ~10W of PCIe module power. We achieved a throughput gain of ~9x for H.264 when compared against libx264 SW encoding. For VP9, we achieved a throughput gain of ~50x when compared with libVPX speed 2 preset.

~9x

faster throughput for H.264 compared to libx264 SW

~50x

faster throughput for VP9 compared to libVPX speed 2

In video coding, we deploy a method to assess and compare compression efficiency, called the Bjontegaard delta rate (BD-Rate), which estimates the number of bits saved (if the BD-Rate is a negative figure) in order to deliver the same objective quality for a given video over a baseline configuration.

MSVP’s video encoding algorithms

The MSVP encoder has two main goals: to be highly power efficient and to deliver the same or better video quality as software encoders. There are existing video encoder IPs, but most of them are targeted at mobile devices with tight area/power constraints and cannot meet the quality bar set by current software encoders.

Because software encoders offer very flexible control and fast evolution over time, it is quite challenging for ASIC video encoders to meet the same performance bar as software encoders.

Here’s a simplified version of the data flow of modern hybrid (hardware and software) video encoders:

Simplified video encoder modules.

These encoders use intra-coding to reduce spatial redundancy and inter-coding to remove temporal redundancy. Different stages of motion estimation are applied in inter-coding to find out the best prediction among all possible block positions in available reference frames. Entropy coding is the lossless compression part that squeezes the statistical redundancy of all syntax elements, including encoding modes, motion vectors and quantized residual coefficients.

For MSVP’s algorithms to perform the way we wanted, we had to find hardware-friendly alternatives for each of the above key modules. We mainly focused on three levels: block level, frame level, and group of picture (GOP) level.

At the block level, we looked for coding tools with the highest return on investment, that were easy/economical (in terms of silicon area and power requirements) to implement in hardware, and that met our performance targets while maximizing compression efficiency. At frame level, we studied the best algorithms to make intelligent frame type decisions among I/P/B frames, and the best rate-control algorithms based on statistics collected from hardware. And at the GOP level, we had to figure out whether to use multiple-pass encoding with look-ahead, or to insert intra (key) frames at a given shot boundary.

Motion estimation

Motion estimation is one of the most computationally intensive algorithms in video encoding. To find accurate motion vectors that closely match the block currently being encoded, a full motion estimation pipeline often includes a multistage search to balance among large search range, computing complexity, and accuracy.

MSVP’s motion search algorithm needs to be one that identifies which potential neighboring blocks can contribute more to quality and only searches around highly correlated neighbors with a limited cycle budget. Although we lack the flexibility of iterative software motion search algorithms, such as diamond or hexagon shapes, hardware motion estimation can search multiple blocks in parallel. Thus, it allows us to search more candidates, cover a larger search range and more reference frames in both single direction and bidirectional mode, and search all supported block partition shapes in parallel.

Rate distortion optimization (RDO)

Achieving high video encoding quality also requires RDO support. Since there are so many decisions to make in video encoding (intra/inter modes, partition block size, transform block types/sizes, etc.), RDO is one of the best practices in video compression to determine which mode is optimal given the current rate or quality target.

MSVP supports exhaustive RDO at almost all mode decision stages. Distortion calculation is intensive but both straightforward and easily parallelizable. But the unique challenge is the bit rate estimation. Entropy coding for the final bitstream is sequential in nature, and each context model is dependent on the previously encoded ones. In a hardware encoder implementation, rate distortion (RD) cost for different blocks/partitions might be evaluated in parallel; thus, it is impossible to have very accurate bit rate estimation. We implemented a pretty accurate bit rate estimation model in MSVP. The model is hardware friendly, in that it is easy to evaluate multiple coding modes in parallel.

Smart quantization

Quantization is the only lossy part of video compression, and it is also the dominant bit rate control knob in any video coding standard. The corresponding parameter is called the quantization parameter (QP), and it is inversely related to quality: Low QP values result in small quantization errors, creating low distortion levels and, subsequently, high quality at the expense of higher bit rates. By making smart quantization choices, encoding bits can be allocated to areas that impact visual quality the most. We perform smart quantization using optimal QP selection and rounding decisions.

Modern video coding standards allow different QP values to be applied to different coding units. In MSVP’s hardware encoder, block-level QP values are determined adaptively based on both spatial and temporal characteristics.

In spatial adaptive QP (AQP) selection, since the human visual system is less sensitive to quality loss at high texture or high motion areas, a larger QP value can be applied to these coding blocks. In temporal AQP, coding blocks that are referenced more in the future can be quantized with a lower QP to get higher quality, such that future coding blocks that reference these blocks will benefit from it.

Smart rounding tries to make a joint optimization on rounding decisions for all coefficients in each coding block. Since the choices of rounding at different coefficient positions are dependent on one another, we need better algorithms that remove the dependency while maintaining the rounding decision accuracy. To reduce compute cost, we’ve applied smart rounding to the final stage after the coding mode for each block is determined. This feature alone can achieve a ~1 percent to 2 percent BD-Rate improvement.

H.264

The frame-level algorithm for the MSVP H.264 encoder can be configured to be either two-pass or one-pass, depending on whether it is a VOD or live streaming use case. In the high quality (longer latency) VOD two-pass mode, MSVP looks ahead N frames and collects statistics, such as intra/inter cost and motion vectors, from these frames. Then, based on the statistics collected in the look-ahead, frame level control applies back-propagation on the reference tree in the look-ahead buffers for each reference frame to assign an importance to frames. Then, finally, the accumulated reference importance of the frame to be coded is modulated using temporal AQP of each block. Finally, the delta QP map is passed to the final encoding pass to be used as the encoding QP, also captured in the output bitstream.

MSVP H.264 encoder frame level control flow.

VP9

In MSVP’s VP9 encoder, multiple-pass encoding is also enabled for high-quality VOD use cases. An analysis pass (the first pass) is performed up front to capture the video characteristics into a set of statistics, and the statistics are used to determine the frame level parameters for filtering and encoding. Since VP9’s frame type is different from H.264’s, the strategy for making frame level decisions is also different, as shown in the following figure:

VP9 encoder frame level algorithm flow.

Meta introduces its first-generation AI inference accelerator

Thu, 18 May 2023 20:39:00 +0200

AI workloads are ubiquitous at Meta — forming the basis for a wide range of use cases, including content understanding, Feeds, generative AI, and ads ranking. These workloads run on PyTorch with first-class Python integration, eager-mode development, and the simplicity of APIs. Deep learning recommendation models (DLRMs), in particular, are important for improving experiences across Meta’s services and applications. But as these models increase in size and complexity, the underlying hardware systems need to provide exponentially more memory and compute while remaining efficient.

We found that GPUs were not always optimal for running Meta’s specific recommendation workloads at the levels of efficiency required at our scale. Our solution to this challenge was to design a family of recommendation-specific Meta Training and Inference Accelerator (MTIA) ASICs. We co-designed the first-generation ASIC with next-generation recommendation model requirements in mind and integrated it into PyTorch to create a wholly optimized ranking system. In addition, we maintained the user experience and developer efficiency offered by PyTorch eager-mode development. Developer efficiency is a journey as we continue to support PyTorch 2.0, which supercharges how PyTorch operates at the compiler level — under the hood.

What is MTIA v1?

The MTIA v1 (inference) die.

In 2020, we designed the first-generation MTIA ASIC for Meta’s internal workloads. This inference accelerator is a part of a co-designed full-stack solution that includes silicon, PyTorch, and the recommendation models. The accelerator is fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W.

At a high level, the accelerator consists of a grid of processing elements (PEs), on-chip and off-chip memory resources, and interconnects.

The accelerator is equipped with a dedicated control subsystem that runs the system’s firmware. The firmware manages available compute and memory resources, communicates with the host through a dedicated host interface, and orchestrates job execution on the accelerator.

The memory subsystem uses LPDDR5 for the off-chip DRAM resources and can scale up to 128 GB.

The chip also has 128 MB of on-chip SRAM shared among all the PEs, which provides higher bandwidth and much lower latency for frequently accessed data and instructions.

The grid contains 64 PEs organized in an 8x8 configuration. The PEs are connected to one another and to the memory blocks via a mesh network. The grid can be utilized for running a job as a whole, or it can be divided into multiple subgrids that can run independent jobs.

Each PE is equipped with two processor cores (one of them equipped with the vector extension) and a number of fixed-function units that are optimized for performing critical operations, such as matrix multiplication, accumulation, data movement, and nonlinear function calculation. The processor cores are based on the RISC-V open instruction set architecture (ISA) and are heavily customized to perform necessary compute and control tasks.

Each PE also has 128 KB of local SRAM memory for quickly storing and operating on data. The architecture maximizes parallelism and data reuse, which are foundational for running workloads efficiently.

The chip provides both thread and data level parallelism (TLP and DLP), exploits instruction level parallelism (ILP), and enables abundant amounts of memory-level parallelism (MLP) by allowing numerous memory requests to be outstanding concurrently.

MTIA v1 system design

The MTIA accelerators are mounted on small dual M.2 boards, which allows for easier aggregation into a server. These boards are connected to the host CPU on the server using PCIe Gen4 x8 links and consume as little as 35 W.

A sample test board with an MTIA.

The servers that host these accelerators use the Yosemite V3 server specification from the Open Compute Project. Each server contains 12 accelerators that are connected to the host CPU and to one another using a hierarchy of PCIe switches. Thus, the communication between different accelerators does not need to involve the host CPU. This topology allows workloads to be distributed over multiple accelerators and run in parallel. The number of accelerators and the server configuration parameters are carefully chosen to be optimal for executing current and future workloads.

The MTIA software stack

The MTIA software (SW) stack aims to provide developer efficiency and high performance. It integrates fully with PyTorch, providing a familiar developer experience. Using PyTorch with MTIA is as easy as using PyTorch for CPUs or GPUs. The MTIA SW stack benefits from the flourishing PyTorch developer ecosystem and tooling. The compiler performs model-level transformations and optimizations using PyTorch FX IR and low-level optimizations using LLVM IR, with extensions to support the custom architecture and ISA of the MTIA accelerator.

The PyTorch runtime for MTIA manages on-device execution and features such as MTIA tensors, memory management, and the APIs for scheduling operators on the accelerator. The runtime and firmware perform communication to the accelerator device. The SW stack supports different modes of execution, such as eager mode and graph mode, and allows workloads to be partitioned across multiple accelerator cards. In the latter case, the SW stack also provides the necessary synchronization and communication between multiple accelerator boards.

The MTIA software stack.

There are multiple ways to author compute kernels that can run on the accelerator, including using PyTorch, C/C++ (for hand-tuned, very optimized kernels), and a new domain-specific language called KNYFE, which takes a short, high-level description of an ML operator as input and generates optimized, low-level C++ kernel code that is the implementation of this operator for MTIA.

Low-level code generation and optimizations leverage the open source LLVM compiler toolchain with MTIA extensions. The LLVM compiler then takes care of the next level of optimization and code generation to produce efficient executables that run on the processor cores within the PEs.

As part of the SW stack, we have also developed a library of hand-tuned and highly optimized kernels for performance-critical ML kernels, such as fully connected and embedding-bag operators. The higher levels of the SW stack can choose to instantiate and use these highly optimized kernels during the compilation and code generation process.

The MTIA SW stack continues to evolve with integration to PyTorch 2.0, which is faster and more Pythonic, yet as dynamic as ever. This will enable new features such as TorchDynamo and TorchInductor. We are also extending Triton DSL to support MTIA accelerators and using MLIR for internal representations and advanced optimizations.

Performance results for MTIA

While our SW stack continues to evolve, we collected some results comparing the performance of MTIA with that of other accelerators. The comparison is based on the end-to-end performance of running five different DLRMs, representing low- to high-complexity workloads.

We used five different DLRMs, ranging from low to high complexity, to evaluate MTIA with representative production workloads.

Efficiency is one of the most important factors for deploying accelerators in the data center, and TCO is a measure of efficiency. Our comparison is focused on the performance-per-watt metric (TFLOPS/W) which is a key component of TCO.

Our study compared MTIA with an NNPI accelerator and a GPU. For low-complexity models, MTIA relies on handling small shapes and batch sizes more efficiently. For the medium- and high-complexity models, MTIA is running larger shapes that are much more optimized on the GPU’s SW stack (this is where MTIA’s SW stack is currently being optimized and is expected to achieve similar efficiency levels over time).

Perf/W of DLRMs

TFLOPS/W

0.040

Low complexity 1

0.033

Low complexity 2

0.035

Medium complexity 1

0.052

Medium complexity 2

0.129

High complexity

Our evaluation found that MTIA handled low-complexity (LC1 and LC2) and medium-complexity (MC1 and MC2) models more efficiently compared with an NNPI and a GPU. We also recognized that we have not yet optimized MTIA for high-complexity (HC) models.

Lessons for the future

Building custom silicon solutions, especially for the first time, is a significant undertaking. From this initial program, we have learned invaluable lessons that we are incorporating into our roadmap, including architectural insights and software stack enhancements that will lead to improved performance and scale of future systems.

The challenges we need to address are becoming increasingly complicated. Looking at historical trends in the industry for scaling compute, as well as memory and interconnect bandwidth, we can see that memory and interconnect bandwidth are scaling at a much lower pace compared with compute over the last several generations of hardware platforms.

Scaling trends for compute, memory, and interconnect bandwidth (source).

The lagging performance of memory and interconnect bandwidth has also manifested itself in the final performance of our workloads as well. For example, we see a significant portion of a workload’s execution time spent on networking and communication.

Moving forward, as part of building a better and more efficient solution, we are focused on striking a balance between these three axes (compute power, memory bandwidth, and interconnect bandwidth) to achieve the best performance for Meta’s workloads. This is an exciting journey, and we’re just getting started.

Acknowledgments:

This project is the result of the work of many talented teams and individuals at Meta. Hence, we would like to especially thank the following teams, whose contributions were instrumental in the success of this project: Infra Silicon, AI & Systems Co-Design, MTIA SW, Emulation, ASIC Platform Software, Hardware Platforms, Release to Production (RTP), and Sourcing and Operations Engineering (SOE).

Building and deploying MySQL Raft at Meta

Tue, 16 May 2023 18:00:00 +0200

We’re rolling out MySQL Raft with the aim to eventually replace our current MySQL semisynchronous databases.
The biggest win of MySQL Raft was simplification of the operation and making MySQL servers take care of promotions and membership. This gave the provable safety of Raft and reduced significant operational pain.
Making MySQL server a true distributed system also has opened up possibilities in downstream systems to leverage it. Some of these ideas are starting to take shape.

At Meta, we run one of the largest deployments of MySQL in the world. The deployment powers the social graph along with many other services, like Messaging, Ads, and Feed. Over the last few years, we have implemented MySQL Raft, a Raft consensus engine that was integrated with MySQL to build a replicated state machine. We have migrated a large portion of our deployment to MySQL Raft and plan to fully replace the current MySQL semisynchronous databases with it. The project has delivered significant benefits to the MySQL deployment at Meta, including higher reliability, provable safety, significant improvements in failover time, and operational simplicity — all with equal or comparable write performance.

Background

To allow for high availability, fault tolerance, and scaling reads, Meta’s MySQL datastore is a massively sharded, geo-replicated deployment with millions of shards, holding petabytes of data. The deployment includes thousands of machines running over several regions and data centers across multiple continents.

Previously, our replication solution used the MySQL semisynchronous (semisync) replication protocol. This was a data path–only protocol. The MySQL primary would use semisynchronous replication to two log-only replicas (logtailers) within the primary region but outside of the primary’s failure domain. These two logtailers would act as semisynchronous ACKer (An ACK is an acknowledgment to the primary that the transaction has been locally written). This would allow the data path to have very low latency (sub-millisecond) commits and would provide high availability/durability for the writes. Regular MySQL primary-to-replica asynchronous replication was used for wider distribution to other regions.

The control plane operations (e.g., promotions, failover, and membership change) would be the responsibility of a set of Python daemons (henceforth called automation). Automation would do the necessary orchestration to promote a new MySQL server in a failover location as a primary. The automation would also point the previous primary and the remaining replicas to replicate from the new primary. Membership change operations would be orchestrated by another piece of automation called MySQL pool scanner (MPS). To add a new member, MPS would point the new replica to the primary and add it to the service discovery store. A failover would be a more complex operation in which the tailing threads of the logtailers (semisynchronous ACKers) would be shut down to fence the previous dead primary.

Why was MySQL Raft necessary?

In the past, to help guarantee safety and avoid data loss during the complex promotion and failover operations, several automation daemons and scripts would use locking, orchestration steps, a fencing mechanism, and SMC, a service discovery system. It was a distributed setup, and it was difficult to accomplish this atomically. The automation became more complex and harder to maintain over time as more and more corner cases needed to be patched.

We decided to take a completely different approach. We enhanced MySQL and made it a true distributed system. Realizing that control plane operations like promotions and membership changes were the trigger of most issues, we wanted the control plane and data plane operations to be part of the same replicated log. For this, we used the well-understood consensus protocol Raft. This also meant that the source of truth of membership and leadership moved inside the server (mysqld). This was the single biggest contribution of bringing in Raft because it enabled provable correctness (safety property) across promotions and membership changes into the MySQL server.

The Raft library and the MySQL Raft plugin

Our implementation of Raft for MySQL is based on Apache Kudu. We enhanced it significantly for the needs of MySQL and our deployment. We published this fork as an open source project, kuduraft.

Some of the key features that we added to kuduraft are:

FlexiRaft — support for two different intersecting quorums: the data quorum and the leader election quorum
Proxying — the ability to use a proxy intermediate node to reduce network bandwidth
Compression — where we compress binary log (transaction) payloads once before distribution
Log abstraction — to support different physical logfile implementations
Primary ban — the ability to prevent some entities from being primary temporarily

We also had to make relatively big changes to MySQL replication to interface with Raft. For this, we created a new closed source MySQL plugin called MyRaft. MySQL would interface with MyRaft through the plugin APIs (similar APIs had been used for semisync as well), while we created a separate API for MyRaft to interface back with MySQL server (callbacks).

MySQL Raft replication topologies

A Raft ring would consist of several MySQL instances (four in the diagram) in different regions. The communication round-trip time (RTT) between these regions would range from 10 to 100 milliseconds. A few of these MySQLs (typically three) were allowed to become primaries, while the rest of them were only allowed to be pure read replicas (non-primary-capable). The MySQL deployment at Meta also has a long-standing requirement for extremely low latency commits. The services that use MySQL as a store (e.g., the social graph) need or have been designed to such extremely fast writes.

To meet this requirement, the configuration of FlexiRaft would use only in-region commits (single region dynamic mode). To enable this, each primary capable region would have two additional logtailers (witnesses or log-only entities). The data quorum for writes would be 2/3 (2 ACKs out of the 1 MySQL + 2 logtailers). Raft would still manage and run a replicated log across all the entities (1 primary-capable MySQL + 2 logtailers ) * 3 regions + (non-primary-capable MySQL) * 3 regions = 12 entities.

Raft roles: The leader, as the name suggests, is the leader in a term of the replicated log. A leader in Raft would also be the primary in MySQL and the one accepting client writes. The follower is a voting member of the ring and passively receives messages (AppendEntries) from the leader. A follower would be a replica in MySQL’s point of view and would be applying the transactions to its engine. It would not allow direct writes from user connections (read_only=1 is set). A learner would be a non-voting member of the ring, e.g., the three MySQLs in non-primary-capable regions (above). It would be a replica in MySQL’s point of view.

Replicated log

For replication, MySQL has historically used the binary log format. This format is central to MySQL’s replication, and we decided to preserve this. From the Raft perspective, the binary log became the replicated log. This was done via the log abstraction improvement to kuduraft. The MySQL transactions would be encoded as a series of events (e.g., Update Rows event) with a start and end for each transaction. The binary log would also have appropriate headers and would typically end with an ending event (Rotate event).

We had to tweak how MySQL manages its logs internally. On a primary, Raft would write to a binlog. This is no different from what happens in standard MySQL. In a replica, Raft would also write to a binlog instead of to a separate relay log in standard MySQL. This created simplicity for Raft as there was only one namespace of log files that Raft would be concerned about. If a follower were promoted to leader, it could seamlessly go back into its history of logs to send transactions to lagging members. The replica’s applier threads would pick up transactions from the binlog and then apply them to the engine. During this process, a new log file, the apply log, would be created. This apply log would play an important role in crash recovery of replicas but is otherwise a nonreplicated log file.

So, in summary:

In standard MySQL:

Primary writes to binlog and sends binlog to replicas.
Replicas receive in relay log and apply the transactions to the engine. During apply, a new replica-only binlog is created.

In MySQL Raft:

Primary writes to binlog via Raft, and Raft sends binlog to followers/replicas.
Replicas/followers receive in binlog and apply the transactions to the engine. An apply log is created during apply.
Binlog is the replicated log from the Raft point of view.

Write transaction on MySQL primary using Raft

The transaction would first be prepared in the engine. This would happen in the thread of the user connection. The act of preparing the transaction would involve interactions with the storage engine (e.g., InnoDB or MyRocks) and generate an in-memory binlog payload for the transaction. At the time of commit, the write would pass through group commit/ordered_commit flow. GTIDs would be assigned, and then Raft would assign an OpId (term:index) to the transaction. At this point, Raft would compress the transaction, store it in its LogCache, and write through the transaction to a binlog file. It would asynchronously start shipping the transaction to other followers to get ACKs and reach consensus.

The user thread, which is in “commit” of the transaction, would be blocked, waiting for consensus from Raft. When Raft would get two out of three in-region votes, consensus commit would be reached. Raft would also ship the transaction to all out-of-region members but would ignore their votes because of an algorithm called FlexiRaft (described below). On consensus commit, the user thread would be unblocked, and the transaction would proceed and commit to the engine. After engine commit, the write query would finish and return to the client. Soon after, Raft would also asynchronously send a commit marker (OpId of current commit) to downstream followers so that they can also apply the transactions to their database.

Crash recovery

Changes had to be made to crash recovery to make it work seamlessly with Raft. Crashes can happen at any time in the lifetime of a transaction and hence the protocol has to ensure consistency of members. Here are some key insights on how we made it work.

Transaction was not flushed to binlog: In this case, the in-memory transaction payload (still in mysqld process memory as an in-memory buffer) would be lost and the prepared transaction in engine would be rolled back on process restart. Since there was no extra uncommitted transaction in the Raft log, no reconciliation with other members needs to be done.
Transaction was flushed to binlog but never reached other members: Mysqld acts as a transaction coordinator and runs a two-phase commit protocol between the engine and the replicated binlog as the participants. On crash recovery, the prepared transaction in engine (e.g., InnoDB or MyRocks) would be rolled back (engine had not reached commit). Raft would go through failover, and a new leader would be elected. This leader would not have this transaction in its binlog and henceforth would truncate this transaction from the erstwhile leader’s binlog because of to a higher term (by pushing a No-Op message), when the erstwhile leader joins back the ring.
Transaction was flushed to binlog and reached to next leader. Current leader died before committing to the engine: Similar to no. 2 above, the prepared transaction in the engine would be rolled back. The erstwhile leader would join the Raft ring as a follower. In this case, the new leader would have this transaction in its binlog and hence no truncation would happen, since the logs would match. When the commit marker is sent by the new leader, the transaction would be reapplied again from scratch.

Raft-initiated state machine transitions

Failover and regular maintenance operations can trigger leadership changes in Raft. After a leader is elected, the MyRaft plugin would try to to transition the accompanying MySQL into primary mode. For this, the plugin would orchestrate a set of steps. These callbacks from Raft → MySQL would abort in-flight transactions, roll back in-use GTIDs, transition the engine side log from apply-log to binlog, and eventually set the proper read_only settings. This mechanism is complex and currently not open sourced.

FlexiRaft

Since the Raft paper and Apache Kudu supported only a single global quorum, it would not work well at Meta, where rings were large but the data path quorum needed to be small.

To circumvent this issue, we innovated on FlexiRaft, borrowing ideas from Flexible Paxos.

At a high level, FlexiRaft allows Raft to have a different data commit quorum (small) but take a corresponding hit on the leader election quorum (large). By following provable guarantees of quorum intersection, FlexiRaft ensures that the longest log rules of Raft and the appropriate quorum intersection will guarantee provable safety.

FlexiRaft supports single region dynamic mode. In this mode, members are grouped together by their geo-region. The current quorum of Raft depends on who the current leader is (hence the name “single region dynamic”). The data quorum is the majority of voters in the leader’s region. During promotions, if terms are continuous, the Candidate will intersect with the last known leader’s region. FlexiRaft would also ensure that the quorum of the Candidate’s region is also attained, otherwise the subsequent No-Op message could get stuck. If in the rare case the terms are not continuous, Flexi Raft would try to figure out a growing set of regions which need to be intersected with for safety or, in the worst case, would fall back to the N region intersection case of Flexible Paxos. Thanks to pre-elections and mock elections, the incidences of term gaps are rare.

Control plane operations (promotions and membership changes)

In order to serialize promotion and membership change events in the binlog, we hijacked the Rotate Event and Metadata event of the MySQL binary log format. These events would carry the equivalent of No-Op messages and add-member/remove-member operations of Raft. Apache Kudu did not have support for joint consensus, hence we only allow one-at-a-time membership changes (you can change the membership by only one entity in one round to follow the rules of implicit quorum intersection).

Automation

With the implementation of MySQL Raft, we reached a very clean separation of concerns for the MySQL deployment. The MySQL server would be responsible for safety via Raft’s replicated state machine. The no-data-loss guarantee would be provably enshrined in the server itself. Automation (Python scripts, daemons) would initiate control plane operations and monitor the health of the fleet. It would also replace members or do promotions via Raft during maintenance or when a host failure was detected. Once in a while, automation could also change the regional placement of MySQL topology. Changing the automation to adapt to Raft was a massive undertaking, spanning multiple years of development and rollout effort.

During prolonged maintenance events, automation would set leadership ban information on Raft. Raft would disallow those banned entities from becoming leader or evacuate them promptly on inadvertent election. The automation would also promote away from those regions into other regions.

Learning from rollouts and challenges encountered along the way

Rolling out Raft to the fleet was a huge learning for the team. We initially developed Raft on MySQL 5.6 and had to migrate to MySQL 8.0.

One of the key learnings was that while correctness was easier to reason with Raft, the Raft protocol in itself does not help much in the concern of availability. Since our MySQL data quorum was very small (two out of three in-region members), two bad entities in the region could pretty much shatter the quorum and bring down availability. The MySQL fleet undergoes a good amount of churn every day (due to maintenance, host failures, rebalancing operations), so initiating and doing membership changes promptly and correctly were a key requirements for constant availability. A large part of the rollout effort was focused at doing logtailer and MySQL replacements promptly so that the Raft quorums were healthy.

We had to enhance kuduraft to make it more robust for availability. These improvements were not part of the core protocol but can be considered as engineering add-ons to it. Kuduraft has the support for pre-elections, but pre-elections are done only during a failover. During a graceful transfer of leadership, the designated Candidate moves directly to a real election, bumping the term. This leads to stuck leaders (kuduraft does not do auto step-down). To address this problem, we added a mock elections feature, which was similar to pre-elections but happened only upon a graceful transfer of leadership. Since this was an async operation, it did not increase promotion downtimes. A mock election would weed out cases where a real election would partially succeed and get stuck.

Handling byzantine failures: Raft’s membership list is considered to be blessed by Raft itself. But during the provisioning of new members, or because of races in automation, there could be bizarre cases of two different Raft rings intersecting. These zombie membership nodes had to be weeded out and should not be able to communicate with each other. We implemented a feature to block RPCs from such zombie members to the ring. This was, in some ways, a handling of a byzantine actor. We enhanced the Raft implementation after noticing these rare incidents that happened in our deployment.

Monitoring the MySQL Raft rollout

While launching MySQL Raft, one of the goals was to reduce operational complexity for on-calls, so that engineers could root-cause and mitigate issues. We built several dashboards, CLI tools, and scuba tables to monitor Raft. We added copious logging to MySQL, especially around the area of promotions and membership changes. We created CLIs for quorum and voting reports on a ring, which help us quickly identify when and why a ring is unavailable (shattered quorum). The investment in the tooling and automation infrastructure went hand-in-hand and might have been a bigger investment than the server changes. This investment paid off big-time and reduced operational and onboarding pain.

Quorum Fixer

Although it is undesirable, quorums do get shattered every now and then, leading to availability loss. The typical case is when automation does not detect unhealthy instances/logtailers in the ring and does not replace them quickly. This can happen because of poor detection, worker queue overload, or a lack of spare host capacity. Correlated failures, when multiple entities in the quorum go down at the same time, are less typical. These don’t happen often, because the deployments try to isolate failure domains across critical entities of the quorum through proper placement decisions. Long story short: At scale, unexpected things happen, despite existing safeguards. Tools need to be available to mitigate such situations in production. We built Quorum Fixer in anticipation of this.

Quorum Fixer is a manual remediation tool authored in Python that squelches the writes on the ring. It does out-of-band checks to figure out the longest log entity. It forcibly changes the quorum expectations for a leader election inside Raft, so that the chosen entity becomes a leader. After successful promotion, we reset the quorum expectation back, and the ring typically becomes healthy.

It was a conscious decision to not run this tool automatically, because we want to root cause and identify all cases of quorum loss and fix bugs along the way (not have them silently be fixed by automation).

Rolling out MySQL Raft

Transitioning from semisynchronous to MySQL Raft over a massive deployment is difficult. For this we created a tool (in Python) called enable-raft. Enable-raft orchestrates the transition from semisynchronous to Raft by loading the plugin and setting the appropriate configs (mysql sys-vars) on each of the entities. This process involves a small downtime for the ring. The tool was made robust over time and can roll out Raft at scale very quickly. We have used it to safely roll out Raft.

Testing and shadow workflow

Needless to say, doing a change in the core replication pipeline of MySQL is a very difficult project. Since data safety is at stake, testing was key for confidence. We leveraged shadow testing and failure injection significantly during the project. We would inject thousands of failovers and elections on test rings before every RPM package manager rollout. We would trigger replacements and membership changes on the test assets to trigger the critical code paths.

Long-running testing with data correctness checks were also key. We have automation that runs nightly on the shards, ensuring consistency of primaries and replicas. We are alerted to any such mismatch, and we debug it.

Performance

The performance of the write path latency for Raft was equivalent to semisync. The semisync machinery is slightly simpler and hence expected to be leaner, however we optimized Raft to get the same latencies as semi-sync. We optimized kuduraft to not add any more CPU to the fleet in spite of pulling in many more responsibilities that previously had been outside the server binary.

Raft made order-of-magnitude improvements to promotions and failover times. Graceful promotions, which are the bulk of leadership changes in the fleet, improved significantly, and we can typically finish a promotion in 300 milliseconds. In the semisync setups, since the service discovery store would be the source of truth, the clients noticing the finish of promotion would be much longer, leading to more elevated end user downtimes on a shard.

Raft typically does a failover within 2 seconds. This is because we heartbeat for Raft health every 500 milliseconds and start an election when three successive heartbeats fail. In the semisync world, this step was orchestration heavy and would take 20 to 40 seconds. Raft thereby gave a 10x improvement in downtimes for failover cases.

Next steps

Raft has helped solve problems with the operational management of MySQL at Meta by providing provable safety and simplicity. Our goals of having a hands off-management of MySQL consistency, and having tools for the rare cases of availability loss, are mostly met. Raft now opens up significant opportunities in the future, as we can focus on enhancing the offering to the services that use MySQL. One of the asks’ from our service owners is to have configurable consistency. Configurable consistency will allow the owners at the time of onboarding, to select whether the service needs X-region quorums or quorums that ask for copies in some specific geographies (e.g., Europe and the United States). FlexiRaft has seamless support for such configurable quorums, and we plan to start rolling out this support in the future. Such quorums will correspondingly lead to higher commit latencies, but use cases have to be able to trade-off between consistency and latency (e.g., PACELC theorem).

Because of the proxying feature (ability to send messages using a multihop distribution topology), Raft can also save network bandwidth across the Atlantic. We plan to use Raft to replicate from the United States to Europe only once, and then use Raft’s proxying feature to distribute within Europe. This will increase latency, but it will be nominal given that the bulk of the latency is in the cross-Atlantic transfer and the extra hop is much shorter.

Some of the more speculative ideas in Meta’s database deployments and distributed consensus space are about exploring leaderless protocols, like Epaxos. Our current deployments and services have worked with the assumptions that come with strong leader protocols, but we are starting to see a trickle of requirements where services would benefit from more uniform write latency in the WAN. Another idea that we are considering is to disentangle the log from the state machine (the database) into a disaggregated log setup. This will allow the team to manage the concerns of the log and replication separately from the concerns of the database storage and SQL execution engine.

Acknowledgements

Building and deploying MySQL Raft at Meta scale needed significant teamwork and management support. We would like to acknowledge the following people for their role in making this project a success. Shrikanth Shankar, Tobias Asplund, Jim Carrig, Affan Dar and David Nagle for supporting the team members during this journey. We would also like to thank the able Program Managers of this project Dan O and Karthik Chidambaram who kept us on track.

The engineering effort involved key contributions from several current and past team members including Vinaykumar Bhat, Xi Wang, Bartlomiej Pelc, Chi Li, Yash Botadra, Alan Liang, Michael Percy, Yoshinori Matsunobu, Ritwik Yadav, Luqun Lou, Pushap Goyal, Anatoly Karp and Igor Pozgaj.

The malware threat landscape: NodeStealer, DuckTail, and more

Wed, 03 May 2023 14:00:00 +0200

We’re sharing our latest threat research and technical analysis into persistent malware campaigns targeting businesses across the internet, including threat indicators to help raise our industry’s collective defenses across the internet.
These malware families – including Ducktail, NodeStealer and newer malware posing as ChatGPT and other similar tools– targeted people through malicious browser extensions, ads, and various social media platforms with an aim to run unauthorized ads from compromised business accounts across the internet.
We’ve detected and disrupted these malware operations, including previously unreported malware families, and have already seen rapid adversarial adaptation in response to our detection, including some of them choosing to shift their initial targeting elsewhere on the internet.

Today, we’re sharing our latest work to detect and disrupt malware campaigns targeting business users across the internet.

We know that malicious groups behind malware campaigns are extremely persistent, and we fully expect them to keep trying to come up with new tactics and tooling in an effort to survive disruptions by any one platform where they spread. That’s why our security teams tackle malware – one of the most persistent threats online – as part of our defense-in-depth approach through multiple efforts at once. It includes: malware analysis and targeted threat disruption, continuously improving detection systems to block malware at scale, security product updates, community support and education, threat information sharing with other companies and holding threat actors accountable in court. This helps raise the cost for these malicious groups and limits the lifecycle of any single strain of malware – forcing threat actors to continue to invest time and resources into constantly adapting to stay afloat

With much malware we’ve seen and countered over the years being hosted outside of social media, including our services, we encourage people to be cautious when downloading new software like browser extensions or mobile apps, or downloading files across the internet. For more security tips, visit our Newsroom.

The malware threat landscape

Before we dive into the technical analysis of one of the new malware families we recently detected – NodeStealer, we’re sharing the latest trends we’ve seen across this threat landscape more broadly to help inform our collective defenses across the internet.

While many malware campaigns use off-the-shelf tooling available powered by a booming marketplace, the focus of our analysis today is on malware families that are custom-built to target business users on particular internet services. Here is what stood out to us in our threat research into these tailored operations and their tooling.

Adversarial adaptation in response to disruptions: Ducktail malware in focus

With more security teams across our industry publicly reporting and sharing threat indicators into various malware operations, we’ve seen operators invest in a number of tactics to enable persistence and adapt to enforcements.

Many of them try to spread across many internet services, including social media, ad platforms, file-sharing and file-hosting services, link shorteners, and even niche websites for creators and their fans. This is likely an attempt to ensure that a complex, multi-pronged malware campaign can withstand takedowns by any one of these services because they each only have limited visibility into the entire malicious operation.

A long-running malware family known in the security community as Ducktail is a good example. For several years, we’ve tracked and blocked iterations of Ducktail originating from Vietnam that have evolved as a result of enforcements by Meta and our industry peers. Ducktail is known to target a number of platforms across the internet, including:

LinkedIn to socially engineer people into downloading malware;
Browsers like Google Chrome, Microsoft Edge, Brave, and Firefox to gain access to people’s information on desktop; and
File-hosting services such as Dropbox and Mega, to host malware.

In addition, many malware families are very astute to the detection of their actions which constantly forces them to adjust in hopes of buying a short advantage window over the defender community.

As an example, in its latest iteration, Ducktail operators, likely in response to our round-the-clock detection terminating stolen sessions, began automatically granting business admin permissions to requests for ad-related actions sent by attackers as an attempt to speed up their operations before we block them. However, our continued detection and mitigations provide protections to businesses against these latest adaptations. In addition, as we learn from these investigations, we keep innovating product security approaches. Today, we’re sharing a number of new product features making business accounts more resilient to these attacks.

Finally, we also issued a cease and desist letter to individuals behind it in Vietnam, referred to law enforcement, and will consider all appropriate additional enforcement options against malicious actors behind targeting people on our services.

Malware lures follow popular trends

Our research and that of security researchers has shown time and again that malware operators, just like spammers, try to latch onto hot-button issues and popular topics to get people’s attention. With an ultimate goal to trick people into clicking on malicious links or downloading malicious software, the latest wave of malware campaigns have taken notice of generative AI tools becoming popular.

Over the past several months, we’ve investigated and taken action against malware strains taking advantage of people’s interest in OpenAI’s ChatGPT to trick them into installing malware pretending to provide AI functionality.

These latest attempts, just like Ducktail, targeted a number of platforms across the internet, including file-sharing services Dropbox, Google Drive, Mega, MediaFire, Discord, Atlassian’s Trello, Microsoft OneDrive, and iCloud to host this malware. Its ultimate goal is to compromise businesses with access to ad accounts across the internet.

Since March 2023 alone, we have found around ten malware families using ChatGPT and other similar themes to compromise accounts across the internet. In one case, we’ve seen threat actors create malicious browser extensions available in official web stores that claim to offer ChatGPT-based tools. They would then promote these malicious extensions on social media and through sponsored search results to trick people into downloading malware. In fact, some of these extensions did include working ChatGPT functionality alongside malware, likely to avoid suspicion from official web stores. We’ve blocked over 1,000 unique ChatGPT-themed malicious URLs from being shared on our platforms and shared them to our industry peers so they, too, can take action, as appropriate.

Similar to Ducktail, we’ve seen blocking and public reporting of these malicious strains force their operators to rapidly evolve tactics to try and stay afloat. We’ve seen them use cloaking in an attempt to circumvent automated ad review systems, and leverage popular marketing tools like link-shorteners to disguise the ultimate destination of these links. Many of them also changed their lures to other popular themes like Google’s Bard and TikTok marketing support. Some of these campaigns, after we blocked malicious links to file-sharing and site hosting platforms, began targeting smaller services, such as Buy Me a Coffee – a service used by creators to accept support from their audiences – to host and deliver malware.

An example of malware hosted on a third-party website disguised as a ChatGPT download.

Building custom malware to target specific internet platforms

Our industry continues to detect and disrupt custom-built novel malware that targets business for advertising fraud. By tailoring these operations to be used for attempted business account compromise on a particular service – like Facebook or Google or others – threat actors are able to focus their tooling to use more sophisticated forms of account compromise, like capturing session tokens in an attempt to circumvent two factor authentication requirements. They can also include functionality that can automatically detect connections between the compromised user and business accounts they might be connected to.

A novel malware strain we named NodeStealer that we recently uncovered and disrupted early in its operation is a good example of this trend. We’re sharing a deep dive into how this particular custom-built malware operates, including our malware analysis.

Novel NodeStealer malware: An in-depth analysis

In late January 2023, our security team identified a new malware NodeStealer that targeted internet browsers on Windows with a goal of stealing cookies and saved usernames and passwords to ultimately compromise Facebook, Gmail, and Outlook accounts. NodeStealer is custom-written in JavaScript and bundles the Node.js environment. We assessed the malware to be of Vietnamese origin and distributed by threat actors from Vietnam.

We identified NodeStealer early – within two weeks of it being deployed – and took action to disrupt it and help people who may have been targeted to recover their accounts. As part of this effort, we submitted takedown requests to third-party registrars, hosting providers, and application services such as Namecheap, which were targeted by these threat actors to facilitate distribution and malicious operations. These actions led to a successful disruption of the malware. We have not observed any new samples of malware in the NodeStealer family since February 27 of this year and continue monitoring for any potential future activity.

We are sharing threat indicators and information about how this malware works to enable further security research by our industry to help us all strengthen our collective defense.

Analyzing the NodeStealer malware

NodeStealer samples are typically disguised as PDF and XLSX files with an appropriate corresponding icon and a filename meant to trick people into opening malicious files. This tactic makes it difficult for people to see that they are opening a potentially malicious executable instead of an innocuous document:

An example of malware icons.

File metadata and packaging

Here’s an example of a NodeStealer file. At the time of discovery, this file only had one detection on VirusTotal. It is likely because the file is almost entirely comprised of the Node.js environment and contains novel malicious code.

A screenshot of VirusTotal scanning results at the time of detection.

While the file is a Windows executable file (with an .exe extension), it is disguised as a PDF file with a PDF icon. We also observed metadata on the file that attempts to disguise the file as a product of “MicrosoftOffice:”

An example of file metadata.

Diving a bit more into the file structure, we found that this malware is written in Javascript, executed using Node.js, and is compiled into a Windows executable with a tool from the Node Package Manager (NPM) called pkg. This particular sample is around 46 MB in size, however we have seen files ranging from 46-51 MB. The file is large because it bundles the entire Node.js environment and all third-party package dependencies.

For context, Node.js is a cross-platform, open-source Javascript runtime environment, which provides various Javascript libraries and is often used to develop web applications. Pkg is a command-line tool that packages node.js code into an executable file for various platforms including Linux, macOS, and Windows.

Malware behaviors

Persistence

When executed, the malware first establishes persistence to ensure that it continues to operate after the victim restarts the machine. The malware uses the auto-launch module on Node.js to do so*

A screenshot of the persistence-enabling code snippet.

In this example, there is a new registry key added under “HKCU\Software\Microsoft\Windows\CurrentVersion\Run\” to execute the malware upon startup.

Stealing browser data

The ultimate goal of this malware is to steal stored password and cookie session information from Chromium-based browsers on the target’s computer. The malware targets Chrome, Opera, Microsoft Edge and Brave browsers. For each of them, the malware will:

First, reference the file paths to files that store sensitive user information such as cookies and credentials (username/password) for various sites:

The malware then decrypts the sensitive data from the browser data stores. Since the browser encrypts the user’s information before storing it, the malware performs the following steps to decrypt the user data:

It will read the encrypted_key from the “Local State” file, Base64 decode it, and retrieve the decryption key by using the win32crypt Node.js library:

Data decryption routine.

After retrieving the decryption key, the malware reads data from the “Cookies” file, which is an SQLite database containing cookie values. The malware looks for a Facebook session cookie and will only continue if one is found. If no Facebook session cookie is found, the malware does not extract more information:

Extracting cookie data and decrypting it.

If a Facebook session cookie is found, the malware starts reading data from the “Login Data” file, which is an SQLite database containing saved usernames and passwords. The malware specifically targets user credentials for Facebook, Gmail, and Outlook. We hypothesize that the malware steals email credentials to compromise the user’s contact point and potentially to access other online accounts connected to that email account:

Retrieving the stored usernames and passwords from the Browser saved password database.

With the decryption key now extracted, the malware decrypts the encrypted data read from the “Login Data” file using AES decryption.

Account reconnaissance

After retrieving the Facebook credentials from the target’s browser data, the malware uses it to make several unauthorized requests to Facebook URLs to enumerate account information related to advertising. The malware gains access to this information by making requests from the targeted user’s computer to the APIs used by our Facebook web and mobile apps, which masquerades its activity behind the user’s actual IP address, cookie values, and system configuration – appearing like a legitimate user and their session. This makes detection of this activity significantly more difficult. The stolen information then enables the threat actor to assess and then use users’ advertising accounts to run unauthorized ads.

Command and control mechanisms

After retrieving the stored browser information and performing the Facebook account reconnaissance, the malware exfiltrates all stolen data to the threat actor’s command-and-control (C2) server hosted at: hxxps://bot2q.advertiser-noreplysupport[.]dev. This C2 server URL is hard-coded into the malware.

The malware aggregates the stolen data in a JSON object which is then Base64 encoded. In an attempt to evade detection, the malware makes a GET request to: hxxps://bot2q.advertiser-noreplysupport[.]dev/avatar.png, with the Base64 data placed in the “Authorization” HTTP header:

Exfiltration of stolen information.

Based on publicly available information, the malware C2 domain was registered with Namecheap on December 27th, 2022. At the time of this analysis, the domain name resolved to the OVH VPS IP 15[.]235[.]187[.]170. We also observed a published DNS mail exchange (MX) record on that domain using Namecheap’s “Private Email” service. The C2 server appears to be a Node.js “Express”-based web application hosted by Nginx, judging by the server’s response header values.

We reported this domain to Namecheap and it is no longer resolving (as of January 25th, 2023).

Threat Indicators

These indicators are available in machine readable formats on our Malware Detection repository on GitHub.

*Please note that we have reformatted some of the source code contained in this blog in order to make it easier to read and understand. We have also added comments to the source code to provide context and explain how it works.

A fine-grained network traffic analysis with Millisampler

Mon, 17 Apr 2023 18:15:00 +0200

What the research is:

Millisampler is one of Meta’s latest characterization tools and allows us to observe, characterize, and debug network performance at high-granularity timescales efficiently. This lightweight network traffic characterization tool for continual monitoring operates at fine, configurable timescales. It collects time series of ingress and egress traffic volumes, number of active flows, incoming ECN marks, and ingress and egress retransmissions. Additionally, Millisampler is also able to identify in-region traffic and cross-region traffic (longer RTT). Millisampler runs on our server fleet collecting short, periodic snapshots of this data at 100us, 1ms, and 10ms time granularities, stores it in local disk, and makes it available for several days for on-demand analysis. Since the data is only aggregated flow-level header information, it does not contain any personally identifiable information (PII). Even with the minimal amount of information it collects, Millisampler data has proven very useful in practice, particularly when combined with existing coarser-grained data — we are able to see clearly how switch buffers or host NICs, for example, might be unable to handle the ingress traffic pattern.

How it works:

Millisampler comprises userspace code to schedule runs, store data, and serve data, and an eBPF-based tc filter that runs in the kernel to collect fine-timescale data. The user code attaches the tc filter and enables data collection. A tc filter is among the first programmable steps on the receipt of a packet and near the last step on transmission. On ingress, this means that the eBPF code executes on the CPU core that is processing the soft irq (bottom half) as the packet is directed toward the owning socket. Because processing happens on many CPU cores, to avoid locks, we use per-CPU variables, which increase the memory requirement to eliminate risk of contention. To minimize overhead, we sample periodically and for short periods of time. Userspace therefore configures two parameters in Millisampler: the sampling interval and the number of samples. We schedule runs with three sampling intervals: 10ms, 1ms, and 100μs, with a fixed number of samples to 2,000 for all sampling intervals. This means that our observation periods range from 200ms (100μs sampling rate) to 20s (10ms sampling rate), allowing us to observe events at sub-RTT to cross-region RTT time scales, and, at the same time, fix the memory footprint of each run to 2,000 64-bit counters per CPU core for each value we measure.

Millisampler collects a variety of metrics. It computes ingress and egress total bytes and ingress ECN-marked bytes from the lengths and CE bits of the packets. Millisampler also soundsTTLd marked retransmits. Millisampler uses a 128-bit sketch to estimate the number of active (incoming and outgoing) connections. Using the sketch results in an approximation of the connection count that is precise up to a dozen connections and saturates at around 500 connections per sampling interval. Although there is space for additional precision, in practice, more than the actual number of connections, the qualitative variation between a few connections to dozens or hundreds of connections has been helpful toward identifying patterns of traffic with more connections (heavy incast) as opposed to more traffic with fewer connections.

Why it matters:

Millisampler is a powerful tool for troubleshooting and performance analysis. Two contrasting network performance faults that we solved at Meta in the last few years relied on our needing a fine-grained view of traffic. The first problem featured synchronized traffic bursts at fine time scales, and seeing this motivated us to build and deploy Millisampler to catch it quickly if it happened again. The second, which an early Millisampler prototype helped root-cause, featured a NIC driver bug that caused it to stop delivering packets for milliseconds at a time, thereby proving the value of Millisampler in complex investigations. While Millisampler (or Millisampler-like data) played an important role in these investigations, it was only as part of our rich ecosystem of data collection tools that track a dizzying array of metrics across hosts and a network.

Beyond such incidents, Millisampler data has also proven useful in characterizing and analyzing traffic characteristics of services, allowing us to design and deploy a range of solutions to help improve their performance. For example, we have been able to characterize the nature of bursts across a number of services in order to understand the intensity of incast and tune transport performance accordingly. We have also been able to look at complex interactions between short-RTT and long-RTT flows and understand how bursts of either affect fairness for the other. In a following post, we will look at an extension of Millisampler — Syncmillisampler — where we run Millisampler synchronously across all hosts in a rack and use that data to identify buffer contention in the top-of-rack ASICs.

Read the full paper:

A microscopic view of bursts, buffer contention, and loss in data centers

Acknowledgements:

Ehab Ghabashneh, Cristian Lumezanu, Raghu Nallamothu, and Rob Sherwood also contributed to the design and implementation of Millisampler.

Deploying key transparency at WhatsApp

Thu, 13 Apr 2023 14:59:00 +0200

WhatsApp has launched a new cryptographic security feature to automatically verify a secured connection based on key transparency.
The feature requires no additional actions or steps from users and helps ensure that a conversation is secure.
Key transparency solutions help strengthen the guarantee that end-to-end encryption provides to private, personal messaging applications in a transparent manner available to all.
We have published an open-source library called Auditable Key Directory (AKD). This enables anyone to verify audit proofs of the directory’s correctness. This underpins our key transparency deployment.

End-to-end encryption is the foundation of private messaging on WhatsApp, helping to ensure that only you and the person you’re communicating with can read what’s sent, and nobody in between, not even WhatsApp. It is among the most widely used deployments of end-to-end encryption and relies on public key cryptography first developed in the 1970s. From a technical point of view, for end-to-end encryption to be trusted, the “ends” of a conversation need to know that one another’s encryption keys are authentic and valid.

To do so, our most security conscious users have always been able to take advantage of our security code verification feature available under a user’s contact info. When in person, keys can be validated with a quick QR code scan or, if remote, sharing the unique 60-digit code.

This is the one of the strongest ways of verifying if a connection is secure. But in reality we know that double checking a long code is cumbersome, and our team has been looking at ways to make this easier for some time.

We’re excited to introduce a new cryptographic security feature to automatically verify a secure connection without the need for this long code. To do so, we’re building on key transparency by developing a new Auditable Key Directory (AKD), which is based on an open-sourced library. The AKD will enable WhatsApp clients to automatically validate that a user’s encryption key is genuine and enables anyone to verify audit proofs of the directory’s correctness.

Our approach to key transparency is two-pronged and introduces two new components:

The server (WhatsApp) maintains an append-only AKD of public keys mapped to user accounts.
A third-party audit record, wherein any change in the server directory is recorded in a publicly available, privacy-preserving audit record for anyone to verify.

With these two additions, users can automatically verify their conversation security thanks to the WhatsApp directory. As this is rolled out, security-conscious users who utilize the verify security code page will notice this verification process occurs quickly and automatically.

This system is a new service provided by WhatsApp that relies on public auditing to verify the end-to-end encryption status of personal conversations. While this system provides easy and convenient verification tools to our users, those who wish to verify their end-to-end encrypted sessions without utilizing WhatsApp servers at all are encouraged to utilize the traditional security code verification process in addition to this new automated process.

The public keys are only a tool that users have to encrypt their messages. The private key – which is used to decrypt messages – is on user devices. Nobody – not even WhatsApp – has access to those private keys. A list of public keys alone cannot provide access to anyone’s content.

How the “Verify Security Code” page works

The crux of end-to-end encrypted messaging is public/private key pairs. The private key is what you utilize to decrypt your messages sent from another party and never leaves your device. The public key, however, is what you give to others so they can encrypt messages. This is done by first giving the key to WhatsApp, where we store it on your behalf and give it to users who wish to message you.

The classic concern that end-to-end encryption was designed to guard against is a person-in-the-middle attack where you think you’re talking to just one user; however, you’re actually talking to a middle-man attacker, who provides an incorrect public key so that they hold the private key and can read your messages. The attacker may then use the correct public key for your contact, re-encrypt the message with it, and send it to the user. What stops this today? WhatsApp has a Security Page for each contact that has a QR code and a 60-digit number that can be verified outside of WhatsApp to make sure it matches what your contact sees on their device. In short, it’s a unique hash of both your public keys and their public keys, so if either of you have the wrong value, the hashes won’t match. When they do match this confirms a secure, end-to-end encrypted conversation.

What’s the problem key transparency is fixing?

While providing a strong guarantee of security, the QR code scanning/number matching feature requires communicating with your contacts outside of WhatsApp – whether it’s over a video-call, in real-life, on the phone, etc. This is:

Difficult to do in 1:1 communications, especially as users change devices (and therefore encryption keys) over time;
Even harder in small groups, since each pair of participants has a unique code (there are no “group” codes);
Is near-impossible to perform in large groups. Every time someone joins or leaves, enrolls a new companion device, changes their phone, etc. this needs to be redone for all participants. For example, in a group of 100 people, that’s 4950 pairs of security verifications.

Ideally, this wouldn’t be a manual process and could be verified through some kind of automated flow.

Enter key transparency: A protocol in which we establish an AKD on WhatsApp that maintains a record of public key changes. Additionally, we’ve established a third-party public repository of auditable change logs to the directory that updates whenever there’s additions to the directory. This is vital for transparency and to further strengthen our end-to-end encrypted guarantee. In effect, this confirms that the same public keys a user uses to contact a recipient are the same ones that everybody else also uses to communicate with the recipient.

Although key transparency does not substitute QR code scanning, it enhances and complements it in the following ways:

QR code scanning requires two people to coordinate out-of-band verification. In contrast, key transparency requires only a single client to initiate and perform a check against the directory, thus improving accessibility of the check process;
Key transparency serves as a public key consistency mechanism when manual QR code verification is impractical (for example in large group communication scenario);
It also serves as a lightweight first-check of end-to-end encryption, which improves adoption of end-to-end encryption checks to more users, benefiting messaging security at-large.

In the event that the automatic check returns a result showing that the connection may not be secure, we recommend users proceed with the manual security verification check.

The history of key transparency

Key transparency describes a protocol in which the server maintains an append-only record of the mapping between a user’s account and their public identity key. This allows the generation of inclusion proofs to assert that a given mapping exists in the directory at the time of the most recent update.

WhatsApp’s realization of key transparency is based on the original academic works on key transparency, starting with CONIKS and SEEMless, with extensions from a recent paper called Parakeet. Together, this resulted in the Rust AKD crate, which serves as the foundation for maintaining a key transparency solution along with generating inclusion and key history proofs from the directory. WhatsApp is hosting this AKD directory as an infrastructure available to all of our users.

Public keys cannot be used to decrypt a user’s messages or determine who you’ve been talking to. They are, however, necessary to make sure that someone is sending a message to the intended recipient by encrypting messages that only the holder of the public key’s associated private key can read.

A user may have many entries as they update their key over time. At WhatsApp’s scale this equates to billions of entries continually growing over time. When a user deletes their account, we remove all of the public keys for that account, but the fact a key existed at a point in time is immutable (we just can’t say what the key was).

How does key transparency work?

Security on principle

From a core design choice, multiple factors helped us decide to enhance the openness and security of this project. First off, the AKD, with all of its proof generation and verification logic, is open-source code. This is a Rust-based crate (library) for any entity that wants to manage an append-only directory with a publicly verifiable log or verify append-only audit proofs and participate as a public auditor of WhatsApp’s key transparency solution. A list of public keys alone cannot provide access to anyone’s content.

This library allows for the system to provide a significant guarantee on the correctness of the directory entries while not compromising security by being vulnerable to memory-based attacks. Additionally, we stuck with the decision to utilize Rust in most of the internal components outlined below.

Applying AKD to WhatsApp

High-volume key changes

WhatsApp deals with tens of thousands of key changes (registration, re-registration, etc.) per minute. This kind of volume is difficult to deal with when trying to insert into an append-only log.

Therefore, we decided to implement a distributed, high-throughput queue where “pending changes” live prior to being gathered together into a batch and inserted to form the next epoch. This allows us to do far larger batch inserts and greatly limits the number of database operations we need to make.

Since the changes to the AKD are additive based on the previous epoch we need to make sure that only a single update occurs at a time. A single processor, sequentially handling each update one-by-one, wouldn’t be able to keep up with the rate of changes within WhatsApp (no matter the database implementation). This adds some latency from the time a key is added or updated to when it’s “published” in the directory.

By batching keys together and making an epoch a collection of changes committed atomically, we can benefit from a lot of query optimizations due to many shared paths in the Merkle Tree stored in the database. The frequency to publish and emit new epochs is a tunable parameter that may be adjusted over time.

Public auditing at scale

The general requirement for all transparency solutions is to be publicly auditable, meaning that anyone, should they want to, can verify the transactions on the directory to assert that:

The history hasn’t been changed (existing records aren’t deleted or updated).
Changes are append-only.

When publishing a new change to the AKD, we emit an audit proof of those changes that is put into public storage for anyone interested. These audit records guarantee the properties of immutable history for anyone to verify should they want to while preserving the privacy of all users in the directory.

This does not risk anyone’s actual info from being public, nor does it reveal any patterns of behavior for any users. You can read more about how this privacy guarantee works as outlined in SEEMless and Parakeet, the academic works from which key transparency is based off.

WhatsApp’s key transparency rollout

Key transparency solutions help strengthen the guarantee that end-to-end encryption provides to private personal messaging applications in a transparent manner available to all. This technology underpins WhatsApp commitment and leadership in the security domain.

WhatsApp is already hosting and operating an AKD for all of our users, regardless of the version or platform of the application you’re utilizing. Users who utilize the verify security code function will start to notice that the verification is automatic as this rolls out on Android in the coming months. This is an important mechanism that empowers security-conscious users to verify an end-to-end encrypted personal conversation quickly.

A more technical deep-dive whitepaper that goes through potential attacks, additional details on data-flows and formats, and more will be released soon.

How Device Verification protects your WhatsApp account

Thu, 13 Apr 2023 14:58:00 +0200

WhatsApp has launched a new security feature that further helps prevent attackers from using vectors like on-device malware.
This security feature, called Device Verification, requires no action or additional steps from users and helps protect your account.
This feature is part of our broader work to increase security for our users from the growing threat of malware.

WhatsApp’s top priority is ensuring that users can communicate privately, simply, and securely. One of the strongest tools at our disposal is end-to-end encryption – meaning that nobody, not even WhatsApp, can read personal messages sent between users. This protects messages from interception, however, we’ve increasingly seen attackers are targeting the end points of communication – mobile devices themselves – and we are increasing our security mechanisms to keep user accounts safe.

In particular, we are concerned about malware that infects a mobile phone in much the same way a virus infects a computer. Malware is used to advance account takeover (ATO) attacks that send messages without the user’s knowledge or permission.

In our ongoing effort to safeguard peoples’ accounts and information on WhatsApp, we’re introducing a new security measure – called Device Verification – to help prevent ATO attacks. Device Verification blocks the attacker’s connection, while allowing the victim to use their WhatsApp account uninterrupted.

Why do we need Device Verification?

WhatsApp uses several cryptographic keys to ensure that communications across the app are end-to-end encrypted. One of these is the authentication key, which allows a WhatsApp client to connect to the WhatsApp server to re-establish a trusted connection. This authentication key allows people to use WhatsApp without having to enter a password, PIN, SMS code, or other credential every time they turn on the app.

This mechanism is secure because the authentication key cannot be intercepted by any third party including WhatsApp. If a device is infected with malware, however, the authentication key can be stolen.

We are primarily concerned about the popularity of unofficial WhatsApp clients that contain malware designed for this purpose. These unofficial apps put users’ security at risk – and it is why we encourage everyone using WhatsApp to use the official WhatsApp app.

Once malware is present on user devices, attackers can use the malware to capture the authentication key and use it to impersonate the victim to send spam, scams, phishing attempts, etc. to other potential victims.

Device Verification will help WhatsApp identify these scenarios and protect the user’s account without interruption.

How Device Verification works

WhatsApp has built Device Verification to benefit from how people typically read and react to messages sent to their device. When someone receives a message their WhatsApp client wakes up and retrieves the offline message from WhatsApp server. This process cannot be impersonated by malware that steals the authentication key and attempts to send messages from outside the users` device.

Device Verification introduces three new parameters:

A security-token that’s stored on the users` device.
A nonce that is used to identify if a client is connecting to retrieve a message from WhatsApp server.
An authentication-challenge that is used to asynchronously ping the users` device.

These three parameters help prevent malware from stealing the authentication key and connecting to WhatsApp server from outside the users` device

How a security-token gets bootstrapped

Every time someone retrieves an offline message, the security-token is updated to allow seamless reconnection attempts in future. This process is called bootstrapping the security-token.

How a new client connection is validated

Every time a WhatsApp client connects to the WhatsApp server, we require the client to send us the security-token that’s on their device. This allows us to detect suspicious connections from malware that is trying to connect to the WhatsApp server from outside the users` device.

What is an authentication-challenge?

An authentication-challenge is an invisible ping from the WhatsApp server to a user’s device. We only send these challenges on suspicious connections. There are three possible responses to the challenge:

Success: The client responds to the challenge from the connecting device.
Failure: The client responds to the challenge from a different device. This means the connection being challenged is very likely from an attacker and the connection will be blocked.
No Response: The client doesn’t respond to the challenge. This situation is rare and indicates that the connection being challenged is suspicious. We retry sending the challenge a few more times. If the client still doesn’t respond, the connection will be blocked.

What’s next

Malware is an issue that increasingly threatens everyone’s security and privacy. Device Verification has been rolled out to 100% of WhatsApp users on Android and is in the process of being rolled out to iOS users. It enables us to increase our users’ security without interrupting their service or adding an additional step they need to take. Device Verification will serve as an important and additional tool at WhatsApp’s disposal to address rare key-theft security challenges. We will continue to evaluate new security features to protect the privacy of our users.

Why xHE-AAC is being embraced at Meta

Tue, 11 Apr 2023 18:00:00 +0200

We’re sharing how Meta delivers high-quality audio at scale with the xHE-AAC audio codec.
xHE-AAC has already been deployed on Facebook and Instagram to provide enhanced audio for features like Reels and Stories.

At Meta, we serve every media use case imaginable for billions of people across the world — from short-form, user-generated content, such as Reels, to premium video on demand (VOD) and live broadcasts. Given this, we need a next-generation audio codec that supports a range of operating points with excellent compression efficiency and modern, system-level audio features.

To address these needs now and into the future, Meta has embraced xHE-AAC as the vehicle for delivering high-quality audio at scale.

The benefits of xHE-AAC

xHE-AAC is the latest member of the MPEG AAC audio codec family. The Fraunhofer Institute for Integrated Circuits IIS played a substantial role in the development of xHE-AAC and the MPEG-D DRC standard.

Today, xHE-AAC is already providing a superior audio experience on Facebook and Instagram — including on Reels and Stories — and has a number of valuable features.

Loudness management

With hundreds of millions of uploads per day across Facebook and Instagram, we receive audio tracks with loudness levels ranging from silence to full scale, and everything in between.

When people play these videos sequentially, they can perceive some audio as being too loud or too quiet. This creates listener fatigue from having to constantly adjust the volume.

xHE-AAC’s integrated loudness management system solves for loudness inconsistency while meticulously preserving creator intent by bringing the average loudness of all sessions to the same target level and managing the dynamic range of each session to fit the playback environment.

Instead of burning in a specific target level and dynamic range compression (DRC) profile during encoding, xHE-AAC allows us to leave the original audio characteristics untouched and delegate loudness management processing to the client via loudness metadata, for the optimal audio experience based on context.

As a result of xHE-AAC’s loudness management, people can spend more time immersed in their favorite content and less time fiddling with the volume control.

Adaptive bit rate audio

Most people who use our apps consume media on mobile devices and expect the highest audio quality without interruption. This presents a challenge for streaming media because connection quality varies on mobile and can result in a very uneven user experience.

To optimize quality under dynamic bandwidth constraints, we produce multiple video and audio qualities to match varying network conditions at playback time. Even though we produce multiple audio lanes, we have historically only employed adaptive bit rate (ABR) algorithms to switch video qualities during playback because it’s difficult to enable adaptive bit rate audio without compromising quality during lane transitions.

In order to enable seamless audio ABR, xHE-AAC introduces the concept of immediate playout frames (IPFs) that contain all the data necessary to start playing a new audio lane without relying on data from other frames. By placing an IPF at the beginning of each Dynamic Adaptive Streaming over HTTP (DASH) segment and aligning the segment durations of each lane, we can seamlessly switch between audio lanes during playback to provide the highest-quality audio at any available bandwidth while avoiding playback stalls.

After launching audio ABR on Facebook for Android, we were able to improve user experience by reducing the number of sessions where playback stalls.

How we deployed xHE-AAC

We generate xHE-AAC bitstreams using an encoder SDK provided by the Fraunhofer Institute for Integrated Circuits IIS, and then prepare the resulting audio files for DASH streaming with shaka-packager. The xHE-AAC encoder’s two-pass encoding mode is used to measure the input loudness envelope and average program loudness on the first pass and perform the actual audio data compression on the second pass. As an added benefit, two-pass encoding allows us to use loudness range control (LRAC) DRC, which mitigates pumping artifacts otherwise introduced by single-pass DRC algorithms.

To prepare an xHE-AAC audio adaptation set for ABR delivery, IPFs are inserted at constant time intervals, audio configuration parameters such as sample rate and channel configuration are kept constant, and unique stream identifiers are selected for each lane in the audio adaptation set.

At playback time, we custom-fit the audio to the listening environment by configuring a target loudness level and DRC effect type based on context, and thanks to the embedded loudness metadata, we can adapt a single xHE-AAC bitstream to a variety of audio consumption use cases, from headphones to device speakers and various levels of background noise. Finally, if the client is starved for data or bandwidth is plentiful, audio ABR will automatically switch audio qualities to ensure that the highest audio quality is played without interrupting the playback session.

Where can you experience xHE-AAC today?

You can experience xHE-AAC audio on Facebook for iOS and Android, as well as on targeted surfaces on Instagram, such as Reels and Stories. We encourage you to install the latest version of Facebook and Instagram apps on iOS 13+ and Android 9+ to ensure that you can experience it.

Acknowledgements

This work is the collective result of the entire Video Infrastructure and Instagram Media Platform teams at Meta in collaboration with Fraunhofer Institute for Integrated Circuits IIS. The author would like to extend special thanks to Abhishek Gera, Tim Harris, Arun Kotidath, Edward Li, Meng Li, Srinivas Lingutla, Denise Noyes, Mohanish Penta, David Ronca, Haixia Shi, Mike Starr, Cosmin Stejerean, Simha Venkataramaiah, Juehui Zhang, Runshen Zhu, and the engineering team at Fraunhofer Institute for Integrated Circuits IIS.

Build faster with Buck2: Our open source build system

Thu, 06 Apr 2023 18:00:00 +0200

Buck2, our new open source, large-scale build system, is now available on GitHub.
Buck2 is an extensible and performant build system written in Rust and designed to make your build experience faster and more efficient.
In our internal tests at Meta, we observed that Buck2 completed builds 2x as fast as Buck1.

Buck2, Meta’s open source large-scale build system, is now publicly available via the Buck2 website and the Buck2 GitHub repository. While it shares some commonalities with other build systems (like Buck1 and Bazel), Buck2 is a from-scratch rewrite. Buck2 features a complete separation of the core and language-specific rules, with increased parallelism, integration with remote execution and virtual file systems, and a redesigned console output. All of these changes are aimed at helping engineers and developers spend less time waiting, and more time iterating on their code.

Thousands of developers at Meta are already using Buck2 and performing millions of builds per day, with builds completing twice as fast as with Buck1. Our own internal analysis has shown that engineers were able to produce meaningfully more code when their builds were executed by Buck2, and we hope the wider industry will also see benefits.

Why rebuild Buck?

Build systems stand between a programmer and running their code, so anything we can do to make the experience quicker or more productive directly impacts how effective a developer can be. The goal of Buck2 was to keep what we liked about Buck1 (the core concepts and workflows), draw inspiration from innovations after Buck1 (including Bazel, Adapton, and Shake), and focus on speed and enabling new experiences.

Buck2’s design is based on the following principles:

The core build system has no knowledge of any language-specific rules. Having the rules separated from the core means that the rules are easier to change and understand. The core of Buck2 is written in Rust, and its language rules (like how to build C++) are written in Starlark. This separation is in contrast to Buck1 (where all rules are written in the core) and Bazel (where C++/Java are written in the core).
The build system is powered by a single incremental dependency graph, avoiding any phases (in contrast to Buck1 or Bazel). This decision eliminates many types of bugs and increases parallelism.
The rules API is designed to contain advanced features for performance, along with dynamic (or monadic) dependency features for expressibility. At the same time, these features are carefully restricted to ensure other properties (for example, fast queries or hermeticity) are not harmed.
The open source release is almost identical to our internal version. The only pieces swapped out are the toolchains (which point at the internal copies of our compilers) and remote execution (which points at our internal servers) — both have open source alternatives supplied. We are also releasing all the rules exactly as they are used internally. Furthermore, we have separated some of the logical components into separate crates (e.g. Starlark, Superconsole, Allocative, Gazebo) so that they can be used outside Buck2.
Buck2 is written to integrate with remote execution, with the ability to run actions on remote machines. We use the same API as Bazel, and have been testing remote execution with Buildbarn and EngFlow. While not required (and not really expected for people starting out with the open source version), we are able to efficiently compute recursive digests and send them to remote execution efficiently.
Buck2 is written to integrate with virtual file systems, where the entire repository is not all checked out, but fetched on demand as the files are accessed. In particular, we support Sapling-based file systems. To integrate well, we watch for file notifications (with Watchman) and request both files and file-digests without direct file operations. The benefit is that we can make virtual file systems as fast as a full checkout, but with the benefits of much faster checkout and much lower disk usage.

The key takeaway from all these improvements is that we have designed Buck2 to be fast. In real world usage, depending on the build, Buck2 is significantly faster than Buck1. If there are no source code changes, Buck2 is almost instant on subsequent builds. If there is a lot of work to do, Buck2 starts executing faster and has greater parallelism. This increase in speed is both a consequence of many of the factors above, but also care and attention.

The user view

For end users, Buck2 works mostly the same as Buck1 (which, to a first approximation, is fairly similar to Bazel). A user defines targets in a BUCK file:

rust_binary(
    name = “my_binary”,
    srcs = [“main.rs”],
    deps = [“:my_library”],
)

A user can then build with buck2 build //:my_binary. The value main.rs is a source file, and :my_library is a dependency defined in the same BUCK file. It’s worth noting that Buck2 is mostly compatible with the BUCK files of Buck1.

As well as the increase in speed, there are two major additional user-visible differences compared to Buck1.

First, the console output has been redesigned on top of the Superconsole library, which we specifically developed for Buck2. The console shows a few more details and feels a lot nicer to use:

Second, there is a persistent daemon that maintains a single dependency graph. When you change a BUCK file, a dependency, or a source file, we invalidate the appropriate things on the dependency graph, then request the output artifacts per the command line. In Buck1 there are multiple distinct dependency graphs, which result in phases like target graph construction, action graph construction, and then action graph execution. There are also some operations that aren’t performed on the graph. If certain things change in Buck1, then entire graphs are thrown away, rather than the minimum pieces being invalidated. With a single dependency graph, Buck2 is simpler, avoids more redundant work, and avoids explicit phases. Everything on the dependency graph has a key (how it is identified) and a value, along with a function to compute the value from the key and other related keys (following the model in the paper, “Build Systems a la Carte”).

The rule author view

While the user model follows Buck1 very closely, the approach for rules is completely different. There are lots of rules in Buck, for example rust_binary used above. While a rule in Buck1 was a Java class, baked into Buck1, a rule in Buck2 is entirely decoupled. Buck2 also ships with a “prelude” of rules that implement most of the Buck1 rules.

Buck1 rules were tuned over time, had lots of performance optimizations and powerful features like graph traversal, but those rules were also expected to obey a lot of complex invariants—sometimes breaking those rules. For Buck2, the rule API is entirely in Starlark, which forced us to abstract those features as generically reusable APIs, aiming to make them safe, expressive, and performant—a tricky balance. We’ll touch on two such examples.

OCaml dependencies

The dependency structure of the OCaml library is hard to express in Buck1. An OCaml library consists of a number of OCaml files. These must be compiled in dependency order—so if A.ml uses B.ml, you must compile B.ml first. Bazel requires the dependency of A.ml on B.ml to be written explicitly in the BUCK file. Buck1 and Buck2 both leave that internal dependency implicit and run the tool ocamldep to infer it, which requires less maintenance as the structure changes. What Buck1 did is run ocamldep just after parsing the BUCK file, which wasn’t really allowed, and it didn’t track dependencies, so if you changed the imports too much Buck1 gave spurious compilation failures. With Buck2, we can use the new primitive dynamic_output, which lets you run a command, read the output of the file, then wire up the rest of the graph—putting in the correct dependencies between the .ml files automatically.

C++ link dependencies

Consider the C++ linking model: To produce a library, you usually need to link together its build output, along with the transitive closure of the build output of its dependencies. If you simply duplicate the set of dependencies at each layer as you move up the graph, you end up with O(n²) memory usage. In Buck1, there was custom code in many rules to capture this pattern, relying on the ability to share Java values in memory and for the dependencies to be represented in place within the rule structure (as there was no reified dependency graph). In Buck2, there are much stronger abstraction boundaries, so such reuse has to be made more explicit. Therefore, we introduced transitive-sets (tsets) to capture this pattern of sets representing a transitive closure. By making tsets more abstract, we were also able to wire the tset directly into the underlying dependency graph, meaning this representation is efficient in both memory and computation time.

Try Buck2 now

We’re keen for people to give Buck2 a try, and we would be happy to hear any feedback (GitHub issues are the best way). We expect Buck2 will be most interesting to moderately sized multi-language projects. Visit the Buck2 getting started page for more information.

Introducing Velox: An open source unified execution engine

Thu, 09 Mar 2023 14:00:00 +0100

Meta is introducing Velox, an open source unified execution engine aimed at accelerating data management systems and streamlining their development.
Velox is under active development. Experimental results from our paper published at the International Conference on Very Large Data Bases (VLDB) 2022 show how Velox improves efficiency and consistency in data management systems.
Velox helps consolidate and unify data management systems in a manner we believe will be of benefit to the industry. We’re hoping the larger open source community will join us in contributing to the project.

[embedded content]

Meta’s infrastructure plays an important role in supporting our products and services. Our data infrastructure ecosystem is composed of dozens of specialized data computation engines, all focused on different workloads for a variety of use cases ranging from SQL analytics (batch and interactive) to transactional workloads, stream processing, data ingestion, and more. Recently, the rapid growth of artificial intelligence (AI) and machine learning (ML) use cases within Meta’s infrastructure has led to additional engines and libraries targeted at feature engineering, data preprocessing, and other workloads for ML training and serving pipelines.

However, despite the similarities, these engines have largely evolved independently. This fragmentation has made maintaining and enhancing them difficult, especially considering that as workloads evolve, the hardware that executes these workloads also changes. Ultimately, this fragmentation results in systems with different feature sets and inconsistent semantics — reducing the productivity of data users that need to interact with multiple engines to finish tasks.

In order to address these challenges and to create a stronger, more efficient data infrastructure for our own products and the world, Meta has created and open sourced Velox. It’s a novel, state-of-the-art unified execution engine that aims to speed up data management systems as well as streamline their development. Velox unifies the common data-intensive components of data computation engines while still being extensible and adaptable to different computation engines. It democratizes optimizations that were previously implemented only in individual engines, providing a framework in which consistent semantics can be implemented. This reduces work duplication, promotes reusability, and improves overall efficiency and consistency.

Velox is under active development, but it’s already in various stages of integration with more than a dozen data systems at Meta, including Presto, Spark, and PyTorch (the latter through a data preprocessing library called TorchArrow), as well as other internal stream processing platforms, transactional engines, data ingestion systems and infrastructure, ML systems for feature engineering, and others.

Since it was first uploaded to GitHub, the Velox open source project has attracted more than 150 code contributors, including key collaborators such as Ahana, Intel, and Voltron Data, as well as various academic institutions. By open-sourcing and fostering a community for Velox, we believe we can accelerate the pace of innovation in the data management system’s development industry. We hope more individuals and companies will join us in this effort.

An overview of Velox

While data computation engines may seem distinct at first, they are all composed of a similar set of logical components: a language front end, an intermediate representation (IR), an optimizer, an execution runtime, and an execution engine. Velox provides the building blocks required to implement execution engines, consisting of all data-intensive operations executed within a single host, such as expression evaluation, aggregation, sorting, joining, and more — also commonly referred to as the data plane. Therefore, Velox expects an optimized plan as input and efficiently executes it using the resources available in the local host.

Data management systems like Presto and Spark typically have their own execution engines and other components. Velox can function as a common execution engine across different data management systems. (Diagram by Philip Bell.)

Velox leverages numerous runtime optimizations, such as filter and conjunct reordering, key normalization for array and hash-based aggregations and joins, dynamic filter pushdown, and adaptive column prefetching. These optimizations provide optimal local efficiency given the available knowledge and statistics extracted from incoming batches of data. Velox is also designed from the ground up to efficiently support complex data types due to their ubiquity in modern workloads, and hence extensively relies on dictionary encoding for cardinality-increasing and cardinality-reducing operations such as joins and filtering, while still providing fast paths for primitive data types.

The main components provided by Velox are:

Type: a generic type system that allows developers to represent scalar, complex, and nested data types, including structs, maps, arrays, functions (lambdas), decimals, tensors, and more.
Vector: an Apache Arrow–compatible columnar memory layout module supporting multiple encodings, such as flat, dictionary, constant, sequence/RLE, and frame of reference, in addition to a lazy materialization pattern and support for out-of-order result buffer population.
Expression Eval: a state-of-the-art vectorized expression evaluation engine built based on vector-encoded data, leveraging techniques such as common subexpression elimination, constant folding, efficient null propagation, encoding-aware evaluation, dictionary peeling, and memoization.
Functions: APIs that can be used by developers to build custom functions, providing a simple (row by row) and vectorized (batch by batch) interface for scalar functions and an API for aggregate functions.
- A function package compatible with the popular PrestoSQL dialect is also provided as part of the library.
Operators: implementation of common SQL operators such as TableScan, Project, Filter, Aggregation, Exchange/Merge, OrderBy, TopN, HashJoin, MergeJoin, Unnest, and more.
I/O: a set of APIs that allows Velox to be integrated in the context of other engines and runtimes, such as:
- Connectors: enables developers to specialize data sources and sinks for TableScan and TableWrite operators.
- DWIO: an extensible interface providing support for encoding/decoding popular file formats such as Parquet, ORC, and DWRF.
- Storage adapters: a byte-based extensible interface that allows Velox to connect to storage systems such as Tectonic, S3, HDFS, and more.
- Serializers: a serialization interface targeting network communication where different wire protocols can be implemented, supporting PrestoPage and Spark’s UnsafeRow formats.
Resource management: a collection of primitives for handling computational resources, such as CPU and memory management, spilling, and memory and SSD caching.

Velox’s main integrations and experimental results

Beyond efficiency gains, Velox provides value by unifying the execution engines across different data computation engines. The three most popular integrations are Presto, Spark, and TorchArrow/PyTorch.

Presto — Prestissimo

Velox is being integrated into Presto as part of the Prestissimo project, where Presto Java workers are replaced by a C++ process based on Velox. The project was originally created by Meta in 2020 and is under continued development in collaboration with Ahana, along with other open source contributors.

Prestissimo provides a C++ implementation of Presto’s HTTP REST interface, including worker-to-worker exchange serialization protocol, coordinator-to-worker orchestration, and status reporting endpoints, thereby providing a drop-in C++ replacement for Presto workers. The main query workflow consists of receiving a Presto plan fragment from a Java coordinator, translating it into a Velox query plan, and handing it off to Velox for execution.

We conducted two different experiments to explore the speedup provided by Velox in Presto. Our first experiment used the TPC-H benchmark and measured close to an order of magnitude speedup in some CPU-bound queries. We saw a more modest speedup (averaging 3-6x) for shuffle-bound queries.

Although the TPC-H dataset is a standard benchmark, it’s not representative of real workloads. To explore how Velox might perform in these scenarios, we created an experiment where we executed production traffic generated by a variety of interactive analytical tools found at Meta. In this experiment, we saw an average of 6-7x speedups in data querying, with some results increasing speedups by over an order of magnitude. You can learn more about the details of the experiments and their results in our research paper.

Prestissimo results on real analytic workloads. The histogram above shows relative speedup of Prestissimo over Presto Java. The y-axis indicates the number of queries (in thousands [K]). Zero on the x-axis means Presto Java is faster; 10 indicates that Prestissimo is at least 10 times faster than Presto Java.

Prestissimo’s codebase is available on GitHub.

Spark — Gluten

Velox is also being integrated into Spark as part of the Gluten project created by Intel. Gluten allows C++ execution engines (such as Velox) to be used within the Spark environment while executing Spark SQL queries. Gluten decouples the Spark JVM and execution engine by creating a JNI API based on the Apache Arrow data format and Substrait query plans, thus allowing Velox to be used within Spark by simply integrating with Gluten’s JNI API.

Gluten’s codebase is available on GitHub.

TorchArrow

TorchArrow is a dataframe Python library for data preprocessing in deep learning, and part of the PyTorch project. TorchArrow internally translates the dataframe representation into a Velox plan and delegates it to Velox for execution. In addition to converging the otherwise fragmented space of ML data preprocessing libraries, this integration allows Meta to consolidate execution-engine code between analytic engines and ML infrastructure. It provides a more consistent experience for ML end users, who are commonly required to interact with different computation engines to complete a particular task, by exposing the same set of functions/UDFs and ensuring consistent behavior across engines.

TorchArrow was recently released in beta mode on GitHub.

The future of database system development

Velox demonstrates that it is possible to make data computation systems more adaptable by consolidating their execution engines into a single unified library. As we continue to integrate Velox into our own systems, we are committed to building a sustainable open source community to support the project as well as to speed up library development and industry adoption. We are also interested in continuing to blur the boundaries between ML infrastructure and traditional data management systems by unifying function packages and semantics between these silos.

Looking at the future, we believe Velox’s unified and modular nature has the potential to be beneficial to industries that utilize, and especially those that develop, data management systems. It will allow us to partner with hardware vendors and proactively adapt our unified software stack as hardware advances. Reusing unified and highly efficient components will also allow us to innovate faster as data workloads evolve. We believe that modularity and reusability are the future of database system development, and we hope that data companies, academia, and individual database practitioners alike will join us in this effort.

In-depth documentation about Velox and these components can be found on our website and in our research paper “Velox: Meta’s unified execution engine.”

Acknowledgements

We would like to thank all contributors to the Velox project. A special thank-you to Sridhar Anumandla, Philip Bell, Biswapesh Chattopadhyay, Naveen Cherukuri, Wei He, Jiju John, Jimmy Lu, Xiaoxuang Meng, Krishna Pai, Laith Sakka, Bikramjeet Vigand, Kevin Wilfong from the Meta team, and to countless community contributors, including Frank Hu, Deepak Majeti, Aditi Pandit, and Ying Su.

Meta’s head of AR hardware on the future of AR

Fri, 24 Feb 2023 15:00:00 +0100

While VR headsets have been with us for at least a decade, AR hardware barely exists today; indeed, the very components that will comprise the hardware scarcely exist, making it a truly zero-to-one innovation challenge. Meta’s Head of AR Glasses Hardware, Caitlin Kalinowski is helping to lead that charge. Kalinowski hails from Portsmouth, NH and studied mechanical engineering at Stanford University. She is a veteran of Meta’s VR team and led product design and integration of the Meta Quest.

Kalinowski spoke recently with Tech@ about the differences between VR and AR and how in her current role she is developing hardware that needs to be smaller, lighter, and similar or equally powerful to the VR headsets she helped develop earlier in her career. Kalinowski also explores how diversity – of skills, of temperaments, of sensibilities, of ethnicity, of gender, and sexuality – is critical for innovation. Finally, she discusses the role of collaboration and creating a safe space for failures that helps navigating the massive space between zero and one.

Meta’s head of AR glasses on the future of AR hardware

Fri, 24 Feb 2023 15:00:00 +0100

Kalinoswki spoke recently with Tech@ about the differences between VR and AR and how in her current role she is developing hardware that needs to be smaller, lighter, and similar or equally powerful to the VR headsets she helped develop earlier in her career. Kalinoswki also explores how diversity – of skills, of temperaments, of sensibilities, of ethnicity, of gender, and sexuality – is critical for innovation. Finally, she discusses the role of collaboration and creating a safe space for failures that helps navigating the massive space between zero and one.

How Meta brought AV1 to Reels

Tue, 21 Feb 2023 14:00:00 +0100

We’re sharing how we’re enabling production and delivery of AV1 for Facebook Reels and Instagram Reels.
We believe AV1 is the most viable codec for Meta for the coming years. It offers higher quality at a much lower bit rate compared with previous generations of video codecs.
Meta has worked closely with the open source community to optimize AV1 software encoder and decoder implementations for real-world, global-scale deployment.

As people create, share, and consume an ever-increasing volume of online videos, Meta is working to develop the most bandwidth-efficient ways to transcode content while maintaining reasonable compute and power consumption levels. Choosing the most appropriate video coding formats — the algorithms for compressing and decompressing the file — is crucial. Over the past two decades, researchers have developed video coding standards with ever-higher compression efficiency, including AVC, HEVC, and VVC, developed by MPEG/JVET, and VP9 and AV1, developed by Google and the Alliance for Open Media (AOM). Newer-generation standard typically can reduce the bandwidth by about 30 percent to 50 percent compared with its predecessor while maintaining similar visual quality. At the same time, however, each new standard has consumed substantially more energy and compute than the last, while necessitating encoders that were many times more complex.

We believe AV1 will be the most viable codec for Meta over the next several years. AV1 is the first-generation royalty-free video coding standard developed by AOM, of which Meta is a founding member. It delivers about 30 percent better coding efficiency than VP9 and HEVC — allowing people who use our apps to enjoy high-quality video at much lower bandwidth, and enabling us to maximize storage efficiency and reduce egress traffic, CDN prefetching/caching, and network congestion. AV1 also has a much richer feature set than other video coding standards and can support most of Meta’s typical production usages. AV1 is royalty-free, and both the encoder and decoder implementations are open sourced, with very active development and good support.

Over the past few years, Meta has worked closely with the open source community to optimize AV1 software encoder and decoder implementations for real-world, global-scale deployment. Our goal is to improve playback from what we currently offer with AVC and VP9. We want to ensure that as we roll out AV1, it delivers real value to the people who use our apps.

Finding the right AV1 encoders and decoders

Several open source and closed-source encoder implementations are ready for production, all almost as efficient as the AV1 reference encoder. In a paper, “Towards much better SVT-AV1 quality-cycles tradeoffs for VOD applications,” jointly published with Intel at last year’s SPIE conference, we benchmarked multiple open source encoders — including x264, x265, libvp9, libaom, SVT AV1, and VVC reference encoder (vvenc) — for a video on demand (VOD) use case. The graph below illustrates the trade-off between encoder quality (vertical axis) and complexity (horizontal axis). Every point on the graph corresponds to an encoder preset. The y-axis represents the average BD-rate relative to libaom cpu-used=0; lower values indicate better coding efficiency. The x-axis represents the encoding time in seconds in logarithmic scale.

A few highlights from this graph:

SVT-AV1, the productization encoder for the AV1 coding standard, maintains consistent performance across a wide range of complexity levels. With a total of 13 presets, SVT-AV1 can cover a complexity range that extends from the higher quality AV1 to the higher speeds AVC presets corresponding to more than 1000x change in complexity. This complexity range covers all open source software encoders used in production systems.
At any given point on the x-axis, SVT-AV1 can maximize coding efficiency compared with any other production encoder. For example, the M12 preset has similar complexity performance to the x264 veryfast preset, but M12 is about 30 percent more efficient.
At any given point on the y-axis, SVT-AV1 can maximize encoding speed compared with any other production encoder. For example, the M8 preset is about as efficient as libvp9 preset 0, but M8 is almost 10 times faster.

SVT-AV1 offers 13 presets, allowing a fine-grained trade-off between quality and speed. More importantly, SVT-AV1 now includes a “-fast-decode” option, which accelerates software decoding — with only a slight drop in efficiency — by automatically limiting or disabling the use of AV1 coding tools that are not software-decoder friendly. SVT-AV1 also provides thread management parameters to balance density and speed — critical for large-scale production — potentially enabling a one- or two-second delay for live video streaming. Many parameters can be adjusted to improve coding efficiency or to support certain production scenarios. Some AV1 coding tools that were proposed for use cases in deployment, such as reference frame scaling, super resolution, film grain synthesis, and switch frames, are also supported in SVT-AV1.

Our biggest challenge will be client-side decoding of AV1. Many hardware vendors, including Intel and NVIDIA, have begun to support AV1 hardware decoding on PC. However, we are serving video primarily to mobile phones, most of which don’t include AV1 hardware decoders. For now, we must rely primarily on software decoders. Two major open source software decoders are compatible with multiple platforms: dav1d was developed by VideoLAN and the open source community and can serve as an app-level decoder, while Google’s libgav1 is integrated into the Android SDK.

After extensively benchmarking the decoders’ performance, focusing on facets such as resource requirements, crashes and responsiveness, and frame drops, we decided to integrate dav1d into the player for both iOS and Android platforms. We have been working closely with the open source community to optimize dav1d’s performance. In the last year, we also worked with Ittiam to conduct a benchmark test on Android phones. dav1d can support 720p30 real-time playback on most of the devices in our sample, achieving 1080p30 on certain mid-range and high-end models.

Some Android phones, such as the Google Pixel 6 Pro and Samsung Galaxy S21, already support hardware AV1 decoding. In the near future, we expect that a growing number of high-end Android models will support AV1 hardware decoding, with mid-tier devices following eventually.

Deploying AV1 encoding on Facebook Reels and Instagram Reels

Early in 2022, we deployed AV1 encoding for Facebook and Instagram Reels. When someone uploads a video, the platform generates multiple bit-rate encodings tailored to the video’s projected watch time. To prevent stalling caused by changes in bandwidth, clients can select the version that best fits their connection speed — a technique called adaptive bit rate (ABR) streaming. For videos with high projected watch time, we use advanced ABR encoding based on the convex hull dynamic optimizer algorithm. For each uploaded video, we produce multiple down-scaled versions and encode each with multiple quantization parameters (QPs) and Constant Rate Factors (CRFs). For example, for a 1080p video, we might create seven resolutions and five CRFs, for a total of 35 encodings. After encoding, the system upscales decoded videos to the original resolution and calculates the quality score.

In the graph of rate distortion (RD) curves below, the x-axis represents the encoding bit rate and the y-axis the quality score, expressed in FB-MOS units on a scale of 0 to 100.

From these 35 RD points, we calculate the convex hull, a curve that connects the RD points on the upper left boundary. (Theoretically, if we could use all possible encoding resolutions and CRFs to produce a much denser plot, any point on the convex hull will be the most optimal encoding option for this video in terms of resolution and CRF value.) As illustrated above, we can then select the best encoding for delivery based on the target quality or bit rate.

We have simplified this complicated process. In previous studies, we found that we could use the high-speed preset for first-pass encoding and to produce the convex hull, and then take a second pass to encode the selected (resolution, CRF) points with the high-quality preset. Even though this approach requires additional encoding, it’s faster because the first pass can be done much more quickly. (Coding efficiency drops only slightly.) This approach works even if the first and second passes use different encoders. For example, we can use AVC or VP9 in the first pass and AV1 in the second. We can also leverage the hardware encoder in our internally designed ASICs to accelerate this process.

In the end, we decided on a two-stage hybrid hardware/software ABR encoding approach. Hardware AVC encoding is triggered at video upload time; for this stage, we store only the quality and bit rate information but not encoded bitstreams. When projected watch time of the video exceeds the threshold, second stage encoding is triggered with software AVC, VP9 or AV1 encoder based on the selected (resolution, CRF) on the convex hull.

We can easily add AV1 as one of the second-stage encoders; it is already deployed for Facebook Reels. We have implemented a similar heuristic-based approach for Instagram Reels. For one example video shown in the graph above, three encoding families with AVC, VP9, and AV1 have been produced. Their RD curves closely follow the convex hull from the first-stage encoding. For this particular video example, the best-quality AV1 encoding rivals those of the other two standards, but with a bit rate 65 percent lower than AVC’s and 48 percent lower than VP9’s. In addition, AV1 achieves the desired quality within a very narrow bit rate range, so we can further reduce compute and storage costs by producing fewer encodings during the second stage. As a result, people who use our products can enjoy high-quality video at much lower bandwidth.

AV1 decoder integration and testing

It was relatively easy to enable AV1 decoding and playback on the iOS devices. After just a few rounds of tests, we started delivery. To integrate the dav1d decoder on iOS, we found that two to four threads would meet most of our production needs; any additional threads would waste memory and power without boosting performance.

dav1d has two modes: synchronous and asynchronous. In synchronous mode, dav1d decodes one frame at a time but enables low-latency decoding for each frame. In asynchronous mode, dav1d decodes multiple compressed frames in parallel, postponing rendering until all frames are decoded. In theory, asynchronous mode provides higher throughput and faster decoding. For now, we adopt synchronous mode on iOS since it fits the existing player stack, but we are looking into migrating to asynchronous mode in the future.

To support the decoding of 10-bit AV1-encoded HDR video, we built a single dav1d binary that supports both 8- and 10-bit decoding and ensures that color information is preserved in the transcoding process.

The Android platform presented bigger challenges. First, because people engage with our apps on a vast number of Android models, we had to run local and large-scale A/B tests on various devices to find the optimal decoder configurations. To help debug and triage problems from the AV1 decoder library, we added extensive logging that propagated back error messages from throughout the player stack. This critical step helped us quickly identify and resolve issues in the integration process.

Second, because we are using app level software decoders, we used the hardware VP9 decoder and software AV1 decoder together when playing the same video stream, to correctly support mixed codec manifest and in-stream ABR lane switch. We needed to make sure they interacted with the render engine correctly.

We also needed to support devices with low performance and display resolution. (This was not a problem with iPhones.) Although AV1 can encode high-resolution videos at a much lower bit rate than VP9, bit rate reduction is smaller for low-resolution videos. That makes it difficult to show improvement in top-line delivery metrics for low-performance Android phones. We responded by using higher-quality encoding presets to boost coding efficiency in low-resolution ABR lanes.

Another challenge was that memory allocation and thread creation increased the decoding latency of the first few video frames, prolonging the software decoder start time, delaying player startup, and causing in-play stalls. This was most challenging with Reels, because people typically scroll across multiple Reels videos in quick succession. To improve scrolling performance, we prefetched multiple Reels videos earlier, before they were played.

Before we conduct a large-scale A/B delivery test, we have to check whether the end device is powerful enough for real-time decoding and playback of AV1 bitstreams. However, there is no easy way to classify Android phone performance. We cannot test every model that exists, as there are thousands of them. And characteristics such as core counts, chipset vendors, RAM size, and year and model are not sufficient indicators of capability. We eventually decided to run a small benchmarking test to measure performance and give each phone a performance score. This benchmarking test consisted of basic compute operations, including Gaussian blur, memory allocation, memory copy, and 3D rendering. With this approach, we could assign scores to any existing or upcoming mobile phones and group them based on those numbers. Our A/B tests then identified the models that could support 720p, 1080p, and 10-bit HDR playback.

After the initial Android rollout, we started to enable AV1 hardware decoding for the few Android phones that support it. We expect hardware decoding to improve AV1 performance, and we plan to perform large-scale tests when a larger number of capable phones become available.

Latest delivery status

We started the AV1 delivery for Facebook Reels on iPhone in early 2022 and observed the benefits within the first week of the rollout.

The following graph shows the week-over-week average playback FB-MOS for all Facebook Reels videos played on iPhones. Playback FB-MOS improved by about 0.6 points after we deployed AV1.

This second graph shows the average bit rate for all Facebook Reels videos played on iPhones. AV1 reduced the average bit rate by 12 percent.

This last graph shows the watch time of different codecs for Facebook Reels on iPhone. AV1 watch time rose to about 70 percent during the first week of rollout.

We have continued to enable new features for iPhone, including 1080p30 8-bit AV1 delivery for iPhone 8 and beyond, 10-bit HDR delivery up to 1080p30 for models of iPhone X and beyond that support HDR display, and 1080p60 8-bit AV1 delivery for iPhone 11 and beyond. AV1 encodes a high percentage of the Facebook Reels and Instagram Reels videos watched on iPhones. We have also enabled 8-bit AV1 delivery to select midrange to high-end Android phones. The watch time percentage on Android for AV1 is relatively small but growing.

What’s next for AV1 at Meta?

AV1 delivers real value to the people who use our products. It offers higher quality at a much lower bit rate compared with previous generations of video codecs. For example, in the video below, there is an obvious difference in quality between AVC, VP9, and AV1 at roughly the same bit rate.

https://engineering.fb.com/wp-content/uploads/2023/02/Alley-Oop-AVI-VP9-H264.mp4

Going forward, we will continue to expand AV1 delivery for Android phones and enable hardware decoding in new devices that support it.

For low-end Android phones, it remains challenging to play back high-resolution AV1 bitstreams. To address this, we are currently experimenting with mixed codec manifest support. On the server side, the ABR delivery algorithm generates a mixed codec manifest that contains multiple video adaptation sets with bitstreams encoded using different codecs, such as VP9 and AV1. It also specifies which AV1 and VP9 lanes the device should choose from based on its performance score. For example, a low-end phone can play AV1 up to 540p and switch to VP9 for higher resolution lanes.

With more and more hardware vendors implementing AV1 decoders in mobile SOCs, we expect the number of AV1 capable devices to continue to grow in the next few years, allowing more end users to enjoy the benefits of AV1.

Acknowledgements

This work is a collective effort by the Video Infra team and Instagram team at Meta, along with external partners, including the Intel SVT team, VideoLAN, Ittiam, Two Orioles, and the open source community. The authors would like to thank Jamie Chen, Syed Emran, Xinyu Jin, Ioannis Katsavounidis, Denise Noyes, Mohanish Penta, Nam Pham, Srinath Reddy, Shankar Regunathan, David Ronca, Zafar Shahid, Nidhi Singh, Yassir Solomah, Cosmin Stejerean, Wai Lun Tam, Hassene Tmar, and Haixiong Wang for their contributions and support.

Inside Meta’s first smart glasses

Thu, 16 Feb 2023 15:00:00 +0100

What’s new:

Meta is sharing the inside story of how it developed the Ray-Ban Stories smart glasses.

Why it matters:

Creating Ray-Ban Stories meant Meta’s engineers had to take on new challenges to build smart glasses that married complex engineering dynamics. How do you make something that features cameras, microphones, audio, and touch controls, all while fitting into a form factor similar to a standard pair of Ray-Ban glasses?

Meta’s engineers knew the final smart glasses product had to be fashionable, comfortable, technologically advanced, lightweight, and have a long battery life. The end result, Ray-Ban Stories, are an important stepping stone on Meta’s journey towards building augmented reality (AR) glasses.

Read the full story on the making of Ray-Ban Stories and how Meta tackled these challenges.

Building a cross-platform runtime for AR

Mon, 13 Feb 2023 15:00:00 +0100

Meta’s augmented reality (AR) platform is one of the largest in the world, helping the billions of people on Meta’s apps experience AR every day and giving hundreds of thousands of creators a means to express themselves Meta’s AR tools are unique because they can be used on a wide variety of devices — from mixed reality headsets like Meta Quest Pro to phones, as well as lower-end devices that are much more prevalent in low-connectivity parts of the world.

How it works:

To achieve this, we’re focused on performance optimization. We give all creators the option to mix and match various AR capabilities as they please, somewhat like LEGO bricks, to create and deliver unique experiences. As creators focus on building amazing experiences, we focus on the complexity of optimizing assets and runtime so these experiences can run everywhere, from mobile all the way to advanced hardware and VR. One way we do so is by deliberately splitting our monolithic runtime into smaller plugins. This way, if an app doesn’t require a specific capability it can easily be excluded using just a quick configuration toggle. We’re continuously looking for new opportunities like this to further expand our platform’s reach and support more use cases with a single AR engine at the core.

Why it Matters:

At Meta, our AR engine group works to ensure that our AR services are available for everyone, regardless of the device they’re using. AR and VR experiences shouldn’t be restricted to the most sophisticated devices but should be widely accessible to all.

Take a Deeper Dive:

And watch the Products @Scale talk below on “Building a Cross-platform Runtime for AR Experience.”

Improving Meta’s global maps

Tue, 07 Feb 2023 15:00:00 +0100

A lot has changed since the initial launch of our basemap in late 2020. We’re Meta now, but our mission remains the same: Giving people the power to build community and bring the world closer together.
Across Meta, our family of applications (Facebook, Instagram, WhatsApp, among others) are using our basemap to connect people through functions like status updates, location sharing, and location-based searching.

In late 2020, Meta launched its basemap. This global, multi-scale map serves as a foundational layer — showing a variety of geographic features onto which we can blend other data, such as local points of interest or locally important features. Meta’s applications (Facebook, Instagram, and WhatsApp, among others) use the basemap to connect people through functions like status updates, location sharing, and location-based searching.

We want our maps to be living documents that adapt to the needs of the people who use our apps, all while keeping up to date with data sources and trends in cartographic design. We’ve made some major upgrades to the basemap over the past two years, from how we store information to the appearance of the maps themselves.

Below are examples from the hundreds of surfaces where our display maps are used.

Instagram maps on Android

Actus (from Meta’s New Product Experimentation team)

Facebook Crisis Response

Facebook check-ins

Mapillary (iOS, Android, Web)

Meta Quest Pro demo finder

WhatsApp business directory on Android

Fast rendering and up-to-date data

We’re now serving several basemaps.

In the fall of 2021, we launched a dark-mode variant to accompany our dark mode interface.

In early 2022 we also launched raster-first basemaps. Raster images are built from discrete pixels, rendering fine gradations of color and shade. But a raster image’s resolution is fixed, so if you expand or zoom in, it begins to look blurry or pixelated. Vector files store equations that describe the lines, curves, shapes, and color that define an image. When the image is expanded or resized, the formulas automatically adjust without compromising resolution.

Rendering maps on mobile and web clients is complex, so we serve both vector and raster maps. We can quickly serve compressed raster maps that work on a variety of devices, old and new, across mobile and web. These basemaps include the same features as our vector maps, but they are designed specifically to be rasterized and served as a tiled map service, which composites sections of the map to display a larger image. Removing cartographic effects like opacity, blur, and shadowing allows for a crisper display when the vector map is rasterized. These raster basemaps look and behave so much like their vector counterparts that most people would not recognize the difference.

All maps across Meta continue to use the Daylight distribution from the open-source database OpenStreetMap (OSM). Daylight ensures that our maps are up-to-date and free of geometry errors, vandalism, and profanity. Recently, we partnered with the Daylight team at Meta to create Earth Tables, an extension of Daylight that simplifies OSM tags. This new data schema was born partly out of our cartographic tiling logic, and it includes everything necessary to make a map of the world. To learn more, check out this presentation we did recently at the annual North American Cartographic Information Society (NACIS) conference.

In 2023, we are looking forward to incorporating new open map data coming from the recently announced Overture Maps Foundation, which Meta co-founded along with Microsoft, AWS, and TomTom.

In some regions, we’ve introduced AI-based translations to increase the readability of map labels, like street names, for people around the world. We’ve also started introducing map features from open data sources. In select cities high resolution pavement data, tree locations, and even pavement markings bring an unparalleled level of detail to our maps.

A toned-down palette to focus attention on important features

Whether someone shares a map of their trip from New York to L.A. on Facebook or searches on Instagram for the best Detroit-style pizza in their neighborhood, the map should provide location context without drowning out the other information. But we noticed that on some surfaces, our maps’ saturated color palette distracted from the main story.

We decided to soften the color palette, allowing for a more complementary background. Because many of the maps’ features are densely interwoven, our task was more complex than changing one or two colors. But the results allow each app’s most important information to shine.

Where we’re going, we don’t need roads

Well, actually we do, but our new design has deprioritized roads to bring focus to community spaces and popular places. Now, roads serve as a background element connecting these feature types.

In select cities, pavement features — sourced from open data sources — provide immediate visual clarity at high zooms. These features are often more helpful than an intricate road map.

We parsed OSM’s complicated building and building:part tags to refashion our building features from the ground up. We’re now also color-coding buildings in our “areas of interest” layer, which highlights features in high traffic areas with a creamy orange tint. At high zooms, to keep the background clean, we turn this layer off and indicate high-traffic areas using a color to highlight buildings. Transportation buildings, such as airport terminals, parking garages, and metro stations, have a bluish tint to correspond with our new iconography layer.

Icon versus icon

Our initial basemaps eschewed icons. When we first introduced maps, a number of our use cases inserted contextual icons, like restaurants, into the map at the client level, so we did not include icons or labels for similar features. Since then, the variety of use cases has grown, and how the map is used across Meta has changed. Recently, we introduced iconography for parks, universities, and entertainment features like amusement parks and zoos. We also amended our existing icons for airports and train stations so that they fit with the new theme.

Mapping out the future

We’re constantly trying to push the boundaries of data and design with our basemaps. We are exploring global land cover, various resolutions of topography, and many other ideas to improve our map experiences. We’re also continuing to work with partners and communities to improve our maps by building reliable, open 3D geospatial map data for the metaverse and digital twins. Visit Maps at Meta for more information.

The evolution of Facebook’s iOS app architecture

Mon, 06 Feb 2023 15:00:00 +0100

Facebook for iOS (FBiOS) is the oldest mobile codebase at Meta. Since the app was rewritten in 2012, it has been worked on by thousands of engineers and shipped to billions of users, and it can support hundreds of engineers iterating on it at a time.

After years of iteration, the Facebook codebase does not resemble a typical iOS codebase:

It’s full of C++, Objective-C(++), and Swift.
It has dozens of dynamically loaded libraries (dylibs), and so many classes that they can’t be loaded into Xcode at once.
There is almost zero raw usage of Apple’s SDK — everything has been wrapped or replaced by an in-house abstraction.
The app makes heavy use of code generation, spurred by Buck, our custom build system.
Without heavy caching from our build system, engineers would have to spend an entire workday waiting for the app to build.

FBiOS was never intentionally architected this way. The app’s codebase reflects 10 years of evolution, spurred by technical decisions necessary to support the growing number of engineers working on the app, its stability, and, above all, the user experience.

Now, to celebrate the codebase’s 10-year anniversary, we’re shedding some light on the technical decisions behind this evolution, as well as their historical context.

2014: Establishing our own mobile frameworks

Two years after Meta launched the native rewrite of the Facebook app, News Feed’s codebase began to have reliability issues. At the time, News Feed’s data models were backed by Apple’s default framework for managing data models: Core Data. Objects in Core Data are mutable, and that did not lend itself well to News Feed’s multithreaded architecture. To make matters worse, News Feed utilized bidirectional data flow, stemming from its use of Apple’s de facto design pattern for Cocoa apps: Model View Controller.

Ultimately, this design exacerbated the creation of nondeterministic code that was very difficult to debug or reproduce bugs. It was clear that this architecture was not sustainable and it was time to rethink it.

While considering new designs, one engineer investigated React, Facebook’s (open source) UI framework, which was becoming quite popular in the Javascript community. React’s declarative design abstracted away the tricky imperative code that caused issues in Feed (on web), and leveraged a one-way data flow, which made the code much easier to reason about. These characteristics seemed well suited for the problems News Feed was facing. There was only one problem.

There was no declarative UI in Apple’s SDK.

Swift wouldn’t be announced for a few months, and SwiftUI (Apple’s declarative UI framework) wouldn’t be announced until 2019. If News Feed wanted to have a declarative UI, the team would have to build a new UI framework.

Ultimately, that’s what they did.

After spending a few months building and migrating News Feed to run on a new declarative UI and a new data model, FBiOS saw a 50 percent performance improvement. A few months later, they open-sourced their React-inspired UI framework for mobile, ComponentKit.

To this day, ComponentKit is still the de facto choice for building native UIs in Facebook. It has provided countless performance improvements to the app via view reuse pools, view flattening, and background layout computation. It also inspired its Android counterpart, Litho, and SwiftUI.

Ultimately, the choice to replace the UI and data layer with custom infra was a trade-off. To achieve a delightful user experience that could be reliability maintained, new employees would have to shelve their industry knowledge of Apple APIs to learn the custom in-house infra.

This wouldn’t be the last time FBiOS would have to make a decision that balanced end user experience with developer experience and speed. Going into 2015, the app’s success would trigger what we refer to as a feature explosion. And that presented its own set of unique challenges.

2015: An architectural inflection point

By 2015, Meta had doubled down on its “Mobile First” mantra, and the FBiOS codebase saw a meteoric rise in the number of daily contributors. As more and more products were integrated into the app, its launch time began to degrade, and people began to notice. Toward the end of 2015, startup performance was so slow (nearly 30 seconds!) that it risked being killed by the phone’s OS.

Upon investigation, it was clear that there were many contributing factors to degraded startup performance. For the sake of brevity, we’ll focus only on the ones that had a long-term effect on the app’s architecture:

The app’s ‘pre-main’ time was growing at an unbounded rate, as the app’s size grew with each product.
The app’s ‘module’ system gave each product ungoverned access to all the app’s resourcing. This led to a tragedy of the commons issue as each product leveraged its ‘hook’ into startup to perform computationally expensive operations so that initial navigation to that product would be snappy.

The changes that were needed to mitigate and improve startup would fundamentally alter the way product engineers wrote code for FBiOS.

2016: Dylibs and modularity

According to Apple’s wiki about improving launch times, a number of operations have to be performed before an app’s ‘main’ function can be called. Generally, the more code an app has, the longer this will take.

While ‘pre-main’ contributed only a small subset of the 30 seconds being spent during launch, it was a particular concern because it would continue to grow at an unbounded rate as FBiOS continued to amass new features.

To help mitigate the unbounded growth of the app’s launch time, our engineers began to move large swaths of product code into a lazily loaded container known as a dynamic library (dylib). When code is moved into a dynamically loaded library, it isn’t required to load before the app’s main() function.

Initially, the FBiOS dylib structure looked like this:

Two product dylibs (FBCamera and NotOnStartup) were created, and a third dylib (FBShared) was used to share code between the various dylibs and the main app’s binary.

The dylib solution worked beautifully. FBiOS was able to curb the unbounded growth of the app’s startup time. As the years went by, most code would end up in a dylib so that startup performance stayed fast and was unaffected by the constant fluctuation of added or removed products in the app.

The addition of dylibs triggered a mental shift in the way Meta’s product engineers wrote code. With the addition of dylibs, runtime APIs like NSClassFromString() risked runtime failures because the required class lived in unloaded dylibs. Since many of the FBiOS core abstractions were built on iterating through all the classes in memory, FBiOS had to rethink how many of its core systems worked.

Aside from the runtime failures, dylibs also introduced a new class of linker errors. In the event the code in Facebook (the startup set) referenced code in a dylib, engineers would see a linker error like this:

Undefined symbols for architecture arm64:
  "_OBJC_CLASS_$_SomeClass", referenced from:
      objc-class-ref in libFBSomeLibrary-9032370.a(FBSomeFile.mm.o)

To fix this, engineers were required to wrap their code with a special function that could load a dylib if necessary:

Suddenly:

int main() {
  DoSomething(context);
}

Would look like this:

int main() {
  FBCallFunctionInDylib(
    NotOnStatupFramework,
    DoSomething,
    context
  );
}

The solution worked, but had quite a few code smells:

The app-specific dylib enum was hard-coded into various callsites. All apps at Meta had to share a dylib enum, and it was the reader’s responsibility to determine whether that dylib was used by the app the code was running in.
If the wrong dylib enum was used, the code would fail, but only at runtime. Given the sheer amount of code and features in the app, this late signal led to a lot of frustration during development.

On top of all that, our only system to safeguard against the introduction of these calls during startup was runtime-based, and many releases were delayed while last-minute regressions were introduced into the app.

Ultimately, the dylib optimization curbed the unbounded growth of the app’s launch time, but it signified a massive inflection point in the way the app was architected. FBiOS engineers would spend the next few years re-architecting the app to smooth some of the rough edges introduced by the dylibs, and we (eventually) shipped an app architecture that was more robust than ever before.

2017: Rethinking the FBiOS architecture

With the introduction of dylibs, a few key components of FBiOS had to be rethought:

The ‘module registration system’ could no longer be runtime-based.
Engineers needed a way to know whether any codepath during startup could trigger a dylib load.

To address these issues, FBiOS turned to Meta’s open source build system, Buck.

Within Buck, each ‘target’ (app, dylib, library, etc.) is declared with some configuration, like so:

apple_binary(
  name = "Facebook",
  ...
  deps = [
    ":NotOnStartup#shared",
    ":FBCamera#shared",
  ],
)
apple_library(
  name = "NotOnStartup",
  srcs = [
    "SomeFile.mm",
  ],
  labels = ["special_label"],
  deps = [
    ":PokesModule",
    ...
  ],
)

Each ‘target’ lists all information needed to build it (dependencies, compiler flags, sources, etc.), and when ‘buck build’ is called, it builds all this information into a graph that can be queried.

$ buck query “deps(:Facebook)”
> :NotOnStartup
> :FBCamera
$ buck query “attrfilter(labels, special_label, deps(:Facebook))”
> :NotOnStartup

Using this core concept (and some special sauce), FBiOS began to produce some buck queries that could generate a holistic view of the classes and functions in the app during build. This information would be the building block of the app’s next generation of architecture.

2018: The proliferation of generated code

Now that FBiOS was able to leverage Buck to query for information about code in the dependency, it could create a mapping of “function/classes -> dylibs” that could be generated on the fly.

{
  "functions": {
    "DoSomething": Dylib.NotOnStartup,
    ...
  },
  "classes": {
    "FBSomeClass": Dylib.SomeOtherOne
  }
}

Using that mapping as input, FBiOS used it to generate code that abstracted away the dylib enum from callsites:

static std::unordered_map functionToDylib {{
  { "DoSomething", Dylib.NotOnStartup },
  { "FBSomeClass", Dylib.SomeOtherOne },
  ...
}};

Using code generation was appealing for a few reasons:

Because the code was regenerated based on local input, there was nothing to check in, and there were no more merge conflicts! Given that the engineering body of FBiOS could double every year, this was a big development efficiency win.
FBCallFunctionInDylib no-longer required an app-specific dylib (and thus could be renamed to ‘FBCallFunction’). Instead, the call would read from static mapping generated for each application during build.

Combining Buck query with code generation proved to be so successful that FBiOS used it as bedrock for a new plugin system, which eventually replaced the runtime-based app-module system.

Moving signal to the left

With the new Buck-powered plugin system. FBiOS was able to replace most runtime failures with build-time warnings by migrating bits of infra to a plugin-based architecture.

When FBiOS is built, Buck can produce a graph to show the location of all the plugins in the app, like so:

From this vantage point, the plugin system can surface build-time errors for engineers to warn:

“Plugin D, E could trigger a load of a dylib. This is not allowed, since the caller of these plugins lives in the app’s startup path.”
“There is no plugin for rendering Profiles found in the app … this means that navigating to that screen will not work.”
“There are two plugins for rendering Groups (Plugin A, Plugin B). One of them should be removed.”

With the old app module system, these errors would be “lazy” runtime assertions. Now, engineers are confident that when FBiOS is built successfully, it won’t fail because of missing functionality, dylibs loading during app startup, or invariants in the module runtime system.

The cost of code generation

While migrating FBiOS to a plugin system has improved the app’s reliability, provided faster signals to engineers, and made it possible for the app to trivially share code with our other mobile apps, it came at a cost:

Plugin errors are not on Stack Overflow and can be confusing to debug.
A plugin system based on code generation and Buck is a far cry from traditional iOS development.
Plugins introduce a layer of indirection to the codebase. Where most apps would have a registry file with all features, these are generated in FBiOS and can be surprisingly difficult to find.

There is no doubt that plugins led FBiOS farther away from idiomatic iOS development, but the trade-offs seem to be worth it. Our engineers can change code used in many apps at Meta and be sure that if the plugin system is happy, no app should crash because of missing functionality in a rarely tested codepath. Teams like News Feed and Groups can build an extension point for plugins and be sure that product teams can integrate into their surface without touching the core code.

2020: Swift and language architecture

While most of this article has focused on architectural changes stemming from scale issues in the Facebook app, changes in Apple’s SDK have also forced FBiOS to rethink some of its architectural decisions.

In 2020, FBiOS began to see a rise in the number of Swift-only APIs from Apple and a growing sentiment for more Swift in the codebase. It was finally time to reconcile with the fact that Swift was an inevitable tenant in FB apps.

Historically, FBiOS had used C++ as a lever to build abstraction, which saved on code size because of C++’s zero overhead principle. But C++ does not interop with Swift (yet). For most FBiOS APIs (like ComponentKit), some kind of shim would have to be created to use in Swift — creating code bloat.

Here’s a diagram outlining the issues in the codebase:

With this in mind, we began to form a language strategy about when and where various bits of code should be used:

Ultimately, the FBiOS team began to advise that product-facing APIs/code should not contain C++ so that we could freely use Swift and future Swift APIs from Apple. Using plugins, FBiOS could abstract away C++ implementations so that they still powered the app but were hidden from most engineers.

This type of workstream signified a bit of shift in the way FBiOS engineers thought about building abstractions. Since 2014, some of the biggest factors in framework building have been contributions to app size and expressiveness (which is why ComponentKit chose Objective-C++ over Objective-C).

The addition of Swift was the first time these would take a backseat to developer efficiency, and we expect to see more of that in the future.

2022: The journey is 1 percent finished

Since 2014, FBiOS architecture has shifted quite a bit:

It introduced countless in-house abstractions, like ComponentKit and GraphQL.
It uses dylibs to keep ‘pre-main’ times minimal and contribute to a blazing-fast app startup.
It introduced a plugin system (powered by Buck) so that dylibs are abstracted away from engineers, and so code is easily shareable between apps.
It introduced language guidelines about when and where various languages should be used and began to shift the codebase to reflect those language guidelines.

Meanwhile, Apple has introduced exciting improvements to their phones, OS, and SDK:

Their new phones are fast. The cost of loading is much smaller than it was before.
OS improvements like dyld3 and chain fixups provide software to make code loading even faster.
They’ve introduced SwiftUI, a declarative API for UI that shares a lot of concepts with ComponentKit.
They’ve provided improved SDKs, as well as APIs (like interruptible animations in iOS8) that we could have built custom frameworks for.

As more experiences are shared across Facebook, Messenger, Instagram, and WhatsApp, FBiOS is revisiting all these optimizations to see where it can move closer to platform orthodoxy. Ultimately, we’ve seen that the easiest ways to share code are to use something that the app gives you for free or build something that is virtually dependency-free and can integrate between all the apps.

We’ll see you back here in 2032 for the recap of the codebase’s 20-year anniversary!

Facebook Code

IPLS: Privacy-preserving storage for your WhatsApp contacts

What is Identity Proof Linked Storage?

The components of IPLS

The AKD and Cloudflare integration

HSM-based key storage

Privacy-preserving contacts storage at WhatsApp scale

OCP Summit 2024: The open future of networking hardware for AI

DSF: Scheduled fabric that is disaggregated and open

DSF platforms for next-generation AI fabrics

Arista 7700R4 series

51T switches for next-generation 400G/800G fabrics

Optics: 2x400G FR4 optics for 400G/800G optical interconnection

Evolving FBOSS and SAI for DSF

FBNIC: A multi-host foundational NIC designed by Meta

The future is open

Meta’s open AI hardware vision

Introducing Catalina: Open Architecture for AI Infra

The Grand Teton Platform now supports AMD accelerators

​Open Disaggregated Scheduled Fabric

Meta and Microsoft: Driving Open Innovation Together

The open future of AI infra

How open source AI can improve population estimates, sustainable energy, and the delivery of climate change interventions

Why we need better population maps

Background on Meta’s AI-powered population maps

Open-sourcing training data for our AI population maps

Applications in sustainable electrification, clean water, and climate change adaptation

React at Meta Connect 2024

Instagram and Facebook For Meta Quest

Meta Horizon mobile app

Meta Horizon Store

Meta Spatial Editor

This is how Meta builds React

Inside Bento: Jupyter Notebooks at Meta

Simulator-based reinforcement learning for data center cooling optimization

How cooling works at Meta’s data centers

A reinforcement learning approach to data center cooling

Results of our RL approach

Future work for AI in data center optimization

Acknowledgements

Read Meta’s 2024 Sustainability Report

Meta is getting ready for post-quantum cryptography

How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale

Why invest in Policy Zones?

How Policy Zones works

How we applied PAI to existing systems at scale

Step 1 – Identify relevant assets

Step 2 – Discover relevant data flows

Step 3 – Remediate data flow violations

Step 4 – Continuously enforce and monitor data flows

Lessons learned from adoption at scale across Meta

Focus on solving one specific end-to-end use case first

Streamline integration complexity

Invest in computational and developer efficiency early on

Simplified and independent annotations are a must to scale to a wide range of requirements

Build tools; they are required

Durable privacy protection for everyone

Acknowledgements

RETINAS: Real-Time Infrastructure Accounting for Sustainability

How we measure greenhouse gas emissions at Meta

Introducing real-time server fleet utilization effectiveness

Example (current static state)

Example (with proposed dynamic accounting)

Depreciation in action:

Effective change in depreciation with extension (from 4y UL to 5y UL):

Characteristics of the metric:

How PyTorch powers AI training and inference

Inside the hardware and co-design of MTIA

Bringing Llama 3 to life

Aparna Ramani discusses the future of AI infrastructure

How Meta animates AI-generated images at scale

Optimizing latency for generating image animations

Halving floating-point precision

Improving temporal-attention expansion

Leveraging DPM-Solver to reduce sampling steps

Combining guidance and step distillation

PyTorch optimizations

Deploying and running image animation at scale

Read more

A RoCE network for distributed AI training at scale

Open Disaggregated Scheduled Fabric