Yelp Engineering

Fine-tuning AWS ASGs with Attribute Based Instance Selection

Wed, 01 May 2024 02:00:00 +0200

This is the next installment of our blog series on improving our autoscaling infrastructure. In the previous blog posts (Open-sourcing Clusterman, Recycling kubernetes nodes) we explained the architecture and inner-working of Clusterman. This time we are discussing how attribute based instance selection in the autoscaling group has helped us make our infrastructure more reliable and cost effective, while also decreasing the operation overhead. This will also cover how these changes enabled us to migrate from Clusterman to Karpenter. (Spoiler alert: Karpenter blog post is coming soon!) Motivation At Yelp we run most of our workload on AWS spot instances, and...

Moderating Inappropriate Video Content at Yelp

Wed, 27 Mar 2024 01:00:00 +0100

One of Yelp’s top priorities is the trust and safety of our users. Yelp’s platform is most well-known for its reviews, and its moderation practices have been recognised in academic research for mitigating misinformation and building consumer trust. In addition to reviews, Yelp’s Trust and Safety team takes significant measures when it comes to protecting its users from inappropriate material posted through other content types. This blog post discusses how Yelp protects its users from inappropriate content in videos.

Recently, Yelp revamped its review experience by giving users the ability to upload videos alongside their review text. This has led to a significant increase in the total number of videos uploaded to the platform.

Starting April 2023, video uploads increased significantly at Yelp.

Videos provide an immersive way to capture and share our experiences. However, this also opens the door to bad actors who may attempt to post disturbing videos to the platform. While such content is very rarely posted on Yelp’s platform, examples of such videos include:

Nudity, sexual activity and suggestive material
Intense violence, graphic gore and disturbing scenes
Extremist imagery and hate symbols

It is extremely important to Yelp to proactively prevent such videos from being displayed to users on our platform, which protects consumers and businesses alike.

Yelp has been committed to providing more value to consumers and businesses by leveraging AI. We recently announced how we are rapidly expanding the use of neural networks to enhance ad relevance, search quality, and wait time estimates, among many others. AI-based systems also play a key role at Yelp to detect inappropriate content across various content types, from reviews to photos. Videos are no exception.

Any machine learning model will have a non-zero chance to classify a legitimate video as inappropriate. This is known as the false positive rate. On the other hand, a model’s recall — in this case the measure of how well it can correctly flag a problematic video — should be maximized. There is always a tradeoff between keeping the recall high and false positive rate low. While flagging and removing inappropriate content as swiftly as possible is extremely important, any model that incorrectly removes legitimate content can be extremely frustrating to users and can discourage them from actively participating on the platform. Therefore, in order to maintain a high recall and effectively handle false positives, we include human evaluation of flagged videos as part of our moderation pipeline.

Yelp’s User Operations team strives to review flagged videos and promptly restore any false positives to enforce the Content Guidelines in a fair and effective manner. However, manual moderation can be time consuming and difficult to scale. On top of that, dealing with large volumes of false positives can be frustrating for employees. Therefore, even with human moderators in the loop, an effective content moderation system should keep the number of false positives to a minimum.

When a video is uploaded to the platform, the moderation pipeline kicks off in parallel to the video ingestion system. The video first gets checked by our matching service, which computes similarity hashes against other videos that were previously removed for violating content guidelines. Matched videos get automatically discarded, which helps manage overall moderation volume by blocking submissions from repeat offenders.

An overview of the video moderation pipeline at Yelp.

Videos that pass the check are then fed to a deep learning model, which returns a multi-label classification. If the classification scores are above our thresholds, the videos are hidden and sent to the User Operations team for review. These thresholds are carefully fine-tuned to keep false positives at a minimum, while still catching and flagging inappropriate content. Inappropriate videos are removed, whereas the ones that were incorrectly flagged are restored.

Moderating videos presents its own unique set of challenges. Videos are much larger in size than other common content types such as reviews and photos. As a result, it takes a lot more time to process and feed them through a neural network. However, it is important to have near real-time classification to remove inappropriate content as quickly as possible. One solution to this challenge includes simply reducing the number of videos going through the neural network by pre-emptively blocking uploads from users with suspicious activity patterns.

Another strategy to overcome this problem involves selectively sampling frames to pass through the deep learning model instead of passing all video frames. We ran experiments to find the optimal frame sampling technique and frequency that would minimize the inference time without sacrificing classification performance. The classification scores for the sampled frames are combined to give a final score.

Sampled frames are fed into the model. The individual scores are combined to give a final score.

The model used for classifying video frames is built upon the model currently in use for moderating photos, given the close similarities between the photos and video frames classification tasks. The photo moderation model has an excellent track record when it comes to protecting Yelp from inappropriate photos, and building on top of it helps us minimize engineering development costs and maintenance burden.

At Yelp, trust and safety is a top priority and we are committed to protecting our consumers and business owners. As video submissions to the platform grow, a robust and efficient moderation system is more important than ever, which is why Yelp combines automated and human moderation to protect our platform from inappropriate videos. The Trust & Safety team continuously strives to improve its moderation pipelines to keep Yelp one of the most trusted review platforms on the web.

This project would not have been possible without the support and collaboration from The Yelp Connect and Consumer Contributions teams. Special thanks to Marcello Tomasini, Gouthami Senthamaraikkannan, Jonathan Wang, Jiachen Zhao, Sandhya Giri, Curtis Wong, and Anka Granovskaya for contributing to the design and implementation of the pipeline.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Phone Number Masking for Yelp Services Projects

Tue, 26 Mar 2024 01:00:00 +0100

In this blog post, we highlight how phone number masking helps build consumer trust in the services marketplace at Yelp, decreases the friction in communication with service professionals, and allows for seamless switching between the Yelp app and a user’s phone. We present a high level overview of our in-house phone masking system and dive into the details of the engineering challenge of optimizing the usage of proxy phone number resources at Yelp’s scale.

Every year, millions of requests for quotes, consultations or other messages are sent to businesses on Yelp. Those users choose to use Yelp to connect with local services professionals for their projects because they value our trustworthy reviews and seamless search experience. Yelp also provides users with a dedicated project workspace where they can outline their request, use our Request a Quote product to get matched with relevant businesses, and use our in-app messaging platform to easily compare quotes and communicate with pros.

While the messaging platform is a convenient tool for communication, we’ve observed the following pain points:

Some customers, especially new users, may not be in the habit of checking the Yelp app for new business replies.
Businesses may sometimes feel that communicating via a phone call is more engaging and personal.
When it comes to more urgent or complex projects, it can be inefficient to describe the issue in a message.

To remedy these pain points, we’ve seen a lot of businesses simply ask for the consumer’s phone number. While this solves the problems above, many customers may feel reluctant to share their contact information out of concern for receiving spam calls and unsolicited promotional messages. They want to feel confident that they can trust the business before providing a phone number.

To facilitate communication between customers and businesses via phone calls, while providing peace of mind that the user’s number is protected, Yelp recently introduced an evolution of the Request a Call feature where customers can communicate with pros via masked phone numbers through both calls and SMS. Upon submitting a Request a Quote, the user can opt-in to receiving calls and texts about their project. If they opt-into sharing their number, Yelp assigns a temporary masked number to the customer and the business, which allows both parties to communicate seamlessly through calls, SMS, and the Yelp app.

After the customer shares and enters their phone number (left), the business can call and text the customer’s masked phone number (right).

Masked phone numbers provide the following benefits over calling or texting directly:

Privacy: Neither party’s real phone number is shared with the other, and both can opt-out of communicating via phone calls at any time.
Protection: Masked numbers cannot be shared with third parties—only the business can reach the customer through the masked number, and vice versa.
Continuity: The full history of texts and calls is mirrored on both the app and the user’s phone, which allows for easy switching between the communication channels.

The conversation history between the customer and the business is synced between the SMS messages and the Yelp messaging platform.

In the next sections, we’ll take you through a high level overview of Yelp’s phone masking process, and highlight the key technical design decisions that we made in order to build consumer trust and provide the convenient benefits outlined above, while minimizing system costs and enabling the system to be scaled to Yelp’s large user base.

Fortunately for us, when it comes to working with phone numbers there is no need to start from scratch. Telephony API providers make it easy to purchase phone numbers, send or receive SMS messages, and initiate or receive phone calls. Additionally, they allow a phone number’s owner to react immediately to any event that occurs on the number, like an incoming call, through sending webhooks to a custom URL and accepting a response with custom instructions on how to handle the event. For example, the incoming call can get an automatic response or could be redirected to another number.

Using these building blocks, setting up a phone number masking application is straightforward. Two parties can have a phone call or engage in an SMS conversation through a proxy number without revealing their real numbers. We can simply forward the messages from one number to the other when receiving a webhook. And it all works seamlessly as if you’re communicating directly with the other person.

Below is a high-level architecture of how Yelp integrated with a telephony API provider to offer the utility of phone masking, while keeping all phone events in sync with the Yelp conversation.

For calls, we proxy the call immediately to the second number and reflect the call on the user’s inbox. For SMS, the flow looks very similar, except instead of a call we persist and forward a text message.

The only thing left is an appropriate data model to integrate phone masking with the Yelp conversation. Our key requirement is that only the Yelp business can call the proxy number to reach the customer, and vice versa. Therefore, we need a data model which encapsulates the customer and business numbers and links them via a proxy number, so that we can route messages and calls to the proxy number to the intended recipient. We call this model a “masking session” because it provides a temporary connection between the two real numbers while not exposing them directly to each other. It looks like this at a high level:

Minimal masking session data model. Each of Yelp’s Services conversations has an associated session which allows us to route messages and calls between the numbers seamlessly, while reflecting the message and call events on the conversation feed.

The diagram above is a good high-level outline for how phone masking works. But in reality Yelp connects hundreds of thousands of businesses and customers every month, and this model requires that we allocate one number for every connection that we facilitate.

The most basic proxy number allocation strategy is to assign a unique number for every masking session.

This can quickly get prohibitively expensive, not to mention that phone numbers are a finite resource and even the telephony API provider couldn’t provide us with that many numbers.

There is a solution though, and it is 2-fold: recycle and reuse.

In the following few sections, we walk through several possible phone number allocation strategies that recycle and reuse proxy numbers between sessions in various ways. Ultimately, we arrive at the one that minimizes the size of the proxy number pool that we need to maintain in order to support our system.

Phone number recycling

For certain use cases such as delivery apps or connecting with your rideshare driver, the masking session needs to be relatively short lived (typically several minutes). In those situations an application can get away with having a relatively small pool of proxy numbers by using a very aggressive recycling policy. The two determining factors for the size of the proxy number pool are the number of sessions needed per unit time and the average lifespan of a session.

For example, if a delivery app has on average 1000 deliveries per hour, typically lasting under 30 minutes, then it would need 500 proxy numbers on average at a given time.

Yelp’s phone masking system implements recycling, but we need to keep sessions active for a longer period of time given conversations between customers and businesses often last for weeks. There is a potential workaround where we recycle a number after N hours of inactivity and then allocate a new number if the conversation resumes. However, then we could risk breaking the continuity of the SMS conversation if the later messages start coming from a new number, and we may cause confusion when a conversation with a new business starts abruptly from the same number. Because of these considerations, we typically mark a proxy number as recyclable only after 30 days.

Therefore, we only ever need to maintain as many phone numbers as the number of connections per month, i.e., our costs scale as O(conversations per month). This is definitely an improvement, but it still requires purchasing millions of phone numbers, which means that we need further optimizations to our phone number use.

Phone number reuse

The idea behind proxy phone number reuse is to use the same number in multiple masking sessions simultaneously instead of it only taking part in one session at a time. The tricky part is to assign the numbers in such a way that all phone calls and texts are routed unambiguously to the intended recipients. Below we describe some options we evaluated.

Unique number for every business

One approach would be to not actually use a unique number for every conversation, but instead have a constant proxy number for each business on Yelp, such that two different customers see the same number for a given business, and we can disambiguate the conversation based on the sender/caller number. It is also quite natural that the business number doesn’t change between different users.

With this proxy number allocation strategy we assign a unique proxy number to each business. Then all customers that have conversations (masking sessions) with the same business always interact with the same proxy number.

Unfortunately, this approach still has a couple of problems. First, it doesn’t allow the business to call or text the proxy number because we wouldn’t know which customer we should forward the call to. (This problem is actually solvable with the two-number-pools approach we describe later). And second, it scales as O(businesses using Request a Quote) which still doesn’t reduce costs sufficiently, but it’s closer to the optimal solution.

Each party sees unique numbers (single proxy pool)

What’s interesting about the unique number per business approach is that it touches on the actual constraints at hand. Namely they are:

Constraint 1: Each customer should be interacting with a different number for each different business they are contacting.
Constraint 2: Each business owner should be interacting with a different number for each different customer they are working with.

However, there is no problem if two different customers see the same number for two different businesses because we can disambiguate who they are calling/texting based on the caller phone number. The same holds true for the business side. Therefore, we only need enough numbers to satisfy the above constraints for all of the customer-business connections every month at Yelp (with recycling).

We can demonstrate how this works out to be a small number with a hypothetical example. If most customers contact less than 10 businesses per month, and most businesses receive less than 100 requests per month¹, we only need 100 numbers (max of the two) to satisfy both constraints. We can also add a safety factor to account for outliers, but the number pool size still ends up being a small constant size.

More importantly, this allocation strategy minimizes our proxy number costs because the number pool size does not need to increase (it is O(1)) with the volume of customer quote requests sent on Yelp or the number of businesses we onboard on the platform.

In this example, there are 2 customers and 2 businesses having a total of 4 unique conversations (masking sessions), facilitated by a pool of 2 proxy numbers. Notice how the proxy numbers are mapped to participants such that each party sees unique numbers for each of their conversations (i.e. both constraints are satisfied).

Each party sees unique numbers (multiple proxy pools)

As a final improvement, we actually use two pools of proxy numbers, one for the customer side and another for the business side. This way the masking is still seamlessly maintained because the customer always communicates with the same number and so does the business owner. They just happen to be different numbers. The final masking session model looks like this:

Like before, there are 2 customers and 2 businesses having a total of 4 unique conversations (masking sessions), but now they are facilitated by 2 pools of 2 proxy numbers each. Each party still sees unique numbers for each of their conversations, but the customer and business in a particular session see distinct proxy numbers (each from their respective pool).

This strategy still satisfies both constraints and keeps costs constant, but it has the following benefits:

Less risk of exhausting numbers: Customer proxy numbers only need to satisfy constraint 1 from the previous section while business numbers only need to satisfy constraint 2. This makes it less likely to run out of proxy numbers to assign to a session. The two constraints become increasingly harder to be satisfied together by a single number the more sessions we create.
Simpler allocation and routing logic: The code is easier to maintain and understand.
Greater flexibility: We can configure each number pool independently. For example, each pool can have a different size, a distinct path for webhooks, specific alerting, etc. We could even change the assignment strategy of each pool if necessary, or we can have additional pools if we needed a different assignment strategy for a new participant type. (E.g. having a constant number per business like mentioned above for the customer side for specific subsets of businesses).

The only downside of this final strategy is that we need to purchase slightly more proxy numbers overall. However, this tradeoff is worth it given the added flexibility and ease of maintenance.

In this blog post we learned how Yelp’s engineering team developed an in-house phone masking system for the Services Marketplace. The feature helps us uphold our core value of “Protecting the Source” by prioritizing the privacy of consumers when connecting them with professionals over the phone, and maintains professionals’ trust that Yelp connects them with high-intent customers who are more eager to get their projects done.

At the same time, it poses an interesting technical challenge to prevent costs from increasing linearly with the volume of traffic. We managed to overcome this problem through good data modeling and intelligent allocation of resources, which allows us to offer the convenience and flexibility of masked phone communication for all Request a Quote projects.

This project required significant cross-team collaboration, and I would like to thank everyone in the Services group and other Yelp teams who contributed to the development and made it possible. Special thanks goes to Yi Qi, Billy Barbaro, James Coles-Nash, Michelle Tan, and Rich Schreiber for your technical and editorial reviews of this article.

1: These numbers are for illustrative purposes only.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

CHAOS: Yelp's Unified Framework for Server-Driven UI

Thu, 14 Mar 2024 01:00:00 +0100

Yelp develops two major applications, Yelp & Yelp for Business, for Web (Desktop & Mobile), iOS, and Android platforms. That’s eight unique clients! Keeping a fresh, consistent UI on all these clients is a major challenge. Server-driven UI (SDUI) has become a standard industry technique for managing UI on multiple platforms. At Yelp, many product teams created SDUI frameworks for their features. Though successful, these frameworks were expensive to develop and maintain, and no single SDUI framework supported all our clients. In late 2021, we began building a unified SDUI framework called CHAOS or “Content Hosting Architecture with Optimization Strategies”.

CHAOS is a backronym. Initially, we thought it would make a good blog post! But we found deeper meaning in the name. According to chaos theory, small changes to a system can dramatically alter its state. CHAOS would simplify the process of deploying major UI changes on our clients, leading to our slogan: “Small changes have big results”

Though we chose CHAOS quickly, we went through many proposals for the phrase behind the acronym:

Creative and Humorous Acronym for Our System
Content Helps Accelerate Our Success
Components Help Accelerate Our Screens

We eventually settled on “Content Hosting Architecture with Optimization Strategies”.

“Content Hosting Architecture” made sense. UI is the content the user sees and interacts with. We were building an architecture for hosting interactive content. The content could be anything from an entire mobile screen or desktop browser page to a single UI element, often called a component.

We added “Optimization Strategies” because we planned to use machine learning (ML) to optimize content. For example, some consumers prefer to see photos when searching for businesses while others prefer to see reviews. Sometimes, the consumer’s preference changes depending on the type of business; photos might be more important for finding a good restaurant and reviews more important for finding a plumber. An ML model could select the best search experience automatically.

SDUI is a popular technique for managing UI on multiple platforms. In a standard UI, the client developer writes both presentation and data fetching logic. Updating the UI requires changing the client. For mobile clients, changes require going through the platform’s app release process and waiting for users to upgrade to the new version. If multiple clients require the same UI changes, the cost of making the changes increases dramatically.

In SDUI, the backend developer writes the presentation and data fetching logic, returning the configured UI to the client. The backend code can be updated without requiring changes to the client, and a single backend change can update the UI on multiple clients.

At Yelp, we’ve built many successful SDUI frameworks. Building a server-driven platform for mobile app development described one such framework, the Biz Native Foundation or BNF, for managing the UX on the iOS and Android versions of Yelp for Business.

The BNF has a very typical server-driven architecture for mobile clients. It supports server-driven mobile screens that host a list of components. Interacting with a component, such as tapping a button, triggers an action that updates the UI directly or indirectly through a property – a piece of observable application state.

While the BNF was being developed, several other major SDUI frameworks were being developed for Yelp’s consumer clients, and more teams were considering SDUI for their use cases. We organized an internal SDUI community to foster knowledge sharing and collaboration. Still, each SDUI framework was an independent effort. Some clients even had multiple SDUI frameworks controlling different aspects of the UI. A single product request might require changes to multiple SDUI frameworks!

Having a single, cross-platform SDUI framework would eliminate duplicate effort and simplify UI changes across multiple clients. We started CHAOS as a community-driven effort to build that framework.

Historically, we’ve built and maintained multiple REST APIs for our clients. Having different APIs, each with its own Swagger spec and backend Python service, was a big reason why we couldn’t unify our SDUI frameworks.

Fortunately, for the last several years, we’ve been switching all Yelp clients to a unified GraphQL API. Therefore, using GraphQL was a requirement for CHAOS. Even if we wanted to use REST for SDUI, our clients would need to support both REST & GraphQL. When Yelp introduced GraphQL, we wanted to replace REST entirely.

We were initially excited about using GraphQL for SDUI. We thought we could evolve our SDUI graph more easily than a REST API, which requires explicit versioning. We thought the explicitness of client queries would help maintain backwards compatibility because each request would document the supported types and fields. As we’ll discuss in the next section, GraphQL presented some challenges when designing the CHAOS API, and we ultimately embedded some REST objects for pragmatic reasons.

We’ll start by outlining the original requirements for CHAOS, then discuss the use model and how it was translated into a GraphQL API.

Requirements

Use GraphQL
Support a variety of use cases on web and mobile clients
Handle forwards & backwards compatibility when making changes

Use model

A view is a piece of UI managed by CHAOS. Every view has a unique name and a layout, which arranges a set of components. Components can trigger actions to implement side-effects. Every layout, component, or action has a unique versioned type.

For example, a product manager wants a simple view to help new Yelp users find local business. The initial design requires a single column layout with text, illustration, and button components. Clicking the button opens a deep link to a Yelp search.

We can easily extend CHAOS to support more use cases by adding more layouts, components, and actions. Layouts can be a single column, a row, or a full web page/mobile screen with multiple sections. Components can be a single piece of text, a button, or an entire section. Actions can open URLs, log analytics, or update application state.

A Yelp client queries the CHAOS GraphQL API for a view. The GraphQL API loads the view by calling a standardized REST API on a CHAOS backend implemented as a Python service.

There’s no single CHAOS backend for all views. Rather, CHAOS backends are microservices for UI. They can be responsible for a single view or multiple related views, and the CHAOS API dispatches client queries based on the view name.

CHAOS provides React, Android, and iOS client libraries for making GraphQL queries and rendering views. CHAOS provides a Python package for building views in CHAOS backends.

Dream Query

At Yelp, when building new GraphQL APIs, we start by writing a Dream Query. We need a query to fetch a CHAOS view by its unique name:

queryGetChaosView($name:String!){chaosView(name:$name){views{identifierlayout}initialViewIdcomponentsactions}}

The query returns a ChaosConfiguration with an array of views and an initial view ID. Though many CHAOS use cases have a single view, some use cases have a sequence of related views. We could always fetch subsequent views with additional GraphQL queries, but they would require extra round trips over a potentially slow and unreliable network connection. Consequently, CHAOS supports returning multiple views within the same configuration for better performance and reliability.

Each view has a layout that arranges components by ID. Layouts are represented by the ChaosLayout union type:

unionChaosLayout=ChaosSingleColumn|ChaosMobilePhoneScreen

CHAOS supports a single column layout that arranges components in a vertical stack, which is great for adding some SDUI to an existing web page or mobile screen.

typeChaosSingleColumnimplementsChaosLayout{rows:[String!]!}

CHAOS also supports a layout for controlling an entire mobile phone screen, a common use case for many of our existing SDUI frameworks.

typeChaosMobilePhoneScreenimplementsChaosLayout{toolBar:Stringmain:[String!]!footer:String}

We’ve been experimenting with layouts for entire web pages and will report on those efforts in subsequent blog posts. More commonly, our web clients use single column layouts to add some SDUI content to a page that otherwise uses traditional data fetching and presentation logic.

Layouts refer to components by ID, and all components in a ChaosConfiguration are stored in the top-level components field. Similarly, components refer to actions by ID, and all actions are stored in the top-level actions field.

Storing components and actions in the top-level configuration has some practical benefits. First, it reduces response size when components or actions are referenced multiple times. Second, it improves readability because layouts are compact and focused on how components are arranged.

Modeling components & actions

Initially, we planned to use explicit GraphQL types to model each component and action. We defined interfaces that all components and actions must satisfy. Because we reference components and actions by ID, they must have a unique string identifier. The other fields depend on the particular component or action.

Let’s say CHAOS supports a single component (ChaosButton) and action (ChaosOpenUrl) with the following GraphQL types:

typeChaosButtonimplementsChaosComponent{identifier:String!text:String!onClick:[String!]!}typeChaosOpenUrlimplementsChaosAction{identifier:String!url:String!}

The client’s query uses fragments to specify the supported component and action types:

queryGetChaosView($name:String!){chaosView(name:$name){views{identifierlayout{...onChaosSingleColumn{rows}}}components{...onChaosButton{identifiertextonClick}}actions{...onChaosOpenUrl{identifierurl}}initialViewId}}

Though this seems like a sensible approach, we found a number of issues in practice.

First, components and actions aren’t like traditional GraphQL types for data fetching. A main selling point for GraphQL is that clients fetch only the fields they require. Well, the client can’t query some button fields and not others; the button won’t work without onClick!

Second, adding new fields must be done carefully. Let’s add a new style parameter to control the appearance of the button:

typeChaosButtonimplementsChaosComponent{identifier:String!text:String!style:ChaosButtonStyleonClick:[String!]!}

Unfortunately, we’ve already released the original button to mobile clients, and there are older app versions that don’t support style. How did we communicate to the CHAOS backend that the mobile client supports the new field?

The GraphQL server knows whether the client’s query includes the new field. We use Apollo Server, and it supplies an info argument to the component’s resolver with an abstract syntax tree (AST) representing the query. But we need to traverse through several nested arrays and objects to find whether style is part of the ChaosButton fragment:

We also need to communicate to the CHAOS backend that the field is available. We’ll be constantly adding and (less frequently) removing fields. Do we send a list of supported fields for every component and action to the backend? That would add a considerable amount of overhead to each request.

The third issue is that adding a type has the same problem. Let’s add a new component to represent a block of styled text:

typeChaosTextimplementsChaosComponent{identifier:String!text:String!textStyle:ChaosTextStyletextAlignment:ChaosTextAlignment}

The client’s query must be updated to support the new component type:

queryGetChaosView($name:String!){chaosView(name:$name){views{identifierlayout{...onChaosSingleColumn{rows}}}components{...onChaosButton{identifiertextstyleonClick}...onChaosText{identifiertexttextStyletextAlignment}}actions{...onChaosOpenUrl{identifierurl}}initialViewId}}

To determine if the query includes the ChaosText fragment, the component’s GraphQL resolver must delve deep into the AST, then pass that information along to the CHAOS backend in a list of supported components (and actions):

In the end we decided that explicit, unversioned GraphQL types weren’t practical. We’d spend too much time and effort maintaining our GraphQL layer without much real benefit. The clients would be writing large queries, and the server would be parsing them. Instead, we modeled each component or action as a versioned REST object in JSON format.

Every component or action has a unique type string with an integer version number, such as chaos.button.v1 and chaos.open-url.v1. GraphQL doesn’t natively support JSON or map fields, so parameters are stored in a stringified JSON object.

type ChaosJsonComponent implements ChaosComponent {
    identifier: String!
    componentType: String!
    parameters: String!
}
type ChaosJsonAction implements ChaosAction {
    identifier: String!
    actionType: String!
    parameters: String!
}

For example, a button component in our GraphQL response looks like:

{"identifier":"primacy-cta","componentType":"chaos.button.v1","parameters":"{\"text\": \"Find local businesses\", \"onClick\": [\"open-search-url\"]}","__typename":"ChaosJsonComponent"}

Clearly, the stringified JSON isn’t very readable. We’ve created developer tools to edit and debug CHAOS configurations.

We still use GraphQL types for views and layouts. These types change less frequently and contain the high-level structure of the UI, so direct readability is more useful. Internally, we still associate layouts with a unique versioned type string, e.g. chaos.single-column.v1, and we may switch to embedded REST objects for layouts, too. We’re still figuring out the right balance between GraphQL and REST, but we’ve been using the approach in production for more than two years without revisiting the decision.

Here’s a complete CHAOS configuration to see how everything comes together:

{"data":{"chaosView":{"views":[{"identifier":"consumer.welcome","layout":{"__typename":"ChaosSingleColumn","rows":["welcome-to-yelp-header","welcome-to-yelp-illustration","find-local-businesses-button"]},"__typename":"ChaosView"}],"components":[{"__typename":"ChaosJsonComponent","identifier":"welcome-to-yelp-header","componentType":"chaos.text.v1","parameters":"{\"text\": \"Welcome to Yelp\", \"textStyle\": \"heading1-bold\", \"textAlignment\": \"center\"}}"},{"__typename":"ChaosJsonComponent","identifier":"welcome-to-yelp-illustration","componentType":"chaos.illustration.v1","parameters":"{\"dimensions\": {\"width\": 375, \"height\": 300}, \"url\": \"https://media.yelp.com/welcome-to-yelp.svg\"}}"},{"__typename":"ChaosJsonComponent","identifier":"find-local-businesses-button","componentType":"chaos.button.v1","parameters":"{\"text\": \"Find local businesses\", \"style\": \"primary\"}, \"onClick”: [\"open-search-url\"]}"}],"actions":[{"__typename":"ChaosJsonAction","identifier":"open-search-url","actionType":"chaos.open-url.v1","parameters":"{\"url\": \"https://yelp.com/search\"}"}],"initialViewId":"consumer.welcome","__typename":"ChaosConfiguration"}}}

Versioning components & actions

When changing a component or action, we increment the version. For example, adding style to the CHAOS button introduces chaos.button.v2.

Clients have their own internal component libraries and use factories associated with each component type to map the CHAOS component to the internal component’s interface. Actions go through a similar mapping process.

CHAOS backends use a YAML config file to determine what component or action types can be used in a CHAOS configuration. The GraphQL layer passes information about the platform (React, iOS, or Android) to the CHAOS backend. For mobile clients, the GraphQL layer also passes the app version.

For React clients, we can update all our React clients simultaneously using Gondola, Yelp’s PaaS for front-end deployment. Therefore, we use web: true to indicate that a type is available for web clients.

For mobile clients, we can’t update older versions. We also have distinct apps for consumers & business owners on each platform. Therefore, we use start: to indicate the first app version that supports a type, and each app/platform combination has its own value.

components:
  - type: chaos.button.v1
    web: true
    consumer-ios:
      start: 22.1.0
    consumer-android:
      start: 22.3.0
    biz-ios:
      Start: 22.1.0
    biz-android:
      start: 22.6.0
actions:
  - type: chaos.open-url.v1
    web: true
    consumer-ios:
      start: 22.1.0
    consumer-android:
      start: 22.3.0
    biz-ios:
      Start: 22.1.0
    biz-android:
      start: 22.6.0

We shipped the first CHAOS use case to production in early 2022, only a few months after starting development. Since then, we’ve been regularly shipping new use cases. CHAOS development is entirely use-case driven. We add new layouts, components, and actions when they are required.

CHAOS isn’t intended to replace traditional UI development. We use CHAOS where it makes sense. Usually, a good use case for CHAOS satisfies one or more following conditions:

It must be consistent across multiple clients.
It has dynamic, highly contextual content.
It must be updated quickly on mobile clients.

For example, CHAOS manages the Yelp for Business support flow on web and mobile clients. When a business owner opens the support flow, we show a CHAOS view with a list of support options:

Some business owners use multiple clients, and some businesses are managed by multiple owners who use different clients. Therefore, we want to show consistent support options on all clients.

Support options are also dynamic and highly contextual. Live chat or phone support isn’t available 24/7, and the phone number depends on location.

Finally, if there’s a technical issue such as an outage, we want to update our mobile clients quickly without waiting for an app release. By adding a note that we’re aware of the issue and working on it, we can keep business owners informed and avoid unnecessary support calls.

With CHAOS, the support options can be updated on all clients by deploying a change to a single backend service.

As we adopt CHAOS more broadly within Yelp, we’ve identified some key areas for future investment.

Automated previews

To verify changes to a CHAOS view, a backend developer tests each client manually.

Though testing web clients is relatively straightforward – everyone has access to a browser – testing mobile clients requires access to simulators or physical devices. Before Yelp switched to remote work, we maintained a mobile device library in each engineering office. After the switch, we integrated with a cloud-based testing solution from a vendor. Even so, manual testing is cumbersome for a backend developer who needs to verify multiple platforms or app versions.

In the future, we plan to support automated previews. When a backend developer publishes a GitHub PR with changes to a CHAOS view, we’ll automatically generate previews for each platform and attach them to the PR when ready.

Currently, when a product manager or designer wants to change a CHAOS view, they must ask a backend developer. The backend developer changes the Python code that configures the CHAOS view, creates a PR, gets it approved, and deploys the changes to production. Even simple changes, such as changing copy, require 30 minutes to several hours.

In the future, we plan to support no-code configuration updates for product managers and designers through internal editing tools.

Optimization strategies

Despite being a core part of the CHAOS backronym, we haven’t implemented any optimization strategies for CHAOS content. Selecting, ordering, and configuring CHAOS content must be done manually in Python code.

In the future, we plan to use ML to automatically select, order, and configure some CHAOS content.

This is the first in a series of blog posts about CHAOS. In upcoming blog posts, our client engineers will explain how CHAOS works on Web, iOS, and Android clients, and our backend engineers will explain how to build a CHAOS backend in Python.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Keeping track of engineering-wide goals and migrations

Wed, 13 Mar 2024 01:00:00 +0100

EE Metrics was envisioned as a hub that helps teams manage their technical debt. EE Metrics provides every team with a detailed web page that contains information about technical debt that needs to be addressed. It also serves as a platform to highlight top engineering initiatives at the organization level.

EE Metrics empowers infrastructure teams to surface important migrations or metrics that could improve the health of software projects. Organization-wide migrations of technologies can often be difficult to surface and keep track of.

Figure 1: Diagram showing how EE Metrics is interacted with and consumed

Many of our users will generally browse their respective team’s health reports within their team specific page to understand which migrations/health metric they need to address based on the impact and priority.

Figure 2: High level overview of the architecture of EE Metrics

EE Metrics contains two key components - a backend service that collects and calculates audit results, and a frontend service that exposes a web application. The primary interface used by users is the web application. The web application allows audit authors to create audits. These audits can be viewed in full detail within their respective pages and are surfaced through Team Health Reports. Team Health Reports attempts to analyze the teams’ health in various categories and identify areas of improvement which forms the primary purpose of EE Metrics.

The Team Health Reports act as a data-driven communication platform between infrastructure teams and product teams. There are two primary categories of metrics that comprise “Audits” in EE Metrics. First, there are org-wide initiatives called “Migrations” that are created by infrastructure teams. These initiatives include code and infrastructure updates/changes that improve the health of software projects from a velocity, quality, reliability, and security perspective. Another set of org-wide initiatives that EE Metrics surfaces are called “Health Checks’’. These tend to be recurring long-term metrics that teams attempt to keep within certain thresholds. An example would be Test Run Times. By keeping the run times of all owned services under a certain threshold, this ensures that the team has confidence that they can continue to ship features reliably and quickly.

The EE Metrics Team Health Report allows teams to view the overall health of a team’s developer velocity, code quality, reliability, and security and gives teams their top priority action items to improve in each of these areas. This helps with balancing the pressure of shipping new product features versus maintenance work.

How do team health reports work?

Team Health Reports are driven by a series of audits that are run against all of a team’s entities (services, libraries, files, directories, etc.). Entities can be any piece of technology or concept that can be owned by teams. To help assign audits to teams, we use the Ownership service to determine what entities fall under the team’s health report (for more information about ownership, check out our blog post on Ownership). Once the health report is generated for a team, it lists the action items teams can take to make improvements, ranked in order of impact and priority. The results of these audits are collected once a day and can be viewed in the EE Metrics web application, or through a monthly email report sent to the team and org leaders. The status of previous audits are also preserved so that users can view the historical results of audits to figure out if there are any trends.

Figure 3: This is a snapshot of a team’s health report as seen in the web application

What are these scores?

The scores in the figure above (figure 3) represent how effective your team is based on the amount of audits outstanding or completed. Audits have a weight assigned to them based on their priority. This helps users understand which audits require more immediate attention. These scores are primarily driven by this factor:

60% of your score is attributed to audits weighted as HIGH.
30% of your score is attributed to audits weighted as MED.
10% of your score is attributed to audits weighted as LOW.

There are other factors that affect scores such as whether a migration is overdue or not, if it’s an informational audit, or if it is a pending new audit. Primarily, we came up with this scoring to ensure that if a team has completed all their high weighted audits, they are deemed to be in good standing.

Figure 4: This is a snapshot of a team’s health report in the form of an email surfacing important notes

Audit Creation and Guidelines

Audits are created by infrastructure teams. These can be one-time initiatives such as migrating off a deprecated service. Audits can also be long term measurements of metrics that must pass a specific threshold or be within acceptable bounds. An example would be measuring how often a test fails during the release process. If the amount of test failures exceed a specific threshold, this would suggest unreliable test coverage and therefore would need to be addressed.

Infrastructure teams are empowered to add new audits to EE Metrics when they are trying to enact change in their areas of ownership. These audits are powered by various data sources collected by the EE Metrics Events Pipeline and additional platform services: these are called metrics and are required for audits to determine the state of an entity. Once a metric is tracked, writing a new audit to the platform is simple. After many iterations of audits, we came up with a set of guidelines when writing a new audit:

Audits should contain enough context for teams to address and solve - if certain audits require a lot of external context, teams are to be directed to additional documentation to help them understand the requirements of the audit.
Audits should be actionable by the teams themselves. If improvements require heavy lifting from an infrastructure team, the infrastructure team should directly drive those improvements.
Audits should be targeted at the team level across the engineering organization. For example, checking for a particular antipattern one specific developer introduced is not the goal for audits.

Once a new audit configuration is deployed, the Team Health Reports are updated to include the new audit.

We’ve taken a democratic approach of allowing infrastructure teams to define their audits’ thresholds and impact levels by establishing defined criteria and providing guidance. While we initially had concerns that infrastructure teams would view their audits as always having the highest impact, we found metric owners have a good understanding on how their audits play in the bigger picture in the overall health of a team.

Required Migrations are any engineering efforts highlighted at the organizational level that are deemed important. These are engineering initiatives that are to be completed by their respective due dates. Some examples of a Required Migration could be an internal migration of services from a deprecated technology to a new technology or organization level upgrades to repositories. These are migrations for technologies that pose the most risk or have outsized benefits across the entire engineering org.

Figure 5: Example of failing Required Migrations

Why is EE Metrics important for Required Migrations?

It can be difficult to highlight and keep track of engineering initiatives that are important to be done at the organization level. Since EE Metrics collects and displays audit results, this can be leveraged to provide an accurate assessment on the completion of an engineering initiative. This aims to provide a platform to keep track of and send detailed reports on the progress of these initiatives at the team and organization level. Teams often do not have the bandwidth to address all of the audits surfaced. To alleviate this, Required Migrations serve as a way to prioritize engineering initiatives. Required Migrations are part of the roadmap planning process org-wide where teams must commit time to addressing these migrations. The goal of EE Metrics is to further increase visibility of these initiatives within the organization.

Determining whether a migration is a top initiative or not depends on several factors. Generally, the overall process is as follows:

Migration authors discuss with their Engineering Manager to propose the escalation of migrations based on importance, severity, and potential consequences if left undone.
Various EMs, TPMs, and Directors coordinate the tentative list of required migrations.
VPs approve the list of migrations and it is labeled as Required Migrations.
Migration authors are designated as the owner foreseeing the completion of their migration. A corresponding migration and audit are created in the EE Metrics services for each Required Migration.

As teams and organization leaders are aware of the required migrations that need to be accomplished, it becomes easier to ensure completion of these migrations are done by a specific date.

EE Metrics serves as a hub for employees to easily identify engineering initiatives and issues that need to be resolved. By handling these issues and performing migrations early on, this reduces technical debt and improves developer effectiveness. As an organization grows and expands, identifying engineering initiatives and potential issues becomes harder to echo without a centralized platform.

A team at Yelp had a cohorting issue with an experiment they were running. This caused a lot of headaches: identifying the problem was difficult and unclear. The team in question checked their EE Metrics Team Health Report and found the audit pointing out deficiencies in their experiment.

Figure 6: The audit that was pointing out that one of their experiments was deficient

The team was able to solve their issue and strived to continue to improve their EE Metrics scores. This had been helpful for their team that they decided to share their experiences with us about EE Metrics and how it had helped them.

The team at Yelp provided a nice testimonial for our team

We’re delighted by all the internal usage of EE Metrics and we will continue to iterate and develop tools to better surface visibility of debt at the company. We hope to see EE Metrics continue paving the path to become a powerful tool when we’re addressing technical debt.

We would like to send a warm thank you to all past, present and future individuals who have contributed to the development of EE Metrics.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Yelp’s AI pipeline for inappropriate language detection in reviews

Tue, 12 Mar 2024 01:00:00 +0100

Yelp’s mission is to connect consumers with great local businesses by giving them access to reliable and useful information. Consumer trust is one of our top priorities, which is why we make significant investments in technology and human moderation to protect the integrity and quality of content on Yelp. As a platform for user-generated content, we rely on our community of users and business owners to help report reviews that they believe may violate our Terms of Service and Content Guidelines. Our User Operations team investigates all flagged content and, if it’s found to be in violation of our policies, may remove it from the platform.

Beyond user reporting, Yelp also has proactive measures in place that help mitigate hate speech, and other forms of inappropriate content through the use of automated moderation systems. In this pursuit, Yelp recently enhanced its technology stack by deploying Large Language Models (LLMs) to help surface or identify egregious instances of threats, harassment, lewdness, personal attacks or hate speech

Automating inappropriate content detection in reviews is a complex task. Given the potential complexities of different contexts, several considerations go into creating a tool that can confidently flag content violating our policies. In the absence of high precision, such a tool can have significant consequences, including delays in evaluating reviews, while less stringent measures can result in the publication of inappropriate and unhelpful content to the public. Addressing this, we have iterated through several approaches to achieve higher precision and recall in the detection of inappropriate content. The tradeoffs in precision-recall drove us to adopt LLMs, which have been largely successful in the field of natural language processing. In particular, we explored the efficacy of LLMs to identify egregious content, such as:

Hate speech (including disparaging content targeting individuals or groups based on their race, ethnicity, religion, nationality, gender, sexual orientation, or disability)
Lewdness (including sexual innuendos, pickup lines, solicitation of sexual favors, as well as sexual harassment)
Threats, harassment, or other extreme forms of personal attacks

Unrelated to this automated system, as previously mentioned, Yelp allows both consumers and business owners to report reviews they believe violate our content policies, including reviews that contain threats, harassment, lewdness, hate speech, or other displays of bigotry. In 2022, 26,500+ reported reviews were removed from Yelp’s platform for containing threats, lewdness, and hate speech. These reported reviews, along with Yelp’s pre-existing systems that curb inappropriate reviews in real-time, provided us with a large dataset to fine-tune LLMs for the given binary classification task, where the goal was to classify reviews as appropriate or inappropriate, in real-time.

To train the LLM for classification, we had access to a sizeable dataset of reviews identified as inappropriate in the past. However, given the inherent complexity of language, especially in the presence of metaphors, sarcasm and other figures of speech, it was necessary to more precisely define the task of inappropriate language detection to the LLM. To accomplish this, we collaborated with Yelp’s User Operations team to curate a high-quality dataset comprising the most egregious instances of inappropriate reviews, as well as reviews that adhered to our content guidelines. A pivotal strategy here was the introduction of a scoring scheme that enabled moderators to signal to us the severity level of inappropriateness in a review. To further augment the dataset, we also implemented similarity techniques using sentence embeddings from LLMs, and identified additional reviews that were similar to the high-quality samples we obtained from moderator annotation.

Apart from this, we also applied sampling strategies on the training data specifically to increase model recall. In order to train a model that can recognize different forms of inappropriate content, it is necessary to have a dataset with enough samples from different sub-categories of inappropriate content. Unfortunately, a large number of reviews that we curated did not contain this information. To solve this problem, we leveraged zero shot and few shot classification capabilities of LLMs to identify the sub-category of inappropriate content and performed under-sampling or over-sampling where needed.

Using the carefully curated data, we began investigating the effectiveness of large language models for the given text classification task. We downloaded LLMs from the HuggingFace model hub and computed sentence embeddings on the preprocessed review samples. Using these embeddings, we determined the separation between appropriate and inappropriate samples by evaluating the silhouette score between the two groups, as well as by plotting them on a two-dimensional space upon dimension reduction with t-SNE. The separation was fairly apparent as can be seen in the figure below.

Visualizing separation between appropriate/inappropriate reviews on model embeddings

Encouraged by this, we minimally fine-tuned the same model on the dataset for the given classification task and saw successful results on the class-balanced dataset (see metrics below).

Trained model metrics on balanced test data

Although the metrics were promising, we still needed to assess the false positive rate generated by the model in real-time traffic. This is because the spam prevalence in actual traffic is very low, so we needed to be extremely careful in our assessment of the model’s performance in real-time and choose a threshold that helps generate high precision.

In order to simulate the model’s performance in real-time, we generated many sets of mock traffic data with different degrees of spam prevalence. The result of this analysis allowed us to determine the model threshold at which we can identify inappropriate reviews with an accepted range of confidence. Now we were ready to push the model’s deployment to actual traffic on Yelp.

The following flow diagram illustrates the deployment architecture. Historical reviews stored in Redshift were selected for labeling and similarity matching (as described in the data curation section). The curated dataset is stored into an S3 bucket and fed into the model training batch script. The model generated from the batch is registered in MLFlow from which it is loaded into MLeap for serving predictions inside a service container (model server component in the picture below). Please refer to this blog post from 2020 for more details on Yelp’s ML platform.

Model training & deployment process

Since incorporating LLMs to help detect harmful and inappropriate content, it enabled our moderators to proactively prevent 23,600+ reviews from ever publishing to Yelp in 2023.

Yelp makes significant investments in its content moderation efforts to protect consumers and businesses. Recent advancements in Large Language Models have showcased their potential in understanding context, presenting us with a significant opportunity in the field of inappropriate content detection. Through a series of strategies we have now deployed a Large Language Model to live traffic for the purpose of identifying reviews that contain egregious instances of hate speech, vulgar language, or threats and thereby, not in compliance with our Content Guidelines. The flagged reviews are manually reviewed by our User Operations team, and through this combined effort, we have proactively prevented several harmful reviews from ever being published on Yelp. However we still continue to rely on our community of users to report inappropriate reviews. Based on the decisions made by moderators and subsequent retraining of the model, we anticipate further improvements in the model’s recall in the future.

I would like to acknowledge everyone that was involved in this project. Special thanks to Marcello Tomasini, Jonathan Wang, Jiachen Zhao for contributing to the design and implementation of the work described here. I’d also like to thank members of the ML infra team, Yunhui Zhang, Ludovic Trottier, Shuting Xi, and Jason Sleight for enabling LLM deployment, and the members of the User Operation team.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Building data abstractions with streaming at Yelp

Fri, 08 Mar 2024 01:00:00 +0100

Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This is facilitated by Yelp’s underlying data pipeline infrastructure, which manages the real-time flow of millions of messages originating from a plethora of services. This blog post covers how we leverage Yelp’s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp’s Business Properties ecosystem (explained in the upcoming sections) as an example.

Let’s start by covering certain key terms used throughout the post:

Offline systems - data warehousing platforms such as AWS Redshift or Yelp’s Data Lake, which are intended for large-scale data analysis
Online systems - systems designed around high-performance SQL and NoSQL database solutions like MySQL or Cassandra DB, specifically built to handle and serve live traffic in real time, typically via REST APIs over HTTP. These databases are optimized for swiftly processing and delivering data as it’s generated or requested, making them crucial for applications and services that require immediate access to up-to-date information

Generally speaking, ‘Business Property’ can be any piece of data that is associated with a Yelp business. For example, if we’re talking about a restaurant, its business properties could include things like what payment methods it accepts, what amenities it provides, and when it is open for business.

There are two types of business properties: Business Attributes and Business Features. You may notice that the terms, attributes and features, are synonymous to each other, and that’s by no accident. The primary distinction is that Business Attributes belong to the legacy system, yelp-main, while Business Features are in a dedicated microservice, aligning with Yelp’s transition to Service Oriented Architecture.

We also gather additional metadata about business properties themselves, such as when they were last modified, how confident we are in their accuracy, and where they originated from. This additional information is referred to as “properties metadata.” We store this metadata in a separate table, which contains data about both Business Features and Business Attributes.

Business properties data is accessed via two primary methods: HTTP APIs for real-time online applications and streaming for offline data synchronization. This post mainly focuses on the streaming aspect.

Existing Business Properties’ streaming architecture

Existing Business Properties' streaming architecture

In yelp-main’s MySQL database, data for Business Attributes is scattered across more than a dozen tables. To share this data efficiently, we employ the MySQL Replication Handler to push it to Kafka
Business Features and metadata for business properties are stored in their respective tables in Cassandra db and we use Cassandra Source Connector to publish their data into Kafka
Ultimately, we use Redshift Connector to synchronize data from all these tables with their corresponding tables in Redshift. This process allows us to maintain an up-to-date dataset in Redshift for analysis and reporting

Challenges with the existing workflow

Weak Encapsulation: Storing data in offline systems exactly as it is stored in source databases forces our clients to understand the inner workings of the source data, which weakens data encapsulation. Ideally, we wanted to abstract away distinctions like ‘Business Features’ and ‘Business Attributes’ and hide implementation details from clients to simplify their interactions. Furthermore, exposing raw data to offline consumers can lead to the disclosure of outdated or incorrect information. Transformation layers via REST APIs prevented online users from facing data discrepancies. However, offline users analyzing raw data still had to grapple with data accuracy issues, such as managing soft-deleted entries.
Discovery and consumption: The lack of proper abstractions also made data analysis and consumption challenging as it meant that consumers, whether they are Product Managers, Data Analysts, or batch processing systems, must create multiple workflows to collect data from various sources. Not to mention, dealing with edge cases and transforming data into a consistent schema added significant effort and cost, leading to an increase in the friction for consumption and a reduction in the general utility of the data.
Maintenance challenges: It also posed certain maintenance challenges as any alteration in the source schema necessitated corresponding changes in the destination store. Ideally, we would prefer the destination store’s schema to be more flexible, dynamic, and less susceptible to changes. This minimizes disruptions for users and mitigates the risk of infrastructure problems due to frequent schema upgrades. It also underscores the fact that a storage schema suitable for one database system might not be ideal for another.

We did explore various alternatives, including a non-streaming solution that involved using Apache Spark for routine batch executions to generate data dumps in diverse formats. However, as some of the data consumer use cases required relatively real-time updates, we had to lean towards a streaming approach.

Building robust data abstractions for both offline and streaming data consumers

We tackled the aforementioned challenges by treating both streaming and offline data consumption as just additional channels for accessing and utilizing data, much like online HTTP clients. Similar to how we simplify complexities for online data consumers through REST APIs, we aimed to provide a consistent experience for streamed data by abstracting away internal implementation details. This means that if a client service transitions from consuming data directly through REST APIs to an asynchronous streaming approach, it will encounter similar data abstractions. For example, just as online consumers won’t see stale or invalid data, the same principle applies to streamed data consumers.

In order to achieve the same, we implemented a unified stream that delivers all relevant business property data in a consistent and user-friendly format. This approach ensures that Business Property consumers are spared from navigating the nuances between Business Attributes and Features or understanding the intricacies of data storage in their respective online source databases.

New consolidated business properties streaming architecture

Business Attributes data collection and transformation: we utilize Apache Beam with Apache Flink as the distributed processing backend for data transformation and formatting Business attribute data. Apache Beam transformation jobs process data originating from various input streams generated by the MySQL replication handler. These streams contain replicated data from their corresponding MySQL tables. The transformation jobs are responsible for standardizing the incoming streaming data, transforming it into a consistent format across all business properties. The transformed data is then published into a single unified stream.
Streaming Business Features: in a similar fashion, the output stream for Business Features, sourced from Cassandra using a source connector, also has its dedicated Apache Beam transformer job. This job formats the data to match the unified format used for Business Attributes, and the resulting data is published into the same unified output stream
Enrich data with properties metadata: we employed a Joinery Flink job - a homegrown solution at Yelp commonly used for joining data across multiple Kafka topics - to amalgamate the business data for both Business Attributes and Features with the corresponding metadata. As a result, the data stream not only contains the business properties data but also the relevant metadata linked to each property.
Final data formatting: transformation job to address issues related to data inconsistencies, remove invalid data entries, and add any necessary supplementary fields, before the final business properties with metadata consolidated stream is exposed for consumption
Offline data storage: the processed business properties data, complete with metadata, is made available for offline consumption and ends up in Redshift, through Redshift Connector. Additionally, it is ingested into Yelp’s Data Lake using a Data Lake connector, making it available for a broader range of analytics and data processing tasks
Real-time consumption and Integration: the same consolidated data stream can cater to real-time consumption by other services within the organization. We use the same stream to sync business property data with Marketing systems, as they require timely syncs for their campaigns

To summarize, with the architecture described above, we have created a unified business properties stream addressing the challenges with the existing workflow mentioned above. This stream is utilized to sync business properties data into offline systems, enabling users to access all business properties through a singular schema, thereby facilitating data discovery, consumption, and overall ease of use.

Additionally, this approach allowed us to enrich business property data with associated metadata and resolve data inconsistencies, such as removing duplicate business properties etc. We used the entity–attribute–value (EAV) model, which accommodates the frequent introduction of new business properties without requiring modifications to the destination store schemas, hence reducing some of the maintenance overhead.

This post shows how Yelp’s robust data pipeline infrastructure can be leveraged to create sophisticated data pipelines that provide data in formats which are more suited and beneficial for both offline and streaming users. While this doesn’t imply that streaming and exposing raw data is never appropriate, however in such situations, it may be more effective to offer multiple streams: one with the raw data and others with processed data that is more befitting for data analysis and consumption

I would like to thank the members of Semantic Business Information team and different streaming teams at Yelp that helped in making this project a reality.

Special thanks to Joshua Flank, Abhishek Agarwal, Ryan Irwin and Sudhakar Duraiswamy for providing insightful inputs and reviewing the blog.

Become a Data Backend Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Coordinator - The Gateway For Nrtsearch

Fri, 06 Oct 2023 02:00:00 +0200

While we once used Elasticsearch at Yelp, we have since built a replacement called Nrtsearch. The benefits and motivations of this switch can be found in our blog post: Nrtsearch: Yelp’s Fast, Scalable and Cost Effective Search Engine. However in this blog post, we will discuss the motivations behind building Nrtsearch Coordinator - a gateway for Nrtsearch clusters. We will also go over how Nrtsearch Coordinator adds sharding logic to Nrtsearch, handles scatter-gather queries, and adds support for dark/live launching cluster improvements.

We traditionally used a gateway to call Elasticsearch, which provides metrics, isolation rate-limiting per client, and geo sharding, and it also eases Elasticsearch upgrades (see Yelp’s Elasticsearch-based Ranking Platform - Indexing and Defense Mechanisms for more details). However, we couldn’t use the same gateway for Nrtsearch for a few reasons:

It was using the Hystrix library for rate-limiting and isolation which has been deprecated for a while.
It was running on Java 1.8 since Hystrix is not supported on newer Java versions.
It exposed a REST API with JSON while Nrtsearch uses gRPC and Protobuf. Converting the Protobuf messages to JSON would make the responses much larger and harder to parse for clients.
It was built for geo sharding but we needed to shard the data using multiple strategies.
It used a cassandra-based system instead of our more recent Flink-based Elasticpipe for indexing.

We considered modernizing the gateway and supporting the required features, but it would have required a lot of changes in the gateway and also in existing applications. Instead we decided to build Nrtsearch Coordinator to address all the issues with the previous gateway. It runs on the latest Java version, uses gRPC and Protobuf, and also has more required features. These features are discussed in detail below.

Sharding

Nrtsearch clusters have a single primary (which does all the indexing) and multiple replicas which serve search requests. The replicas start up by downloading a copy of the index from S3, and then connect to the primary to get the real-time indexing updates. We also have the replicas keep the docvalues (column-based per-field data structures that are read sequentially) for the entire index in memory using OS disk cache for faster retrieval for search requests. This design presents two challenges:

Index size is limited by the amount of memory we can get in an instance. Larger instances are also more expensive.
Replicas will require more time to bootstrap the larger an index is – since the download from S3 will take longer – increasing the time it takes to scale up the number of replicas when there is an increase in search traffic.

While these challenges won’t present issues for small indices (sized in 10s of GBs), they will for larger indices (100s of GBs). This is a typical problem faced by databases since data size can easily grow beyond the space available on a disk. Databases typically “shard” (create chunks of) large amounts of data and distribute them across multiple nodes so that each node has a manageable data size. The Nrtsearch Coordinator allows us to do the same for Nrtsearch, but instead of distributing data across multiple nodes in a cluster, we do it across multiple Nrtsearch clusters. We call this logical grouping of clusters a “cluster group.”

Interactions between Nrtsearch primaries and replicas of clusters in a cluster group, and Nrtsearch Coordinator

We can easily create the required number of Nrtsearch clusters, and then Nrtsearch Coordinator will direct both indexing (including add document, delete and commit requests) and search requests to the right clusters. All of these requests include a sharding parameter object which contains the required information for Nrtsearch Coordinator to send the request to the right cluster. Nrtsearch Coordinator also needs a sharding configuration which defines how the sharding will be performed. The information within the sharding parameter and the required configuration will depend on the type of sharding being used:

ID sharding

ID sharding simply takes the mod of an integer by the number of clusters/shards and can index the data in the cluster or search for data in a cluster. While the name implies that the integer must be an ID, it may or may not be the document ID. The sharding configuration needs to map the numbers 0 to n-1 (where n is the number of Nrtsearch clusters) to Nrtsearch cluster and index name. Example ID sharding configuration:
```
clusters_to_indices:
  0:
    cluster_1: index_name_1
  1:
    cluster_2: index_name_2
  2:
    cluster_3: index_name_3
```
Geo sharding

With geo sharding, the data in the same region is stored in a single cluster. The sharding parameter may contain a geo point (latitude and longitude) or a geo box (two geopoints representing opposite corners of a rectangular area). The sharding configuration needs to contain a mapping from geo box to a Nrtsearch cluster and index name. A request will be mapped to an Nrtsearch cluster if the point or box are contained in its corresponding geo box. We add some fudge factor to index businesses that are at the boundary to keep the search behavior consistent. Example geo sharding configuration:
```
geoshards:
  - index_name: west_americas
    cluster_name: search_west
    bounds:
      min_latitude: -90.0
      max_latitude: 90.0
      min_longitude: -170.0
      max_longitude: -100.0
  - index_name: east_americas
    cluster_name: search_east
    bounds:
      min_latitude: -90.0
      max_latitude: 90.0
      min_longitude: -100.0
      max_longitude: -30.0
```
Default sharding

This implies that we are only using a single Nrtsearch cluster and not sharding the data. The sharding parameter need not contain anything while the sharding configuration needs the single Nrtsearch cluster and index name. Example default sharding configuration:
```
cluster_name: search
index_name: business_v1
```

We select one of these sharding strategies:

If the index size is small enough to fit on a single cluster use default sharding.
If the index is large, can be split by location, and every search query only has a single geo area use geo sharding.
Use ID sharding for everything else.

When sharding data, databases generally try to split the data evenly across all shards. Queries are fanned out to all shards and then the results are combined. As you can see with ID sharding (unless using document IDs as the sharding parameter) or geo sharding, there is no guarantee that the data will be evenly distributed across Nrtsearch clusters. These sharding strategies can only be used with search queries that access a single shard. Say you have a geo shard for the Eastern U.S. and you have a search request that only needs results within the area of New York. You can direct the search request to the New York shard by setting the sharding parameter to the geo box containing New York. In addition to that you can also add a geo bounding box to the query to limit the results to New York.

Geo sharding example

This works with ID sharding too. You can search over all reviews of a single business by ID sharding on business ID instead of review ID. Also since we run Nrtsearch on Kubernetes we can individually set the resources for primaries and replicas in each cluster, and also the number of replicas. For example:

If a cluster has a small index we can set it to have less memory.
If a cluster has only a few updates we can reduce the CPU on the primary.
If a cluster receives more traffic than other clusters, its replicas can scale up and service the traffic. There is no need to increase the number of replicas for other clusters.

All we need is that the index sizes on each cluster are small enough that the docvalues fit in memory and that Nrtsearch can download the index and startup within a few minutes. But if your search query requires searching over all data across multiple shards, we can ID shard on the document ID to have all data evenly spread across all clusters and use scatter-gather.

Scatter-Gather

Nrtsearch Coordinator also supports scatter-gather, in other words, it can fan out search requests to all clusters and combine the responses for use-cases where we cannot apply application level sharding logic. This can be used with any type of sharding but is best used with ID sharding using document ID in the sharding parameter to evenly distribute the data and also search load.

Processing a search request this way enables parallel processing and improves performance for searches over huge datasets contained in a cluster group. Consider an Nrtsearch index that contains reviews and is sharded by review ID. Scatter-Gather can be used to query all reviews containing the word pizza across all clusters. In this case we can send the same query to all the clusters and combine the responses to rank them accordingly.

We implemented scatter-gather to distribute an incoming search request across multiple clusters using multi-threading to invoke all the search tasks in parallel and with appropriate timeouts to process the request. Nrtsearch Coordinator acts as a collector for these individual search responses. All the logic needed to merge and sort these responses are built into Nrtsearch Coordinator. This requires scatter-gather to be performant to take advantage of Nrtsearch’s high performance searches on each cluster.

The Nrtsearch Coordinator merges the responses as they are received. The hits are ranked either according to the relevance scores or the query’s sort field type. We use a heap data structure to merge the results and to retain the top N document IDs requested by the client. Currently if any request to a cluster errors out we return an error in the response. Support for partial responses is discussed in the future work section.

Nrtsearch Coordinator Scatter-Gather feature

An Nrtsearch search response contains the hits results, search diagnostics, collector or aggregation results and several other metrics and information about the search query that is processed. All of these fields are merged accordingly to enrich the combined search response with all the useful information.

When aggregations such as Terms aggregation are requested, Nrtsearch uses collectors to get results from individual segments of an index and a reduce logic computes the aggregations per cluster. If topN results are requested, for example, we get the topN from each shard to combine and sort the individual responses. We use a query-and-fetch approach here instead of query-then-fetch since we did not experience any latency concerns for our current use cases. However in the future, we plan to implement a query-then-fetch approach to handle large search requests to clusters with a higher number of shards. For search clients that require higher accuracy when dealing with imbalanced shards, we will be fetching more than the requested number of results from each shard so that the final topN results have the highest accuracy and relevance.

In Nrtsearch Coordinator, we recursively process the results of these collectors and the nested collectors within them to merge the responses. These results are then ordered and processed using a priority queue to have top buckets of certain size in the final aggregation result.

Some search requests can take too long to be processed, which can cause timeouts in the Nrtsearch cluster. The reasons why the query could not be processed within a reasonable time may vary from queries that require ranking a large number of documents, to a lack of resources in the Nrtsearch cluster. We log these slow queries along with the time taken to understand the root cause behind the slow processing time. The slow query is logged in Nrtsearch Coordinator because sharding is not part of Nrtsearch. It would not be possible to investigate a sharding problem if we were logging the slow query through Nrtsearch instead of Nrtsearch Coordinator.

It is important to note that the information in the slow query log does not contain any sort of sensitive information that could harm users’ privacy. The term “slow” is subjective and configurable in the Nrtsearch Coordinator configuration file. This is an example of a slow query configuration:

queryLogger:
  defaultStreamName: all_slow_queries
  timeTakenMsToLoggingPercentage:
    # 1% of the queries that took more than 150ms but not more than 350ms
    # will be logged into the default all_slow_queries stream
    150: 0.01
    350: 1.0
  timeTakenMsToStreamName:
    # 100% of the queries that took more than 350ms will be logged in the
    # stream name defined below instead of all_slow_queries
    350: slow_queries_over_350_ms
# fields that should be skipped when logging a search response/request
sensitiveFieldsInSearchResponse: [response_sensitive_field]
sensitiveFieldsInSearchRequest: [request_sensitive_field]

Dark and live launch

Many changes on Nrtsearch clusters are only infrastructural and not behavioral. For such infrastructural changes, we look for the following:

Client code should not require any changes.
The new cluster group should return the same response.
The response from the new cluster group should not be slower than the status quo cluster group.

Dark and live launches (also known as blue-green deployment) are a great way for developers to safely test a new Nrtsearch cluster group by slowly shifting incoming traffic to the new cluster group. A comparison between the responses from the status quo and the new cluster groups is very useful to build confidence in the new cluster group behavior before actually serving live traffic to it, avoiding any negative impact on the clients.

Nrtsearch Coordinator is a good place to add the dark/live launch features because it already routes requests to the proper Nrtsearch cluster based on the sharding parameters. Dark/live launches also route requests to the proper Nrtsearch cluster group, but based on a traffic percentage. Having this logic in Nrtsearch Coordinator instead of client services also means that any client using Nrtsearch Coordinator during a dark/live launch would have the new Nrtsearch cluster changes without the need of any change on the client side.

All of the traffic percentage and launch type (status quo, dark launched, and live launched) definitions are configurable in the Nrtsearch Coordinator configuration file. Currently, dark/live launches only work for search requests. We can define the different types of launches as follows:

Status quo - Status quo is the cluster group that Nrtsearch Coordinator currently sends all search requests to.
Dark launch - Dark launched cluster groups are the cluster groups that we want to test in a way that does not have any user impact. Dark launching should not affect the status quo response in any way, including the content or timings. To achieve that, Nrtsearch Coordinator sends any search request to the status quo AND the dark launched cluster groups. Only the search response from the status quo cluster group is returned to the client. In more detail, the same request is first sent to and processed by the status quo cluster group. Then, the same request is sent to the dark launched cluster group, but in a different thread such that the response from the status quo cluster group is not blocked and it can be returned right away to the client. As a result, we can keep track of both the status quo and the dark launched cluster group responses for the same request. These responses and the search request are logged so that we can later compare if both cluster groups behave the same (more in Comparison Report section).
Live launch - Live launched cluster groups are cluster groups that usually went through a dark launch first and can now be gradually exposed to users. When live launching, Nrtsearch Coordinator sends any search request to the status quo OR one of the live launched cluster groups. The response from the selected (status quo or live launched) cluster group is returned to the user. Since the same request is not sent to both the status quo and the live launched cluster groups, we do not have a comparison log similar to what we have during dark launch.

How dark/live launch works in Nrtsearch Coordinator

Besides defining the status quo as well as the dark/live launched cluster groups, Nrtsearch Coordinator also needs to know by how much it should route the search traffic to these cluster groups, which can happen from 0% to 100%. A common dark/live launch flow looks like the following:

Dark/live launch flow

Comparison report

We developed a comparison report tool with the purpose of facilitating the comparison of Nrtsearch search responses between the status quo and dark launched cluster groups. Since we log the status quo and dark launched responses for the same request, we can use these logs to check the behavior of the dark launched cluster group against the status quo. Each line in this log contains the search request, the search response of the status quo, and the search response of the dark launched cluster groups. The comparison report tool uses this log to compare the responses and generates a summary of the comparison, by checking the response equality in the following order: total hits → hit fields that are ids → remaining hit fields → hit scores. The complete Nrtsearch response structure can be found here. This is how the comparison report summary looks like:

----- COMPARISON REPORT SUMMARY -----
Dark launch cluster group: test-cluster-group
Total log lines compared: 293
Number of error messages: 15 (5.12% of total log lines)
Number of matching responses: 178 (60.75% of total log lines)
Number of mismatching responses: 100 (34.13% of total log lines)
-- Total hits mismatch stats --
Number of mismatching total hits: 70 (23.89% of total log lines)
Total hits average difference: 60
-- Top hits mismatch stats --
Number of mismatching ids: 7 (2.39% of total log lines)
Number of mismatching fields: 23 (7.85% of total log lines)
Number of mismatching scores: 0 (0.00% of total log lines)
Comparison report saved at nrtsearch_coordinator/generated/comparison_reports/comparison_report_20221109-155500.txt

The comparison report is a command line tool that is part of the Nrtsearch Coordinator repository. While this tool could have been released separately from Nrtsearch Coordinator, we thought of deploying it together to avoid installing and deploying it in different environments. It also makes sense to deploy the comparison report tool and Nrtsearch Coordinator together because the comparison tool is highly coupled with the dark launch log formatting, which is defined in Nrtsearch Coordinator.

Support pagination, partial responses and combining facet results in scatter-gather
Translating coordinator requests to work with API changes in nrtsearch to avoid changes in clients
Add more sharding strategies which work better for a variety of use-cases

We would like to thank all current and past members of Ranking Infrastructure team at Yelp who have contributed to building Nrtsearch Coordinator including Andrew Prudhomme, Erik Yang, Karthik Alle, Mohammad Mohtasham, Tao Yu, Ziqi Wang, Umesh Dangat, Jedrzej Blaszyk and Samir Desai.

Become a Data Backend Engineer at Yelp

Do you love building elegant and scalable systems? Interested in working on projects like Nrtsearch? Apply to become a Data Backend Engineer at Yelp.

View Job

Back to blog

Overview of JupyterHub Ecosystem

Tue, 25 Jul 2023 02:00:00 +0200

At Yelp, Apache Spark and JupyterHub are heavily used for batch processing and interactive use-cases, such as in building feature models, conducting ad-hoc data analysis, sharing templates, making on-boarding materials, creating visualizations, and producing sales reports.

Our initial deployments of Jupyter at Yelp were iPython notebooks managed at an individual level. Later on when Jupyterlab was released (2018), our notebook ecosystem was extended to Jupyter Servers running on dev boxes, which was managed by individual engineering teams. Over time with growing use-cases and data-flow, this introduced unnecessary version variability, became error-prone due to the number of manual steps, caused config duplicacy, lacked comprehensiveness in resource usage and cost monitoring, created security issues, and added maintenance overload at an organizational level.

In this blog post, we will discuss the evolution of our Jupyterhub ecosystem which is now managed by a single team and presents an easy to use, scalable, robust, and monitored system for all engineers at Yelp. This blog will focus on each major component as part of the ecosystem and describe its purpose and evolution over time. Finally, we will illustrate the evolution of all the components in a unified chronological order in a diagram.

The Yelp JupyterHub ecosystem encompasses JupyterHub, our internal notebook archiving service Folium, Papermill, Spark on PaaSTA, and a Spark Job Scheduling Service (e.g. Mesos or Kubernetes). We solved many of our problems through a combination of novel feature development, extension integrations, and migrations of infrastructure components, all while trying to maintain minimal impact to existing Jupyter workflows. The diagram below shows the most common workflow for a user to launch a notebook and upload it to Folium.

High-level Architecture of JupyterHub Ecosystem at Yelp

Scale: Over the years, we have scaled our usage of the JupyterHub ecosystem to several teams owning thousands of batches. To put this into perspective, Spark batch runs have doubled every year, almost following Moore’s law. As of today, over 100 service owners own more than 1200 batches and hundreds of Jupyter and Folium notebooks are executed daily. These run across different underlying hardware (EMR, gpu, spot, on-demand), processing billions of messages and terabytes of data on a daily basis.

Jupyter notebook usage started at Yelp through users launching notebook servers from within a service virtual environment on individual dev boxes. As mentioned earlier in this post, as the scale of usage of our ecosystem increased, it brought a bunch of challenges, making it harder to manage use-cases at an organization level.

As a result, our Spark and JupyterHub infrastructure went through a plethora of migrations to adapt to newer technologies. The chronological stages of the migrations are as below:

The Jupyterhub setup used as part of individual dev boxes was later extended by running team based Jupyterhub instances on team dev boxes, which utilized the open source docker spawner to launch user servers. This led to teams sharing a common shared infrastructure without a single user having to maintain and set-up their own servers.
A Centralized JupyterHub Ecosystem was built which ran on top of PaaSTA. We started with launching and managing our Jupyter notebooks using marathon spawner. Spark Cluster used a Mesos scheduler to launch its executors. This meant a single team was able to manage the JupyterHub ecosystem, while also providing a single point of entry for launching Spark sessions integrated with PaaSTA infrastructure.
We then adopted a more industry-wide and well-maintained open-sourced orchestration platform, Kubernetes.
- The initial phase involved launching Jupyter notebooks with the aim of moving away from Marathon Spawner in favor of using Kubespawner. At this stage, Spark jobs launched on Jupyter notebook ran Spark drivers on Kubernetes while executors were still running on Mesos. Moving to Kubespawner opened doors for many features. It provided smarter bin packing, centralized management, and improved monitoring of Jupyter nodes inside a Kubernetes cluster.
- The next phase involved migration of Spark schedulers running executors from Mesos over to Kubernetes. This took us one step further towards Mesos deprecation and auto-scaling of executor instances with Dynamic Resource Allocation. This opened doors to integrate security-related improvements such as enabling adding IAM roles for containers through Pod Identity for Spark Drivers..

All this was done under the hood, without impacting the user-experience and not requiring any service based migrations.

Launching Jupyter notebook and writing Spark job without having to deal with underneath components

One of the goals of the ML Compute team – a team focused on batch and machine learning infrastructure – is to continuously work in the direction of a ‘one-click-set-up-everything’ philosophy. This helps Jupyter and Spark users to shift their focus to notebook development instead of infrastructure management. This starts with providing a single web url entry-point for any internal user as shown in the diagram below. The entry-point lets the user launch a Jupyter Server after logging in with their LDAP credentials and using two-factor authentication (2FA).

The Jupyter Server is run from a docker image, which users can use directly or customize based on their requirements. These images have all the permissions, environment, packaging, and most recommended configurations required to install and run Spark, an otherwise onerous task.

Customizations to our Jupyter launcher set up user credentials based on assigned AWS roles to access various internal database resources (S3, Redshift), as well as allowing users to select between GPU vs CPU pools with custom resource configurations at the time of launch.

Single entry point to launch secured and customized notebook server

Customized Jupyter Kernels

The single entry-point leads to spawning a JupyterHub server. Most users have to select the right coding environment (python, sql, etc.) with relevant dependencies installed, often referred to as Kernels. Jupyter notebook comes with a default ipykernel built on top of IPython. We built our own internal custom standard Kernels for IPython and SQL, catered towards data-science and other Yelp Jupyter users. Our Sql Kernel provides an option for users to connect to multiple Datalake or Redshift Clusters and execute SQL queries interactively.

Creating Spark Session

Now that we have a notebook server ready to use, one can create a Spark Session with a single api call, create_spark_session. Besides returning an active Spark Session, this api internally takes care of the following:

Deduces the final set of relevant Spark parameters based on different input sources
Deduces the optimal default AWS resource and docker container configurations
Takes care of setting up the environment variables (example: AWS creds)
Emits resource usage monitoring link, spark history link, estimated cost
Sends request to our other internal system Clusterman to spin-up a Spark Cluster in our shared Spark pool

Once the Spark session is created, a notebook user can focus on developing and iterating Spark batches, building data-science models onto the live Spark cluster. The diagram below shows a sample example of launching a Spark Cluster to use through a single api call.

Creating a Spark Session

Managing Access Controls

Notebook users often want to connect to various AWS resources like Yelp’s Datalake, S3 paths, and Redshift. For a secured Yelp infrastructure, we want to make sure that each notebook developer can only access a designated set of clusters and resources based on their team-roles or privileges. Each user at Yelp has a designated set of roles giving them required access controls to AWS resources and databases, with session based creds accessible only after 2FA. To ensure that development experience is not impacted through manual, multi-step, and error-prone setup to manage the desired access controls, we provide easy UI based prompts and reminders for initializing and refreshing session based credentials.

During early years of our JupyterHub usage, we relied on syncing up each users’ static AWS credentials at the time of the launch of Jupyter Servers from a secured S3 location. Later on we moved towards using federated creds for batches run by human users. The lifespan of these federated creds is less than 12 hours and needs to be refreshed once the old credentials expire. Notebook extensions were added for users to refresh dev or prod credentials with a bunch of button click cycles as shown in the diagram below. The refresh mechanism generates federated credentials, also referred to as temporary credentials, using two-factor authentication linked to one of the designated roles associated with the triggering user. Later on this multi-step process was improved for users to generate credentials as part of a single sign-on process for their designated role. The future plan involves expanding this to auto-refresh credentials on expiry, so that their ongoing job execution or jobs requiring 12 hour runtimes don’t get impacted.

Pop-up option to refresh the credentials using 2FA authentication

Many use-cases of the JupyterHub ecosystem involve re-running notebooks by multiple users with different inputs over time. As an example, data scientists receive multiple requests to recreate a past report with different sets of inputs. Relying solely on Jupyter notebooks involved a lot of manual steps. Some of these manual steps included starting a Jupyter server, finding their notebooks locally or in S3 buckets, updating the code, running them manually, and emailing the outputs to stakeholders. These steps took up a lot of development time and coordination, and were also error-prone and reduced developer velocity.

To solve this challenge, we built a notebook archiving and sharing service called Folium. Folium integrates with Jupyterhub to enable notebook reproducibility and improve developer velocity. A notebook developer can upload their notebook to Folium to share or re-run a notebook with a single click to get desired results (e.g., business data, machine learning model outputs, graphs). Later versions of Folium introduced tagging, grouping, and versioning of notebooks followed by integration of generation of temporary AWS role based credentials for the user re-running the notebook. For more details, refer to our previous engineering blog on Folium: Introducing Folium: Enabling Reproducible Notebooks at Yelp.

Typical workflow for uploading notebook to Folium

Parameterizing Notebook Reruns

We used the open-source Papermill library for parameterizing and executing Jupyter notebooks. The built-in support from Papermill only allows input/output to/from the local filesystem, and only supports running notebooks on the local machine. Our integration allowed users to directly rerun a templated notebook with different parameters in Folium, without the need to start up a Jupyter server, update notebook code with different inputs or monitor the running status manually. To do this, we adapted Papermill to use an I/O handler, letting the papermill read input notebooks from Folium and write output notebooks with computed results back to Folium, and provided a UI that launches new k8s pods for running individual notebooks.

Providing a smooth user-experience is one the key goals of our JupyterHub ecosystem’s evolution. As our ecosystem scaled in terms of its usage and teams, and with the integration of more systems like Folium, Papermill, and Federated Creds, it became necessary to add new features and extensions.

Here is a summary of some of the JupyterLab extensions and features we added as part of JupyterHub ecosystem:

Monitoring
- Slack Notifications for long-running and expensive Jupyter notebooks.
- Open-source JupyterLab extension Spark Monitor shows the live status of the Spark job action execution within consecutive cells of Jupyter, which help us focus on the current job execution status without having to switch between SparkUI and Jupyter notebook.
- Cluster-level Monitoring on SignalFx/Prometheus: Active notebook run count, percentage of pool (CPU, GPU, on-demand) usage, individual notebook resource usage, data for all the customizations (like kernel, container, user) being used.
Usability
- Menu buttons to upload and download notebooks from Folium.
- Menu buttons to refresh or generate AWS temporary creds for both development and production access
- Menu button to list available aws roles assigned and identify the privileges assigned to a user.
Features
- Side-tabs with a list of available Redshift and Datalake tables. Selecting a particular table auto-generates code-template to connect to and query respective databases.
- Integration of black and isort code formatter menu buttons inside JupyterLab.
Cost Savings
- Extension of the cull idle notebook server script to identify, report and kill long-running Spark Clusters to save cost. This is besides our regular cron job to kill idle notebook servers.
- Dynamic Resource Allocation integration to scale down the Spark Cluster when no spark action is in progress.
- Shutdown Server Menu Button option to enable users to manually shutdown or restart their server.

The diagram below summarizes the evolution of different components in the JupyterHub ecosystem in a timeline view at Yelp.

Timeline flow graph of JupyterHub ecosystem evolution

At Yelp, our team is committed to continuous evolution of the JupyterHub ecosystem. We have scaled usage of the JupyterHub ecosystem from an individual engineer, to team based deployments, to our current organization-wide deployments. In the process, we learned a lot about reducing complexity and increasing reliability, allowing our current setup to be maintained and evolved by an individual machine learning compute infrastructure team.

Our vision of increasing the development velocity and ease-of-use of our systems, reducing onboarding time, and ensuring security is at the forefront of our team’s continuous efforts and roadmap. We have accomplished this through a combination of adapting open source projects and current best practices to Yelp infrastructure, while focusing our internal development towards developer pain points specific to Yelp’s internal ecosystem.

Some of our future initiatives include enabling code navigation, expanding support for different types in parametrized notebooks, making Folium notebooks schedulable, increasing adaptation of GPU servers for model processing, and auto-refreshing federated credentials.

Special thanks to everyone on the Core ML, Compute Infrastructure, Security and other dependent teams for their tireless contributions in the bringing and continuous evolution of JupyterHub Ecosystems to keep it up-to-date. Thanks to Zeke Koziol, Blake Larkin, Jason Sleight, Ryan Irwin and Jonathan Budning for providing insightful inputs and sharing historical context and reviewing the blog.

Back to blog

Speeding Up Delivery With Merge Queues

Tue, 11 Jul 2023 02:00:00 +0200

Merging code safely can be quite time consuming for busy repositories. A common method is to test and merge branches serially, and one at a time, in order to ensure the safety of the main branch. However, this method does not scale well when many developers want to merge code at the same time. In this blog post, you’ll see how we’ve sped up code merging at Yelp by creating a batched merge queue system!

In our blog post about Gondola, our frontend Platform as a Service (PaaS), we talked about the benefits of moving to a monorepo. As we onboarded more teams and developers into our monorepo, we experienced a bottleneck when integrating code changes (merge requests) during peak hours. Ensuring quick code delivery is important to us at Yelp, as it enables us to iterate quickly and ship fast. Whether it’s bundling JavaScript or running mobile builds, we wanted a system that could speed up all our repositories without changing the developer experience (DX).

We’ve traditionally run pipelines in serial to keep a clean main branch and prevent merge conflicts from happening. However, this does not scale well when many developers want to push code at the same time (we’ve observed our repo is busiest in the morning). As such, we’ve explored different ways to merge code while guaranteeing the same branch safeties.

An illustration showing how branches were integrated traditionally.

During our exploratory phase, a common method we saw to expedite code delivery was to run pipelines in parallel when merge requests overlap. The idea behind this approach is to merge any in-progress merge requests along with the new request in our pipeline. This decreases the time spent waiting between pipelines compared to merging in serial, while still guaranteeing merge safety. If a pipeline fails however, the merge request is removed and new pipelines are started for every merge request after the failing merge request, with the failing merge request now removed.

An illustration showing how branches could be integrated in parallel.

This approach is quite resource intensive, since for N branches/merge requests, there would be N pipelines running at the same time. On systems with shared/limited resources, this is a heavy burden to carry, as resource constraints may also negatively affect pipelines currently running.

The merge queue strategy involves batching up merge requests into merge groups and integrating these merge groups sequentially. This approach keeps our resource usage low as we still run one pipeline at a time at most. However, instead of merging one merge request at a time we can merge in as many good, non-conflicting merge requests as possible.

An illustration showing how branches are integrated with merge queues.

When a merge group fails, we perform a binary search to find the bad merge request(s) by splitting the merge group into two child merge groups. With this strategy, a child merge group that no longer contains any bad merge requests can merge its subset of merge requests all at once. However, if a child merge group continues to fail, we continue the binary search until we get a merge group of one merge request, which either passes or fails and does not split further.

To illustrate this we created the diagram below. We start with a merge group with six merge requests (labeled A to F), with one of them being unable to merge (labeled C). The first merge group A, B, C, D, E, and F on the left gets split because we are unable to merge C. The next merge group being evaluated contains merge requests A, B, C which gets split again. We are able to evaluate and merge in merge groups containing D, E, F and A, B afterwards. Eventually, we reach a point where there is a merge group containing merge request C, which fails to merge and does not get split into any child merge groups.

Example showing how a merge group of 6 merge requests are merged and split over time.

For a bit of background, code delivery pipelines at Yelp start with a magic pull request comment, !integrate. This triggers a pipeline to perform common actions like merging code, running tests, and pushing upstream. With this in mind, we wanted the new system to preserve the developer UX, while still being flexible enough to rollout to any repo.

To create the merge queue, we began by building a new service that would execute the run loop. This service manages state/logic (such as merge group creation, splitting, etc.) and periodically checks if a pipeline should run for a subsequent merge group. In addition, we extended the !integrate comment logic to seamlessly replace the old workflow with this new approach. Repos can choose to use the merge queue by adding a config file that specifies which pipeline to run. The existence of this config file also indicates that the new magic comment logic should be used. As a result, the magic comment for this repo will direct a pull request to join the merge queue (after checking mergeability) instead of running a pipeline immediately.

In our delivery pipelines, most repositories follow 3 steps as mentioned earlier: merge, test, and push. To account for more complex repos, we allowed developers to perform additional actions before/after these standardized steps or replace them instead. This structure also helps standardize and simplify pipeline code for our repository owners as they onboard to using merge queues. With these changes, pipelines can continue performing necessary actions, while being managed by the merge queue to speed up previously sequential builds.

Implementing merge queues was a huge improvement from our serial integration pipeline. On the extreme end, we’ve even seen merge groups with over 10 merge requests merge in successfully! By using this system over the past year, our frontend monorepo had an average of about 1.2 merge requests per merge group. In a hypothetical world where a pipeline takes one hour to run, this translates to saving 12 minutes of developer time per pipeline run when compared to running pipelines in serial! For busy repos, which can easily have thousands of merge requests a year, those time savings can add up.

The merge queue project was a collaboration between Webcore and our talented Continuous Integration and Delivery team. Many thanks to the developers who’ve contributed to this project from idea generation to further optimizations.

Back to blog

Dependency Management at Scale

Wed, 17 May 2023 02:00:00 +0200

Keeping project dependencies up to date is an ever-growing concern. An increasing number of dependencies is used for even the most simple applications. It’s easy for teams to deprioritize maintaining them, resulting in numerous security vulnerabilities. As dependencies become increasingly out of date, the level of effort to get a project into a good state increases significantly. Teams may even get blocked by outdated dependencies when doing critical development work.

Being proactive about applying upgrades goes a long way. Tools like Dependabot can really help with this. But what if you’re trying to enforce these practices across hundreds of teams and thousands of projects? And what if you have complex requirements that need to be enforced? At Yelp, this is where the Yokyo Drift service comes in.

Yokyo Drift actively scans all repositories in use at Yelp. It submits pull requests that upgrade any outdated dependencies, and tracks and monitors the progress of these upgrades.

Building a generic solution that works for the majority of projects is challenging. Projects should be relatively standard. This is encouraged by providing a variety of tooling and quality of life upgrades to repositories that adhere to the Yelp standard. The more a project deviates from the standard, the more difficult it becomes to keep it automatically up to date.

In addition, projects must have a robust testing pipeline and good test coverage. Thorough automated testing should run as part of the CI pipeline before any change is accepted. Upgrading dependencies is likely to introduce bugs, and inadequate testing means that teams may not feel confident merging upgrades, thus encouraging them to stick to outdated dependencies. #Tracking Project State Batch jobs regularly collect and index a variety of information about Yelp repositories. Yokyo Drift monitors the specific dependencies that are used throughout the organization. When a vulnerability is discovered in a dependency, we can immediately identify all affected repositories and dispatch a fix to rapidly eliminate the vulnerability. All indexed information is available in a simple UI.

Figure 1: Project status screen

Scheduled Upgrades

We encourage teams to always keep their project dependencies up to date. Small, frequent updates are much easier to manage. Repository owners can configure how frequently they’d like to receive updates. Yokyo Drift performs both major and minor version upgrades, typically on a monthly or quarterly basis.

Yelp projects rely on curated package repositories, and we are only able to upgrade to these pre-vetted versions, thereby ensuring we don’t introduce any unwanted security issues.

Scheduled upgrades are randomly distributed throughout the month. This ensures a consistent use of resources with few spikes. More importantly, it allows our teams to provide support to repository owners and not overwhelm them with too many pull requests at the same time. Performing upgrades for all repositories in one day would result in an overwhelming number of questions in a short amount of time.

Targeted Upgrades

Targeted upgrades allow us to upgrade specific libraries to specific minimal versions across the entire organization. These can be invoked dynamically by other teams using the Yokyo Drift API, or manually using the UI shown below.

Figure 2: Performing a targeted upgrade

This functionality is frequently used by security teams. Once a vulnerability is discovered in a specific version of a library, we can immediately see the impact and deploy a mass upgrade across all of Yelp’s projects. We then actively monitor the progress and ensure the vulnerability is eliminated in all of Yelp’s systems.

Library owners are also frequent users of targeted upgrades. They can rapidly deploy bug fixes and other improvements to all relevant projects.

Pull Requests

All changes are submitted as pull requests in GitHub. Since changes go through the existing CI pipeline, a variety of security and automated tests are executed. We rely on the Ownership service to determine the relevant team responsible for each repository. Pull requests are assigned for review to one of the repository’s owners, who is responsible for manually fixing small changes that may be required by library upgrades. The change automatically gets merged once all checks pass and the repository owner approves the pull request.

Occasionally, teams will be unable to review these pull requests in a timely manner, so automated reminders are sent to the reviewer at a set interval. In addition, Yokyo Drift attempts to always keep the pull request up to date. Merge conflicts are avoided by regularly pulling the latest changes from the master branch and performing the upgrade again if needed.

Updating dependencies on one repository can be time-consuming. It may involve building the project, performing dependency resolution, and even running some automated checks. This is manageable when upgrading a single repository, but quickly becomes unusable when upgrading hundreds or even thousands of repositories. To address this, we need to be able to automatically scale up and down as needed.

Creating a new upgrade job enqueues a payload for each repository that needs upgrading. Workers are then responsible for taking items off the queue, performing the necessary changes, and submitting the pull requests. Workers are configured to automatically scale up as queue size increases and scale back down when the queue clears. Because of this, thousands of complex upgrades can be executed quickly.

The Yokyo Drift UI tracks the progress of each task. A typical successful task will move through the following stages: pending, in_progress (the upgrade is in progress), open (pull request is open), and merged.

Figure 3: Upgrade progress tracking

The job progress page keeps track of how these updates affect repositories. A status of “checks_failed” indicates that the repository is failing automated tests. This status is not uncommon, however, a large number of repositories failing tests may indicate a fundamental problem with the upgrade. Migration authors such as package owners can investigate this and determine if any changes should be made, the end goal being to reduce friction with teams and make these upgrades as easy as possible to integrate.

This progress screen can also be used to directly control job progress. Upgrades can be rerun on individual repositories, the entire change can be canceled or reverted, and teams can be nudged to review and approve the changes if necessary.

Dependencies can easily become outdated and cause significant problems to development teams. Updating them regularly makes the process more manageable and reduces the number of security vulnerabilities. Enabling teams to upgrade a single dependency across thousands of projects is valuable, both for security teams and dependency developers.

Thanks to Luis Perez, Kyle Deal, James Flinn, Jason Tran, Rebecca Fan, Mitali Parthasarathy, Hanna Farah, and many others that have contributed to Yokyo Drift over the years.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Performance for Free on Android with our MVI Library

Mon, 24 Apr 2023 02:00:00 +0200

In 2018, Yelp switched from using the MVP architecture to the MVI architecture for Android development. Since then, adoption of our new MVI architecture library has risen and we’ve seen some great performance and scalability wins. In this blog post, we’ll cover why we switched to MVI in the first place, how we managed to get performant screens by default, and our take on unit testing MVI.

What is MVI?

One of the main reasons to use an architecture is to make things easier to test by separating concerns. For Android, this means keeping the Android SDK out of our presenters and abstracting away all the code that will cause issues for unit tests.

The general idea of Model View Intent (MVI) is that when the user interacts with the UI, a view event is sent to be processed in the model. The model can make network requests, manipulate some view state and send the state back to the view. They’re connected by an event bus or stream so no direct references to Android are required (thus concerns are separated for testing).

Why we switched away from MVP

Our MVP implementation did not scale well

Although Model-View-Presenter (MVP) is a great architecture with a lot of benefits, we found that it didn’t scale well for our larger, more complicated pages. Our presenters grew to have far too many lines of code and became unwieldy and awkward to maintain as we needed to add more state-management and create more complex presenter logic for MVP pages. It was possible to scale an MVP page using multiple presenters, but there was no one approach documented. Our MVP contracts also contained many duplicated interface methods.

We wanted free performance by default

When Google introduced the Android Vitals dashboard and announced that performance can affect our listing and promotability in the Play Store, Yelp’s Core Android team invested effort in improving our cold start timings, frame rendering timings, and frozen frames percentages. Although we made significant improvements in those areas, we found that performance regressions were easy to come by and our performance degraded again over time.

There are a few ways to prevent performance regressions: we could set up performance alerts, we could try to catch regressions before they’re merged, or we could also try to make our apps run smoothly by default. While we did try all of these in the end, our performance came to us for free through auto-mvi, our new MVI library.

Why we chose MVI and not MVVM

We evaluated both the MVI and the Model-View-ViewModel (MVVM) architectures before ultimately deciding on MVI. First, we looked at the basic requirements in our apps. Both of Yelp’s apps require a lot of scrolling and clicking in comparison to, for example, video streaming applications. Next, we looked at what other technologies we were using and determined which architecture would be most compatible with them.

We rely heavily on our in-house Bento library which is a wrapper around RecyclerView. In Bento, a Component is a part of the UI which can be slotted into any RecyclerView. We set up each Component to be its own mini MVP-universe that has its own view, model, and presenter.

In our prototypes, we found that combining Bento with the MVVM pattern was confusing and led to difficult to read code. However, MVI complimented Bento and allowed click events to be fired from within view holders without the need for direct references to the encompassing Fragment or Activity. Additionally, since some of our screens have a lot of UI elements, MVVM would require some data classes with many (greater than 30) fields which would not scale well.

How does auto-mvi work?

When the user interacts with the app, view events are emitted from the view (Fragment or Activity). A view event might be a click or scroll event. A presenter (note: to avoid confusion, at Yelp, we refer to the Model in MVI as the “presenter”) receives the events and sends back view states. The view then responds to these states and decides what to show accordingly. These view events and states are represented as sealed classes in Kotlin. They are emitted over an event bus which both the view and presenter can listen to for new events and states.

Scaling and readability with annotations

Both the presenter and view must handle all of these incoming states. Most Android MVI implementations accomplish this with a when statement in Kotlin. However, the when statement wouldn’t scale very well for Yelp. It would be difficult to read. Imagine the following but with fifty other is clauses:

private fun onViewEvent(viewEvent : MyFeatureEvents) {
  when (viewEvent) {
      is HeaderClicked -> onHeaderClick()
      is FooterClicked -> onFooterClick()
  }
}

To get around the when condition problem, the general idea was to route states and events to function references using a map. That meant going from the above code to:

private val functionMap = mapOf(
    HeaderClicked::class to ::onHeaderClick,
    FooterClicked::class to ::onFooterClick
)
private fun onViewEvent(viewEvent: MyFeatureEvents) {
   ((functionMap[viewEvent::class]) as KFunction0).invoke()
}

Then all the onViewFunction() needs to do is look up the map.

So we could avoid the big when statement. Writing the function map is gross though and still defeats our scalability goal. We’d just be trading a large when statement for a large map. We would also need to handle the number of parameters the functions can have. The above code only covers the easiest, zero-parameter case.

This is how we arrived at the idea to annotate the functions instead. When the presenter and view are created, we use reflection (on a background thread) to create the map of states to functions. Our interface AutoFunction (which is where “auto” comes from) provides the mechanism for this and also routes incoming states and events to relevant functions, and then executes the function with reflection. Again, taking the following example:

private fun onViewEvent(viewEvent : MyFeatureEvents) {
  when (viewEvent) {
      is HeaderClicked -> onHeaderClick()
      is FooterClicked -> onFooterClick()
  }
}

Instead we have:

@Event(HeaderClicked::class)
fun onHeaderClick() {
  // do something
}
@Event(FooterClicked::class)
fun onFooterClick() {
  // make network request etc
}

With this approach, the scaling issue is solved. There is no when statement at all, no function map, and not even a specific function responsible for handling incoming events or states. It also has the advantage that it’s incredibly easy to read.

Scaling with sub presenters

One of the issues we found while using MVP was that for the most complex screens in Yelp’s consumer app, the presenters quickly grew difficult to maintain and understand. With this in mind, the auto-mvi library has a strategy for scaling presenters for complex screens such as this. A page will define one main presenter and within it there can be multiple sub presenters. A sub presenter can handle the logic for a particular feature or part of the UI. For example, for a page with these click events defined in the contract:

sealed class MyFeatureEvents : AutoMviViewEvent {
   object MyButton1Clicked : MyFeatureEvents()
   object MyButton2Clicked : MyFeatureEvents()
   object MyButton3Clicked : MyFeatureEvents()
}

We could respond to them all in one presenter like this:

class MyFeaturePresenter(
   eventBus: EventBusRx
) : AutoMviPresenter(eventBus) {
   @Event(MyButton1Clicked::class)
   fun onMyButton1Clicked() {
       // do something
   }
   @Event(MyButton2Clicked::class)
   fun onMyButton2Clicked() {
       // do something
   }
   @Event(MyButton3Clicked::class)
   fun onMyButton3Clicked() {
       // do something
   }
}

But with a sub presenter, we can handle a subset of events elsewhere:

class MyFeaturePresenter(
  eventBus: EventBusRx
) : AutoMviPresenter(eventBus) {
    // The rest of click events are handled in here
   @SubPresenter private val subPresenter = MyFeatureSubPresenter(eventBus)
   @Event(MyButton1Clicked::class)
   fun onMyButton1Clicked() {
        // do something
   }
}

Since everything is connected via an event bus, it’s simple for one sub presenter to handle a portion of the incoming view events and respond to the view. A bonus win of this pattern is that the organization of unit tests is much improved as each sub presenter can have its own separate unit test. This sub presenter pattern also helps put scaling code at the forefront of one’s mind during planning. If there is a clear division of logic, e.g. header logic vs footer logic, you can easily plan this from the beginning instead of waiting until the presenter is over a thousand lines long at some future point.

Performance for free

With auto-mvi using reflection to execute functions, an opportunity presented itself. The reflection call is straightforward:

myFunctionReference.invoke()

The function – like all the functions in our previous MVP presenters – executes on the main thread. However, by moving the execution of this one line to a background thread instead, we shifted a large portion of the total code that executes in the Yelp apps off the main thread leading to increased performance over all. This change only affects the presenters. The view code still runs on the main thread as it is required to.

The code executes on a single background thread to ensure that each unit of work is carried out sequentially. This means all the presenter code, be it performant or not – it’s all running on a background thread in the model now.

Testing

Writing unit tests for MVP presenters and views is easy and one of the greatest advantages it has over other architectures. We used Mockito to verify that functions were called on the interfaces that made up the MVP contract which is a seamless and straightforward way to test. For example;

fun whenButtonClicked_loadingProgressShown() {
       presenter.buttonClicked() // Simulated UI interaction
       Mockito.verify(view).showLoadingProgress()
}

In MVI, we wanted to make sure that the code was still easily testable. The approach we decided on was to record the events and states that are emitted over the event bus and make assertions on them.

To simplify testing, we created a JUnit test rule called PresenterRule. In addition to abstracting away most of the setup required for the presenter and event bus, the presenter rule also acts as an event bus recorder and provides a set of functions for asserting what happened.

Taking the example above, this looks like:

fun whenButtonClicked_loadingProgressShown() {
     presenterRule.sendEvent(ButtonClicked)
     presenterRule.assertEquals { listOf(ShowLoadingProgress) }
}

Along with verifying that functions are executed, this approach also provides a high-level look at what events and states were triggered and in what order. Lastly, developers can also assert that certain states were not triggered.

Reflecting 4 years later

Does it actually help scalability?

Many teams have made use of the sub presenter pattern with great results. In 2020, the Biz Mobile Foundation team rewrote Yelp’s Business Owner App’s home screen using auto-mvi and made great use of the sub presenter pattern. By utilizing sub presenters, this complicated page’s presenter size remained small and manageable, less than 200 lines with 8 sub presenters. There are also separate unit test classes for the sub presenters which are a lot more manageable than if all the tests were in one file.

Does it actually help performance?

From a high level, we can use Android Vitals to gauge our apps’ performance. However, auto-mvi is just one tool in Yelp’s performance arsenal. In combination with the Core Android team’s other performance efforts, Yelp’s consumer app’s frozen frame and rendering statistics on Google Play’s Android Vitals dashboard are significantly better than our competitors.

Looking at a more specific use case, in 2020, Yelp’s Growth team migrated the onboarding pages to auto-mvi and analyzed the frame rendering timings of the old flow vs the new MVI one, and found a > 50% improvement in the MVI version. This is precisely the kind of improvement we should expect as the presenter code isn’t clogging up the main thread anymore. Below outlines the speed gains we saw on this page with auto-mvi vs MVP.

Avg Frame Render Time Improvement (Relative)	P90 Frame Render Time Improvement (Relative)	Frozen Frame % Improvement (Absolute)
-51%	-67%	-3.99%

The performance boost resulted in an improvement in product metrics too, with a 6.32% relative lift for the Onboarding Flow Completion rate and an 8.26% relative lift for Signup Rate Completion.

Without any involved, special or overly scientific effort made for performance in particular here, the page’s performance improved. You might even say the performance was free.

Is unit testing still easy?

Most if not all of Yelp’s MVI presenters are accompanied by unit tests and the provided testing rule has proven to speed-up developer workflow. To date, we have thousands of unit tests making sure Yelp’s apps are doing what they’re supposed to do.

Conclusion

In summary, every architecture has its advantages and disadvantages but the most important thing is to choose the one that’s most suitable for your business needs. Auto-mvi has allowed Yelp to tackle the development of simple screens to complex screens and everything in between in a scalable and testable way while keeping runtime performance a feature and not an afterthought.

Acknowledgments

Thanks to Diego Waxemberg, Jason Liu, and all the feature teams at Yelp who provided invaluable feedback on our early prototypes and more importantly, adopted auto-mvi on their screens. On Core Android, shoutout to Kurt Bonatz, Matthew Page, Ying Chen for their contributions and help maintaining auto-mvi over the years. Many thanks to all the past members of Yelp who contributed ideas and feedback too.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Yelp Content As Embeddings

Thu, 20 Apr 2023 02:00:00 +0200

Yelp aims to offer easily accessible high-quality content. We need to tag, organize and rank online content to attain this goal. For this purpose, Yelp engineers have started using general embeddings on different data. It improves usability and efficiency for all kinds of model development. Having embeddings that encapsulate semantic information readily available for the massive amounts of data Yelp owns makes implementing new deep learning models easier, since it can serve as an excellent baseline for any model input.

This blog post discusses how the Content and Contributor Intelligence team generates low-dimensional representations of review text, business information and photos for any unspecified machine learning task.

Text Embeddings

Text embedding has been researched in depth in the scientific community. First, embeddings were generated with sparse vectors representing words. Embeddings developed further with context-aware embeddings since the same word can have different meanings depending on how it is used in a sentence. With the use of transformers in recent years, we now have text snippet embeddings that capture more semantic meaning.

Semantic comprehension of the text is essential for Yelp. Yelp reviews are our most valuable asset since they contain a lot of business context and sentiment. We want to capture the essence of each review text to serve their information to our users better. We looked for versatility in our embedding as we try to use the same embedding in various tasks: tagging, information extraction, sentiment analysis and ranking.

Embeddings based on reviews are currently generated by the Universal Sentence Encoder off-the-shelf model offered by Tensorflow. This blog section will present the USE model, any modifications tested to improve it and its advantages for the Yelp dataset.

Universal Sentence Encoder

The Universal Sentence Encoder (USE) shows many advantages for Yelp data. It transforms varying sentence lengths into a fixed-length vector representation. The generated representation aims to encode the meaning and context of the text snippet instead of simply averaging the words together or getting their position in a learned latent space like Latent Dirichlet Allocation (LDA).

The paper presenting the Universal Sentence Encoder trained a model on various data sources and tasks like text classification, semantic similarity, and clustering. Training a model on varying tasks makes it more general and captures all the possible expressiveness of a text snippet. The model demonstrates promising results on eight transfer tasks and suggests that training this model on diversified data sources and sufficiently varied tasks makes it universal, as the name suggests. Universal embeddings are what we were looking for to exploit our most diverse and deep content, Yelp reviews. We want to extract the business information and context given in the review, do sentiment analysis and even rank the reviews based on their relevance and information diversity with the help of the generated review embeddings.

The deep averaging network (DAN) version of USE takes words and bigram embeddings and then averages them together. This resulting embedding serves as input to a feedforward deep neural network that produces the universal sentence embedding we aim to obtain.

An architecture overview of DAN, taken from https://amitness.com/2020/06/universal-sentence-encoder/

Yelp Exploration

By nature, most NLP models will perform better when trained on domain-specific text. With this hypothesis, we developed and compared a Yelp fine-tuned encoder with the pre-trained USE model available on TensorFlow-Hub. We aimed to create a better model adapted to the Yelp domain than the pre-trained model. After fine-tuning the model, we wanted to use it to generate an embedding for reviews specifically.

Yelp data contains different text formats like reviews, captions, searches, and survey responses that can all be used to fine-tune the USE encoder. Since these models are not generative, we needed to create generic supervised learning tasks to fine-tune the model on Yelp domain text.

Some examples of learning tasks we used:

Review Category Prediction
Review Rating Prediction
Search Category Prediction
Sentence Order Prediction
Same Business Prediction

For the evaluation task, we chose:

Photo Caption Classification
Menu Item Classification
Business Property Classification
Synonym Generation for a phrase input

The model evaluation made on the Yelp domain showed that the ready-to-use model performed better than or as well as the Yelp pre-trained encoder for all tasks. These results happened because either the Yelp domain touches many generic subjects in the USE model or our experiments lacked task diversity to gain an edge. Based on these results, we decided to keep the off-the-shelf USE pre-trained model.

USE on Yelp Domain

We can measure two embeddings’ relatedness when they are projected together in the same vector space. This is helpful for semantic search, cluster analysis, and other applications.

Below is a graph representation of a USE embedding space applied on the Yelp dataset. We wanted to verify that the embedding representation and their vector space position related to each other, which is expected of a semantic embedding that captures the general subject of the text snippet it encodes.

We computed the cosine similarity between different embedding representations of reviews from different categories and regrouped them in the following heatmap. We verified that reviews from the same category domain were closer in the vector space than reviews from a different domain, as shown by the lighter boxes in the graph below.

Numbers on the axis reference 44 different reviews IDs. Those reviews business categories are shown in the table below. We can see a clear correlation between reviews from a similar business type.

Labels association table
Reviews ID	Yelp Business Type
0 to 10	Restaurants
11 to 21	Dry Cleaning
22 to 33	Groomer
33 to 43	Plastic Surgeon

Business Embeddings

After Yelp created embedding representations of reviews which showed great potential across several projects, we explored different possibilities to grow our vector representations. We started exploring ways to develop a business vector representation using all of its metadata.

We chose to base our business embedding on user content. We select the 50 most recent reviews and average their vector embeddings to create our first business embedding representation. It’s a great way to start since reviews contain quality content describing the businesses. The next step will be to add the photo embeddings as well.

Business embeddings help generate a top-k similarity list to relate businesses to other businesses, users to businesses and users to users based on their matching business interaction history. This correlation matrix of similarity helps show signification recommendations like “Users like you also liked…” or “Since you like business A, you might like business B”. You can learn more about this use case in this blog post.

Photo Embeddings

Review and business vector representations have existed at Yelp for some time already. Last year, when the paper that presented the Contrastive Language-Image Pre-training (CLIP) model was published, it inspired us at Yelp to generate more semantic data representation based on photos this time.

Research made on photo’s semantic representation improved significantly with the use of transformers applied on images. This section will present OpenAi’s CLIP model, its known capabilities, CLIP’s pre-trained effectiveness on the Yelp domain and some vulnerabilities that are good to be aware of before using it.

CLIP model

Our photo encoder is based on the CLIP model’s performance and ability. This model has learned to associate an image with the most relevant text given. It is a pre-trained zero-shot model that associates a natural language with high-level visual concepts.

CLIP’s input is two sets of features, an image and text. The feature embedding of a single image and a bunch of possible texts are generated alongside their respective encoders. After, CLIP aims to regroup similar image-text pairs in the embedding space and distance dissimilar ones using contrastive representation learning based on their cosine similarity. Our first goal here is to generate photo embeddings. To that end, we experimented with the CLIP pre-trained model then used the generated embeddings on the Yelp dataset in the next section and compared it with our models in production.

The CLIP model is a zero-shot model, meaning it can infer successfully from an unseen dataset. A zero-shot model is an opportunity for Yelp to better identify and tag unseen photo categories to improve photo search. Our classifier won’t need a thousand examples for each new tag or label added.

Research done on CLIP showed its multimodal neuron ability in an abstract concept. Instead of reacting to a specific image feature like a Convolutional Neural Network model’s neuron, it responds to a cluster of ideas with a high-level theme.

In the table below are some examples of high-level themes. You can see tombstones in the image associated with the word Halloween. Those images, generated using different tools referenced in the openAI blog post, try maximizing a single neuron activation with gradient-based optimization for the input (i.e. “Halloween”) and the distribution of images.

Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/

Evaluation made on Yelp Photo Dataset

We compared the CLIP model with three existing ResNet50 classification models to evaluate CLIP’s capability on the Yelp domain. Our 5-way Restaurant, Food and Nightlife classifier identifies Food, Drinks, Menu, Interior or Exterior categories for photos. The food classifier consists of 27 food dishes, and the Home Services Contractor Classifier identifies five categories of repairs. We tested the CLIP model without any fine-tuning applied to the pre-trained model found on HuggingFace. We manually engineered the classes’ labels to optimize the CLIP model performance but didn’t optimize the categories themselves since we wanted a direct comparison with the existing models.

5-Way Restaurant, Food and Nightlife Classifier

While experimenting, we rapidly concluded that we could not simply reuse our previously-used class names as input. The paper suggested adding ‘A photo of’ in front of our title, but it didn’t prove effective for all the categories. The table below contains the label engineering applied to the 5-Way Restaurant, Food and Nightlife classification problem.

Original Labels	Engineered Labels
Drink	A photo of a drink
Food	A photo of food
Menu	A photo of a menu
Interior	A photo of inside a restaurant
Outside	A photo of a restaurant exterior
	A photo of other

The following table compares the ResNet50 model currently in production and the zero-shot CLIP model. Results for the 5-way restaurant, food and nightlife classifier show that CLIP has potential and that label engineering could beat a domain-trained deep learning model. These results also encourage us to explore further the potential of a fine-tuned CLIP model on Yelp domain.

Comparison Table of the 5-Way Classifier
	ResNet50		CLIP
	Precision	Recall	Precision	Recall
Drink	96.8 %	87.1 %	96 %	91 %
Food	96.0 %	92.7 %	88 %	91 %
Menu	95.0 %	80.3 %	51 %	94 %
Interior	89.4 %	92.2 %	92 %	77 %
Outside	84.3 %	94.6 %	96 %	80 %
Other			29 %	38 %

Let’s dive deeper into the results shown above with the table below. It shows the precision with which the CLIP model predicted the hand-labeled Yelp dataset. More importantly, it shows which categories get mixed up together. Most notable are photos classified as Other by the CLIP model that were otherwise labeled in the dataset.

With a closer look, we observe that many Interior and Exterior photos get classified as Other by the CLIP model. Here are some examples for Interior.

Images taken from yelp.com

We can see that these misclassifications are due to photo composition. People are often in the foreground of interior and exterior photos. The CLIP model is built to emphasize the embedding representation of concepts shown in the images. The attention model emphasized foreground elements at the cost of background elements.

Food Classifier

The Food Classifier aims to identify a dish showcased in a photo. The production model is a ResNet50 trained on 27 food classes (Comparison Table found in appendix 1). CLIP performed well in general compared to the production model, but it still needs improvement in multiple categories.

CLIP is a peculiar model. Using it like a ResNet50 might create some error opportunities. First, we must remember that category labels were engineered but not the categories themselves. Too many labels hindered the results of models like ResNet since each label category is trained from scratch and requires many examples.

On the contrary, using as many dish names as possible would better reflect the photo for the CLIP model. CLIP was trained by pairing one image with 32,768 randomly sampled text snippets. The model can work with a wide range of possible outputs. For our comparison tests, category engineering wasn’t done.

Second, we also found that some original dishes might confuse our results. Images labeled Waffles in our dataset were considered misclassified as Chicken Wings or Fried Chicken by the CLIP model. Hand verification shows its classification accurately represents the images that showcase a Texan dish of Fried Chicken and Waffles.

Image taken from yelp.com

Label	Probability
Chicken Wings & Fried Chicken	44 %
Waffles	11 %
Ribs	9 %
Dessert	9 %
Tacos	5 %
Steak	5 %
Sandwiches	5 %

Image taken from yelp.com

Label	Probability
Chicken Wings & Fried Chicken	51 %
Waffles	42 %
Ribs	3 %
Pancakes	1 %
Dessert	1 %

Lastly, some dishes’ names represent the protein in the meal but aren’t plated to showcase it. For example, the CLIP model misclassified some images labeled Grilled Fish in the Salad category.

Images taken from yelp.com

Home Services Contractor Classifier

Home Services Contractors Classifier got great results with CLIP for most categories. The categories were highly curated in the past to optimize the ResNet50 model in production as it was with the 27-food classifier seen previously. CLIP offers the possibility to remove the restraint of needing a large number of examples for each category our model infers. A review of the possible output classes from CLIP will lead to more diversified content tags on Yelp.

In the confusion matrix above, we can see that CLIP doesn’t identify enough photos labeled “Other” in our dataset. To remedy that, we tried using a 70% compatibility threshold to label an image, with the default being Other. The table below shows the results. The table shows that having a threshold is a tradeoff between increasing precision (decreasing the number of false positives) and decreasing recall (reducing the percentage of positives identified).

Comparison Table of the Home Services Contractor Classifier
	CLIP		CLIP - 70% threshold
	Precision	Recall	Precision	Recall
Bathroom, Bathtub and Shower	88 %	87 %	91 %	82 %
Decks and Railing	20 %	84 %	35 %	76 %
Door, Door Repair & Installation	24 %	81 %	38 %	74 %
Kitchen	92 %	85 %	94 %	79 %
Solar Panel	83 %	77 %	89 %	69 %
Other Contractors	80 %	57 %	69 %	77 %

CLIP’s vulnerability

Before using CLIP and publishing its results, it’s better to know its vulnerabilities and how to optimize the model performance. We already identified label and category engineering and used threshold. Here we describe another likely inconvenience of the model at Yelp.

Below are some examples of high-level themes, as seen previously. Let’s focus on more abstract concepts like typographical neurons that show images of word snippets related to syllables.

Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/

This demonstrates the model’s capability to “read” as mentioned in the image caption. It comes with the caveat of easily fooling the algorithm with typographic attacks. A prominent word like iPod handwritten on a sticker in any photo would classify it as an iPod, even if the picture clearly shows an apple.

Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/

On Yelp’s dataset, it means restaurant merchandise lying around in a photo could create more misclassifications by the model.

Conclusion

While working on this new project, we took the opportunity to review and upgrade our storing system for the vector representation we are responsible for. We aimed to make this data as accessible and easy to use as possible for any internal project.

We needed to generate new embeddings for all our collected Yelp data to complete this project with the identified models and techniques chosen to create our content embeddings.

Yelp aims to constantly grow the breadth, depth, and accuracy of the data we show to our consumers. Review and text embeddings show great promise in helping us improve in all three dimensions.

Many teams are working on the extensive datasets Yelp offers. There are still a lot of unexploited opportunities based on Yelp’s datasets, especially in deep learning. CLIP-based embedding is the first version of photo embedding generation and is only the beginning. Fine-tuning the CLIP model on the Yelp domain will improve the photo embeddings. Our team is presently exploring it. Also, the business embedding is currently only incorporating the review embeddings. It could also use photos or other metadata as inputs.

This project means that Yelp now owns a database with hundreds of millions of embeddings. Many Yelp teams are already exploiting them to improve their products.

Acknowledgements

Many people were involved in these projects, but special thanks to Parthasarathy Gopavarapu, Satya Deo, John Roy, Blake Larkin, Shilpa Gopi, and Jason Sleight. They helped with the design and implementation of these projects or the content of this post.

Appendix

Comparison table Prod (ResNet50) compared to zero-shot CLIP model after some label engineering.

Comparison Table of the Food Classifier
	ResNet50		CLIP
	Recall	Precision	Recall	Precision
Pizza	0.96	0.92	0.90	0.83
Sushi & Sashimi	0.87	0.78	0.79	0.69
Ramen & Noodles	0.82	0.95	0.70	0.55
Sandwiches	0.93	0.97	0.57	0.44
Tacos	0.78	0.75	0.83	0.59
Salads	0.67	0.92	0.65	0.50
Donuts	0.8	0.77	0.55	0.87
Steak	0.84	0.84	0.39	0.46
Burgers	0.84	0.87	0.77	0.59
Bagels	0.91	0.90	0.55	0.85
Cupcakes	0.75	0.81	0.74	0.93
Fish & Chips	0.87	0.77	0.89	0.74
Burritos & Wraps	0.79	0.67	0.47	0.66
Hot Dogs	0.76	0.73	0.54	0.90
Crepes	0.94	0.94	0.69	0.55
Waffles	0.89	0.89	0.49	0.88
Pancakes	0.69	0.79	0.38	0.83
Nachos	0.81	0.86	0.77	0.74
Soups & Chowder	0.7	0.71	0.47	0.69
Ribs	0.67	0.6	0.6	0.69
Curry	0.64	0.61	0.57	0.62
Paella	0.79	0.82	0.9	0.79
Oysters & Mussels	0.69	0.79	0.69	0.87
Grilled Fish	0.86	0.77	0.51	0.59
Pasta	0.65	0.53	0.55	0.85
Chicken Wings & Fried Chicken	0.81	0.83	0.57	0.56
Dessert	0.85	0.86	0.58	0.41

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Gondola: an internal PaaS architecture for frontend app deployment

Fri, 03 Mar 2023 01:00:00 +0100

The Yelp website serves millions of users and business owners each day, and engineers in our product teams are constantly adding and improving features across hundreds of pages. Webcore, Yelp’s frontend infrastructure team, is always looking to ensure that web developers can ship their changes quickly and safely, without the burden of maintaining complex team-specific infrastructure.

To achieve this, we made some significant changes to our internal deployment model for React pages in late 2019. This blog post will explain why we made these changes, describe the new architecture we implemented, and share some of the lessons we learned along the way.

We ended up with an architectural model based on an immutable key-value (KV) store with clearly defined page boundaries: frontend asset manifests that can be hot-swapped quickly and safely in production. Alongside that platform layer, “Gondola”, we rolled out a new monorepo, solving many of the challenges we had begun facing as we scaled the number of feature teams and webpages across the site.

Yelp’s website was originally served by a large Python monolith, and over time this has shifted towards a microservice architecture for backend services, allowing teams to maintain their own Docker images, deployment pipelines, and runbooks. This concept was then expanded to the frontend, which brought over frontend asset build configs (webpack, Babel, ESLint…) for teams to maintain. Webcore set up shared configs and CLI tooling to encode recommended best practices in order to ensure a consistent frontend build experience.

In this environment, each individual feature team at Yelp ended up owning one small “website slice”, from top to bottom. Full-stack developers on these teams would be responsible for their entire stack, encompassing both the frontend and backend as well as the linting, testing, and on-call responsibilities that came along with it. Even with the help of the shared Webcore-provided frontend infra tooling, relying on teams to keep the shared configs up to date wasn’t ideal - especially if certain frontend microservices had minor deviations.

Our status quo model, where each team owns a potentially-fragmented piece of the website stack

As a result, we often saw a lag between releasing a new version of our shared build infrastructure and seeing its effects on the wider set of web pages. We’d sometimes even have cases where pages would be stuck on an old version of our tooling for months, and so it was difficult for Webcore to have confidence in infrastructure changes we released. Manually testing every frontend microservice wasn’t feasible because they often drifted from Webcore standards, resulting in custom deployment models and unique setups.

As we started moving to React and away from our Python-powered templating, it was clear that we were becoming less reliant on server-side logic. Much of our UI was starting to be described via React (rendered through Server Side Rendering), and our data fetching was moving to GraphQL on a per-component basis. Despite not needing anything other than simple data fetching and stitching on the server, developers would have to deploy a full Python service to make even a simple copy change or style update. This could sometimes take an hour or more for larger deployments when many instances were required, and rolling back or reverting changes could take a similar amount of time even for frontend-only updates!

A better model

When comparing our largest frontend microservices at Yelp, we could see that much of our existing infrastructure concerning the deployment of pages could be simplified. Large amounts of boilerplate code existed in order to fetch data, manipulate it into an appropriate form, and then send it off to be server-side rendered using a specified React component representing the whole page.

We also saw room for improvement given the fact that our services were now generally “thin”, since they delegate React SSR to an external service powered by Hypernova (something we published an updated blog post talking about recently). We imagined a new, centralized service containing generalized logic built to serve all web pages at Yelp. Essentially an internal Platform-as-a-service for React pages!

Our service, “Gondola”, had the following requirements:

Deploying and rolling back frontend code should be near-instant
Deployment of assets should be decoupled from the Python code powering Gondola
The service should contain minimal page-specific logic: all page behavior should be described by the rendered React components
Teams should only be required to own product code, not infrastructure, and ownership should be clearly defined

Our first step was to reduce the scope of team ownership from a microservice (the “full website slice”) to a “page”. A Gondola page can be defined as an asset manifest describing all JS and CSS entrypoint files that we need to include in order to fully describe a desired UI, along with appropriate chunk names (including async chunks) mapped to public CDN urls for each asset. It gives us a way to fully describe each page’s frontend needs and can be generated at build time by webpack:

{"entrypoints":{"gondola-biz-details":{"js":["gondola-biz-details.js","common.js"],"css":["gondola-biz-details.css"]},"gondola-search":{"js":["gondola-search.js","common.js"],"css":["gondola-search.css"]}},"common.js":"commons-yf-81b79eb1bc6d156.js","gondola-biz-details.js":"gondola-biz-details_a775bc492d91960a.js","gondola-biz-details.css":"gondola-biz-details_eabd4c9f434f9468.css","gondola-search.js":"gondola-search_69082d627b823fd5.js","gondola-search.css":"gondola-search_d0ef76f21dcbf11d.css"}

This choice was very deliberate, as it allows us to embrace the web platform (with URLs at its core) as the primary building block for routing, bundling, and deploying code to yelp.com. This simplified many decisions in the rest of our design once we had settled on this level of granularity as our main abstraction.

We then took our existing Pyramid React renderer (a Pyramid renderer designed to take props from Pyramid and produce a rendered SSR page via React), which was built for individual teams to use in their services, and tweaked it to work alongside a fast KV store powered by DynamoDB. In our database, we store our page manifest data keyed by Gondola page version, and in a separate table track the active Gondola page for a given path (we use the commonly-adopted path-to-regexp format for matches here).

All interaction with our KV store is performed via a small CLI tool we distribute across our development environments (including Jenkins) which talks to DynamoDB in a consistent schematised way. The Gondola service itself only requires read-only access to the database so that it can serve the appropriate pages as requests come in.

This means that the flow for an incoming request to Yelp looks as follows:

An incoming request hitting the Gondola service to render the Search page

A user requests a Gondola-powered page such as /search - this goes directly to Gondola, and matches the route via path-to-regexp
Gondola queries DynamoDB to determine the active version for /search, and the accompanying asset manifest for that version
A query is made to our dedicated Server Side Rendering (SSR) service which returns rendered html
The appropriate asset tags from the manifest are included in the page shell to hydrate the page

By basing the rendering of the page entirely on the contents of the page manifest, the Gondola service has a lot of flexibility: this model supports our first requirement of near-instant deployments, since “deploying” a Gondola page now consists of updating a single version row in our DB. This assumes you’ve built and uploaded your assets and manifest, but this can happen at any time beforehand: creating a new Gondola page version isn’t tied to deployment.

This means that our merge pipeline becomes a lot safer. The only thing that can affect production is the DB being updated to flip active versions, and the version can be instantly reverted in the same way if we spot errors during rollout.

The nature of the KV store model also lends itself to cacheability: a given page and version pair is immutable, and we can serve manifests very efficiently from an in-memory cache layer without needing complex cache invalidation.

One of the most important benefits to this model is that Webcore now has the ability to make changes to all Gondola pages at once, and introduce significant UX and DX improvements across all pages with ease. For example, we can add new metrics to our performance logging infrastructure centrally, or optimise our first-byte times for all pages with a single Pull Request.

In a world where teams maintain their own frontend microservices, we don’t have the ability to make sweeping changes. This would require either a large amount of onboarding and education or Webcore-lead migrations to get everyone onto the latest and greatest libraries containing any improvements we ship out. This comes with its own set of dependency versioning challenges and is generally no fun for anyone.

Deployment Previews

Another win for this model is the ability to add additional logic around the hot-swapping of frontend versions: as one important example, it allows us to implement Deployment Previews internally, where we can tag specific versions as pre-release and view them against the production website instantly via a query param.

A deployment preview model naturally fits with our routing behavior above. Deployment Preview IDs (using memorable and fun names like cool-purple-hippo-24!) can slot in anywhere that versions are used, and the logic remains almost identical.

While not a novel feature in the wild across most modern static site hosts, the existence of Deployment Previews internally allows for:

Realistic demos against prod data rather than relying on persistent sandboxes or screenshots
The ability to quickly compare two versions against the same environment, including unreleased versions
Audits and automatic smoke tests run on every PR, against the Deployment Preview url

The last point is something that has a great deal of potential in the future, too: we already have several “Page Checks” which run Lighthouse performance audits, checks for console errors or JS exceptions, A11y audits, and automatic screenshots. All of these checks can be run at PR-time without the developer having to do anything, and results are reported back via Github status checks conveniently:

Example Page Checks running against a branch’s Deployment Preview at PR time

This all hinges on our ability to switch out the running version of a page near-instantly in any environment, made possible by Gondola. There are likely also lots of other opportunities that we’ve yet to explore which are unlocked by this newfound freedom!

The Monorepo

In addition to our work to build out the Gondola service, we needed a pipeline to ferry changes between a Pull Request, asset manifest, and our DB, so that they can subsequently be deployed.

Our status quo was a loose collection of team-owned Jenkins pipelines spread across many different individual git repositories. This was never ideal due to the reasons outlined earlier on, but the rethinking of our deployment model gave us a great opportunity to do something about our package dependency model. The result was a new monorepo for frontend code.

By moving to a monorepo, we sought to solve some of the largest problems that had been frustrating developers previously:

No more “dependency hell” - updating a monorepo package version automatically releases a new page if the package is directly or transitively depended upon, and packages are enforced to only depend upon the latest version on disk
- It’s easier to reason about the dependencies that will be bundled in the final page
- We can also globally enforce a deduplicated lockfile to minimise our install and build times
No backend infrastructure to maintain: we’ve moved all of that to the Gondola service, so the monorepo can be 100% frontend code
Any improvements to the build immediately benefit all developers, with no need for migrations - all build infrastructure and tooling is shared and maintained by Webcore, and any changes can be easily confirmed to work in the monorepo
Faster, more efficient bundling: since all pages and packages live together, we’re able to run a single Webpack build with multiple entry points, and utilize granular chunking strategies that can take advantage of cross-page shared chunks

To avoid the growth of the monorepo slowing down developers, we built and continue to maintain tooling to run tests only against packages that have been affected by the PR in question (we use lerna with additional custom scripts). This was and is one of the biggest concerns that tends to appear when discussing monorepos, and it’s important that we stay on top of build performance to ensure that we’re not frustrating developers.

We also enforce strict package boundaries and require that each package in the monorepo has a defined owner. We provide a helpful scaffold to make this process simple when first adding code to the monorepo, which has helped significantly with onboarding.

Developers, developers, developers

A major part of the work involved with Gondola was to ensure that developers could be onboarded with minimal disruption. A lot of this work was non-technical: we felt it was important to involve our customers (in this case, internal front-end developers) as early as possible in the design process and make sure that what we were building was actually useful for them! Writing docs as we went and pairing with early adopters directly helped mitigate a vast swathe of potential problems which we may not otherwise have discovered.

In our case, since we were asking developers to change some of their pre-established patterns of working on frontend code, we sought to maintain as much familiarity as possible with our tooling decisions. As one example, at Yelp we use Make as standard in all repos, so it was important to ensure that a developer opening up the monorepo for the first time would feel at home. We set up symlinked Makefiles per-package to ensure that running commands from within a package would feel close-to-identical to the old flow.

We also set up a dedicated docsite with an in-depth migration guide, and provided clear iterative steps: specifically in our case our emphasis was on the fact that step one would involve moving frontend code to the monorepo without the requirement that they move to Gondola. This made it easier for teams to tackle the migration at their own pace without the need for any “big bang” rewrites.

Part of our dedicated Gondola migration guide written internally for developers

Supporting legacy data fetching

While GraphQL is our primary supported data fetching method at Yelp, there are still some services which continue to fetch their data via Python. Since we don’t expose the Python backend to Gondola users, this poses a problem: how can we allow developers to onboard onto Gondola without requiring them to take on an additional GraphQL migration?

We solved this by building a custom Pyramid renderer we call the “Gondola Legacy Renderer”: it’s designed to be plugged into any existing service, and fire off a request to Gondola with an additional set of “legacy props” passed via GET request body internally. This means that we unlock the ability for any existing service to become a proxy for Gondola itself, gaining the majority of benefits of a “real” Gondola page while teams complete their migration to GraphQL.

Several teams have adopted the Legacy Renderer and we’re pleased with its ability to bridge the gap for developers who otherwise may not have had the bandwidth to start migrating away from dedicated team-owned services.

The future

With Gondola, we aimed to build a platform for all of Web at Yelp: we wanted to introduce a large shift in our mental model of deployment and question some of the existing assumptions we had about what was feasible to design.

So far, we’ve seen positive signs from our customers that our approach was successful. The majority of Yelp’s web traffic is now served by Gondola, but there’s lots more to do: the Gondola platform can never really be “finished”, so we continue to roll out and improve core features and take into account feedback from web developers across the company.

As teams continue to onboard, we’ve introduced optimistic build queues, started incrementally adopting fast rust-based tooling like swc in critical areas, and continue to implement Page Checks to provide assurance that PRs created against Gondola meet the company’s web performance goals. There’s also room for exciting new Deployment Preview integrations and ways to improve our DX for all developers.

With releases like React 18 and its support for streamed SSR responses, our ability to make sweeping changes across the monorepo (and by extension all Gondola pages at Yelp) gives us confidence that we can perform this and other large migrations in ways that stay out of feature developers’ way: something that’s critical to ensure we’re not negatively affecting deployment velocity while embracing industry best practices.

Conclusion

The creation of Gondola itself was years in the making: the journey from our legacy Python/jQuery templates, to React, to GraphQL, and finally to the monorepo model did not happen overnight. It was important to iterate gradually with immediate benefits gained at each stage - rewrites should always be avoided!

By simplifying and slimming down our deployment model, we’ve been able to introduce features that were impossible before, removing a large amount of cognitive overhead from feature devs who shouldn’t be required to maintain their own website stacks top-to-bottom.

It’s been exciting and encouraging to see the positive response from developers, as well as the amount of support we’ve had from all our internal customers! There’s a lot more we want to get done, but Gondola serves as a great platform for us to do it, and the future of web development is looking exciting at Yelp.

Acknowledgements

Gondola wouldn’t have been possible without the input from many teams and individuals across the company. Thanks go out to current and past members of the Webcore team, the many contributors to Gondola’s codebase and docs, as well as the initial spec reviewers from our product teams that helped turn the idea into reality.

Additional thanks goes out to all the developers in our web tech community that work every day with the platform and offer us honest and direct feedback that helps us shape Gondola’s roadmap!

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

How Yelp's Security Team Does Threat Hunting

Mon, 20 Feb 2023 01:00:00 +0100

Here at Yelp, we have multiple security teams specialized in various areas. One thing we all have in common is the fact that we all enjoy a bit of threat hunting occasionally. We opted for taking advantage of everyone’s diverse knowledge and began our journey of creating our own threat hunting methodology. This blog post includes the less glamorous details, such as our early beginnings and initiative, our “success in progress” and the multitude of approaches that we considered. In the end, we will present the stable process that we are now using and continuing to improve at every iteration.

Imaginary engineer (working outside Yelp): ‘Wait, so does Yelp conduct threat hunts?’

Yelp: ‘Of course, we do! Do you not?’

Imaginary engineer: ‘Well.. it looks so complex that we don’t know where to start yet’

Yelp: ‘Oh, it’s actually only as complex as you allow it to be. We tried a complex process but we also tried a working one. Here, let me tell you what we tried and how it suited us.’’

We have multiple security teams specialized in various and we all enjoy a bit of threat hunting from time to time. We made the participation in threat-hunting exercises voluntary and available to all our great security engineers, rather than having a dedicated threat-hunting team. Our success story stems from having captured plenty of interest and soon enough we had more and more people participating. Our threat hunters are your typical security engineers who put their blue hat on and start building security tooling, processes and everything else they see fit to make sure nothing keeps Yelp awake at night. They are curious and tenacious and they like putting on other hats for a change, so they are happy to put away their blue hat and try on a red one every once in a while.

The “Let’s Start Threat Hunting!” Phase (aka Phase 0)

Luckily, before our time at Yelp, there were 2 great engineers who started it all. They would get together once in a while. They would think about the security gaps and shortcomings their organization shared but didn’t record them anywhere. They’d cherry-pick one and exploit it to the limit. Who doesn’t like breaking stuff? Quietly, then they would throw away their red hat as if nothing happened, put back their more comfortable hat, the blue one, and fix those security shortcomings. As quiet as they were, they still caught attention. But the good kind of attention, as that’s how we’ve got buy-in from stakeholders to do more of these! Yay!

So.. what went wrong? Well, the team got bigger and bigger and everyone wanted to be part of their success story. But we couldn’t contribute - we weren’t part of the circle when the knowledge was shared. It didn’t scale. It didn’t fit new employees.

The “Let’s Ramp up Everyone” Phase (aka Phase 1)

What does your instinct tell you to do when you have 10 people instead of 2? Group them in teams! So we split the group into three teams. The red team would plan a threat hunt, emulate their attack and map their recordings to MITRE ATT&CK and Cyber Kill Chain. The blue team would investigate it as per our Incident Response procedure. The purple team would analyze what was caught and what was missed and they would dive deep into anomalous logs to make sure there is no real threat present in our environment, putting aside our emulation. Then we’d all brainstorm security controls to improve our posture for every security gap, and implement them. Then we’d try the attack again and send an executive report to stakeholders with the TL;DR. Then finally we would do a retrospective.

A lot of steps, isn’t it? Can you guess how many were missed during the threat hunts? Plenty. Why? This process was so laborious and intensive that it never fully caught on. And there was always so much work for everyone! Many lost interest and excitement quickly because, as you may recall, we are not a threat hunting team. Threat hunting was seriously competing with our other roadmapped projects and initiatives. We did achieve some success, with the greatest one being that by working in teams, everyone got ramped up and was happy to contribute. That was our greatest struggle during Phase 0, hence we’re confident that this step was necessary before we were able to move forward.

The “Let’s Get Back to Basics” Phase (aka Phase 2)

By this point, our process was too complex a machine, so we decided to scale it back down. Although it wasn’t in vain - we’ve got plenty of good ideas, some that are now in use and some others that are dormant, waiting for us to be ready. THIS IS OUR CURRENT PROCESS.

We went back to 2 people working on a threat hunt, while the rest are free to work on their other responsibilities. We’ve got better at planning: we only choose granular exploits, similar in size and difficulty with a MITRE TTP. TTP stands for “Tactics, Techniques and Procedures” within the MITRE ATT&CK framework, which is a knowledge base for past and current malicious activity all over the world. TTP is like a zoom tool with 3 levels. You zoom once and you see why an attacker might perform an action. You zoom twice and you see how they do it at a conceptual level. You zoom thrice and you see the actual tools that they use and other hands-on details.

A granular exploit that we choose to hunt is actually often a TTP. And if we should threat hunt a complex scenario, like a real attack from the news, then we break it down in TTPs and then conduct one threat hunt per one TTP. Depending on the exploit complexity and the engineers’ time availability, the team is free to choose how many iterations a threat hunt needs.

Also, we carefully choose the scope rather than trying to tackle all of our environments and edge cases at once. As we’re working at such granularity, the 2 people can take the threat hunt from beginning to end and deliver measurable value fast: they plan the threat hunt, exploit it, investigate if we had any security controls in place to catch it and fix them if needed. They also look at our real logs that match the same criteria and investigate any anomalies, for potential real threats. Finally, they prepare a short presentation (a handful of slides of content) and present it to the rest of the teams and interested stakeholders.

This process was well-embraced by the team and everyone gets to be part of the success story while still having the time to deliver their other projects. And we get measurable and consistent results that are easy to grasp by anyone interested in our work.

The “Hopes for the Future” Phase (aka Phase 3)

So this is our current process. Are we done? No, we are not done. Remember that some ideas that came up during Phase 1 are still dormant? We’d like to wake them up at some point. Not today, but when we’re ready. By then, who knows, maybe we will even automate them. In this way, we would keep the process as light as today, but we would inform it with logic on prioritization, risk assessment, metrics, stakeholders bulletin etc.

But until then, we’ll do as Saitama and leave tomorrow’s problems to tomorrow’s us.

Imaginary Engineer: ‘Thanks! That’s quite a journey! I’ll give it a try… But what if we fail?’

Yelp: ‘Well, remember what Albert Einstein said: “Failure is success in progress”. We’re confident that we’re progressing towards success with each and every phase. And so are you.’

First and foremost, many thanks to Andrea Dante Bozzola for supporting our threat hunting interest in contrast with the roadmap we would wishfully like to see delivered. It goes without saying that the CorpSec team has been of tremendous help even though we, Security Effectiveness, have institutionalized the threat hunting here at Yelp. Finally, we’d like to thank Matteo Piano for reviewing this blog post and for his anime expertise. Here is a list of everyone who has been actively contributing to threat hunting since its inception, in no particular order: Matt Carroll, Matteo Piano, Ioana Iliescu, Ramona Tame, Daniel Popa Cristobal, Joey Weate, Andrea Dante Bozzola, Tommy Stallings, Ignacio Rodriguez Paez, Florian Stein.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Rebuilding a Cassandra cluster using Yelp’s Data Pipeline

Mon, 30 Jan 2023 01:00:00 +0100

Robots are frequently used in the manufacturing industry for numerous use-cases. Amongst many, one case is to eliminate defective products automatically from reaching the finished goods inventory. The same principles of these systems can be adopted to filter out malformed data from datastores. This blog post deep dives into how we rebuilt one of our Cassandra(C*) clusters by removing malformed data using Yelp’s Data Pipeline.

Apache Cassandra is a distributed wide-column NoSQL datastore and is used at Yelp for storing both primary and derived data. Many different features on Yelp are powered by Cassandra. Yelp orchestrates Cassandra clusters on Kubernetes with the help of operators (explained in our Operator Overview post). At Yelp, we tend to use multiple smaller clusters based on the data, traffic and business requirements. This strategy assists in containing the blast radius in case of failure events.

For us at Yelp, the primary driver for this effort was the discovery of data corruption across multiple nodes inside one of our Cassandra clusters. This corruption was widespread to different tables including those in the system keyspace. The following were some of the events that happened as we discovered the issue.

Numerous exceptions started happening in the Cassandra logs indicating the corruption. The exceptions were found to be happening over multiple nodes in the cluster.
Repairs began failing on the Cassandra cluster, which can lead to inconsistencies and data resurrection.
The compaction process was seen failing on the Cassandra cluster. The compaction process allows SSTable (Sorted String Table) to be merged together, leading to maintenance of fewer SSTables, and hence improved read performance.

Since the corruption was widespread, removing SSTables and running repairs wasn’t an option as it would have led to data loss. Also, based on corruption size estimates and recent data value, we opted not to restore the cluster to the last corruption free backed up state.

More technical details about the corruption and the initial remediation steps like repairs and data scrubbing are covered in the Appendix. Though those steps didn’t help us in fixing the issue, they provide vital information about the nature of the corruption.

In order to mitigate the issue and stop more data from getting corrupted, we decided to rebuild a new cluster by migrating data from the existing cluster.

Overall Strategy

The overall high level strategy for rebuilding a new Cassandra cluster for mitigating the issue is quite similar to the sortation systems used for quality checking in the manufacturing industry. Within the industry, automatic sorters are installed on the conveyors that inspect the product and filter out the defective ones from reaching the finished goods inventory.

Conceptual Model of Sortation System

Using the same principle, a Data Pipeline was created to rebuild a new Cassandra cluster after eliminating the malformed data as depicted in the figure below.

Corruption Mitigation Strategy at a High Level

The process extensively relies on the different connectors and pipeline tools developed by Yelp’s Data Infrastructure teams. Here’s a quick explanation of the overall dataflow.

A new Cassandra cluster “Sanitized Cassandra Cluster” was spun up on Yelp’s modern Kubernetes infrastructure. This allowed the new cluster to leverage from many hardware and software upgrades.
The data from the original Cassandra cluster was published into Yelp’s Data Pipeline to create an “Original Data Stream” through Yelp’s Cassandra Source Connector. The Cassandra Source connector relies on the Change Data Capture (CDC) feature, which was introduced in the Cassandra 3.8 version. More details about the Cassandra Source connector can be found in the blogpost: Streaming Cassandra into Kafka in (near) Real Time.
The Stream Processors allow transformation of the Data Pipeline streams. This stream process acts as an “automatic sorter” responsible for eliminating the malformed data from reaching the destination. Of the various different supported stream processors by Yelp’s Data Pipeline, Stream SQL was adopted here in this case as it allowed writing stream processing applications in a language similar to SQL. While writing the stream processor, there were a few considerations required.
- Source and Destination Data Stream identifiers: The identifiers allow selection of the input & output Data Pipeline topics.
- Sanitization Criteria: This specifies the valid list/ranges of values for fields inside the Data Pipeline. Inspecting the data, we figured out that using a criteria based on the id & time values can filter out malformed data. A simple stream SQL statement for sanitizing on the basis of a criteria based on non-negative id and valid time_created range would look as follows.
```
SELECT
  id, created_time
FROM 
WHERE
id IS NOT NULL
AND id >= 0
AND time_created IS NOT NULL
AND TIMESTAMPDIFF(DAY, CURRENT_TIMESTAMP, time_created) <= 1
AND TIMESTAMPDIFF(YEAR, CAST('2000-01-01 00:00:00' AS TIMESTAMP),
                 time_created) >= 0;
```
- Malformed Stream Criteria: This allows creation of a data stream containing all the malformed data. That can simply be created by inverting the sanitization stream SQL statement.
```
SELECT
 id, created_time
FROM 
WHERE NOT(
id IS NOT NULL
AND id >= 0
AND time_created IS NOT NULL
AND TIMESTAMPDIFF(DAY, CURRENT_TIMESTAMP, time_created) <= 1
AND TIMESTAMPDIFF(YEAR, CAST('2000-01-01 00:00:00' AS TIMESTAMP),
                time_created) >= 0
);
```
The data from the sanitized data stream was ingested into the Sanitized Cassandra Cluster through Yelp’s Cassandra Sink Connector.
The data from the malformed data stream was further analyzed to discover
- whether the corruption is legit
- what percentage of data got corrupted
- whether there is a possibility of extracting useful information from it

Data Validation

Like any other data migration project, validation of data was of utmost importance. A couple of steps were used for data validation, which ultimately verified the above strategy.

Validation using Random Sampling

This is perhaps the most common strategy for validating data migration analogous to Quality Control inspections of finished products in manufacturing industries. A random subset of the migrated data was selected and value comparison for all the columns was done between the Original Cassandra Cluster and Sanitized Cassandra Cluster.

Data Validation using Random Sampling

Since this is a statistical sampling technique, the confidence level greatly depends upon the sample size. Cochran’s equation helped us in estimating a sample size for sufficiently large tables since the data residing inside the Cassandra tables was sufficiently large.

\[n = Z^2 p (1-p) / e^2\]

where n is the sample size, Z is the z-score for confidence interval; chosen as 1.96 for 95% confidence interval p(1-p) determines the degree of variability; Value of p chosen as 0.5 for maximum variability e is the sampling error; used as 5%

The total number of partitions randomly sampled were 400 (>385 from Cochran’s equation) for the tables. One of our tables has a total data of 162G divided into approximately around 7.2 million partitions.

Validation using Comparison Tee

The Database Reliability Engineering team at Yelp uses a proxy for our Cassandra datastores in order to isolate the infrastructure complexity from the developers. The proxy supports a few different wrappers, with Tee being particularly relevant here.

Until this point, the traffic was still being served by the Original Cassandra Cluster. This Teeing feature allowed us to do further verification from client request perspectives. The conceptual model of Teeing is depicted in the figure below.

Data Validation using Comparison Tee

Here is a brief explanation of the model.

A fraction of read requests were sent to both the Original Cassandra Cluster and the Sanitized Cassandra Cluster before switching the traffic to the sanitized cluster.
Comparison was done on the responses observed from both the clusters, and the comparison results were logged.
Response from the Original Cassandra Cluster was sent back to the requesting client.
Offline Analysis of logged comparison results led to Data validation between the two clusters.

An example client performing Comparison Tee for keyspace kspace would look like:

original_client = DataClient(cluster="original_cassandra_cluster"
)
sanitized_client = DataClient(cluster="sanitized_cassandra_cluster"
)
def compare_results(main_result, tee_result):
    if main_result != tee_result:
        return {"original": main_result, "sanitized": tee_result}
    return {}
teed_client = ComparisonTee(
    client=original_client,
    tee_client=sanitized_client,
    comparison_fn=compare_results,
)

Switching Traffic

The total amount of corruption observed in the cluster was roughly estimated to be around 0.009% of the total data. Once the data was completely validated, the traffic was switched from the Original Cassandra Cluster with faults to the Sanitized Cassandra Cluster. The Original Cassandra Cluster was torn down after moving the entire traffic. This allowed a seamless transition with zero downtime and without any visible effect on the user experience.

The execution of the project allowed us to rebuild the cluster with sanitized data, but also enabled us to move our cluster to an improved infrastructure with zero downtime. There were quite a few learnings from this project.

It is important to have validation plans at each stage (and if possible multiple validation criteria) when carrying out a complex data movement.
Cassandra logs provide great insight into the database operations being performed. This includes information about any uncaught exceptions, garbage collector, cluster topology, compaction, repairs etc. Any anomaly observed inside the logs can be pretty useful for debugging errors or performance issues. From an operational perspective, it’s better to create alerts for any new uncaught exceptions and analyze them as they happen.
Repairs are essential for a guaranteed data consistency on a Cassandra cluster in case one of the data nodes goes down for an extended duration (greater than max_hint_window_in_ms). Absence of periodic repairs on a Cassandra cluster can lead to data integrity issues. However, running repairs on an unhealthy, broken or corrupted cluster is not recommended and is likely going to make things worse.

There is so much more to write here with respect to the learnings - Data Pipeline infrastructure tools, datastore connectors, Scribe Log Streams, CI/CD pipelines for Cassandra deployments - and much more. If you are interested to know more about these, what better way is there than to come and work with us.

Thanks to Adel Atallah, Michael Persinger, Toby Cole and Sirisha Vanteru who assisted at various stages of the design and implementation of the project.
The authors would like to thank the Database Reliability Engineering team at Yelp for various contributions in handling the issue.

Data Corruption Overview

The corruption was detected when engineers observed exceptions of the following form in the Cassandra system.log file in one of the clusters.

Last written key DecoratedKey(X) >= current key DecoratedKey(Y)

This Cassandra cluster was still on our old AWS EC2 based infrastructure as described in our Operator overview post. Along with the above exception, the engineers also observed the Cassandra process crashing on a few nodes in the same cluster while trying to deserialize CommitLog Mutations. A mutation is synonym to a Database write since it changes the data inside the database. Exceptions of the following form were observed in the Cassandra logs.

org.apache.cassandra.serializers.MarshalException: String didn't validate

Repairs are required for a guaranteed data consistency on a Cassandra cluster in case one of the data nodes goes down. At Yelp, we run periodic repairs on Cassandra clusters for fixing any data inconsistencies. However, following this issue, engineers observed that the repairs started to fail on the above cluster, and actually caused the “Last written key” exception to spread to all the nodes inside that cluster. The cluster contained two data centers, with each having a replication factor of 3. Even though there wasn’t any observable impact due to necessary replication and validation safeguards, the exceptions still required further analysis from operational perspectives. An immediate action was taken to stop the repairs from running for this cluster.

The investigation around the exception revealed that at-least one of the SSTable (Sorted String Table) rows was unordered, which caused the compaction operation to fail. SSTables are immutable files that are always sorted by the primary key. This indicated a corruption event inside the Cassandra SSTables. These SSTable corruptions were observed for different tables, including the tables in the system keyspace, across multiple nodes in that cluster, indicating a distributed corruption present on multiple nodes in the cluster. This means that using full table scans on user keyspaces via a batch processing framework like Spark wouldn’t completely solve the problem, as the corruptions would still be persisted in system keyspaces.

Since the SSTable corruption was widespread across all the nodes inside the cluster, removing the SSTables and running the repairs wasn’t an option, as this will lead to data loss.

Restoring the cluster from the periodic backups was another open option for us. However, there’s a trade-off for losing recent data inserted after the last backup with no corruptions. A quick impact analysis revealed that it’s more valuable to retain the recent data as compared to the old corrupted one.

Scrubbing SSTables

Data Scrubbing process is used as a data cleansing step, and aims to remove the invalid data from the database. With Cassandra, we had 2 options for running the scrubbing process.

Online Scrubbing
Offline Scrubbing

Online Scrubbing

Online scrubbing can be invoked using either the nodetool scrub or nodetool upgradesstables command, with the latter being recommended. Since the online scrubbing process is much slower than the offline one, we opted for the offline scrubbing.

Offline Scrubbing

Offline scrubbing can be performed with an opensource tool sstablescrub, that gets shipped with Cassandra. We stopped the Cassandra node gracefully after running nodetool drain, as it is a prerequisite for the execution of sstablescrub. The data for keyspace kspace & table table can be scrubbed as follows.

sstablescrub kspace table

However, there were failures seen in the offline scrubbing process and following logs were observed in the output.

WARNING: Out of order rows found in partition:

WARNING: Error reading row (stacktrace follows):
WARNING: Row starting at position 491772 is unreadable; skipping to next
........
WARNING: Unable to recover 7 rows that were skipped. You can attempt manual recovery from the pre-scrub snapshot. You can also run nodetool repair to transfer the data from a healthy replica, if any

WARNING: Row starting at position 22560156 is unreadable; skipping to next
null
Exception in thread "main" java.lang.AssertionError
        at org.apache.cassandra.io.compress.CompressionMetadata$Chunk.(CompressionMetadata.java:474)
        at org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:239)
        at org.apache.cassandra.io.util.MmappedRegions.updateState(MmappedRegions.java:163)
        at org.apache.cassandra.io.util.MmappedRegions.(MmappedRegions.java:73)
        at org.apache.cassandra.io.util.MmappedRegions.(MmappedRegions.java:61)
        at org.apache.cassandra.io.util.MmappedRegions.map(MmappedRegions.java:104)
        at org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:362)
        at org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:331)
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinal(BigTableWriter.java:336)
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:318)
        at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:322)
        at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:370)
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173)
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:184)
        at org.apache.cassandra.io.sstable.SSTableRewriter.finish(SSTableRewriter.java:357)
        at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:291)
        at org.apache.cassandra.tools.StandaloneScrubber.main(StandaloneScrubber.java:134)

This led to failure in complete removal of corrupted rows inside the SSTable.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Recycling Kubernetes Nodes

Thu, 05 Jan 2023 01:00:00 +0100

Manually managing the lifecycle of Kubernetes nodes can become difficult as the cluster scales. Especially if your clusters are multi-tenant and self-managed. You may need to replace nodes for various reasons, such as OS upgrades and security patches. One of the biggest challenges is how to terminate nodes without disturbing tenants. In this post, I’ll describe the problems we encountered administering Yelp’s clusters and the solutions we implemented.

At Yelp we use PaaSTA for building, deploying and running services. Initially, PaaSTA just supported stateless services. This meant it was relatively easy to replace nodes since we only needed to gracefully remove the pods from our service mesh on shutdown. However, it may still result in services with fewer replicas than expected. We now run many diverse workloads in our clusters including stateful services, batch jobs and pipeline tasks. Some workloads run on private pools (groups of nodes) but many workloads run in shared pools. At Yelp, we use Clusterman to manage our Kubernetes pools. Clusterman is an open source autoscaling engine that we initially wrote to scale our Mesos clusters and subsequently adapted to support Kubernetes.

There are many challenges in multi-tenant clusters since tenants and cluster administrators often work on different teams (and maybe different time zones). Cluster administrators often need to perform maintenance on their clusters, including the replacement of nodes for security fixes, OS upgrades, or other tasks. Given the diversity of workloads running on the clusters, it’s very difficult for administrators to do so safely without working closely with the workload owners to ensure that the termination and replacement pods are done safely. This can also be difficult in Yelp’s distributed, asynchronous work environment. Maintenance can take a long time given the diverse set of workloads and large size of the clusters. Additionally, manual work is error-prone, and a human is more likely to mistakenly delete the wrong node or pod! We decided to tackle the problems in two parts:

Protecting workloads from disruptions.
Node replacement automation.

1. Protecting workloads from disruptions

A good place to start is the Kubernetes documentation on disruptions. There are two types of disruptions:

Voluntary (by cluster admin): draining a node for an upgrade or a scaling-down
Involuntary: hardware failures, kernel panic, network partition, etc.

We will focus on voluntary disruptions in our case. Pod Disruption Budget (PDB) is the industry standard to protect Kubernetes workloads from voluntary disruptions. “As an application owner, you can create a PDB for each application. A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front-end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.” (Kubernetes, Pod Disruption Budget). At Yelp we have some sensitive workloads like Nrtsearch and Cassandra where we don’t want to disrupt more than one pod at a time in each cluster.

If you have bare (without controller) pods in your cluster you should consider some limitations to using PDB. Specifically, you can not use the maxUnavailable and percentage fields.

Besides PDBs, we also evaluated some alternative ways to prevent involuntary disruptions. For example by using Validating Admission Webhook and PreStop Hooks to protect workloads but we decided to continue with PDBs since it was designed for exactly this use case.

2. Node replacement automation

Once we defined PDBs for all the applications running on our clusters, we moved on to thinking about the automation needed to replace nodes. We chose to add features to Clusterman to manage node replacement. Before getting into the solution, it is helpful to know a little about Clusterman’s internal components.

Clusterman components

Metrics Data Store: All relevant data used by scaling signals is written to a single data store for a single source of truth about historical cluster state. At Yelp, we use AWS DynamoDB for this datastore. Metrics are written to the datastore via a separate metrics library.
Pluggable Signals: Metrics (from the data store) are consumed by signals (small bits of code that are used to produce resource requests). Signals run in separate processes configured by supervisord, and use Unix sockets to communicate.
Core Autoscaler: The autoscaler logic consumes resource requests from the signals and combines them to determine how much to scale up or down via the cloud provider.

We added two more components to solve the node replacement problem: Drainer and Node Migration Batch

Drainer

The Drainer is the component which drains pods from the node before terminating. It may drain and terminate nodes for three reasons:

Spot instance interruptions
Node migrations
The autoscaler scaling down

The Drainer uses API-initiated eviction for node migrations and scaling down. API-initiated eviction is the process by which you use the Eviction API to create an Eviction object that triggers graceful pod termination. Crucially, API-initiated evictions respect your configured PodDisruptionBudgets and terminationGracePeriodSeconds.

The Drainer taints nodes as a first step to prevent the Kubernetes Scheduler from scheduling new pods to the draining node. Then it tries to evict the pods periodically until the node is empty. After evicting all pods, it deletes the node and terminates the instance. In cases where we have defined PDBs that are very strict or there is not much spare capacity in the pool, this can take a long time. We’ve added a user-configurable threshold to prevent very long (or indefinitely) draining nodes. The Drainer will then forcibly delete or un-taint the node depending on the uptime requirements of the workloads running in that pool.

Node Migration

The Node Migration batch allows Clusterman to replace nodes in a pool according to various criteria. This automates the process of replacing nodes running software with security vulnerabilities, upgrading the kernel we run, or upgrading the whole Operating System to newer versions. It chooses which nodes to replace and sends them to the Drainer to terminate gracefully, continuously monitoring the pool capacity to ensure we don’t impact the availability of workloads running on the cluster.

We’ve created NodeMigration Custom Resource to specify migration requirements. We can request migration based on kernel version, OS version, instance type and uptime. For instance, the target of the following manifest is to keep nodes uptime less than 30 days:

apiVersion: "clusterman.yelp.com/v1"
kind: NodeMigration
metadata:
 name: my-test-migration-220912
 labels:
   clusterman.yelp.com/migration_status: pending
spec:
 cluster: mycluster
 pool: default
 condition:
   trait: uptime
   operator: lt
   target: 30d

Conclusion

Finally, we can describe the high level design of our new system as the following.

High level design of the system

Now that we have this system running we are able to more easily deploy new versions of Ubuntu or keep nodes fresh. We can create migration manifests using a CLI tool and Clusterman will gradually replace all the instances whilst ensuring that the workloads are not disrupted and that new nodes are running correctly

Acknowledgements

This was a cross-team project between Yelp’s Infrastructure and Security team. Many thanks to Matteo Piano for being the main part of the project and leading it, and to the many teams at Yelp that contributed to making the new system a success. We want to thank Compute Infra, Security Effectiveness and any of the other teams that contributed by making PDBs. Additionally, thanks to Matthew Mead-Briggs and Andrea Dante Bozzola for their managerial support.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp, if you're interested apply!

View Job

Back to blog

Lessons from A/B Testing on Bandit Subjects

Wed, 21 Dec 2022 01:00:00 +0100

Abstract Compared to full-scale ML, multi-armed bandit is a lighter weight solution that can help teams quickly optimize their product features without major commitments. However, bandits need to have a candidate selection step when they have too many items to choose from. Using A/B testing to optimize the candidate selection step causes new bandit bias and convergence selection bias. New bandit bias occurs when we try to compare new bandits with established ones in an experiment; convergence selection bias creeps in when we try to solve the new bandit bias by defining and selecting established bandits. We discuss our strategies to mitigate the impacts of these two biases.

We have many multi-armed bandits running at Yelp. They help us select the best contents we show on our webpage, choose the optimal ad rendering format on our app, and pick the right channel and timing to reach our users and business owners.

We typically use the Thompson Sampling method. Thompson Sampling is a Bayesian method that combines the domain knowledge we have via prior distributions and the real-world observations we collected for each arm. It is easy to understand for broader audiences and simple to implement. It also introduces noises throughout the day even though our bandits are typically updated nightly. Compared to its alternatives, research has shown that it performs better in the real world (Chapelle and Li 2011).

Compared to machine learning (ML) models or ML based contextual bandits, simple multi-armed bandits¹ (bandits henceforth) have several important infrastructural and logistical advantages:

Code light: our bandit implementation is a Python function with only a couple of lines. At serving time, user teams only need to pass in the prior distribution and the real observations of each arm as a dictionary.
Setup light: compared to serving a model, bandits do not require a separate service call to make predictions. Typically user teams only need to set up a Cassandra table that stores past observations. Past observations can be computed via a nightly batch and piped into the aforementioned Cassandra table.
Resource light: unlike models, bandits do not require features to learn. This means the product owner does not need to staff a sizable team building a feature engineering pipeline and researching the model architecture.
Maintenance light: bandits do not need heavy monitoring and alerting because it has no complex dependency. By design, bandits balance exploration and exploitation gracefully. With an appropriate data retention window, bandits can also handle data drifts without human intervention. From our experience, for bandits to work correctly, the oncall person only needs to ensure the bandits are updated correctly, which typically is a light task.

Because of these advantages, bandit is a sweet spot for many teams to try out before they fully commit to ML. For some applications, the bandit performance may be good enough such that teams choose to stay in the bandit world.

A seemingly minor drawback

As with all the good things in life, bandits do not come without drawbacks. One drawback we face is the difficulty of handling too many items (the curse of dimensionality). When having too many arms, the exploration requires too much data and takes too long from a practical perspective.

A common practice to mitigate this issue is performing a candidate selection step and sending only top results to the bandit. The candidate selection step can be anything from a simple heuristic, a rule based formula, to a simple model, or a hybrid of all. We only require it to be mostly stable day to day so that bandit’s historical learnings are still useful today. Because of such freedom, a lot of work can be done to optimize the candidate selection step.

This seemingly innocuous candidate selection step causes many challenges when it comes to A/B testing different candidate selection models. To show this point, let’s first materialize the case of advertising photo selection as an example.

When advertisers choose “Let Yelp optimize” for their advertising photos, we test different photos and learn which one gets the most clicks. Under the hood, this is achieved by a bandit system. In particular, each pull is an impression while each success is a click. We use the standard Beta-Bernoulli Bandit with K arms (K is a small fixed number). Because many advertisers simply have too many high quality images for the bandits to learn within a reasonable time window, we have a candidate selection step before the bandit.

A high level summary of Yelp’s advertising photo selection pipeline

For the illustration purpose, let’s assume the status quo candidate selection method is a rule based formula while the challenger is a light-weight model trained on some pre-computed image embeddings. Because these two approaches are quite different, they typically produce distinct top K images.

To verify the new model selects better performing candidates, we set up an A/B experiment diverted by advertiser_id. If we stop here and naively run this experiment as is, we may reach a false conclusion caused by the new bandit bias.

New bandit bias

Let’s examine the following mock up example. In this example, the top 3 photos produced by the status quo rule are (1, 2, 3) while the top 3 photos resulted from the new model are (4, 5, 6). Let’s assume that the true click-through rate (CTR) of 1, 2, 3 are 0.2%, 1.5%, and 1.0% while the true CTR of 4, 5, 6 are 0.3%, 2.0%, 1.0% respectively. So the new model is superior by construction.

However, because the bandit has no data about (4, 5, 6), it has to start from scratch. In particular, at the beginning of the experiment, the bandit will evenly allocate impressions to all three. On the contrary, the bandit in the status quo cohort has figured out photo 2 is the best among (1, 2, 3) and most traffic is allocated to photo 2 already. The following table shows a possible scenario on day 1 of the experiment. Notice on day 1 the CTR of the status quo group is 1.3% but the treatment group is only 0.9%. The bandit will eventually figure out photo 5 is a better performing one and allocate more traffic to it. But until then, the treatment will continue to underperform.

photo_id	True CTR	Cohort	Day 1 impressions	Day 1 clicks	Observed CTR
1	0.2%	Status Quo	40	0	1.3%
2	1.5%		356	5
3	1.0%		61	1
4	0.3%	Treatment	161	0	0.9%
5	2.0%		149	3
6	1.0%		147	1

What if we wipe out the bandit history in the status quo group as well before the experiment? This indeed is the cleanest way to compare the two groups, but we will nuke the performance of the whole system, which typically is not acceptable from a business perspective.

What if we remove bandits from the equation during experimentation since we’re comparing the candidate selection methods? This idea does not work. The treatment is the middle step of the system but the success metric is defined only after the bandit does its magic. Because we don’t know which photo will give us the highest CTR a priori, we cannot remove the step that is designed to find the highest CTR. In other words, our experimentation subject has to be the whole system, bandit included.

Some bandits will have more data and hence learn faster than others. In practice, we typically observe a big performance plunge from the treatment group at the beginning but it will be gradually improving throughout the experiment. The real difficulty is to tell when the new bandit bias is small enough such that we can attribute the difference between treatment and control groups to our new model.

In summary, the first lesson we learned is that bandits need to be converged to be comparable. So we came up with a definition of convergence such that when a bandit is declared converged, it won’t cause major new bandit bias.

The 80-80 rule of convergence

Intuitively, if a bandit is considered converged, it must be done with exploration and be mainly working on exploitation. We believe this intuition can be further broken down into two subdimensions:

If there’s a clear best performing arm, then the bandit has found it.
If the bandit can’t distinguish multiple arms, then the bandit must have enough evidence to show they have similar enough performance.

Notice for the bandit to move on to exploitation, it does not need to exactly pinpoint the performance of each arm. For worse performers, knowing “they are worse” is enough.

Inspired by the Upper Confidence Bound algorithm, we use confidence intervals² (CI) of posterior distributions to define convergence. Our definition of convergence for advertising photo selection is as follows. Note that this definition is not necessarily appropriate for your case. But you can use it as an inspiration.

Compute the 80% CI of the posterior distribution for each arm.
Apply the merge interval algorithm (see, e.g., LeetCode 56) on 80% CIs. That is, put all arms into one group if their CIs have some overlap. If there is no overlap, then the arm is its own group.
Rank the groups by their posterior means. This ranking is well defined because all groups are separated after the previous step.
[80% CI no overlap] Examine the group with the highest CTR (top group henceforth). If the top group has only one arm for the past 7 days, then we call the bandit converged.
[80% CI width drop] Otherwise, if all CIs in the top group are less than 20% width of the prior distribution’s CI for the past 7 days, then we call the bandit converged.
Once the bandit is considered converged, its data may be used for analysis purposes starting from the next day.

The 80% CI no overlap rule captures the case when there is a clear winner. Based on our experience, once any arm’s 80% CI is separated from others, the underperforming ones stop receiving much traffic even if their performance estimates still contain much uncertainty (a.k.a., knowing “they are worse” is enough).

The posterior and traffic plot of a newly created bandit that converged under the 80% CI no overlap rule. The solid lines are posterior CTRs of photos and the shaded areas are their corresponding 80% CIs. They are plotted on the linear scale. The dotted lines are impressions the bandit allocated to each arm, in log scale. In the initial phase, the bandit is mostly working on exploration so each arm gets a decent amount of traffic. On day 7, the orange arm’s CI is separated from other arms’ and the other arms only receive about 1-5% of the traffic.

The 80% CI width drop rule captures the case where the difference between multiple arms are not practically significant. In this case, bandits will continue to allocate traffic to all arms in the top group so typically the CI width drop in the top group is fast.

The posterior and traffic plot of a newly created bandit that converged under the 80% CI width drop rule. At day 5, the green & orange arms’ CIs are separated from the blue arm’s. While the bandit stops allocating traffic to the blue arm, the bandit cannot significantly differentiate the green and orange arms so both continue to receive significant traffic.

Under our definition, the new bandit bias is usually of a smaller magnitude than our usual effect size. Moreover, a usual t-test cannot distinguish newly converged bandits and fully converged bandits permitted by data retention period.

Convergence selection bias

Unfortunately, even though it may help reduce the new bandit bias, just applying the definition of convergence introduces another bias: convergence selection bias. To explain this bias, let’s consider the following example.

advertiser_id	Cohort	Observed CTR	Is Converged	Average CTR of Converged Bandits
1	Status Quo	0.7%	Yes	1.0%
2		1.4%	Yes
3		0.9%	Yes
4	Treatment	0.8%	Yes	0.9%
5		1.5%	No
6		1.0%	Yes

This example is constructed such that treatment has a superior performance. Notice all bandits in the status quo cohort are converged because they have been collecting data for a longer period while only some bandits are converged in the treatment cohort as they are shorter lived. If we compare average CTR of converged bandits in the two groups, then we would falsely conclude that the treatment is doing worse.

You might dismiss this example since we conveniently mark the best performing bandit unconverged and remove it from comparison. This is not the case; it is a real concern. If we apply the bandit convergence algorithm, then the converged bandits will typically NOT be representative of the whole population. Converged bandits are associated with more traffic in general, and more traffic is associated with more advertising budget, certain advertiser types, more densely populated geolocation and probably some other unknown factors. Because of these factors, the treatment and control balance no longer holds and we re-introduce confounding into a randomized experiment.

Formally, we are running into the selection on post-treatment variables issue. That is, in the analysis, we pick samples based on variables that may be affected by the treatment in some unknown way. Such variables may be correlated with the outcome variable in some unknown way. Therefore, the selected sample for analysis is, in some sense, cherry picked, which is absolutely not okay in experiment analysis. Moreover, because we have to define convergence based on post-treatment variables, in general we cannot get around the post-treatment selection with any definition of convergence.

We may frame the convergence selection bias as a form of missing data bias: the data from the unconverged bandits are missing. Therefore, we can draw some insights from the missing data literature.

After doing some literature review, we conclude that a carefully implemented matched pair design with pairwise deletion can help minimize the bias in this situation. In particular, King et al’s (2007) matched pair design insulates their policy experiment from certain selection bias of missing data. Fukumoto (2015) examined the missing data bias in detail for matched pair design. According to Fukumoto, pairwise deletion has a smaller bias compared to all other methods. Imai and Jiang (2018) developed a sensitivity analysis that provides a bias bound for the matched pair design.

We combined the recommendations from these papers. Our matched pair design with pairwise deletion and sensitivity analysis goes as follows:

Compute the feature set with respect to the population of interest. In the advertising photo selection case, we actually can get this step for free because Yelp has other ML feature pipeline running for advertisers.
Match subjects as much as possible. Two advertisers in the same pair should have the same value for categorical variables. For numerical values, we may use Mahalanobis distance for their matching. After this step, all observed confounders are accounted for.
Within the same pair, apply treatment to one subject and control to the other randomly. This step is to randomize the unobserved potential confounders.
Apply pairwise deletion. That is, if both bandits in a pair are judged converged, add them into the analysis pool; otherwise drop both from the analysis.
In the event of advertiser churn, perform pairwise deletion as well.
Perform sensitivity analysis as in Imai and Jiang (2018) Section 2.4, Theorem 3.

Unfortunately, this design is no panacea. First, as stated in Fukumoto (2015), the matched pair design can help reduce the bias, but it cannot guarantee bias free.

Second, the result we get after pairwise deletion is not an estimate of the average treatment effect. The pairs that have little chance to converge within the experiment window are less represented in the final analysis pool and hence we count their treatment effects less. Formally, the pairwise deletion estimand can be interpreted as a weighted average treatment effect, where the weights are the relative ex ante probability of convergence. But in practice, it means it is difficult to communicate how much effect we can expect if we ship the experiment to the whole population.

Third, this design is complicated and time consuming to perform. So in our work we actually do not perform it unless we have to. Because of the new bandit bias, a negative is not necessarily a true negative but a positive is a true positive. So if a vanilla A/B experiment read (dropping the early period) provides a positive finding already, then we will conclude and ship the experiment.

Conclusion

Compared to full-scale ML, multi-armed bandit is a lighter weight solution that can help teams quickly optimize their product features without major commitments. However, because of its inability in handling high cardinality, we have to couple bandits with a candidate selection step. This practice creates two biases whenever we want to improve the candidate selection step: new bandit bias and convergence selection bias.

Our current recommendation to experimental design has two components: a 80-80 definition of bandit convergence and a matched pair design with pairwise deletion. The former reduces the new bandit bias and the latter minimizes the selection bias. Working together, they deliver a successful A/B experiment on bandit subjects.

References

Chapelle, Olivier, and Lihong Li. “An empirical evaluation of Thompson sampling.” Advances in neural information processing systems 24 (2011).
Fukumoto, Kentaro. “Missing data under the matched-pair design: a practical guide.” Technical Report, Presented at the 32nd Annual Summer Meeting of Society for Political Methodology, Rochester, 2015.
Imai, Kosuke, and Zhichao Jiang. “A sensitivity analysis for missing outcomes due to truncation by death under the matched‐pairs design.” Statistics in medicine 37, no. 20 (2018): 2907-2922.
King, Gary, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T. Moore, Jason Lakin, Manett Vargas, Martha María Téllez‐Rojo, Juan Eugenio Hernández Ávila, Mauricio Hernández Ávila, and Héctor Hernández Llamas. “A “politically robust” experimental design for public policy evaluation, with application to the Mexican universal health insurance program.” Journal of Policy Analysis and Management 26, no. 3 (2007): 479-506.

Acknowledgements

The content of this blog is a multi-year effort and we have lost track of all our talented colleagues who have contributed to this problem space. An incomplete list of contributors (with a lot of recency bias) is: Wesley Baugh, Sam Edds, Vincent Kubala, Kevin Liu, Christine Luu, Alexandra Miltsin, Alec Mori, Sonny Peng, Yang Song, Vishnu Sreenivasan Purushothaman, Jenny Yu. I also thank Marcio Cantarino O’Dwyer for reviewing and helpful suggestions.

Notes

1: Multiple simple bandits and a lookup table based finite state contextual bandit are equivalent. For example, if we set up a simple bandit for each advertiser, it is equivalent to setting up a contextual bandit with the context vector being onehot(advertiser_id). On the contrary, having a lookup table based contextual bandit is equivalent to setting up one multi-armed bandit per state. Therefore, we do not distinguish them in this blog post.

2: Technically, we should use the term credible interval. But we maintain the terminology confidence interval because, in this context, the difference between the two only introduces unnecessary complications to people who are less familiar with Bayesian statistics.

Become an Applied Scientist at Yelp!

Are you intrigued by data? Uncover insights and carry out ideas through statistical and predictive models.

View Job

Back to blog

Spark Data Lineage

Thu, 04 Aug 2022 02:00:00 +0200

In this blog post, we introduce Spark-Lineage, an in-house product to track and visualize how data at Yelp is processed, stored, and transferred among our services.

Spark and Spark-ETL: At Yelp, Spark is considered a first-class citizen, handling batch jobs in all corners, from crunching reviews to identify similar restaurants in the same area, to performing reporting analytics about optimizing local business search. Spark-ETL is our inhouse wrapper around Spark, providing high-level APIs to run Spark batch jobs and abstracting away the complexity of Spark. Spark-ETL is used extensively at Yelp, helping save time that our engineers would otherwise need for writing, debugging, and maintaining Spark jobs.

Problem: Our data is processed and transferred among hundreds of microservices and stored in different formats in multiple data stores including Redshift, S3, Kafka, Cassandra, etc. Currently we have thousands of batch jobs running daily, and it is increasingly difficult to understand the dependencies among them. Imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by a few critical Yelp services; you are about to make structural changes to the batch job and want to know who and what downstream to your service will be impacted. Or imagine yourself in the role of a machine learning engineer who would like to add an ML feature to their model and ask — “Can I run a check myself to understand how this feature is generated?”

Spark-Lineage: Spark-Lineage is built to solve these problems. It provides a visual representation of the data’s journey, including all steps from origin to destination, with detailed information about where the data goes, who owns the data, and how the data is processed and stored at each step. Spark-Lineage extracts all necessary metadata from every Spark-ETL job, constructs graph representations of data movements, and lets users explore them interactively via a third-party data governance platform.

Figure 1. Example of Spark-Lineage view of a Spark-ETL job

Figure 2. Overview of Spark-Lineage

To run a Spark job with Spark-ETL is simple; the user only needs to provide (1) the source and target information via a yaml config file, and (2) the logic of the data transformation from the sources to the targets via python code.

Figure 3. An example diagram of a Spark-ETL job

On the backend side, we implement Spark-Lineage directly inside Spark-ETL to extract all pairs of source and target tables having dependency relationships from every batch job. More precisely, we use the NetworkX library to construct a workflow graph of the job, and find all pairs of source and target tables that have a path between them in the corresponding Directed Acyclic graph (DAG) workflow of that job. All the middle tables in the transformation are not recorded in Lineage because they are temporary. For example, (Input Table 1, Output Table 2) is a pair in Figure 3 since there is a path between them, while (Input Table 2, Output Table 2) is not. For every such pair, we emit a message to Kafka including the identifiers of the source and target, together with other necessary metadata. These messages are then transferred from Kafka to a dedicated table in Redshift.

The reason we go with a two-step process instead of sending messages directly to one place is that Redshift has maintenance downtime while Kafka is highly available to receive newly emitted messages at all times. On the other hand, storing data in Redshift is highly durable and easy to query for analytics purposes. At Yelp, we have on the order of thousands of batches per day, and on average each job emits around 10 messages. In total, the Lineage table grows by a couple of million rows per year, which can be handled at ease by Redshift. Spark-Lineage then reads from the Redshift table and serves users, using an ETL tool plug-in.

Building Spark-Lineages UI

First, we parse the metadata made available from the above steps in Redshift and identify the source and target information. This metadata is first read into a staging table in the Redshift database. The reason we stage this data is to identify any new jobs that have been introduced in the daily load or to capture any updates to the existing scheduled jobs.

We then create a link (a canonical term for tables, files, etc.) for each Spark-ETL table together with additional information extracted from the metadata. We also add relationships between these jobs with their respective schemas. Finally we establish the connections among source and target tables according to the DAG extracted from Spark-ETL.

A mock-UI of Spark-Lineages is shown in Figure 1, where the user can browse or search for all Spark tables and batch jobs, read the details of each table and job, and track the dependencies among them from their origin to their end.

Understanding a Machine Learning feature

Data scientists working on Machine Learning models often look for existing data when building new features. In some cases the data they find might be based on different assumptions about what data should be included. For example, one team may include background events in a count of all recent events that a given user has performed, when the model does not wish to include such events. In such a case, Spark-Lineage allows a team to track down what data is used to identify these different decisions and what data can alleviate the discrepancies.

Understanding the impacts

One of the major advantages of having data lineage being identified and documented is that it enables Yelpers to understand any downstream/upstream dependencies for any changes that will be incorporated into a feature. It also provides an ability for easy coordination across relevant teams to proactively measure the impact of a change and make decisions accordingly.

Fixing data incidents

In a distributed environment, there are many reasons that can derail a batch job, leading to incomplete, duplicated, and/or partially corrupt data. Such errors may go silently for a while, and when discovered, have already affected downstream jobs. In such cases, the response includes freezing all downstream jobs to prevent the corrupt data from spreading further, tracing all upstream jobs to find the source of the error, then backfilling from there and all downstream inaccurate data. Finally, we restore the jobs when the backfilling is done. All of these steps need to be done as fast as possible and Spark-Lineage could be the perfect place to quickly identify the corrupted suspects.

Besides, mentioning the responsible team in Spark-Lineage establishes the accountability for the jobs and thus maintenance teams or on-point teams can approach the right team at the right time. This avoids having multiple conversations with multiple teams to identify the owners of a job and reduces any delay in this that could adversely affect the business reporting.

Feature Store

Yelp’s ML Feature Store collects and stores features and serves them to consumers to build Machine Learning models or run Spark jobs and to data analysts to get insights for decision-making. Feature Store offers many benefits, among them are:

Avoiding duplicated work, e.g. from different teams trying to build the same features;
Ensuring consistency between training and serving models; and
Helping engineers to easily discover useful features.

Data Lineage can help improve the Feature Store in various ways. We use Lineage to track the usage of features such as the frequency a feature is used and by which teams, to determine the popularity of a feature, or how much performance gain a feature can bring. From that, we can perform data analytics to promote or recommend good features or guide us to produce similar features that we think can be beneficial to our ML engineers.

Compliance and auditability

The metadata collected in Lineage can be used by legal and engineering teams to ensure that all data is processed and stored following regulations and policies. It also helps to make changes in the data processing pipeline to comply with new regulations in case changes are introduced in the future.

This post introduces the Yelp Spark-Lineage and demonstrates how it helps tracking and visualizing the life cycle of data among our services, together with applications of Spark-Lineage on different areas at Yelp. For readers interested in the specific implementation of Spark-Lineage, we have included a server- and client-side breakdown below (Appendix).

Implementation on the server side

Data identifiers

The most basic metadata that Spark-Lineage needs to track are the identifiers of the data. We provide 2 ways to identify an input/output table: the schema_id and the location of the data.

Schema_id: All modern data at Yelp is schematized and assigned a schema_id, no matter whether they are stored in Redshift, S3, Data Lake, or Kafka.
Location: Table location, on the other hand, is not standardized between data stores, but generally it is a triplet of (collection_name, table_name, schema_version) although they are usually called something different for each data store, to be in line with the terminologies of that data store.

Either way, if we are given one identifier, we can get the other. Looking up schema information can be done via a CLI or PipelineStudio – a simple UI to explore the schemas interactively, or right on Spark-Lineage UI with more advanced features compared to PipelineStudio. By providing one of the two identifiers, we can see the description of every column in the table and how the schema of the table has evolved over time, etc.

Each of the two identifiers has its own pros and cons and complements each other. For example:

The schema_id provides a more canonical way to access the data information, but the location is easier to remember and more user-friendly.
In the case the schema is updated, the schema_id will no longer be the latest, while looking up using the pair (collection_name, table_name) will always return the latest schema. Using schema_id, we can also discover the latest schema, but it takes one more step.

Tracking other information

Spark-Lineage also provides the following information:

Run date: We collect the date of every run of the job. From this we can infer its running frequency, which is more reliable than based on the description in the yaml file, because the frequency can be changed in the future. In the case we don’t receive any run for a month, we still keep the output tables of the job available but mark them as deprecated so that the users are aware of that.
Outcome: We also track the outcome (success/failure) of every run of the job. We do not notify the owner of the job in case of a failure, because at Yelp we have dedicated tools for monitoring and alerts. We use this data for the same purpose as above; if a service fails many times, we will mark the output tables to let the users know about that.
Job name and yaml config file: This helps the user quickly locate the necessary information to understand the logic of the job, together with the owner of the job in case the user would like to contact for follow-up questions.
Spark-ETL version, service version, and Docker tag: This information is also tracked for every run and used for more technical purposes such as debugging. One use case would be if an ML engineer finds out a statistical shift of a feature recently, he can look up and compare the specific code of a run today versus that of last month.

Implementation on the client side

Representation of Spark ETL jobs: As a first step to represent a Spark ETL job, a new domain named “Spark ETL” is created. This enables easy catalog searching and results in a dedicated area for storing the details of Spark-ETL jobs from the Redshift staging table. Once the domain is available, unique links (for the spark ETL jobs) are created in the data governance platform with job name as the identifier.

Adding metadata information: The details of the Spark ETL job (e.g., Repository, source yaml, etc.) are attached to the respective links created above. Each of the metadata information is given a unique id and value with a relation to the associated job. The current mechanism implemented for the Spark ETL jobs can be extended to represent the additional information in future.

Assign accountability: As the information about the owners is fetched from Kafka into Redshift, the responsibility section of the job link in the data governance platform can be modified to include the “Technical Steward” – an engineering team who is accountable for the Spark ETL job, including producing and maintaining the actual source data and responsible for technical documentation of data and troubleshooting data issues.

Establishing the lineage: Once the Spark-ETL jobs and the required metadata information are made available in the data governance platform, we establish the 2-way relation to depict source to Spark ETL job and Spark ETL job to target relation. The relations are established using a REST POST API call. After the relations are created, the lineage is auto-created and is made available for use. There are multiple views that can be used for depicting the relations but “Lineage View” captures the dependencies all the way until Tableau dashboards (See Figure 1).

Thanks to Cindy Gao, Talal Riaz, and Stefanie Thiem for designing and continuously improving Spark-Lineage, and thanks to Blake Larkin, Joachim Hereth, Rahul Bhardwaj, and Damon Chiarenza for technical review and editing the blog post.

Become an ML Platform Engineer at Yelp

Want to build state of the art machine learning systems at Yelp? Apply to become an ML Platform Engineer today.

View Job

Back to blog

Android in Analytics Infra

Wed, 03 Aug 2022 02:00:00 +0200

At Yelp, we have a reasonably large Android community for a company of Yelp’s size. These talented and skilled Android engineers work on Yelp’s client and business applications. We would like to share some of the unique challenges that we’ve experienced along with our various efforts to overcome those challenges. Analytics Infra is a team at Yelp that works on experimentation and logging platforms and supports them across the entire Yelp ecosystem. Within the Analytics Infra team, we have an Android working group. You may consider our team as an infrastructure team - a team that implements end-user functionality -...

Writing Emails Using React

Wed, 20 Jul 2022 02:00:00 +0200

As part of our effort to connect users with great local businesses, Yelp sends out tens of millions of emails every month. In order to support the scale of those sends, we rely on third-party Email Service Providers (ESPs) as well as our internal email system, Mercury.

Delivering the emails is just part of the challenge—we also need to give email developers a way to craft sophisticated templates that conform to our Yelp design guidelines. In the past, Yelp web and full stack engineers would rely on our legacy template language, Cheetah, to write emails. However, as the Yelp design language continued to evolve, this approach began to show its age: the code wasn’t maintained and visuals were no longer consistent with those of our apps and website. Additionally, Cheetah is a little-known language that represents an entirely different development workflow from what Yelp engineers are most accustomed to writing in their day-to-day work. Essentially all new web development is done in React.

In 2021, we set out to solve these problems by creating an email development system based on React components. Since its general release, this system has been used to develop more than a dozen new email types to send at scale, with millions of emails sent to date.

In this blog post, we’ll detail how we’ve repurposed elements of our website’s infrastructure to support email development, and how these systems address common problems encountered by email developers.

Yelp web developers write React code in our frontend monorepo, where they can create new packages, write components, and deploy pages. By using React for email development as well, developers are able to write emails in familiar ways:

New emails are scaffolded using Yeoman.
Each email template is its own React component. The component’s children are made up of shared email components that are imported from elsewhere in the monorepo.
Our React email code is type checked and linted, following the same rules as the rest of the monorepo.
Developers can write Storybook examples for emails and see them displayed in the same way as our web components.
Emails are tightly integrated with our core web infrastructure; building, image imports, CDN asset upload, and i18n are all supported for free.

We’ve also created tooling for email developers that want to send tests to real email clients. yelp-js-email extracts markup from Storybook examples and uses it as a basis for generating a preview email. With a little bit of AST modification, we can generate a test application which renders a preview email that closely conforms to the process used in a production send. The outcome is that developers are able to create and test new emails seamlessly without any backend work or campaign configuration.

An example email rendered in Storybook and sent to a developer’s inbox with a simple command.

When it comes time to send an email, developers release a new version of their package containing the email component. Next, they add the released version to our email-rendering microservice. After making a backend config change describing the new email campaign, it’s ready to be sent. Our backend developers can then submit email payloads to Mercury through Yelp’s data pipeline which triggers a server-side render of the email component. After some post-processing transformation to prepare the rendered HTML for sending, we then pass the email through the rest of our email pipeline on to our ESP. In a matter of seconds, a user receives an email from Yelp that was written with React!

For more details on the implementation of React emails and challenges we’ve encountered, read on.

Email clients have a (not undeserved) reputation for being difficult to develop for. Maintaining good compatibility while accounting for the various quirks of dozens of individual clients and platforms is a task that requires expert knowledge, thorough testing, and a lot of patience. Unlike modern evergreen web browsers, which have largely coalesced around a common standard, email clients have gotten away with custom and broken behaviors for years.

The critical takeaway from this insight is that email developers ought to explicitly define email clients that they intend to support. In the same way that developers across the industry have factored declining market share and maintenance costs into their decisions to drop compatibility for Internet Explorer 11, email developers can make choices when it comes to the email clients that they support. Throughout our testing we’ve found that popular email clients have better HTML and CSS compatibility than most developers realize. Outside of a few notable “problem clients” (Desktop Outlook, Windows 10 Mail, etc.), the big players (Gmail, Apple Mail, Yahoo, outlook.com, etc.) largely render spec compliant markup and styles more or less correctly. Common email wisdom, such as the recommendation to never use

tags, does not apply to these clients in 2022.

We looked at engagement numbers for our various emails and found that a small percentage of email recipients were opening their mail in legacy email clients. After consulting with our Product team, we determined that we would drop support for Desktop Outlook and related clients. Dropping support means that the text content of the email will still render, but we don’t consider it blocking if the email is otherwise visually broken. By explicitly defining the email clients we intend to support, we:

Give developers confidence when developing emails that they will display correctly for users within our support standards.
Get a better sense of the absolute market share of our emails.
Are able to craft more compelling, visually appealing, and responsive emails for the majority of Yelp users.

Providing developers with drop-in email components that have already been audited against our support standards is critical. In the same way that web engineers compose pages on the Yelp website by arranging a set of common, consistently designed components, we want to allow the possibility of building emails with minimal custom code required.

We determined early on that our Design Systems team did not have the resources to build and maintain an entirely separate set of React components exclusively for emails. Our design language is constantly evolving across our three major platforms and requires continual upkeep. Adding email into the mix would incur a significant cost. Our approach was then to reuse as much of the implementations of our existing web React components as we could.

This might seem impossible at first glance, as web React components are built for a browser context that involves lots of functionality not supported in email clients (interactivity, statefulness, and even animation). However, at its most basic, React functions just like a classic templating language. Blocks of template code can be conditionally rendered, composed, and injected with data, all in the context of JSX. When server-side rendered, a React component is transformed into HTML. That HTML is what we can assemble into our email body and send as a static email.

We employ two strategies to ensure that our existing web components are able to render properly in an email context: component wrappers, used to refine the prop APIs of our web components, and CSS in JS transforms, used to make individual style tweaks. Where neither approach is suitable we’ll create custom components built just for emails.

Component Wrappers

Each of our repurposed web React components are wrapped in a corresponding component prefixed with “Email” (e.g., Text -> EmailText, Container -> EmailContainer, etc.). This wrapper component allows us to modify the available props for the component and provide one-off tweaks and overrides.

For example, our standard Button component supports an onClick handler, but we don’t want email developers to use it (no JavaScript means it won’t work!). We simply define a new Props type (at Yelp we use Flow) with the restricted component API. We also use it as an opportunity to document relevant compatibility notes we’ve encountered in our testing, and to make other tweaks to the template before forwarding along props to the underlying Button component:

CSS in JS Transforms

We’re big fans of CSS in JS at Yelp, and are in the process of migrating most of our React components from SASS stylesheets to Emotion. One of the often overlooked features of writing CSS in JS is the level of control it provides over styles at runtime. Emotion actually facilitates this through custom Stylis plugins, which allow developers to systematically transform their styles while a component is rendering!

We put this to good use for our React email components, tweaking CSS to maximize email compatibility and minimize cognitive load for developers. Check out this example Stylis plugin that helps address a compatibility issue with usage of “var” in property values:

Without this plugin, we’d be hard put to reuse our React Button component in emails. Since the problematic “var” style is embedded in Button’s render method, we’d likely be forced to add an email-specific prop with some branching logic to conditionally remove the CSS. This adds a maintenance burden and introduces a concern with email rendering that our web components would not otherwise have.

Using this custom CSS in JS plugin, we get to maintain the encapsulation of our web components while still making the tweaks we need to use them effectively in emails. This is particularly important for emails that we send using AMP, which validates styles against a very strict subset of web CSS. Using these plugins, we can modify our styles to ensure they conform to the specification.

One other advantage of using CSS in JS to style our emails is that we know we won’t be wasting bytes with styles that we don’t need. When we’re building, our styles will be tree shaken alongside the rest of our components’ JS.

We aren’t always able to repurpose an existing web component using these two approaches; sometimes there are fundamental incompatibilities. In these cases, we’re able to write an Email* component from scratch, knowing that we aren’t unnecessarily introducing duplication. We probably want to revisit the design of these components in an email context anyway. For example, our RatingSelector component allows users to start a review on the Yelp website:

In an email context, we still want to allow users to tap a star rating to begin a review, but we’re not able to replicate the highlighting behavior on hover in email clients. There’s no way to address this difference using a wrapping component or CSS in JS plugin, so we created a custom EmailRatingSelector inspired by the original design but better suited for rendering statically.

As another example, at Yelp we use pre-built React components to standardize most of our layout needs. Most notably we use one called Arrange. As Arrange relies on some pretty tricky styles and media queries that don’t play nice with most email clients, we decided to create a custom EmailArrange component in its place. For EmailArrange we greatly simplified the available options, opting for a fixed layout

, but maintained a similar props API to the web Arrange component. Developers consuming EmailArrange will see it work in much the same way that they’re accustomed to, and the implementation details of the email-specific differences are abstracted away from them.

At Yelp, we’ve found SSR (Server-Side Rendering) of React pages to be a critical component of our web application, positively impacting SEO, page performance, and user experience. Since we need to turn our React email components into HTML for sending, rendering them server-side (ideally using our existing infrastructure) was a critical piece of the puzzle.

The performance characteristics of rendering emails fundamentally differs from serving requests from web traffic. Web traffic tends to scale gently, following predictable curves as users frequent Yelp more often at certain times of the day. Email rendering happens entirely differently. Except in cases where they’re sent immediately in response to a particular user action, emails are typically sent in massive, scheduled campaigns that consist of thousands or even millions of emails in one batch.

We performed some early tests on our legacy SSR system and found that it was a bottleneck—suddenly queuing thousands of requests to render emails at once quickly overwhelmed it. We were forced to throttle the requests at just tens of emails per second, which was unacceptable for the production campaigns we knew we needed to run.

As outlined in a previous blog post, we’d encountered a myriad of similar challenges scaling Server-Side Rendering in other places, so our awesome web infrastructure team invested in a new system that SSRs pages with far greater performance and reliability. Since the widespread rollout of this modern system, we’ve been able to easily scale to thousands of emails sent per second, such that the rendering step is no longer a bottleneck in our pipeline.

When a backend developer writes a batch for a massive email campaign (typically using Spark), they queue email-send requests to Mercury, our internal notifications system (powered by our data pipeline and Kafka). Those sends contain basic information such as the user to send to and the campaign it belongs to, as well as a payload containing data that’s required to render the template. These send requests are ingested by workers in our email-rendering microservice, which in turn triggers a request to our SSR service shard (the payload gets turned into React props). The shard returns our rendered email in HTML, which is then forwarded along to the rest of the email-sending pipeline.

Services involved in an end to end send of React emails

There’s another benefit to rendering emails using our existing SSR infrastructure: since our React web pages are already configured to support making GraphQL queries during SSR using Apollo, we can make online queries to include data in our email templates using our GraphQL API with no additional work needed.

Through the previous steps we’ve outlined, we’ve prepared our server-side rendered HTML to be sent to email clients. Even after the initial render, there’s still a little post-processing work to be done.

First, we make an effort to clean up the SSR HTML—in its raw state it’s tailored to be rendered and hydrated in a web browser. Using pyquery we can clean up extraneous script tags and attributes that we won’t use. Next, we want to establish the metadata for the email (subject line, from address, etc.). Inside each React email component, we use a </code> tag rendered via <a href="https://github.com/staylor/react-helmet-async#readme">react-helmet-async</a> to set our subject line. Data attributes on that tag provide the rest of the metadata we need.</p><p></p><p>Using this approach lets us keep the source of truth for email metadata alongside the component itself.</p><p>Finally, <strong>we need to inline our styles</strong>. Even modern email clients like Gmail suffer from <a href="https://github.com/hteumeuleu/email-bugs/issues/90">limitations like <code class="language-plaintext highlighter-rouge"><style></code> tag byte limits</a> that make it necessary to move <code class="language-plaintext highlighter-rouge"><head></code> tag CSS into HTML element <code class="language-plaintext highlighter-rouge">style=""</code> attributes. There are a bunch of open source options to accomplish this task (e.g., <a href="https://github.com/Automattic/juice">juice</a> and <a href="https://github.com/premailer/premailer">premailer</a>). In our testing, we found that existing Python implementations were far too slow for our needs, sometimes taking upwards of a second. Instead, we opted to write a custom style inliner built on <a href="https://pypi.org/project/tinycss2/">tinycss2</a> and <a href="https://pypi.org/project/selectolax/">selectolax</a>. Even with some AST traversal, the implementation is quite straightforward, and we were able to minimize the time we spent inlining styles down to just a few milliseconds.</p><p>After performing all these tasks, we construct a basic email HTML structure and forward it on its way. The email is ready to be sent!</p><p>By relying on React components and existing Yelp web infrastructure, we were able to architect an email template system that’s easy to use for developers, has up-to-date designs, lowers maintenance costs, and surpasses our performance requirements. In aligning product and engineering needs and clearly defining our email compatibility standards, we spend less time concerned with outlier email clients and more time creating compelling campaigns for Yelp users.</p><p>While the approach to sending emails outlined above is from the frame of reference of Yelp’s infrastructure, the overall system can be replicated using some fundamental building blocks like CSS in JS, a mature SSR platform, and an extensible email-rendering pipeline.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b970ccef-75bf-45ce-bda5-e6f3f3988e38/Senior-Software-Engineer-Full-Stack-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Migrating from Styleguidist to Storybook</h1> <p>Wed, 06 Jul 2022 02:00:00 +0200</p> <p>One of the core tenets for our infrastructure and engineering effectiveness teams at Yelp is ensuring we have a best-in-class developer experience. Our React monorepo codebase has steadily grown as developers create new React components, but our existing <a href="https://react-styleguidist.js.org/">React Styleguidist</a> (Styleguidist, for short) development environment has failed to scale in parallel. By transitioning from Styleguidist to <a href="https://storybook.js.org/">Storybook</a>, we were able to offer a faster and more user-friendly development environment for React components along with better alignment to developer and designer workflows. In this post we’ll take a deep dive into how and why we migrated to Storybook.</p><p>Styleguidist is an interactive React component development environment that developers use to develop and view their user interfaces. Styleguidist can also be used to produce static documentation pages (style guides) that can be hosted and shared with stakeholders.</p><p>Documentation is created using Markdown with code blocks that render a React component in an isolated interactive playground. A simple example looks like the following:</p><div class="language-markdown highlighter-rouge highlight"><pre>The `<ButtonGroup />` component is used to arrange multiple `<Button />` components side-by-side. ```jsx const Button = require('../Button').default; <ButtonGroup> <Button text="Foo" /> <Button text="Bar" /> <Button text="Baz" /> </ButtonGroup> ``` </pre></div><div class="image-caption"><p class="subtle-text"><small>An example Styleguidist playground</small></p></div><p>At Yelp, we’ve encountered various drawbacks from using Styleguidist that have led to a subpar React development experience:</p><ul><li>Styleguidist lacks an add-ons ecosystem due to a lack of wider Web community support, so additional functionality in Styleguidist would have to be written from scratch.</li> <li>Styleguidist does not scale well with large packages because it renders an isolated playground for every example in that package, resulting in slow initial load times and slow hot reloads.</li> <li>Developers have to create many permutations of each of their components to show every possible state a component supports.</li> <li>Editing Styleguidist markdown to change component state in the UI is not intuitive for developers and non-technical users.</li> </ul><p><a href="https://storybook.js.org/">Storybook</a> is an open source UI development and documentation tool that has gained popularity in the Web community in the past few years. It has strong community support and a rich add-ons ecosystem, making it easy to extend for accessibility testing, cross-browser testing, and other functionality.</p><p>Storybook allows users to browse and develop component examples one by one via <a href="https://storybook.js.org/docs/react/get-started/whats-a-story">Stories</a>. Stories capture the rendered state of a React Component, just like a Styleguidist Markdown example. This contrasts with the significantly slower Styleguidist, which always renders every example of every component in a package.</p><p>In Styleguidist, developers often create one example per visual permutation of their component, resulting in added maintenance burden (e.g. updating every example after changing a component API). In Storybook, developers can utilize auto generated <a href="https://storybook.js.org/docs/react/essentials/controls">Controls</a> via <a href="https://github.com/reactjs/react-docgen">react-docgen</a> that allow users to mutate and preview components directly in the documentation UI. This further streamlines the experience compared to Styleguidist, because documentation users no longer need to edit Markdown to change a component’s state.</p><div class="image-caption"><p class="subtle-text"><small>An example Storybook playground</small></p></div><p>Our React monorepo contained thousands of Styleguidist files, each with many examples of component usage within it. It was not feasible to migrate these by hand, and it would be unreasonable to force developers to manually rewrite their examples in the new Storybook format. To maintain our existing React component examples and reduce developer overhead in our migration, we developed the following requirements:</p><ul><li>Our existing Styleguidist files used ES5 style imports and syntax. We want to keep our new Storybook syntax consistent with component source code by using ES6 everywhere.</li> <li>Documentation in Storybook should be familiar to developers who have used Styleguidist. <ul><li>Storybook supports <a href="https://mdxjs.com/">MDX</a> which is a file format that combines Markdown with JSX to render React components in Markdown for documentation pages, and we can translate existing Styleguidist Markdown to MDX.</li> </ul></li> <li>Each example code block in Styleguidist should be translated into a <a href="https://storybook.js.org/docs/react/get-started/whats-a-story">Story</a>, and the component’s stories.js file should contain all examples.</li> </ul><p>With these goals in mind, we decided to use codemods to refactor our style guide files into the Storybook format. Codemods are a series of scripted actions that transform a codebase programmatically, and allow for large automated changes to be made without manual work.</p><p>First we extracted the Styleguidist code blocks; the rest of the contents of the Markdown file (e.g. plaintext descriptions) could be directly copied verbatim to the new MDX file. To achieve a one to one migration, we consider each code block as its own Story. We were able to leverage existing tools like <a href="https://www.npmjs.com/package/remark-code-blocks">remark-code-blocks</a> to extract Javascript codeblocks, and <a href="https://github.com/5to6/5to6-codemod">5to6-codemod</a> to convert ES5 syntax within these codeblocks to ES6 syntax.</p><div class="language-js highlighter-rouge highlight"><pre>// before: // const Button = require('../Button').default; import Button from '../Button'; </pre></div><p>To reduce developer friction during this transition, we decided to contain all Stories for a component in the same <code class="language-plaintext highlighter-rouge">component.stories.js</code> file, which is then displayed in the <code class="language-plaintext highlighter-rouge">component.stories.mdx</code> Docs Page. However, we discovered that MDX code blocks are run in the same context, and our assumption of maintained playground isolation from Styleguidist is no longer true. This issue is particularly problematic when dealing with transforming multiple Styleguidist examples in the same file, because joining the code blocks together results in duplicate imports:</p><div class="language-markdown highlighter-rouge highlight"><pre>```jsx import Button from '../Button'; Full width `ButtonGroup` example: <ButtonGroup fill> (omitted for brevity) ``` ```jsx import Button from '../Button'; // <-- this import is duplicated from above! Disabled `ButtonGroup` example: <ButtonGroup disabled> (omitted for brevity) ``` </pre></div><p>After combining the above stories into a single JS file, the Button import is duplicated. Our codemod needs to parse and dedupe these imports to prevent runtime errors. Additionally, we need to include the components that <a href="https://react-styleguidist.js.org/docs/documenting/#writing-code-examples">Styleguidist implicitly imports</a> for us:</p><div class="language-jsx highlighter-rouge highlight"><pre>// ButtonGroup.stories.js import Button from '../Button'; // deduped import { ButtonGroup } from './'; // added implicit import explicitly <ButtonGroup> <Button text="Foo" /> <Button text="Bar" /> <Button text="Baz" /> </ButtonGroup> </pre></div><p>Next, we write the extracted Markdown code blocks with deduped imports and ES6 syntax in <code class="language-plaintext highlighter-rouge">component.stories.js</code>, and a <code class="language-plaintext highlighter-rouge">component.stories.mdx</code> file with standard Storybook boilerplate:</p><div class="language-jsx highlighter-rouge highlight"><pre>// ButtonGroup.stories.mdx import { ArgsTable, Canvas, Description, Meta, Story } from '@storybook/addon-docs'; import * as stories from './ButtonGroup.stories.js'; import { ButtonGroup } from './'; <Meta title="yelp-react-component-button/ButtonGroup" component={ButtonGroup} /> The `<ButtonGroup />` component is used to arrange multiple `<Button />` components side-by-side. <Canvas> <Story name="Example0" story={stories.Example0} /> </Canvas> </pre></div><p>Lastly, we needed Storybook to understand how to build our components. We were able to extend the <a href="https://storybook.js.org/docs/react/builders/webpack#extending-storybooks-webpack-config">Storybook build configuration</a> with our existing production webpack configuration. This allowed us to preserve Storybook’s automatic docgen functionality, and miscellaneous features like code preview blocks. Using our existing webpack configuration also meant that components would appear and behave exactly as they do in real production pages.</p><p>Migrating our React component examples from Styleguidist to Storybook has massively improved developer experience and component playground performance. We were able to utilize Storybook features like <a href="https://storybook.js.org/docs/react/configure/overview#on-demand-story-loading">on-demand loading</a> to improve performance by generating a smaller bundle at compile time, resulting in faster playground boot times. Using our codemod migration strategy, we were able to transform nearly all of the examples in our monorepo without runtime errors, without blocking developers during the migration process.</p><p>Switching to Storybook opens up new possibilities for Yelp, and we’re excited to onboard add-ons to accelerate frontend developer productivity further.</p><p>We hope that this breakdown in our migration process helps teams facing similar migrations!</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a6cfee89-2dd0-4451-bf52-746b9547dfb7/Software-Engineer-Full-Stack-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Spark Data Lineage in Collibra</h1> <p>Wed, 29 Jun 2022 02:00:00 +0200</p> <p>In this blog post, we introduce Spark-Lineage, an in-house product to track and visualize how data at Yelp is processed, stored, and transferred among our services.</p><p><strong>Spark and Spark-ETL:</strong> At Yelp, <a href="https://spark.apache.org/">Spark</a> is considered a <a href="https://engineeringblog.yelp.com/2020/03/spark-on-paasta.html">first-class citizen</a>, handling batch jobs in all corners, from crunching reviews to identify similar restaurants in the same area, to performing reporting analytics about optimizing local business search. Spark-ETL is our inhouse wrapper around Spark, providing high-level APIs to run Spark batch jobs and abstracting away the complexity of Spark. Spark-ETL is used extensively at Yelp, helping save time that our engineers would otherwise need for writing, debugging, and maintaining Spark jobs.</p><p><strong>Problem:</strong> Our data is processed and transferred among hundreds of microservices and stored in different formats in multiple data stores including Redshift, S3, Kafka, Cassandra, etc. Currently we have thousands of batch jobs running daily, and it is increasingly difficult to understand the dependencies among them. Imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by a few critical Yelp services; you are about to make structural changes to the batch job and want to know who and what downstream to your service will be impacted. Or imagine yourself in the role of a machine learning engineer who would like to add an ML feature to their model and ask — “Can I run a check myself to understand how this feature is generated?”</p><p><strong>Spark-Lineage:</strong> Spark-Lineage is built to solve these problems. It provides a visual representation of the data’s journey, including all steps from origin to destination, with detailed information about where the data goes, who owns the data, and how the data is processed and stored at each step. Spark-Lineage extracts all necessary metadata from every Spark-ETL job, constructs graph representations of data movements, and lets users explore them interactively via <a href="https://www.collibra.com/us/en">Collibra</a>, a third-party data governance platform.</p><div class="image-caption"><p class="subtle-text"><small>Figure 1. Example of Spark-Lineage view of a Spark-ETL job</small></p></div><div class="image-caption"><p class="subtle-text"><small>Figure 2. Overview of Spark-Lineage</small></p></div><p>To run a Spark job with Spark-ETL is simple; the user only needs to provide (1) the source and target information via a yaml config file, and (2) the logic of the data transformation from the sources to the targets via python code.</p><div class="image-caption"><p class="subtle-text"><small>Figure 3. An example diagram of a Spark-ETL job</small></p></div><p>On the backend side, we implement Spark-Lineage directly inside Spark-ETL to extract all pairs of source and target tables having dependency relationships from every batch job. More precisely, we use the <a href="https://networkx.org/">NetworkX</a> library to construct a workflow graph of the job, and find all pairs of source and target tables that have a path between them in the corresponding Directed Acyclic graph (DAG) workflow of that job. All the middle tables in the transformation are not recorded in Lineage because they are temporary. For example, (Input Table 1, Output Table 2) is a pair in Figure 3 since there is a path between them, while (Input Table 2, Output Table 2) is not. For every such pair, we emit a message to Kafka including the identifiers of the source and target, together with other necessary metadata. These messages are then transferred from Kafka to a dedicated table in Redshift.</p><p>The reason we go with a two-step process instead of sending messages directly to one place is that Redshift has maintenance downtime while Kafka is highly available to receive newly emitted messages at all times. On the other hand, storing data in Redshift is highly durable and easy to query for analytics purposes. At Yelp, we have on the order of thousands of batches per day, and on average each job emits around 10 messages. In total, the Lineage table grows by a couple of million rows per year, which can be handled at ease by Redshift. Collibra then reads from the Redshift table and serves users, using a Snaplogic plug-in.</p><h2 id="building-spark-lineages-ui-on-collibra">Building Spark-Lineages UI on Collibra</h2><p><a href="https://www.collibra.com/us/en">Collibra</a> is a platform to collaborate and establish effective governance for data management and stewardship, enabling their customers/users to find meaning in their data and improve business decisions. Collibra is used at Yelp to provide a platform for data cataloging, discovery, and governance. The tool is being used by Engineering, Product, and several business teams across Yelp.</p><p>First, we parse the metadata made available from the above steps in Redshift and identify the source and target information. This metadata is first read into a staging table in the Redshift database using a <a href="https://www.snaplogic.com/">Snaplogic</a> ETL tool. The reason we stage this data is to identify any new jobs that have been introduced in the daily load or to capture any updates to the existing scheduled jobs.</p><p>We then create an asset (a canonical term for tables, files, etc., in Collibra) for each Spark-ETL table together with additional information extracted from the metadata. We also add relationships between these assets with existing assets (e.g., schemas). Finally we establish the connections among source and target tables according to the DAG extracted from Spark-ETL.</p><p>The UI of Spark-Lineages is shown in Figure 1, where the user can browse or search for all Spark tables and batch jobs, read the details of each table and job, and track the dependencies among them from their origin to their end.</p><h2 id="understanding-a-machine-learning-feature">Understanding a Machine Learning feature</h2><p>Data scientists working on Machine Learning models often look for existing data when building new features. In some cases the data they find might be based on different assumptions about what data should be included. For example, one team may include background events in a count of all recent events that a given user has performed, when the model does not wish to include such events. In such a case, Spark-Lineage allows a team to track down what data is used to identify these different decisions and what data can alleviate the discrepancies.</p><h2 id="understanding-the-impacts">Understanding the impacts</h2><p>One of the major advantages of having data lineage being identified and documented is that it enables Yelpers to understand any downstream/upstream dependencies for any changes that will be incorporated into a feature. It also provides an ability for easy coordination across relevant teams to proactively measure the impact of a change and make decisions accordingly.</p><h2 id="fixing-data-incidents">Fixing data incidents</h2><p>In a distributed environment, there are many reasons that can derail a batch job, leading to incomplete, duplicated, and/or partially corrupt data. Such errors may go silently for a while, and when discovered, have already affected downstream jobs. In such cases, the response includes freezing all downstream jobs to prevent the corrupt data from spreading further, tracing all upstream jobs to find the source of the error, then backfilling from there and all downstream inaccurate data. Finally, we restore the jobs when the backfilling is done. All of these steps need to be done as fast as possible and Spark-Lineage could be the perfect place to quickly identify the corrupted suspects.</p><p>Besides, mentioning the responsible team in Spark-Lineage establishes the accountability for the jobs and thus maintenance teams or on-point teams can approach the right team at the right time. This avoids having multiple conversations with multiple teams to identify the owners of a job and reduces any delay in this that could adversely affect the business reporting.</p><h2 id="feature-store">Feature Store</h2><p>Yelp’s ML Feature Store collects and stores features and serves them to consumers to build Machine Learning models or run Spark jobs and to data analysts to get insights for decision-making. Feature Store offers many benefits, among them are:</p><ol><li>Avoiding duplicated work, e.g., from different teams trying to build the same features;</li> <li>Ensuring consistency between training and serving models; and</li> <li>Helping engineers to easily discover useful features.</li> </ol><p>Data Lineage can help improve the Feature Store in various ways. We use Lineage to track the usage of features such as the frequency a feature is used and by which teams, to determine the popularity of a feature, or how much performance gain a feature can bring. From that, we can perform data analytics to promote or recommend good features or guide us to produce similar features that we think can be beneficial to our ML engineers.</p><h2 id="compliance-and-auditability">Compliance and auditability</h2><p>The metadata collected in Lineage can be used by legal and engineering teams to ensure that all data is processed and stored following regulations and policies. It also helps to make changes in the data processing pipeline to comply with new regulations in case changes are introduced in the future.</p><p>This post introduces the Yelp Spark-Lineage and demonstrates how it helps tracking and visualizing the life cycle of data among our services, together with applications of Spark-Lineage on different areas at Yelp. For readers interested in the specific implementation of Spark-Lineage, we have included a server- and client-side breakdown below (Appendix).</p><h2 id="implementation-on-the-server-side">Implementation on the server side</h2><h3 id="data-identifiers">Data identifiers</h3><p>The most basic metadata that Spark-Lineage needs to track are the identifiers of the data. We provide 2 ways to identify an input/output table: the <em>schema_id</em> and the <em>location</em> of the data.</p><ul><li> <p><strong>Schema_id:</strong> All modern data at Yelp is schematized and assigned a schema_id, no matter whether they are stored in Redshift, S3, Data Lake, or Kafka.</p> </li> <li> <p><strong>Location:</strong> Table location, on the other hand, is not standardized between data stores, but generally it is a triplet of (collection_name, table_name, schema_version) although they are usually called something different for each data store, to be in line with the terminologies of that data store.</p> </li> </ul><p>Either way, if we are given one identifier, we can get the other. Looking up schema information can be done via a CLI or PipelineStudio – a simple UI to explore the schemas interactively, or right on Collibra with more advanced features compared to PipelineStudio. By providing one of the two identifiers, we can see the description of every column in the table and how the schema of the table has evolved over time, etc.</p><p>Each of the two identifiers has its own pros and cons and complements each other. For example:</p><ul><li>The schema_id provides a more canonical way to access the data information, but the location is easier to remember and more user-friendly.</li> <li>In the case the schema is updated, the schema_id will no longer be the latest, while looking up using the pair (collection_name, table_name) will always return the latest schema. Using schema_id, we can also discover the latest schema, but it takes one more step.</li> </ul><h3 id="tracking-other-information">Tracking other information</h3><p>Spark-Lineage also provides the following information:</p><ul><li><strong>Run date:</strong> We collect the date of every run of the job. From this we can infer its running frequency, which is more reliable than based on the description in the yaml file, because the frequency can be changed in the future. In the case we don’t receive any run for a month, we still keep the output tables of the job available in Collibra but mark them as deprecated so that the users of Collibra are aware of that.</li> <li><strong>Outcome:</strong> We also track the outcome (success/failure) of every run of the job. We do not notify the owner of the job in case of a failure, because at Yelp we have dedicated tools for monitoring and alerts. We use this data for the same purpose as above; if a service fails many times, we will mark the output tables to let the users know about that.</li> <li><strong>Job name and yaml config file:</strong> This helps the user quickly locate the necessary information to understand the logic of the job, together with the owner of the job in case the user would like to contact for follow-up questions.</li> <li><strong>Spark-ETL version, service version, and Docker tag:</strong> This information is also tracked for every run and used for more technical purposes such as debugging. One use case would be if an ML engineer finds out a statistical shift of a feature recently, he can look up and compare the specific code of a run today versus that of last month.</li> </ul><h2 id="implementation-on-the-client-side">Implementation on the client side</h2><p><strong>Creating assets in Collibra:</strong> As a first step to create the Spark ETL assets in Collibra, a data domain named “Spark ETL” is created for easy catalog searching and to have a dedicated area for storing the details of these jobs within Collibra. Once the domain is made available, the Spark ETL job details that are staged in the staging Redshift table are loaded as assets using Collibra API with the job name as the unique identifier.</p><div class="image-caption"></div><p><strong>Assigning attributes to the assets:</strong> The details of the Spark ETL job (e.g., Repository, source yaml, etc.) are attached to the respective assets created above as attributes. Each of the attributes has a unique id and value with a relation to the associated asset using asset_attribute_key. The current asset attributes for the Spark ETL jobs can be extended to represent the additional information in future.</p><div class="image-caption"></div><p><strong>Accountability of the asset:</strong> As the information about the owners is fetched from Kafka into Redshift, the responsibility of the asset can be modified to include the “Technical Steward” – an engineering team who is accountable for the Spark ETL job, including producing and maintaining the actual source data and responsible for technical documentation of data and troubleshooting data issues.</p><div class="image-caption"></div><p><strong>Establishing the lineage:</strong> Once the assets and the required attributes are made available in Collibra, we establish the 2-way relation to depict source to Spark ETL job and Spark ETL job to target relation. The relations are established using a REST POST Collibra API call. After the relations are created, the lineage is auto-created and is available under the diagram section of the asset. There are multiple views that can be used for depicting the relations among the established Collibra assets but “Lineage View” captures the dependencies all the way until Tableau dashboards (See Figure 1).</p><p>Thanks to Cindy Gao, Talal Riaz, and Stefanie Thiem for designing and continuously improving Spark-Lineage, and thanks to Blake Larkin, Joachim Hereth, Rahul Bhardwaj, and Damon Chiarenza for technical review and editing the blog post.</p><div class="island job-posting"><h3>Become an ML Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/855b8be8-29b3-40c6-be1f-dd1f22663cc8?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>A Simply, Ordinary Reduction</h1> <p>Mon, 27 Jun 2022 02:00:00 +0200</p> <p>Experimentation has become standard practice for companies, and one of the most important aspects is how to evaluate the results to make ship/no-ship decisions. Have you run into experiments where you don’t have enough data for statistically significant results or perhaps the performance of your primary metric seemingly disagrees with that of your secondary metrics? If so, leveraging existing features to perform variance reduction may help with coming to a conclusion. At Yelp, we have found that using features typically used in ML modeling, in particular, can help with measuring treatment effects better than solely using t-tests!</p><p>Before deciding to fully launch a new feature, you will typically want to have some confidence that the feature will actually lead to some form of a win (e.g. engagement, revenue, etc.). To test the feature change, one of the most common ways is to use an A/B experiment. At its simplest level, start by randomly assigning half of your users to see the new feature and the other half to not. Once the experiment has run for a sufficiently long amount of time, the experiment is done and you can compare the results.</p><p>For this comparison of the control and treatment cohorts, standard practice is to use a <a href="https://www.investopedia.com/terms/t/t-test.asp">t-test</a> to determine if the two cohorts have statistically significant differences. First, you need to choose some metric to represent the performance of each cohort. Once you have calculated the metric for each user in control and treatment cohort, the treatment lift can simply be the average of treatment metrics minus the average of control metrics. To determine if that lift is statistically significant, use a t-test to compare the two sets of metrics for control and treatment cohorts.</p><p>While this all sounds great in practice, one of the key downsides of only using a t-test is that when there is a significant amount of unexplained variation in the comparison metric, you may have to run the experiment longer than you would like to reach a statistically significant difference. This is where variance reduction techniques come into play. To start this blog post, let’s actually go through a demo of how we would use an Ordinary Least Square regression to help in our experiment analysis! Ordinary Least Square regression is reminiscent of a <a href="https://www.youtube.com/watch?v=CUjrySBwi5Q&ab_channel=FunnyTikTok">certain popular TikTok video</a> that will serve as a great guide as we learn more about how it works.</p><h2 id="a-fresh-pie">A Fresh Pie!</h2><p>For our demo, let’s use Yelp, a company you are hopefully very familiar with. One way Yelp helps connect users to local businesses is through ads on various parts of the Yelp website/app. Let’s say we identify a specific segment of advertisers who could really benefit from spending slightly more in their advertising budget with Yelp by using a new feature on the Yelp dashboard. We believe if a business owner sees this on their Yelp dashboard, they will be more likely to increase their advertising budget with Yelp.</p><p>As a side note, in practice, we are actually working on product features that help advertisers to make the best spending recommendations (see the Budget Design and Infrastructure Updates section of this <a href="https://blog.yelp.com/news/yelp-releases-new-yelp-for-business-features-enabling-more-effective-advertising-and-adding-control-and-value-for-business-owners/">blog post</a>) for every local business here at Yelp!</p><p>Now back to the demo! We can set up this in Python with the following code snippet:</p><div class="image-caption"></div><p>In this code snippet, we have 50 new Yelp advertisers that visit the Yelp dashboard per day and we run this experiment for a month. Let’s assume that the current budget of each Yelp advertiser is normally distributed and a minimum value of $5 since budget cannot be negative. We also assume that the proposed treatment, on average, results in a $2 increase in the business owner’s advertising budget, which we assume to be normally distributed and independent of the business’s existing advertising budget.</p><p>Thus, the metric we would want to compare is what the post-treatment advertising budgets are between control and treatment cohorts to see if there is a statistically significant difference. Here is how we would do this with a t-test:</p><div class="image-caption"></div><p>What we observe is that, on average, post-treatment advertising budgets are ~$1.80 higher in the treatment cohort than in the control cohort (well within the variation we set for the expected $2 budget increase). More importantly, we see from the t-test that the <a href="https://www.investopedia.com/terms/p/p-value.asp">p-value</a> between resulting advertising budgets between the two cohorts is 0.0473, which means this difference is indeed statistically significant. Perfect, we are now more confident that our treatment has the desired effect on increasing advertisers’ budgets!</p><h2 id="save-me-a-slice">Save Me a Slice!</h2><p>Now I know what you’re thinking. That was quite a lot of assumptions we made to simplify our A/B experiment, so let’s complicate things quite a bit. Different advertisers at Yelp have different budgeting needs. Let’s incorporate this difference into our code and try running the same t-test.</p><div class="image-caption"></div><p>This defines three possible types of advertiser budgets: <code class="language-plaintext highlighter-rouge">low</code>, <code class="language-plaintext highlighter-rouge">mid</code>, and <code class="language-plaintext highlighter-rouge">high</code>, each with a different distribution of advertising budgets that make up 25%, 50%, and 25% of Yelp’s advertisers respectively.</p><p>From the results of the t-test, we can see that the difference between advertising budgets of treatment and control cohorts is negative and more importantly, we do not observe a statistically significant difference. The problem here with running a t-test is that the added variance from the three different types of advertiser budgets is attributed as noise, but in reality, it can be explained.</p><p>As an exercise, let’s see what would happen if we ran the experiment longer to reduce this “noise.” If we were to run this experiment for 3 months, we would actually still see the same results and it is not until the 4th month that we see a statistically significant lift in advertising budget in the treatment cohort.</p><p>We probably don’t want to be running an experiment for 4 months for a variety of reasons (e.g. this subset of advertisers might not even benefit from the increased advertising budget after that long). Let’s see if we can come to a different conclusion using an <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">Ordinary Least Square</a> (OLS) regression where we define the dependent variable as post-treatment advertising budget. We can define it as the following:</p><div class="image-caption"></div><p>As an example, if we only choose <code class="language-plaintext highlighter-rouge">cohort</code> as the only feature, which is equivalent to having an empty <code class="language-plaintext highlighter-rouge">X</code>, we will actually get the same results as our t-test.</p><div class="image-caption"></div><p>We can ignore most of the numbers in the summary of our OLS regression and focus on the <code class="language-plaintext highlighter-rouge">coef</code> and <code class="language-plaintext highlighter-rouge">P>|t|</code> values for our treatment indicator feature (<code class="language-plaintext highlighter-rouge">cohort[T.Treatment]</code>). This treatment indicator feature is simply 1 if the advertiser belonged to the treatment cohort and 0 otherwise. One thing to note is that there is no feature for the Status Quo cohort since our OLS regression has selected it to be the reference group for the <code class="language-plaintext highlighter-rouge">cohort</code> feature. We can manually select the reference group as an input to the OLS regression if necessary, but if not, it will be random.</p><p>The coefficient of this feature, <code class="language-plaintext highlighter-rouge">coef</code>, represents the average effect that being in the treatment group has on the advertiser’s post-treatment budget. Thus, like the t-test, we are seeing that the treatment leads to $1.85 lower post-treatment advertising budget. <code class="language-plaintext highlighter-rouge">P>|t|</code> represents the statistical significance of this feature’s coefficient, which for this OLS regression has the same p-value we calculated in our t-test. Again, in this example, we see that the coefficient is not statistically significant with a value of 0.386.</p><p>Rather than just use <code class="language-plaintext highlighter-rouge">cohort</code> as a feature, however, let’s see what happens when we add pre-treatment advertising budgets as part of <code class="language-plaintext highlighter-rouge">X</code>. Since cohort assignments are randomly picked, there should be no violation of independence between the treatment label and adding this second feature.</p><div class="image-caption"></div><p>What we observe now is that the coefficients of both features in our OLS are significant. The values themselves are also informative! The coefficient for the treatment cohort matches our expectations of a $2 increase in advertising budget and the coefficient for pre-treatment advertising budget is 1. Thus, our model essentially believes that the post-treatment budget can be represented by the pre-treatment budget plus $2 if the advertiser was in the treatment cohort.</p><p>To truly understand how much time using an OLS regression with informative predictors can save in an experiment, we can create an A/A test and run a power analysis. For the A/A test, we will run the same two OLS regressions as above, but set the <code class="language-plaintext highlighter-rouge">treatment_effect</code> equal to the <code class="language-plaintext highlighter-rouge">sq_effect</code>. Once we have these two regressions, we can calculate an estimate of the population standard deviation of our treatment indicator feature from the std err output and use that in our power analysis.</p><p>Let’s assume a relatively standard alpha of 0.05 and beta of 0.2. If we wanted to detect a minimum effect size of $0.10 without pre-treatment budget as a feature, we would need over 10,000,000 samples. Note, this is equivalent to our initial methodology of just running a t-test.</p><div class="image-caption"></div><p>Instead, if we add pre-treatment budgets as a feature to detect the same minimum effect size, we’ll need 800 samples. This illustrates the immense impact that informative predictors can have on making the correct ship/no-ship decision with significantly shortened experiment lengths.</p><div class="image-caption"></div><h2 id="thats-enough-slices">That’s Enough Slices!</h2><p>If you’re still not convinced that using an OLS regression is necessary, I would absolutely agree. In the previous example, we could have run a t-test to look at the differences between post and pre-treatment budgets as our primary metric.</p><div class="image-caption"></div><p>Let’s make things even more complicated then! Our other assumption was that the treatment would have the same effect on all advertisers, which is very rarely the case in practice. Let’s replicate this behavior in our demo by varying the treatment effect for the category that the advertiser is a part of. For example, let’s say that <code class="language-plaintext highlighter-rouge">restaurant</code>, <code class="language-plaintext highlighter-rouge">plumber</code>, <code class="language-plaintext highlighter-rouge">electrician</code> categories make up 25%, 25%, and 50% of all Yelp advertisers. Although not true in practice, let’s also assume that the category of advertiser and their advertising budget are independent of one another.</p><div class="image-caption"></div><div class="image-caption"></div><p>Let’s now run the same OLS and add category as a feature since we know that the treatment effect is dependent on what category the business is a part of.</p><div class="image-caption"></div><p>As before, our reference group for <code class="language-plaintext highlighter-rouge">category_type</code> is the <code class="language-plaintext highlighter-rouge">electrician</code> value so we do not see an indicator feature for that specific category of advertiser.</p><p>Unfortunately, the results aren’t exactly what we would have expected. For example, if a business owner is an <code class="language-plaintext highlighter-rouge">electrician</code>, our model would be predicting that the treatment would increase their advertising budget by ~$0.75, whereas, in reality, it should have decreased budget by $0.50. This $0.75 represents the average treatment effect on advertising budget since over all advertisers, 50% (electricians) will see a budget decrease of $0.50, 25% (plumbing) will see a budget increase of $4, and 25% (restaurants) will see no treatment (<code class="language-plaintext highlighter-rouge">-$0.50*50% + $4*25% + $0*25% = $0.75</code>). Sometimes this is actually all you need, especially in the case that you just want to understand what will happen if Yelp decides to treat a randomly selected advertiser.</p><p>Say we want to dive deeper and understand the treatment effect for each category. The problem with our current OLS model is that we are unable to capture the interaction effect between categories and what the conditional treatment effect will be. To remedy this, let’s leverage interaction variables in our OLS by multiplying the treatment label with each categorical feature in the following manner:</p><div class="image-caption"></div><div class="image-caption"></div><p>Now we can see the OLS regression has two additional interaction features. For example, <code class="language-plaintext highlighter-rouge">cohort[T.TREATMENT]:category_type[T.plumber]</code> will be 1 if an advertiser is a plumber and in the treatment group and 0 otherwise. Essentially, this feature, combined with our treatment indicator feature, will give us the average treatment effect on advertising budget for plumbers. It is also worth noting that <code class="language-plaintext highlighter-rouge">category_type</code> features alone are not statistically significant, which makes sense since category alone should not affect a business’s advertising budget in our example.</p><p>This is something that is both more interpretable and consistent with the data we generated in our demonstration. For each category of advertiser, we can see that:</p><ol><li><code class="language-plaintext highlighter-rouge">electrician</code>: There is no interaction term because this is our reference group for <code class="language-plaintext highlighter-rouge">category_type</code>. Thus, the treatment effect is simply the coefficient of the treatment label, -0.4450. Thus, the advertising budget is roughly $0.50 less, as expected.</li> <li><code class="language-plaintext highlighter-rouge">plumber</code>: The coefficient of the interaction term is 4.4844, so if we subtract the coefficient of the treatment label, the advertising budget is roughly $4 more, also as expected.</li> <li><code class="language-plaintext highlighter-rouge">restaurant</code>: The coefficient of the interaction term is 0.5002, so if we subtract the coefficient of the treatment label, the advertising budget is neutral, also as expected.</li> <li>Also, note that the coefficients for all the features mentioned in this section are statistically significant.</li> </ol><p>Thus, we have been able to show that using an OLS for variance reduction can significantly help with two parts of experiment analysis: decreasing the amount of time we need to run the experiment as well as giving insight into the varying effects that the proposed treatment will have on different populations in the experiment.</p><h2 id="requirements-for-ols">Requirements for OLS</h2><p>Now that we have gone through a demonstration of how an OLS regression may help with experiment analysis, let’s talk about some of the caveats of performing this type of analysis.</p><ol><li>The first is fairly straightforward; using an OLS regression is a conditional expectation and will give the average effect of a feature if we do not include any interaction terms. <ul><li>In practice, the treatment effect will likely not be uniform across all subjects. For example, if our treatment has a larger effect on businesses with more reviews on Yelp, an OLS regression without interaction terms would not be able to distinguish the treatment effect difference between two identical businesses with 0 and 10 reviews than two identical businesses with 1000 and 1010 reviews.</li> <li>Despite this, starting with an OLS regression can still help with identifying predictive features and can sometimes be all you may want or need in your experiment analysis.</li> </ul></li> <li>The second caveat is that the selection of businesses for treatment must be independent of other features used in the OLS regression. <ul><li>When running a randomized experiment, this criteria will usually be met as whether or not a business receives the treatment is random.</li> </ul></li> <li>Variance reduction will only be noticeable when the features are highly predictive of the dependent variable. <ul><li>The theory behind variance reduction is that we want to attribute what a t-test would consider unexplainable noise to be explained by other features. If these features are unable to explain much, we would not be significantly reducing variance and be no better off than running a t-test for analyzing the experiment.</li> </ul></li> <li>Be careful with regularization! <ul><li>Adding a regularization term in your OLS regression can give you a biased read on coefficients because the coefficient will likely be smaller than their original, unbiased values.</li> </ul></li> </ol><p>Arguably, the most important requirement when we perform variance reduction is that the features we use must be pre-treatment values. If we do use post-treatment features, there are two scenarios possible:</p><ol><li>The treatment has no effect on the post-treatment feature. <ul><li>If the post-treatment feature has no effect on what we are trying to predict, we actually don’t accomplish anything by including the post-treatment feature. In fact, if we add too many useless features, we may incorrectly inflate the standard error of the treatment indicator due to a decrease in degrees of freedom from the extra features.</li> <li>If the post-treatment feature does have an effect on what we are trying to predict, our coefficient for the treatment indicator feature will remain the same, but the statistical significance of that feature may change. Since we are reducing the amount of total variance in the predictor with a post-treatment feature, the standard error of the treatment indicator feature will decrease.</li> <li>Ultimately, if we are absolutely sure that the treatment will have no influence on some feature, it should be safe to add the post-treatment feature, but in practice, it is hard to make and prove such a statement.</li> </ul></li> <li>The treatment does have an effect on the post-treatment feature. <ul><li>First, this violates one of our previous requirements since the treatment indicator is no longer independent of all other features.</li> <li>Let’s also take an example from literature to illustrate the problems with doing this in more detail. Let’s suppose our treatment is a Yelp advertising tutorial and we are trying to measure the effect the tutorial has on businesses purchasing advertisements. Our post-treatment feature will be a sentiment score for each business towards Yelp and for this scenario, assume that the Yelp advertising tutorial does lead to higher sentiment scores. This example is adapted from <a href="https://doi.org/10.7910/DVN/EZSJ1S">Montgomery et al</a>:</li> </ul></li> </ol><p><strong>Scenario 1</strong>: Sentiment scores have a relationship with purchasing ads (let’s assume a positive one). If this is the case and we include it as a feature, the higher ads purchase rate will be attributed to higher sentiment scores rather than the treatment, causing the coefficient of the treatment indicator of the OLS model to be biased. Simply removing the feature will correctly attribute the higher levels of ads purchases to the treatment label feature.</p><p><strong>Scenario 2</strong>: Sentiment score does not have any effect on purchasing ads. While this may seem harmless to include post-treatment sentiment scores as a feature, it actually is not. Let’s say there is a confounding feature such as business age, where older businesses tend to have higher rates of purchasing Yelp ads and sentiment scores towards Yelp.</p><ul><li>What will happen now is for businesses with higher sentiment scores, there can be two scenarios: they belong to an older business age demographic or they received treatment. All else being equal, businesses that are older will have higher purchase rates than those who received treatment. This will cause our OLS model to falsely associate the treatment with negatively impacting ads purchase rates when we hold the post-treatment feature of sentiment scores equal.</li> <li>Note, if we include business age as a feature in the OLS, this would no longer be an issue. However, because we cannot determine all such unknown variables, we will always face the possibility of having a biased coefficient for our treatment indicator feature if we decide to include post-treatment features.</li> </ul><p>We also want to note that having stale features may be problematic as well. Stale features may lead to having features that are too old to be informative predictors for our dependent variable, which will in term decrease the effect of variance reduction in our experiment analysis (see Caveat 3 above).</p><p>This highlights the importance of having time travelable features or more specifically, having an ETL with the ability to generate event-based features. There exists numerous sources (e.g. from <a href="https://netflixtechblog.com/distributed-time-travel-for-feature-generation-389cccdd3907">Netflix</a>) of online content discussing the benefits of time travelability and feature logging, but those primarily focus on how this infrastructure benefits robust training processes for machine learning (ML) models. Time travelability allows ML practitioners to generate features at prediction time since features generated any further would result in label leakage.</p><p>What these articles don’t cover and what we have done at Yelp is leverage the same ETL to generate pre-treatment feature sets since we know exactly when treatment occurs for our population (essentially replace prediction time with treatment time in the ETL). This allows a proper setup of pre-treatment features if we decide to use an OLS regression in an experiment analysis.</p><h2 id="conclusion">Conclusion</h2><p>TLDR: Using an OLS regression may be superior to a t-test in interpreting experiment results!</p><ul><li>The simplest form of an OLS regression is equivalent to a t-test, where the only feature is the treatment indicator label.</li> <li>The more variance introduced into an experiment, which happens naturally in the real world, the more likely it will be the case that a t-test will not be sufficient.</li> <li>With an OLS regression, we can also leverage interaction terms if there is reason to believe that treatment will affect separate populations differently.</li> <li>Of all the criteria of using an OLS regression, we would like to emphasize the importance of not using post-treatment features as this can significantly distort the interpretations of treatment effects.</li> <li>Overall, an OLS regression can more accurately capture treatment effects on specific segments of your population and in a significantly less amount of experiment run time when we have highly predictive pre-treatment features.</li> </ul><p>As a side note, we would also like to call out that this is not the first time this technique has been used for experiment analysis. Please see a prior <a href="https://engineeringblog.yelp.com/2021/07/analyzing-experiments-with-changing-cohort-allocations.html">Blog Post</a> by Alexander Levin about how we can use the same technique to account for mixshift changes over the course of an experiment.</p><h2 id="acknowledgments">Acknowledgments</h2><ul><li>Shichao Ma for the idea to try this when analyzing an experiment we designed and ran</li> <li>Yang Song for reviewing and adding helpful comments</li> </ul><div class="island job-posting"><h3>Become an Applied Scientist at Yelp!</h3><p>Are you intrigued by data? Uncover insights and carry out ideas through statistical and predictive models.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/5b9e5f45-b501-447f-857b-72ee24699765/Applied-Scientist-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Data Sanitization with Vitess</h1> <p>Wed, 22 Jun 2022 02:00:00 +0200</p> <p>Our community of users will always come first, which is why Yelp takes significant measures to protect sensitive user information. In this spirit, the Database Reliability Engineering team implemented a data sanitization process long ago to prevent any sensitive information from leaving the production environment. The data sanitization process still enables developers to test new features and asynchronous jobs against a complete, real time dataset without complicated data imports. MySQL and other open source project innovations over the last decade have led us on a journey to Vitess, which is now responsible for over 1500 workflows across more than 100 database schemas that serves the sanitized data needs of all of our developers at the click of a button.</p><h2 id="vitess-concepts">Vitess Concepts</h2><p>The following are excerpts or paraphrases from the vitess.io site and will be helpful to know when seeing these terms used later on:</p><ul><li><a href="https://vitess.io/">Vitess</a> is a database clustering system for horizontal scaling of MySQL</li> <li><a href="https://vitess.io/docs/14.0/concepts/vstream/">VReplication</a> is a system where a subscriber can indirectly receive events from the binary logs of one or more MySQL instance shards, and then apply it to a target instance</li> <li><a href="https://vitess.io/docs/14.0/concepts/tablet/">vt-tablet</a> processes connect to a MySQL database, local or remote</li> <li><a href="https://vitess.io/docs/14.0/concepts/vtctld/">vtctld</a> is an HTTP server useful for troubleshooting or getting a high-level overview of the state of Vitess</li> </ul><h2 id="why-did-yelp-choose-vitess">Why did Yelp choose Vitess</h2><p>Yelp began exploration of Vitess in late 2019 when a need was growing for new capabilities within our MySQL infrastructure. Data sanitization was the most pressing need at the time, and the newly developed VReplication features would help improve the reliability and scalability of our existing sanitization system. The potential of using Vitess to also serve as a data migration tool, and multi-version replication medium in the future also helped lead us to choosing Vitess.</p><h2 id="basics-of-our-mysql-setup">Basics of our MySQL Setup</h2><p>MySQL is the primary datastore for all transactional workloads at Yelp. The production environment contains more than 20 distinct replication clusters across cloud datacenters in multiple regions of the United States. Nearly every action a user takes on Yelp will be handled on the backend by MySQL. Our largest three MySQL clusters are responsible for serving over 300,000 queries per second covering data measuring in the tens of terabytes, not even counting the queries satisfied by the caching in front of them.</p><p>Each MySQL cluster is organized with a single source of row-based replication, depicted in the diagram below as “Primary”. Replication then continues on to an intermediary, which serves as the replication source to all leaves below it. Our leaves can have different roles, and may be consumer facing or internal-facing. The role of “Replica” is one that is restricted to the leaf level, and serves as the data sanitization source to our development environment.</p><div class="c2"></div><h2 id="legacy-data-sanitization">Legacy Data Sanitization</h2><p>The ability to query data, test batches, and run developer playgrounds outside of the production environment against sanitized data was first provided using trigger-based standard MySQL 5.5.x Replication (statement-based replication).</p><div class="c4"></div><p>Statement-based sanitization was inherently flawed, but usable as a rough approximation of production for many years. When rows are written, triggers present on the sanitized database replica match a pattern for things such as addresses, emails, or names and are obfuscated in a variety of ways.</p><p>Trigger-based sanitization came in various forms, the simplest of which was to clear the column, and then clear the column continuously going forward:</p><figure class="code"><figure class="highlight"><pre class="language-sql" data-lang="sql">UPDATE user SET last_name = '' ; DROP TRIGGER IF EXISTS user_insert ; DELIMITER ;; CREATE TRIGGER user_insert BEFORE INSERT ON user FOR EACH ROW BEGIN SET NEW.last_name = '' ; END ;; DELIMITER ; DROP TRIGGER IF EXISTS user_update ; DELIMITER ;; CREATE TRIGGER user_update BEFORE UPDATE ON user FOR EACH ROW BEGIN SET NEW.last_name = '' ; END ;; DELIMITER ;</pre></figure></figure><h2 id="flaws-in-the-trigger-based-system">Flaws in the Trigger-based System</h2><p>Among the trigger-based system’s worst flaws was that data correctness, even of the unsanitized columns, within this system was never really possible. Once data is obfuscated it cannot always be updated or deleted through statement-based replication in the future.</p><figure class="code"><figure class="highlight"><pre class="language-sql" data-lang="sql">CREATE TABLE user ( id int NOT NULL AUTO_INCREMENT, first_name varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL, last_name varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL, PRIMARY KEY (id), KEY last_name_idx (last_name), ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci</pre></figure></figure><p>To illustrate, take this simple example, with the user_insert trigger in place and the simplified table structure above:</p><figure class="code"><figure class="highlight"><pre class="language-sql" data-lang="sql">INSERT INTO user (first_name,last_name) VALUES ('john','smith'); TRIGGER ACTION: BEFORE INSERT ON user FOR EACH ROW BEGIN SET NEW.last_name = '' ; UPDATE user SET first_name = 'james' WHERE first_name = 'john' and last_name = 'smith'; No Rows Affected</pre></figure></figure><p>The result of the trigger sanitization is that the statement no longer matches as it would on an unsanitized host, effectively leaving rows impossible to reference in this manner.</p><p>Multiple terabytes of data had to be intermittently rebuilt from scratch due to infrastructure failures, migrations to the cloud, functionally sharding the source cluster, and version upgrades. When executing this process for copying, then sanitizing, and applying triggers, the engineering time required rose dramatically from hours, to days, and towards the end of life of this implementation, a full week. To alleviate manual interventions, backups were enabled to reduce the need to rebuild these hosts from scratch. Testing was implemented to ensure they worked, but even so they were becoming increasingly unwieldy as they were less and less capable of using our standard tooling. Innovations in MySQL enabled by upgrading to newer versions eventually led to the failure of the trigger system in standard MySQL, as triggers do not execute on the replicas in row-based replication (RBR).</p><h2 id="mariadb-workaround-for-trigger-based-system">MariaDB Workaround for Trigger-based System</h2><p>Having overlooked the inability of triggers from executing on replicas, we quickly pivoted to find an alternative upon implementation of RBR across our fleet. MariaDB proved to be a serviceable option, providing the ability to execute triggers on row-based events.</p><p>The downside to running with MariaDB, which we did for just over a year, was the necessity of maintaining two versions of every tool. While largely compatible with MySQL, the MariaDB tools subtly renamed a lot of the commands, implemented backups a little bit differently, and required maintaining two versions of packages.</p><h2 id="vitess-setup">Vitess Setup</h2><p>Our Vitess deployment consists of more than 2000 vt-tablets deployed across dozens of machines residing in our dev, staging, and production environments. These vt-tablets are responsible for VReplication of over 6000 distinct workflows that materialize data from one database instance to another that share no traditional MySQL replication. Several hundred of these vt-tablets are responsible for over 1500 workflows involved in the data sanitization process.</p><p>Much of our core setup is off the shelf, and the best resource for deploying Vitess can be found <a href="https://vitess.io/docs/get-started/operator/">here</a>. The implementation we went with for our initial deployment of Vitess was couched in the knowledge that we had no consumer-facing use cases, little local knowledge of Vitess, and a likely need to implement and materialize data in the future. Knowing that, and that no sharding was needed for this use-case, we created a slimmed-down deployment to only include vtctld and vt-tablet containers.</p><p>Our tablets all connect to external MySQL databases, and were deployed on dedicated servers for vt-tablets and vtctlds. The tablets were launched in pairs, with a source tablet and a target tablet for each MySQL schema living on the same physical machine to minimize network transit. Tablet state is stored in Zookeeper, and actual tablet deployments are coordinated with a scheduled job and static configuration file managed by humans based on resource consumption of different tablets.</p><div class="image-caption"><p class="subtle-text"><small>Data flow through tablets</small></p></div><p>Another role of MySQL hosts was created for each physical MySQL cluster Vitess would materialize data from, which we denoted as ‘migration’ hosts explicitly to serve the needs of Vitess VReplication. Like the ‘replica’ role, this role is exclusive to the leaf level of the replication hierarchy. The migration role is advertised via Envoy/Smartstack, <a href="https://engineeringblog.yelp.com/2020/11/minimizing-read-write-mysql-downtime.html">the discovery system used at Yelp</a>, and discovered by the appropriate vt-tablets. With its own role, in this case the target (writable host) and the sanitized server are discovered the same way by the vt-tablets and is a full-fledged replication hierarchy with automatic failover targets available to maintain up-time and ease maintenance.</p><table><thead><tr><th>Ecosystem</th> <th>Non-Prod</th> <th>Prod</th> </tr></thead><tbody><tr><td>Zookeeper</td> <td>m5d.xlarge</td> <td>m5d.xlarge</td> </tr><tr><td>Tablet Hosts</td> <td>c6i.4xlarge</td> <td>c6i.12xlarge</td> </tr></tbody></table><h2 id="vitess-materialization-logic">Vitess Materialization Logic</h2><p>The logic for data sanitization was previously captured as what to change a column value to after it was already seen in replication, and was not directly compatible with Vitess. Another way of thinking about this, is that the unsanitized data was actually replicated to the sanitized server in the relay log, and then on write was modified based on the trigger rules. With materialization rules, the sanitized data is never replicated to the sanitized server, and is instead retrieved directly from the source in a modified, or custom, fashion. In the process of creating this setup, we iterated over every table and created a purpose built rule for sanitizing (or not) the data for use by our developers. All of our workflows are stored in a simple git repository for later re-use, such as for re-materialization or schema changes necessitating modification of the custom rules.</p><h3 id="example-of-a-simple-custom-materialization-rule">Example of a simple custom materialization rule:</h3><figure class="code"><figure class="highlight"><pre class="language-jql" data-lang="jql">{ "workflow": "user_notes_mview", "sourceKeyspace": "yelp_source", "targetKeyspace": "yelp_target", "stop_after_copy": false, "tableSettings": [ { "targetTable": "user_notes", "sourceExpression": "SELECT id, user_id, 'REDACTED' AS note, note_type, time_created FROM user_notes", "create_ddl": "copy" } ] }</pre></figure></figure><h3 id="example-of-a-normal-materialization-rule">Example of a normal materialization rule:</h3><figure class="code"><figure class="highlight"><pre class="language-jql" data-lang="jql">{ "workflow": "user_mview", "sourceKeyspace": "yelp_source", "targetKeyspace": "yelp_target", "stop_after_copy": false, "tableSettings": [ { "targetTable": "user", "sourceExpression": "SELECT * FROM user", "create_ddl": "copy" } ] }</pre></figure></figure><p>There are over 1500 materialization rules in place to vreplicate some or all of the tables from over 100 database schemas into one monolithic database from multiple physical source clusters. At any given time there is near real-time VReplication happening between the originating write and the downstream sanitized write for each of the workflows. Co-locating all of the sanitized data was a conscious choice, and provides a single target for playgrounds our developers run to connect to, eases management, and in the case of data corruption is simple to re-seed.</p><h2 id="vitess-performance-considerations">Vitess Performance Considerations</h2><p>We learned early on that workflows are not created equal, and that the more workflows that run on a schema the higher resources that are used by the source and target tablets to manage the binary logs and data streaming. As a result of these heavy weight tablets we had to scale up our instances, and further coordinate which tablets run on which hosts in order to spread the load as evenly as possible. The actual load wasn’t the only limiting factor either, as running too many containers on a single server can become unstable and will result in dockerd issues eventually. In the final deployment, we are running over 250 tablets and attempt to keep the number of tablets per node to no more than 50 to limit the dockerd issues we encounter. These tablets are always paired, source and target, as seen below.</p><div class="c4"></div><p>For deployments like this, it’s important to understand the impact large numbers of workflows will have on recovery in the event of failure. When enabling workflows you could also encounter throughput issues that are easier to intuit because the data is being actively copied by Vitess. Doing these materializations in chunks is an obvious optimization, and largely fixes the issues encountered during the course of standing up a sanitized database as we have done. If instead, though, you have an existing system that fails (host that a tablet runs on dies, target writable host dies, service mesh goes down, etc) recovering is not trivial.</p><div class="image-caption"><p class="subtle-text"><small>Workflows per database</small></p></div><p>This chart shows the relative number of Vitess workflows per database schema. The bigger the slice, the more workflows.</p><p>We have more than 100 database schemas, and many have few workflows as visualized in the above chart. Upon failure, these smaller sets of workflows are able to rapidly re-read binary logs and pick up where the local state indicates they should in quick order. There are also three schemas with upwards of 100 tables, one has nearly 600, and these workflows each must re-establish their positions independently from each other (our workflows are all created 1:1 to tables). On occasions when there is a failure with hundreds of workflows involved on one tablet, we found that stopping and starting them in a staggered way (25 per 3 minutes for example) can help the system recover to working order where it may have never recovered otherwise.</p><h2 id="vitess-to-the-future">Vitess to the Future</h2><p>With Vitess, Yelp was able to eliminate mountains of technical debt, bring in a tool with boundless potential, and improve the security and speed of our Sanitization process. Our old system was no longer scaling, and we started to have lengthy manual maintenance cycles whenever a problem came up. Problems with Vitess are easy to fix, and best of all can be automated in most situations.</p><p>We have plans in motion for using k8s <a href="https://github.com/Yelp/paasta">paasta</a> instead of managing the infrastructure directly. Using the standard k8s operator and a more broadly understood deployment will help as we begin to utilize more Vitess components.</p><p>Other projects include one dubbed internally “Dependency Isolation”, where an existing binlog based data-pipeline system is being moved away from the source clusters to one driven by Vitess. This allows us to decouple our consumer-facing cluster upgrades from the data pipeline databases, and to perform the upgrades consciously and independently. A third project in flight is designed to harness the ability to materialize read-only view tables into different database schemas, a common enough use of Vitess. Providing local read-only views of tables can allow for faster development cycles, and easier extraction of data from our monolith.</p><div class="island job-posting"><h3>Become a Database Reliability Engineer at Yelp</h3><p>Do you want to be a Database Reliability Engineer that builds and manages scaleable, self-healing, globally distributed systems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b3e09e7e-736a-4ca0-9d45-6fc6368b2796/Database-Reliability-Engineer-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Beyond Matrix Factorization: Using hybrid features for user-business recommendations</h1> <p>Mon, 25 Apr 2022 02:00:00 +0200</p> <p>Yelp’s mission is to connect people with great local businesses. On the Recommendations & Discovery team, we sift through billions of users-business interactions to learn user preferences. Our solutions power several products across Yelp such as personalized push notifications, email engagement campaigns, the home feed, Collections and more. Here we discuss the generalized user to business recommendation model which is crucial to a lot of these applications.</p><div class="image-caption"><p class="subtle-text"><small>High level overview of our recommendation system.</small></p></div><p>Our previous approach for user to business recommendation was based on <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html">Spark’s Alternating Least Squares (ALS)</a> algorithm which factorized the user-business interaction matrix to user-vectors and business-vectors. By performing a dot-product on top of these vectors we are able to come up with top-k recommendations for each user. We explained the approach in detail in a prior blog - <a href="https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html">“Scaling Collaborative Filtering with PySpark”</a>.</p><p>In this blog, we discuss how we switched from a collaborative filtering approach to a <strong>hybrid approach</strong> - which can handle multiple features and be trained on different objectives. The new approach doubled the number of users we could recommend for while also vastly improving performance for all users. The main takeaway here is that we were able to achieve these results pretty quickly by having a <strong>clearly defined objective</strong> and following a <strong>cost-efficient design</strong> which saved huge development costs for our initial Proof of Concept.</p><p>We start with discussing drawbacks of matrix factorization followed by some guidelines that shaped our approach. We later present the solution along with challenges and the improvements gained</p><p>Matrix factorization learns ID-level vectors for each user and business and requires a good number of user/business level interactions. This leads to a couple of major drawbacks:</p><ol><li>Worse performance on tail users (users who have very few interactions).</li> <li>An inability to add content-based features such as business reviews, ratings, user segment, etc</li> </ol><p>Because of drawback #1, we identify two segments of users - head and tail.</p><ul><li><strong>Head users</strong> have enough interactions with businesses to learn vector representations using the matrix factorization approach.</li> <li><strong>Tail users</strong> have very few interactions and suffer from the cold-start problem. They were excluded from matrix factorization which resulted in better performance on head users and also made the approach more scalable.</li> </ul><p>The solution for drawbacks 1 and 2 is to use a hybrid approach which uses content-based features in addition to interaction features. In the evaluation section, we show how content and collaborative features could play different roles for these user types in a hybrid model which results in a better model performance.</p><p>In our initial exploration phase we considered approaches like <a href="http://staff.ustc.edu.cn/~hexn/papers/www17-ncf.pdf">Neural Collaborative Filtering</a>, a <a href="https://www.tensorflow.org/recommenders">Two tower model</a>, a <a href="https://www.tensorflow.org/api_docs/python/tf/keras/experimental/WideDeepModel">WideDeep model</a> and <a href="https://snap.stanford.edu/graphsage/">GraphSage</a>. Even though implementations for these approaches were readily available, we found them to be either hard to scale for our problem size or poorly performant when used off the shelf.</p><p>To be cost-efficient and gather early feedback, we took an iterative approach towards building a custom solution. We set the following guidelines to adhere to the <strong>iterative design</strong>:</p><ul><li><strong><em>Model infrastructure first:</em></strong> Build a training, evaluation and prediction pipeline with a few clearly defined objectives.</li> <li><strong><em>Reduce dev-effort when you can:</em></strong> Use a supervised technique like <a href="https://xgboost.readthedocs.io/en/stable/">XGBoost</a> (or <a href="https://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a>) which were better supported by our ML infrastructure team.</li> <li><strong><em>Know your friend:</em></strong> Replacing matrix factorization seemed like a farther out goal as it is known to work pretty well. So instead of replacing it, we planned to build our hybrid model on top of it by taking its scores as one of the key features.</li> <li><strong><em>Gain more friends:</em></strong> Enrich signals used by the recommender by deriving a good set of content-based features.</li> </ul><p>We used a <strong>supervised <a href="https://en.wikipedia.org/wiki/Learning_to_rank">learning to rank technique</a></strong> to combine both the content and collaborative approaches. The entire approach is summarized in the diagram below:</p><div class="image-caption"><p class="subtle-text"><small>A diagram of our hybrid recommendation approach.</small></p></div><p>Similar to many other machine learning projects, we anticipated feature engineering to play a key role. Hence, most of our effort went into building a good set of features for the model to learn from. Features were extracted at the user-business level for a specific date which marks the end of the feature period.</p><p>The set of features can be categorized into two major buckets:</p><ul><li><strong>Interaction features:</strong> Include output affinity scores from matrix factorization and aggregates for different interaction types at user, business and user ✕ business level.</li> <li><strong>Content-based features:</strong> Include features like categories of a business, review rating, review count, user type, user metadata, etc. Apart from general content features, we also added a <strong>text-based similarity</strong> feature computed between a user and a business.</li> </ul><p>We derived the text-based similarity feature from Yelp’s business reviews. Reviews are encoded with a <strong><a href="https://tfhub.dev/google/universal-sentence-encoder-large/3">Universal Sentence encoder</a></strong> and later aggregated at the business-level by either a <strong>max or average <a href="https://d2l.ai/chapter_convolutional-neural-networks/pooling.html">pooling</a></strong>. The business-level embeddings were later aggregated at the user-level by associating a user with all the businesses they have interacted with. The text-based similarity is computed as a cosine similarity between the user-level and the business-level aggregate embeddings. This feature turned out to be the most important content-based feature as discussed in the evaluation section.</p><p>With all these features, we need an objective to optimize for which is discussed below.</p><p>As we aimed to come up with a <strong>personalized ranked order of businesses</strong>, we used <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG">Normalized Discounted Cumulative Gain (NDCG)</a> as our primary metric. The relevance level for NDCG was defined based on the <strong>strength of interaction</strong> between a user and a business. For example, the gain from business views could be 1.0 as it’s a low-intent interaction whereas bookmarks could be 2.0 as it is a stronger-intent interaction. To ensure there isn’t any label leakage, we made sure there is a clear separation between the time-period where features and labels were generated from.</p><p>To optimize for NDCG, we relied on <a href="https://xgboost.readthedocs.io/en/stable/">XGBoost’s</a> <strong><code class="language-plaintext highlighter-rouge">rank:ndcg</code></strong> objective which internally uses the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf">LambdaMART</a> approach. One thing that’s worth mentioning here is how we defined “groups” for the ranking task. XGBoost uses the group information to construct a pairwise loss where two training rows from the same group are compared against each other. Since our objective was to get personalized recommendations based on where the user is located, we defined our <strong>group based on both user and location</strong>. We use the same definition of groups when evaluating our models.</p><h2 id="negative-sampling">Negative sampling</h2><p>Since we are using supervised learning, our model will be most effective if it has both positive and negative interaction examples to learn from. For the hybrid user-business model, we don’t have negatives as most of the implicit user-business interactions (e.g. get directions, visit the website, order food, etc.) are positive. Deriving an implicit negative interaction like a user viewing but not interacting with a business is tricky and can be heavily biased as it depends on what businesses were shown to them (sample bias), how it was shown to them (presentation bias), and so forth. Handling these biases are usually product specific (e.g. the presentation biases for search vs. recommendation could be very different) which makes it harder to build a generalized user to business model. A more generalized approach for negative sampling would be to consider all non-interacted businesses as negatives. In fact, some of the common techniques to fetch negatives involve subsampling the non-positive candidates either randomly or based on popularity (refer <a href="http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/">word2vec negative sampling</a>).</p><p>We take a generalized approach to negative sampling but we also introduce a <strong>recall-step</strong>. Candidate businesses are recalled using a specific selection criteria like user preferences, location radius, etc., and only the recalled candidates are used to train the model. Candidates can be labeled as either positive or negative based on whether these had future interactions. This approach worked well for our use case for couple of reasons:</p><ul><li>A recall step that we can evaluate for only a few candidates per user is more efficient and allows us to scale predictions up to millions of users and businesses.</li> <li>A recall step ensures the relationship with the label is learned without any bias when a similar recall strategy is used during training and prediction time. Common techniques of negative sampling relies on resampling negatives multiple times during training (e.g. each training iteration) to reduce bias, but this approach can be difficult to implement with supervised training like XGBoost.</li> </ul><p>During training, we used a special type of recall that allowed the model to learn generic preferences instead of being very application specific. Users were associated with a sampled set of locations from their past history and a user’s top-k businesses for the locations were recalled using matrix factorization scores or business popularity. We downsampled the user, location pairs and businesses to make the training data size manageable.</p><h2 id="scaling-predictions">Scaling predictions</h2><p>Prediction is an intensive job where we need to identify top-k recommendations for tens of millions of users and businesses. When using the matrix factorization based approach we <a href="https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html">scaled the naive approach of evaluating all pairs of dot products</a> using numpy BLAS optimizations and a file-based broadcast on Pyspark. We couldn’t use the same approach here as both feature computation and XGBoost model evaluation are more expensive than just doing a dot-product.</p><p>To speed up prediction we added a <strong>recall step</strong> based on the downstream product application. These applications restrict recommendation candidates based on the following criteria:</p><ul><li><strong><em>User location:</em></strong> For localized recommendations we need to consider only businesses near the city or neighborhood where the user is located.</li> <li><strong><em>Product level constraints:</em></strong> Candidate businesses are further restricted by category or attribute constraints based on the product application (e.g. new restaurants for the Hot & New business push-notification campaign, businesses with Popular Dishes for the Popular Dish push-notification campaign, etc.) These criteria let us narrow down the set of user-business pairs for which the model needs to be evaluated, thereby making the predictions more scalable.</li> </ul><p>To prove to ourselves the hybrid approach works better, we evaluated the models offline based on historical data. We also evaluated the model subjectively by running a survey among a few Yelp employees who were tasked to rate recommendation rankings from different approaches. Both these evaluations suggested that the new hybrid approach performs much better than the baseline approaches. Here, we share these metrics based results.</p><p>Since this is a ranking task, we chose <strong><a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG">Normalized Discounted Cumulative Gain</a> (NDCG)</strong> and <strong><a href="https://queirozf.com/entries/evaluation-metrics-for-ranking-problems-introduction-and-examples#map-mean-average-precision">Mean average precision</a> (MAP)</strong> as metrics. The hybrid approach was compared against a couple of baselines:</p><ol><li>Popular businesses in the user’s location - available for both head and tail users</li> <li>Matrix factorization - available only for head users</li> </ol><p>In order to mimic production settings using historical data, we created test sets which are in the future of the model’s training period (i.e. both feature generation period and label period were shifted into the future for the test set).</p><p>At first, we look at the relative improvement from the business popularity baseline at different values of rank (i.e. rank k=1, 3, 5, 10, 20, 30, .., 100). We find that the model <strong>more than doubles</strong> the NDCG and MAP metrics compared to a “locally popular” baseline at k=1!</p><div class="image-caption"><p class="subtle-text"><small>Relative percentage improvement of hybrid approach vs. the popularity baseline. We see positive improvements overall. At k=100 and user_type=head we see an improvement of 30% in the NDCG metric and 81% improvement in the MAP metric. At k=100 and user_type=tail we see a 20% improvement in NDCG metric and 52% improvement in the MAP metric.</small></p></div><p>When comparing with the matrix factorization baseline, the improvement at different ranks (k) roughly ranges between <em>5-14</em>% for NDCG and <em>10-13</em>% for MAP.</p><div class="image-caption"><p class="subtle-text"><small>Relative percentage improvement of hybrid approach vs. the matrix factorization baseline. We see positive improvements overall. At k=100 and user_type=head we see an improvement of 5% in the NDCG metric and 9.8% improvement in the MAP metric.</small></p></div><h2 id="content-vs-collaborative-signals-for-head--tail-users">Content vs Collaborative signals for head & tail users</h2><p>For a hybrid model to work effectively, it should use both <strong>content and collaborative signals to achieve the best of both worlds</strong>. For head users with a good number of collaborative signals it should rely more on these signals whereas for tail users it should rely more on content-based features. We wanted to validate whether this was indeed happening in our hybrid model.</p><p>To perform this analysis, we picked <strong>representative features for content and collaborative signals</strong>. Review text based similarity and matrix factorization score were the top features in the model and it made sense to pick these as representative features. We use <a href="https://christophm.github.io/interpretable-ml-book/pdp.html">Partial Dependence plots</a> (PDPs) against these features which shows the average prediction on the entire dataset when a feature is set to a particular value.</p><p>First, we plot PDP for head vs. tail users against feature percentiles of review text similarity. The plot below shows percentiles of the review text similarity feature (content similarity) on the x-axis and the average of prediction along with spread on y-axis. We see that tail users have a stronger relationship against this feature which indicates that the model relies more on the content-based feature for tail users.</p><div class="image-caption"><p class="subtle-text"><small>Partial dependence plot (PDPs) with content similarity percentile on x-axis and average prediction on the y-axis. We plot the PDPs for head vs. tail users separately. The plot shows a stronger relation for tail users.</small></p></div><p>Since matrix factorization scores are available only for head users, we plot PDP against collaborative vs. content for only head users. The plot below shows percentiles of content or collaborative features on the x-axis and average and spread of predictions on the y-axis.</p><div class="image-caption"><p class="subtle-text"><small>Partial dependence plot (PDPs) against feature percentiles. We overload the x-axis with percentiles from two features and plot two PDPs in the same plot - text similarity feature (content) and matrix factorization score (collaborative). The confidence spread shows a stronger relation for the collaborative feature as the spread narrows at higher percentiles.</small></p></div><p>We see that both the content and collaborative features are strongly related to the user business relevance prediction, which means that both features are used effectively in the hybrid model. The collaborative feature has a stronger relationship as the prediction spread narrows at higher percentiles. This suggests that head users have a detailed enough browsing history that lets us learn user specific preferences, for example that a user is vegetarian but doesn’t particularly like Thai food.</p><p>The above plots confirm our initial thoughts of how <strong>content and collaborative features can play different roles for different user types</strong> in the hybrid model.</p><ul><li><strong><em>Write down your objective:</em></strong> Recommendation is a vast space and there are a lot of approaches one could take to improve it. Our initial exploration phase had a lot of uncertainty. However, writing down our specific goals to “Provide model-based recommendations for tail users” and “Enable support for more content-based features” gave us the focus we needed to improve our models and made it easier to get buy-in from the product team.</li> <li><strong><em>Setup model training infrastructure early:</em></strong> In the beginning, it was hard to debug and iterate with several copies of code on different ad hoc notebooks. Once we built out the first version of the training pipeline to include feature ETLs, sampling and label strategies, it was easy to iterate on each of these components separately.</li> <li><strong><em>Think about evaluation earlier:</em></strong> We set up the baselines based on matrix factorization and business popularity very early in our model development. This made it easy to compare results against these baselines and iterate on the modeling and training phases until we beat the baseline.</li> <li><strong><em>Use subjective evaluation in conjunction:</em></strong> In addition to objective metrics, it is important to look at individual recommendations, feature importances and PDPs to make a better judgment. At one point, we had an issue with negative sampling where all negative samples came from matrix factorization top-k which made the model learn a negative relationship with respect to this feature. It’s hard to debug these issues without the help of model debugging tools.</li> </ul><p>Switching to a hybrid approach was a major change in our user to business recommendation system. In this blog, we documented our journey in developing this new approach and are glad to see big improvements. We plan to run several additional A/B experiments for push notification and email notification campaigns to confirm that these improvements translate to better user experiences. Given the current infrastructure, we feel more confident to try more complex models based on neural networks.</p><p>If you are inspired by recommender systems, please check out the <a href="https://www.yelp.careers/us/en">careers page</a>!</p><p>This blog was a team effort. I would like to thank Blake Larkin, Megan Li, Kayla Lee, Ting Yang, Thavidu Ranatunga, Eric Hernandez, Jonathan Budning, Kyle Chua, Steven Chu and Sanket Sharma for their review and suggestions. Special thanks to Parthasarathy Gopavarapu for working on generating text-based embeddings.</p><div class="island job-posting"><h3>Become an ML Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/e9a3e447-7271-431d-b8d3-29168c9c01ef/Software-Engineer-Machine-Learning-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 2 - Migration)</h1> <p>Thu, 03 Mar 2022 01:00:00 +0100</p> <p>In a <a href="https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html">previous post</a> we detailed the architecture and motivation for developing our new <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>-based deployment model. We’d now like to share our strategy for seamlessly migrating our existing Kafka clusters from <a href="https://aws.amazon.com/pm/ec2/">EC2</a> to our <a href="https://kubernetes.io/">Kubernetes</a>-based internal compute platform. To help facilitate the migration, we built tooling which interfaced with various components of our cluster architecture to ensure that the process was automated and did not impair clients’ ability to read or write Kafka records.</p><div class="image-caption"><p class="subtle-text"><small>Migrating Kafka on EC2 to Kafka on PaaSTA</small></p></div><h2 id="background">Background</h2><p>In the status quo implementation, EC2-backed Kafka brokers within a cluster were associated with an <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html">auto scaling group</a> (ASG). Attached to each ASG was an <a href="https://aws.amazon.com/elasticloadbalancing/">Elastic Load Balancer</a> (ELB) which facilitated all connections to the cluster and acted as an entrypoint. Several auxiliary services and jobs also accompanied each cluster, but most of these were already deployed on PaaSTA. However, some important management systems ran directly on Kafka servers as cron jobs. Of particular importance for this redesign were the <a href="https://github.com/Yelp/kafka-utils/blob/master/kafka_utils/kafka_cluster_manager/cmds/rebalance.py">cluster rebalance algorithm</a> and the topic auto partitioning algorithm. The rebalance algorithm attempts to evenly distribute partitions and leaders across the brokers of the cluster, while the auto partitioning algorithm automatically sets topic partition counts based on throughput metrics. Since we were already planning on incorporating Cruise Control in our architecture, now was a good time to migrate to a new rebalancing algorithm.</p><p>Thus, the three critical components we focused on replacing during this migration were the cluster entrypoint, the cluster balancing algorithm, and the topic auto partitioning algorithm. We didn’t need to look far for a replacement to the ELB since PaaSTA natively provides load balancing capabilities through Yelp’s service mesh, which makes it simple to advertise the Kafka on Kubernetes containers which compose a cluster. In the status quo EC2 scenario we also ran a custom rebalance algorithm on Kafka hosts, but this was ultimately replaced by Cruise Control (see <a href="https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html">part 1</a> for more details on this service) which exposed comparable functionality. Finally, our <a href="https://puppet.com/">Puppet</a>-based cron job running a topic auto partitioning script was replaced with a similar <a href="https://github.com/Yelp/Tron">Tron</a> job running on PaaSTA. Below is a table providing an overview of the different components across the deployment approaches.</p><table><thead><tr><th>Component</th> <th>EC2</th> <th>PaaSTA</th> </tr></thead><tbody><tr><td>Cluster Entrypoint</td> <td><a href="https://aws.com">ELB</a></td> <td>Yelp’s service mesh</td> </tr><tr><td>Cluster Balance</td> <td><a href="https://github.com/Yelp/kafka-utils/blob/master/kafka_utils/kafka_cluster_manager/cmds/rebalance.py">rebalance algorithm in kafka-utils</a></td> <td><a href="https://github.com/linkedin/cruise-control">Cruise Control</a></td> </tr><tr><td>Topic Auto Partitioning</td> <td>cron job (Puppet-based)</td> <td><a href="https://github.com/Yelp/Tron">Tron</a> job</td> </tr></tbody></table><figure class="code"><figcaption class="c1">Table of Components Used by Each Deployment Approach</figcaption></figure><p>Since we would not be migrating all of our clusters simultaneously, we wanted to avoid the need to make significant changes to our Kafka cluster discovery configuration files. For additional context, at Yelp we use a set of <code class="language-plaintext highlighter-rouge">kafka_discovery</code> files (generated by Puppet) which contain information about each cluster’s bootstrap servers, <a href="https://zookeeper.apache.org/">ZooKeeper</a> chroot, and other metadata. Many of our internal systems (such as <a href="https://github.com/Yelp/schematizer">Schematizer</a> and <a href="https://engineeringblog.yelp.com/2020/01/streams-and-monk-how-yelp-approaches-kafka-in-2020.html">Monk</a>) rely on the information in these files. This migration strategy entailed updating only the broker_list to point to the service mesh entrypoint, thereby retaining compatibility with our existing tooling. We did take this migration as an opportunity to improve the propagation method by removing Puppet as the source of truth and instead opted to use srv-configs (the canonical place for configurations used by services). An example discovery file is shown below:</p><div class="language-plaintext highlighter-rouge highlight"><pre>>> cat /kafka_discovery/example-cluster.yaml --- clusters: uswest1-devc: broker_list: - kafka-example-cluster-elb-uswest1devc.<omitted>.<omitted>.com:9092 - kafka-example-cluster-elb-uswest1devc.<omitted>.<omitted>.com:9092 zookeeper: xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181/kafka-example-cluster-uswest1-devc local_config: cluster: uswest1-devc ... </pre></div><h2 id="migration-strategy-overview">Migration Strategy Overview</h2><p>At a high level the goal of the migration was to seamlessly switch from using EC2-compatible components to using PaaSTA-compatible components without incurring any downtime for existing producer and consumer clients. As such, we needed to ensure that all the new components were in place <em>before</em> migrating any data from EC2-based brokers to PaaSTA based-brokers. We also wanted to minimize the amount of engineering time required for the migrations, so we implemented some tools to help automate the process. Finally, we needed to ensure that this process was thoroughly tested and rollback-safe.</p><p>The first step of the migration process was to set up a PaaSTA-based load balancer for each of our Kafka clusters, which could also be used to advertise EC2-based brokers. This exposed two distinct methods of connecting to the Kafka cluster: the existing ELB and the new service mesh proxy which would be used for the PaaSTA-based brokers during and after the migration. This entailed updating the aforementioned <code class="language-plaintext highlighter-rouge">kafka_discovery</code> files to include the alternate connection method, and we also devised a new way to propagate these files with a cron job rather than rely on Puppet. As alluded to in the prior post, reducing our reliance on Puppet helped us halve the time to deploy a new Kafka cluster since we could alter and distribute these configuration files much more quickly. After this was done we also invalidated any related caches to ensure that no clients were using the outdated cluster discovery information. Below is a set of figures illustrating this process during the migration:</p><div class="image-caption"><p class="subtle-text"><small>Cluster Connection Migration</small></p></div><p>Next, we deployed a dedicated instance of Cruise Control for the cluster, with <a href="https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/cctrl-overview/topics/cctrl-self-healing.html">self-healing</a> <strong>disabled</strong>. We didn’t want multiple rebalance algorithms to run simultaneously, and since the self-healing algorithm is able to rebalance the cluster, we prevented Cruise Control from automatically moving topic partitions. After this we created a PaaSTA instance for the cluster, except we explicitly disabled the Kafka Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator’s</a> use of Cruise Control. For an EC2 cluster with <em>N</em> brokers we then added an additional <em>N</em> PaaSTA-based brokers, effectively doubling the cluster size during the migration.</p><p>After the new PaaSTA brokers were online and healthy, the cluster had an equal number of EC2 brokers and PaaSTA brokers. We also enabled metrics reporting by creating the <a href="https://github.com/linkedin/cruise-control/blob/fb13240bc5759b30720339c27fdc3a04b8544c23/config/cruisecontrol.properties#L49-L50">__CruiseControlMetrics</a> topic and setting up the appropriate configs prior to each migration. To retain control over when partitions would be moved, we disabled our status quo automated rebalance algorithm. At this point we were ready to start moving data away from the EC2 brokers and leveraged Cruise Control’s API to remove them. Note that this API only moves partitions away from the specified brokers and does not actually decommission the hosts. We continued to <a href="https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_RecordLifecycleActionHeartbeat.html">send heartbeats for EC2 lifecycle actions</a> throughout the migration procedure since the autoscaling group associated with the EC2 brokers would persist until the end of the migration process. Below is a figure illustrating the state of each component throughout the migration:</p><div class="image-caption"><p class="subtle-text"><small>Migrating from Conditional Rebalance Script to Cruise Control</small></p></div><p>Rather than manually issue broker removal requests, we built a rudimentary migration helper service to check the cluster state, repeatedly issue requests to the Cruise Control REST API, and remove EC2 brokers one by one. After Cruise Control finished moving all partition data away from the EC2 brokers and onto the PaaSTA brokers, we were ready to terminate the EC2 brokers. This was accomplished by shrinking the size of the ASG from <em>N</em> to 0 and by removing references to the old EC2 ELBs in our configuration files. Since we use <a href="https://www.terraform.io/">Terraform</a> to manage AWS resources, the rollback procedure was as simple as a <code class="language-plaintext highlighter-rouge">git revert</code> to recreate the resources. After the EC2 brokers had been decommissioned, we removed the instance of our decommission helper service and enabled self-healing in the cluster’s Cruise Control instance. This was now safe to do since the cluster was composed entirely of PaaSTA-based brokers. At this point the cluster migration was complete, and the remaining work entailed cleaning up any miscellaneous AWS resources (autoscaling SQS queues, ASGs, ELBs, etc.) after deeming it safe to do so.</p><h2 id="risks-rollbacks-and-darklaunches">Risks, Rollbacks, and Darklaunches</h2><p>While we strove to optimize safety over migration speed, there were naturally still some risks and drawbacks associated with our approach. One consideration was the temporary cost increase due to doubling the size of each cluster. The alternative to this was to iteratively add one PaaSTA broker, perform data migration away from one EC2 broker, decommission one EC2 broker, and repeat. Since this approach confines the data movement to one broker’s replica set at a time, this approach would have extended the total duration of the migration procedure. Ultimately we decided that we favored migration speed, so the up-front cost of having twice as many brokers was a cost that we were willing to pay. Additionally, we estimated that the benefits associated with having the cluster on PaaSTA would outweigh these initial costs in the long run. Another tradeoff was that doubling the size of the cluster would also result in very large cluster sizes for some of our high traffic clusters. Those clusters required additional attention during the migration process, and this engineering time-cost was also an initial investment that we were willing to make for the sake of shorter migrations.</p><p>In case of a catastrophic issue during the migration, we also needed to devise a rollback procedure. Sequentially reversing the order of the migration procedure at any stage was sufficient to roll back the changes (this time using Cruise Control’s <code class="language-plaintext highlighter-rouge">add_broker</code> API rather than the <code class="language-plaintext highlighter-rouge">remove_broker</code> API after removing any pending reassignment plans). The primary risk associated with this is that both the migration and the rollback procedure are heavily reliant on Cruise Control being in a healthy state. To mitigate this risk we assessed the resource requirements of these instances on test clusters and then overprovisioned the hardware resources for the non-test Cruise Control instances. We also ensured that there was adequate monitoring and alerting on the health of these instances. Finally, we provisioned backup instances which would serve as a replacement if the primary instance became unhealthy.</p><p>While the plan seemed sound in theory, we needed to test it on real clusters and thoroughly document any anomalies. To do this we first used <a href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330">Kafka MirrorMaker</a> to clone an existing cluster and then performed a darklaunch migration in its entirety in a non-production environment before repeating the darklaunch migration in a production environment. Once we had established sufficient confidence and documentation, we performed real migrations of all of our Kafka clusters in development and staging environments before performing any production migrations.</p><h2 id="challenges-and-learnings">Challenges and Learnings</h2><p>As previously alluded to, the major risk with the plan was that Cruise Control needed to be healthy in order to proceed with a migration or rollback. We did encounter some instability in some of our non-prod migrations wherein a Cruise Control instance became unhealthy due to offline partitions in a Kafka cluster which temporarily experienced broker instability. Since Cruise Control’s algorithms and internal cluster model rely on being able to read from (and write to) a set of metrics topics, communication between Cruise Control and each Kafka cluster must be maintained. Offline partitions can thus prevent Cruise Control from operating properly, so in those cases the priority is to first triage and fix the issue in Kafka. Additionally, Cruise Control exposes configuration values for tuning various aspects of its internal metrics algorithm, and we found that it was sometimes helpful to reduce the lookback window and number of required data points. Doing so helped Cruise Control regenerate its internal model more quickly in cases where Kafka brokers encountered offline partitions.</p><p>Since we were migrating individual clusters, beginning with clusters in our development environment, we were able to gain insights into the performance characteristics of a Kafka cluster when it was running on PaaSTA/Kubernetes compared to when it was running on EC2. Much like with our instance selection criteria when running on bare EC2 instances, we were able to set up Kafka pools with differing instance types according to resource requirements (e.g. a standard pool and a large pool, each containing different instance types).</p><p>Another approach we initially considered for our migration procedure was to set up a fresh PaaSTA-based cluster with <em>N</em> brokers and then use Kafka MirrorMaker to “clone” an existing EC2 cluster’s data onto that new cluster. We also considered adjusting the strategy such that we would add one PaaSTA broker, remove one EC2 broker, and repeat <em>N</em> times. However, this would have entailed updating our operator’s reconcile logic for the purpose of the migration, and we would have needed to manually ensure that each broker pair was in the same availability zone. It would have also introduced a lengthy data copying step which we did not feel was acceptable for large clusters. After some further testing of procedures in our development environment, we ultimately settled on the procedure described here.</p><h2 id="acknowledgements">Acknowledgements</h2><p>Many thanks to Mohammad Haseeb, Brian Sang, and Flavien Raynaud for contributing to the design and implementation of this work. I would also like to thank Blake Larkin, Catlyn Kong, Eric Hernandez, Landon Sterk, Mohammad Haseeb, Riya Charaya, and Ryan Irwin for their valuable comments and suggestions. Finally, this work would not have been realized without the help of everyone who performed cluster migrations, so I am grateful to Mohammad Haseeb, Jamie Hewland, Zhaoyang Huang, Georgios Kousouris, Halil Cetiner, Oliver Bennett, Amr Ibrahim, and Alina Radu for all of their contributions.</p><div class="island job-posting"><h3>Principal Platform Software Engineer (Data Streams) at Yelp</h3><p>Want to build next-generation streaming data infrastructure?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a04be5e0-7421-48c7-8a4a-9c02b9c758cd/Principal-Platform-Software-Engineer-Data-Streams-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Server Side Rendering at Scale</h1> <p>Tue, 22 Feb 2022 01:00:00 +0100</p> <p>At Yelp, we use Server Side Rendering (SSR) to improve the performance of our React-based frontend pages. After a string of production incidents in early 2021, we realized our existing SSR system was failing to scale as we migrated more pages from Python-based templates to React. Throughout the rest of the year, we worked to re-architect our SSR system in a way that increased stability, reduced costs, and improved observability for feature teams.</p><h2 id="what-is-ssr">What Is SSR?</h2><p>Server Side Rendering is a technique used to improve the performance of JavaScript templating systems (such as React). Rather than waiting for the client to download a JavaScript bundle and render the page based on its contents, we render the page’s HTML on the server side and attach dynamic hooks on the client side once it’s been downloaded. This approach trades increased transfer size for increased rendering speeds, as our servers are typically faster than a client machine. In practice, we find that it significantly improves our <a href="https://web.dev/lcp/" target="_blank">LCP</a> timings.</p><p>We prepare components for SSR by bundling them with an entrypoint function and any other dependencies into a self-contained .js file. The entrypoint then uses <a href="https://reactjs.org/docs/react-dom-server.html" target="_blank">ReactDOMServer</a>, which accepts component props and produces rendered HTML. These SSR bundles are uploaded to S3 as part of our continuous integration process.</p><p>Our old SSR system would download and initialize the latest version of every SSR bundle at startup so that it’d be ready to render any page without waiting on S3 in the critical path. Then, depending on the incoming request, an appropriate entrypoint function would be selected and called. This approach posed a number of issues for us:</p><ul><li>Downloading and initializing every bundle significantly increased service startup time, which made it difficult to quickly react to scaling events.</li> <li>Having the service manage all bundles created a massive memory requirement. Every time we scaled horizontally and spun up a new service instance, we’d have to allocate memory equal to the sum of every bundle’s source code and runtime usage. Serving all bundles from the same instance also made it difficult to measure the performance characteristics of a single bundle.</li> <li>If a new version of a bundle was uploaded in between service restarts, the service wouldn’t have a copy of it. We solved this by dynamically downloading missing bundles as needed, and used an LRU cache to ensure we weren’t holding too many dynamic bundles in memory at the same time.</li> </ul><p>The old system was based on Airbnb’s <a href="https://github.com/airbnb/hypernova" target="_blank">Hypernova</a>. Airbnb has written their own <a href="https://medium.com/airbnb-engineering/operationalizing-node-js-for-server-side-rendering-c5ba718acfc9" target="_blank">blog post</a> about the issues with Hypernova, but the core issue is that rendering components blocks the event loop and can cause several Node APIs to break in unexpected ways. One key issue we encountered is that blocking the event loop will break Node’s HTTP request timeout functionality, which significantly exacerbated request latencies when the system was already overloaded. Any SSR system must be designed to minimize the impact of blocking the event loop due to rendering.</p><p>These issues came to a head in early 2021 as the number of SSR bundles at Yelp continued to increase:</p><ul><li>Startup times became so slow that Kubernetes began marking instances as unhealthy and automatically restarting them, preventing them from ever becoming healthy.</li> <li>The service’s massive heap size led to significant garbage collection issues. By the end of the old system’s lifetime, we were allocating nearly 12GB of old heap space for it. In one experiment, we determined that we were unable to serve >50 requests per second due to lost time spent in garbage collection.</li> </ul><div class="image-caption"><p class="subtle-text"><small>Request Latency</small></p></div><ul><li>Thrashing the dynamic bundle cache due to frequent bundle eviction and re-initialization created a large CPU burden that began affecting other services running on the same host.</li> </ul><p>All of these issues degraded Yelp’s frontend performance and led to several incidents.</p><p>After dealing with these incidents, we set out to re-architect our SSR system. We chose stability, observability, and simplicity as our design goals. The new system should function and scale without much manual intervention. It should be easy to observe not only for infra teams, but for bundle-owning feature teams as well. The design of the new system should be easy for future developers to understand.</p><p>We also chose a few specific, functional goals:</p><ul><li>Minimize the impact of blocking the event loop so that features like request timeouts work correctly.</li> <li>Shard service instances by bundle, so that each bundle has its own unique resource allocation. This reduces our overall resource footprint and makes bundle-specific performance easier to observe.</li> <li>Be able to fast-fail requests we don’t anticipate being able to serve quickly. If we know it’ll take a long time to render a request, the system should immediately fall back to client-side rendering rather than waiting for SSR to time out first. This provides the fastest possible UX to our end users.</li> </ul><h2 id="language-choice">Language Choice</h2><p>We evaluated several languages when it came time to implement the SSR Service (SSRS), including Python and Rust. It would have been ideal from an internal ecosystem perspective to use Python, however, we found that the state of V8 bindings for Python were not production ready, and would require a significant investment to use for SSR.</p><p>Next, we evaluated Rust, which has <a href="https://github.com/denoland/rusty_v8" target="_blank">high quality V8 bindings</a> that are already used in popular production-ready projects like <a href="https://github.com/denoland/deno" target="_blank">Deno</a>. However, all of our SSR bundles rely on the Node runtime API, which is not part of bare V8; thus, we’d have to reimplement significant portions of it to support SSR. This, in addition to a general lack of support for Rust in Yelp’s developer ecosystem, prevented us from using it.</p><p>In the end, we decided to rewrite SSRS in Node because Node provides a <a href="https://nodejs.org/api/vm.html" target="_blank">V8 VM API</a> that allows developers to run JS in sandboxed V8 contexts, has high quality support in the Yelp developer ecosystem, and would allow us to reuse code from other internal Node services to reduce implementation work.</p><p>SSRS consists of a main thread and many worker threads. Node worker threads are different from OS threads in that each thread has its own event loop and memory cannot be trivially shared between threads.</p><p>When the main thread receives an HTTP request, it executes the following steps:</p><ol><li>Check if the request should be fast-failed based on a “timeout factor.” Currently, this factor includes the average rendering run time and current queue size, but could be expanded upon to incorporate more metrics, like CPU load and throughput.</li> <li>Push the request to the rendering worker pool queue.</li> </ol><p>When a worker thread receives a request, it executes the following steps:</p><ol><li>Performs server side rendering. This blocks the event loop, but is still allowable since the worker only handles one request at a time. Nothing else should be using the event loop while this CPU-bound work happens.</li> <li>Return the rendered HTML back to the main thread.</li> </ol><p>When the main thread receives a response from a worker thread, it returns the rendered HTML back to the client.</p><div class="image-caption"><p class="subtle-text"><small>SSRS Architecture</small></p></div><p>This approach provides us with two important guarantees that help us meet our requirements:</p><ul><li>The event loop is never blocked in the main web server thread.</li> <li>The event loop is never needed while it’s blocked in a worker thread.</li> </ul><p>We used <a href="https://github.com/piscinajs/piscina" target="_blank">Piscina</a>, a third-party library that provides the functionality described above. It manages thread pools with support for task queueing, task cancellation, and many other useful features. <a href="https://www.fastify.io/" target="_blank">Fastify</a> was chosen to power the main thread web server because it’s both highly performant and developer-friendly.</p><p>Fastify Server:</p><div class="language-javascript highlighter-rouge highlight"><pre>const workerPool = new Piscina({...}); app.post('/batch', opts, async (request, reply) => { if ( Math.min(avgRunTime.movingAverage(), RENDER_TIMEOUT_MSECS) * (workerPool.queueSize + 1) > RENDER_TIMEOUT_MSECS ) { // Request is not expected to complete in time. throw app.httpErrors.tooManyRequests(); } try { const start = performance.now(); currentPendingTasks += 1; const resp = await workerPool.run(...); const stop = performance.now(); const runTime = resp.duration; const waitTime = stop - start - runTime; avgRunTime.push(Date.now(), runTime); reply.send({ results: resp, }); } catch (e) { // Error handling code } finally { currentPendingTasks -= 1; } }); </pre></div><h2 id="autoscaling-for-horizontal-scaling">Autoscaling for Horizontal Scaling</h2><p>SSRS is built on PaaSTA, which provides <a href="https://paasta.readthedocs.io/en/latest/autoscaling.html" target="_blank">autoscaling mechanisms</a> out of the box. We decided to build a custom autoscaling signal that ingests the utilization of the worker pool:</p><p><code class="language-plaintext highlighter-rouge">Math.min(currentPendingTasks, WORKER_COUNT) / WORKER_COUNT;</code></p><p>This value is compared against our target utilization (setpoint) over a moving time window to make horizontal scaling adjustments. We found that this signal helps us keep per-worker load in a healthier, more accurately provisioned state than basic container CPU usage scaling does, ensuring that all requests are served in a reasonable amount of time without overloading workers or overscaling the service.</p><h2 id="autotuning-for-vertical-scaling">Autotuning for Vertical Scaling</h2><p>Yelp is composed of many pages with different traffic loads; as such, the SSRS shards that support these pages have vastly different resource requirements. Rather than statically defining resources for each SSRS shard, we took advantage of dynamic resource autotuning to automatically adjust container resources like CPUs and memory of shards over time.</p><p>These two scaling mechanisms ensure each shard has the instances and resources it needs, regardless of how little or how much traffic it receives. The biggest benefit is running SSRS efficiently across a diverse set of pages while remaining cost effective.</p><p>Rewriting SSRS with Piscina and Fastify allowed us to avoid the blocking event loop issue that our previous implementation suffered from. Combined with a sharded approach and better scaling signals allowed us to squeeze more performance, while reducing cloud compute costs. Some of the highlights include:</p><ul><li>An average reduction of 125ms p99 when server side rendering a bundle.</li> <li>Improved service startup times from minutes in the old system to seconds by reducing the number of bundles initialized on boot.</li> <li>Reduced cloud compute costs to one-third of the previous system by using a custom scaling factor and tuning resources more efficiently per-shard.</li> <li>Increased observability since each shard is now responsible for rendering one bundle only, allowing teams to more quickly understand where things are going wrong.</li> <li>Created a more extensible system allowing for future improvements like CPU profiling and bundle source map support.</li> </ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="http://www.yelp.com/careers?job_id=3358a10e-b1af-4a5a-bd0e-4aa6bab35c93?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Developing a New Native Ads Dashboard Using Server-Driven UI</h1> <p>Tue, 15 Feb 2022 01:00:00 +0100</p> <p>Updating the ads experience for Yelp Advertisers by creating a new Native Ads Dashboard using Server-Driven UI.</p><p>The Yelp Ads Dashboard is a tool that advertisers can use to update their ad settings and keep track of how their ad is performing. In 2020, we revamped the Ads Dashboard web experience to provide greater visibility into an ad’s performance and better access to control and customize options from a single page.</p><div class="image-caption"><p class="subtle-text"><small>Ads Dashboard on Desktop</small></p></div><p>In order to ensure consistency across platforms from both a visual and feature standpoint, we decided to update our Ads Dashboard experience on mobile to continue to provide advertisers with an exceptional experience.</p><p>When we started planning this project, we agreed on four specific objectives: Feature consistency with the web; we wanted our customers (business owners) to be able to do everything on mobile that they could do on the web, with no additional steps. Visual consistency with the web; we wanted to ensure that the visual experience was similar to the web so that it felt familiar and Yelpy. A fast and easy way to add new components or edit existing ones. Avoid duplicating logic, especially anything already written for the web.</p><p>The Ads Dashboard would live on the Yelp for Business App — an iOS and Android app for business wwners on Yelp. This app is developed natively, which means there are separate codebases for iOS and Android. So whenever we make a change to the app, we need to update both the iOS and Android versions of it, and push them out separately.</p><p>We knew we still wanted to develop the Ads Dashboard natively, but noticed that our current processes would not meet our requirements. Having separate codebases meant:</p><p>Adding new components or updating existing ones would require a lot of time and coordination between iOS and Android engineers. Logic, both for achieving visual consistency and getting data to power components, would have to be duplicated in two different places.</p><p>Luckily, we weren’t the first ones to encounter these problems, and a solution was already in the works: the <a href="https://engineeringblog.yelp.com/2021/11/building-a-server-driven-foundation-for-mobile-app-development.html">Biz Native Foundation</a>.</p><p>Biz Native Foundation (BNF) is a Server-Driven UI framework currently being developed at Yelp for our Biz App. The goal of BNF is to accelerate app development for the Biz App by consolidating the business logic and screen configurations in the backend, instead of having it exist separately within the iOS and Android apps.</p><p>With BNF, our backend service for mobile apps is able to send a screen configuration to the apps, informing them of which components to render, as well as which data and properties to prepopulate those components with. The mobile apps are configured to parse the screen configuration sent to them and understand what components they need to render.</p><p>However, there were a few obstacles here that we needed to tackle.</p><p>Firstly, BNF was an entirely new framework. We were going to be one of the first teams at Yelp to build out an entire page using BNF. This meant that we had no common components built out for us. <strong>Not only did we have to build these from scratch, we also had to set the standard for how to do so for future projects, and make sure that everything we built was reusable and extensible.</strong></p><p>We compiled a list of common and specialized components we’d need, including things like Buttons, Headers, and Charts. We started to anticipate which parts of these components we’d want to have control over when building new features and ensure they were customizable.</p><p>Soon enough, we had a library of components we could use. Once the pieces were built, putting them together was as easy as making a single code change to the backend, and the magic of Server-Driven UI became clear.</p><p>Making changes to the screen layout, as well as components and what they looked like, was fast. We were able to test different configurations and easily play around with the screen. On top of that, a lot of the complex logic that drives the component structure and layout lived in a single place and was easy to update without redeploying the iOS and Android apps.</p><p>The second problem was a little more subtle. We knew that some of the components, like Charts, would need to display some data. However, retrieving that data from the backend on page load was expensive and could slow down the load time. Additionally, since BNF is essentially a way for us to configure the screen, we didn’t want to burden it with the responsibility of providing data that was agnostic to the UI.</p><p>Instead, we wanted a way for each component to fetch additional data as needed after it was loaded on the screen. Enter GraphQL.</p><p>GraphQL is a query language that makes it super easy and straightforward to fetch data. At Yelp, we’ve been using GraphQL to power a lot of data-driven components like Charts and Graphs. In fact, our components on the Ads Dashboard on the web are currently making GraphQL queries to fetch data for Ad Performance.</p><p>We realized that GraphQL was the way to go for our components on mobile as well. Not only was it fast, it also parallelized a lot of data fetching. Beyond that, since we’d already written GraphQL queries to fetch data on the web, it was easy to use the same queries for mobile!</p><p>We had our solutions, and things were coming together. The flow was simple: our backend service sent a list of components and their properties to mobile clients, and some data-driven components used GraphQL to fetch any additional data. The mobile client did not need to perform any additional tasks. We performed a number of tests to ensure the Ads Dashboard was robust, and easy to update with new features.</p><p>We launched the first version of the Ads Dashboard for mobile in Q1 2021. Since then, we’ve been adding new features and components to provide even more valuable information to advertisers. Looking back, the Biz Native Foundation was the right choice as it exceeded our expectations and allowed us to iterate faster than ever on a mobile app.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="http://www.yelp.com/careers?job_id=3358a10e-b1af-4a5a-bd0e-4aa6bab35c93?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 1 - Architecture)</h1> <p>Wed, 15 Dec 2021 01:00:00 +0100</p> <p>Yelp’s <a href="https://kafka.apache.org/">Kafka</a> infrastructure ingests tens of billions of messages each day to facilitate data driven decisions and power business-critical pipelines and services. We have recently made some improvements to our Kafka deployment architecture by running some of our clusters on <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>, Yelp’s own Platform as a Service. Our <a href="https://kubernetes.io/">Kubernetes</a> (k8s) based deployment leverages a custom Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator</a> for Kafka, as well as <a href="https://github.com/linkedin/cruise-control">Cruise Control</a> for lifecycle management.</p><div class="image-caption"><p class="subtle-text"><small>Kafka on PaaSTA on Kubernetes</small></p></div><h2 id="architectural-motivations-and-improvements">Architectural Motivations and Improvements</h2><p>In the past, all of our Kafka clusters ran on dedicated <a href="https://aws.amazon.com/pm/ec2/">EC2</a> instances on AWS. Kafka was deployed directly on these hosts and configuration management was highly reliant on our centralized <a href="https://puppet.com/">Puppet</a> repository. The deployment model was somewhat cumbersome and creating a new cluster took over two hours on average. We set out to develop a new deployment model with the following goals in mind:</p><ul><li>Reduce the dependency on slow Puppet runs.</li> <li>Promote adoption of PaaSTA internally and leverage its CLI tools to improve productivity.</li> <li>Improve maintainability of our lifecycle management system.</li> <li>Simplify the process of performing OS host level upgrades and Kafka version upgrades.</li> <li>Streamline the creation of new Kafka clusters (aligned with how we deploy services).</li> <li>Expedite broker decommissions and simplify recovery process when hosts fail. Having the ability to re-attach EBS volumes also allows us to avoid unnecessarily consuming network resources, which helps save money.</li> </ul><p>Yelp had previously developed practices for running <a href="https://kubernetes.io/docs/tutorials/stateful-application/">stateful applications</a> on Kubernetes (e.g. <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">Cassandra on PaaSTA</a> and <a href="https://engineeringblog.yelp.com/2020/10/flink-on-paasta.html">Flink on PaaSTA</a>), so PaaSTA was a natural choice for this use case.</p><p>The new deployment architecture leverages PaaSTA pools–or groups of hosts–for its underlying infrastructure. Kafka broker <a href="https://kubernetes.io/docs/concepts/workloads/pods/">pods</a> are scheduled on Kubernetes <a href="https://kubernetes.io/docs/concepts/architecture/nodes/">nodes</a> in these pools, and the broker pods have detachable <a href="https://aws.amazon.com/ebs/">EBS</a> volumes. Two key components of the new architecture are the Kafka operator and Cruise Control, both of which we will describe in more detail later. We deploy instances of our in-house Kafka Kubernetes operator and various sidecar services on PaaSTA, and one instance of Cruise Control is also deployed on PaaSTA for each Kafka cluster.</p><p>Two crucial distinctions between the new architecture and the old architecture are that Kafka now runs within a <a href="https://www.docker.com/">Docker</a> container, and our configuration management approach no longer relies on Puppet. Configuration management is now in accord with the PaaSTA-based configuration management solution in which <a href="https://www.jenkins.io/">Jenkins</a> propagates YAML file changes whenever they are committed to our service config repository. As a result of this architectural overhaul, we’re now able to leverage existing PaaSTA CLI tooling to see the status of clusters, read logs, and restart clusters. Another major benefit is that we’re now able to provision new Kafka clusters by providing the requisite configuration (see below), and this approach has allowed us to <em>halve the time taken</em> to deploy a new Kafka cluster from scratch.</p><div class="image-caption"><p class="subtle-text"><small>PaaSTA Tooling Example</small></p></div><figure class="code"><figure class="highlight"><pre class="language-yaml" data-lang="yaml">example-test-prod: deploy_group: prod.everything pool: kafka brokers: 15 cpus: 5.7 # CPU unit reservation breakdown: (5.7 (kafka) + 0.1 (hacheck) + 0.1 (sensu)) + 0.1 (kiam) = 6.0 (as an example, consider that our pool is comprised of m5.2xlarge instances) mem: 26Gi data: 910Gi storage_class: gp2 cluster_type: example cluster_name: test-prod use_cruise_control: true cruise_control_port: 12345 service_name: kafka-2-4-1 zookeeper: cluster_name: test-prod chroot: kafka-example-test-prod cluster_type: kafka_example_test config: unclean.leader.election.enable: "false" reserved.broker.max.id: "2113929216" request.timeout.ms: "300001" replica.fetch.max.bytes: "10485760" offsets.topic.segment.bytes: "104857600" offsets.retention.minutes: "10080" offsets.load.buffer.size: "15728640" num.replica.fetchers: "3" num.network.threads: "5" num.io.threads: "5" min.insync.replicas: "2" message.max.bytes: "1000000" log.segment.bytes: "268435456" log.roll.jitter.hours: "1" log.roll.hours: "22" log.retention.hours: "24" log.message.timestamp.type: "LogAppendTime" log.message.format.version: "2.4-IV1" log.cleaner.enable: "true" log.cleaner.threads: "3" log.cleaner.dedupe.buffer.size: "536870912" inter.broker.protocol.version: "2.4-IV1" group.max.session.timeout.ms: "300000" delete.topic.enable: "true" default.replication.factor: "3" connections.max.idle.ms: "3600000" confluent.support.metrics.enable: "false" auto.create.topics.enable: "false" transactional.id.expiration.ms: "86400000"</pre></figure><figcaption class="c1">Example configuration file for a cluster with 15 brokers running Kafka version 2.4.1</figcaption></figure><h2 id="the-new-architecture-in-detail">The New Architecture in Detail</h2><p>One primary component of the new architecture is the Kafka Kubernetes operator which helps us manage the state of the Kafka cluster. While we still rely on external <a href="https://zookeeper.apache.org/">ZooKeeper</a> clusters to maintain cluster metadata, message data is still persisted to the disks of Kafka brokers. Since Kafka consumers rely on persistent storage to be able to retrieve this data, Kafka is considered a stateful application in the context of Kubernetes. Kubernetes natively exposes abstractions for managing stateful applications (e.g. <a href="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/">StatefulSets</a>), but Kubernetes has no notion of Kafka-specific constructs by default. As such, we needed additional functionality beyond that of the standard Kubernetes API to maintain our instances. In the parlance of Kubernetes, an <em>operator</em> is a custom controller which allows us to expose this application-specific functionality.</p><div class="image-caption"><p class="subtle-text"><small>Kafka Operator Overview</small></p></div><p>The operator is in charge of establishing when Kubernetes needs to perform an action on the cluster. It has a reconcile loop in which it observes the state of custom cluster resources and reconciles any discrepancies by interacting with the Kubernetes API and by calling APIs exposed by another key architectural component: Cruise Control.</p><p><a href="https://github.com/linkedin/cruise-control">Cruise Control</a> is an open-source Kafka cluster management system developed by LinkedIn. Its goal is to reduce the overhead associated with maintaining large Kafka clusters. Each Kafka cluster has its own dedicated instance of Cruise Control, and each cluster’s operator interacts with its Cruise Control instance to perform lifecycle management operations such as checking the health of the cluster, rebalancing topic partitions and adding/removing brokers.</p><p>The paradigm used by Cruise Control is in many ways similar to the one used by the operator. Cruise Control monitors the state of the Kafka cluster, generates an internal model, scans for anomalous goal violations, and attempts to resolve any observed anomalies. It exposes APIs for various administrative tasks and the aforementioned lifecycle management operations. These APIs serve as a replacement for our prior ad hoc lifecycle management implementations which we used for EC2-backed brokers to perform conditional rebalance operations or interact with AWS resources like <a href="https://aws.amazon.com/sns">SNS</a> and <a href="https://aws.amazon.com/sqs/">SQS</a>. Consolidating these into one service has helped to simplify our lifecycle management stack.</p><div class="image-caption"><p class="subtle-text"><small>Cluster Architecture</small></p></div><p>Putting these components together, we arrive at a cluster architecture in which we define a Custom Resource Definition (CRD) through our internal config management system and couple it with a custom Kafka Docker image. The Kafka Kubernetes operator uses the config, CRD, and the Docker image in its interaction with the Kubernetes API to generate a KafkaCluster Custom Resource on a Kubernetes master. This allows us to schedule Kafka pods on Kubernetes nodes, and the operator oversees and maintains the health of the cluster through both the Kubernetes API and the APIs exposed by the Cruise Control service. Humans can observe the cluster and interact with it through the Cruise Control UI or PaaSTA CLI tools.</p><p>Finally, we’d like to illustrate the overall flow of operations with an example scenario. Consider the case of scaling down the size of the cluster by removing a broker. A developer updates the cluster’s config and decrements the broker count, which in turn updates the Kafka cluster’s CRD. As part of the reconcile loop the operator recognizes that the desired cluster state differs from the actual state represented in the StatefulSet, so it asks Cruise Control to remove a broker. Information about the removal task is returned by the Cruise Control API, and the operator annotates the decommissioning pod with metadata about this task. While Cruise Control performs the process of moving partitions away from the broker to be decommissioned, the operator routinely checks the status of the decommission by issuing requests to Cruise Control. Once the task is marked as completed, the operator removes the pod and the internal state of the cluster spec has been reconciled.</p><div class="image-caption"><p class="subtle-text"><small>Scale Down Scenario</small></p></div><h2 id="what-comes-next">What comes next?</h2><p>After designing this architecture we built tooling and constructed a process for seamlessly migrating Kafka clusters from EC2 to PaaSTA. As of this post we have migrated many of our clusters to PaaSTA, and we’ve deployed new clusters using the architecture detailed here. We’re also continuing to tune our hardware selection to accommodate different attributes of our clusters. Stay tuned for another installation in this series where we will share our migration process!</p><h2 id="acknowledgements">Acknowledgements</h2><p>Many thanks to Mohammad Haseeb for contributing to the architecture and implementation of this work, as well as for providing the architecture figures. I would also like to thank Brian Sang and Flavien Raynaud for their many contributions to this project. Finally, I’d like to thank Blake Larkin, Catlyn Kong, Eric Hernandez, Landon Sterk, Mohammad Haseeb, Riya Charaya, and Ryan Irwin for their insightful review comments and guidance in writing this post.</p><div class="island job-posting"><h3>Principal Platform Software Engineer (Data Streams) at Yelp</h3><p>Want to build next-generation streaming data infrastructure?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a04be5e0-7421-48c7-8a4a-9c02b9c758cd/Principal-Platform-Software-Engineer-Data-Streams-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Building a unified setup flow to better onboard business users</h1> <p>Wed, 08 Dec 2021 01:00:00 +0100</p> <p>At Yelp we are always striving to optimize our user experience so we can help guide our customers to success. We aim to streamline the onboarding process for business owners by centralizing customer products into a single page.</p><h2 id="the-challenge">The Challenge</h2><p>Yelp offers an array of <a href="https://business.yelp.com/products/business-page/">free</a> and <a href="https://business.yelp.com/products/business-page/upgrades/">paid</a> products that help local businesses connect with consumers. To set up these products on their Yelp page, business owners previously had to navigate through multiple tabs, which negatively impacted product setup rates (roughly 55%) and lowered overall user engagement.</p><div class="image-caption"><p class="subtle-text"><small>Previously, the only way businesses could set up Yelp advertising products was through navigating multiple tabs.</small></p></div><h2 id="the-setup-flow">The Setup Flow</h2><p>To make it easier for business owners to set up their business page and run their advertising campaigns on Yelp, <strong><a href="https://blog.yelp.com/news/yelp-releases-new-yelp-for-business-features-enabling-more-effective-advertising-and-adding-control-and-value-for-business-owners/">we built a new unified setup flow</a> dedicated to ushering them through the setup process.</strong></p><div class="image-caption"><p class="subtle-text"><small>What a business owner sees under the new centralized flow.</small></p></div><h2 id="so-how-was-it-built">So, how was it built?</h2><p>Ideally, we could’ve just imported all of the setup components into a new single page application. However, all of these components were built differently and lacked consensus among their architecture. So, rather than rebuild all the components the same way, we designed a new system that could accept these different components despite their varying structures.</p><h3 id="mvp-component-architecture">MVP Component Architecture</h3><p>When designing the setup flow we wanted to focus on scalability while also maintaining a reasonable project scope. To balance these two priorities we created a plug-and-play schema that each setup component was required to follow in order to be imported into our page. The component for each step must:</p><ol><li>Read and write all its own data.</li> <li>Require only basic properties such as business ID, CSRF tokens, or the locale of the request.</li> <li>Accept a couple of callback functions that would communicate with our single page application to denote when to save or skip over the current step in the flow.</li> </ol><p>Once the setup step abides by our requirements we can plug it into our page skeleton. By conforming to this layout, we can easily add new steps to our setup flow as other product teams build new features or update existing ones.</p><div class="image-caption"><p class="subtle-text"><small>Example of the Page Skeleton with a Setup Step Imported</small></p></div><h3 id="data-fetching">Data Fetching</h3><p>At Yelp, we have historically used <a href="https://developer.mozilla.org/en-US/docs/Web/Guide/AJAX/Getting_Started">AJAX</a> to fetch data in our frontend components. However, for this project we relied heavily on <a href="https://graphql.org/">GraphQL</a> to fetch all the data we needed. GraphQL is a query language that gives clients the power to ask for the exact data they need and nothing more. It also provides a high level of data stewardship that helps developers build robust data models and avoid having to write manual parsing code on the frontend. The smooth developer experience of building with GraphQL and the scope creep that comes from having many AJAX endpoints, made this an easy decision when designing the data fetching for this new system.</p><p>Not only did this save us from hooking up our lightweight single page application to clunky frontend services, it also resulted in substantial performance gains. Upon rendering, GraphQL is able to batch together multiple data fetches to make only one request to the server.</p><p>Additionally, we cache all the GraphQL calls and the data they return. This increases performance because any re-requested data can be found in the cache and doesn’t have to hit the backend server.</p><div class="image-caption"><p class="subtle-text"><small>The flow of data in our MVP component architecture.</small></p></div><h3 id="v2-component-architecture">V2 Component Architecture</h3><p>In order for a setup step component to communicate with the page skeleton, our MVP component architecture relied on callback functions.</p><p>For example, when a user saved their newly updated business hours, the setup step component used a callback function called <strong>onSuccessfulSave()</strong> to inform the page skeleton. When called, the page skeleton marked the current step as complete and moved on to the next step. However, using callbacks was limiting because we had to add a new function for every additional piece of information the page wanted to know about the plugged in component. We quickly realized that this system was not scalable.</p><p>To solve this problem, we have begun working on a V2 of the setup flow that shares a <a href="https://reactjs.org/docs/context.html#contextprovider">context provider</a> between the plugged-in component and the page skeleton. This provides efficient & clean communication between the setup flow and the state of each step e.g. if it’s saving, loading or ran into any network errors. This new version allows the flow to communicate more information to the user about each plugged-in component which will greatly improve the user experience.</p><h2 id="results">Results</h2><p>After launching our MVP and getting early feedback from A/B testing, we rolled out this new flow to 100% of the businesses that go through our claim process. The setup flow has increased product setup rates by an average of 8% across all the steps with some products seeing a significant boost. For example, our <a href="https://blog.yelp.com/news/yelp-connect-a-new-voice-for-restaurants-to-reach-locals/">Yelp Connect</a> product saw a 35% increase in its set up rate!</p><p>As we continue to improve this system, our focus is on making the setup process efficient in order to help businesses grow and thrive on our platform.</p><h2 id="acknowledgements">Acknowledgements</h2><p>This project was a group effort so shoutout to everyone on the Biz Guidance Team: Zoher Zoomkawalla, Arun Bharadwaj, Taras Anatsko, Brenda Kaing, Abdul Lateef Haamid, Heidi Makein, Sophia Chen, Dorothy Cruz Perdomo, and Leon Rudyak. We also had a lot of cross team support so big thank you to everyone else who helped build this new flow!</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="http://www.yelp.com/careers?job_id=3358a10e-b1af-4a5a-bd0e-4aa6bab35c93?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Building a server-driven foundation for mobile app development</h1> <p>Tue, 30 Nov 2021 01:00:00 +0100</p> <p>Yelp has many teams of mobile developers who collectively maintain two different mobile apps on iOS and Android: Yelp (hereinafter “Consumer App”) and Yelp for Business (hereinafter “Biz App”). We’re always looking for ways to ship features more quickly and consistently on all these platforms! We adopt vendor and open-source libraries when possible, and we develop our own shared libraries when necessary. While many teams were already independently adopting server-driven UI (SDUI) to build their features faster and cheaper, we felt something was missing – a foundation that tied all our libraries together into a shared, server-driven, end-to-end solution for mobile features.</p><p>In this blog post, we’ll cover the Biz Native Foundation (BNF), which provides a foundation for building, testing, deploying, and monitoring server-driven features in our Biz App. At the end, we’ll share future plans for extending this foundation to our Consumer App.</p><div class="c2"></div><h2 id="what-is-the-yelp-for-business-mobile-app">What is the Yelp for Business mobile app?</h2><div class="c2"></div><p>Launched in <a href="https://blog.yelp.com/2014/12/our-gift-to-business-owners-a-yelp-app-just-for-you">December 2014</a> for both iOS and Android, our <a href="https://business.yelp.com/tools/business-mobile-app/">Yelp for Business</a> mobile app enables businesses to manage their presence on Yelp and connect with customers from their phone.</p><p>The app has several core screens, or tabs, with highly personalized content for each business and app user. For example, a restaurant that opened during COVID-19 will have different needs than a firmly established plumber, and a business owner will have different needs than a manager or employee. The core screens link to secondary screens for updating business information, adding photos, responding to reviews, and finding new customers.</p><p>Developing both iOS and Android versions of a complex, personalized app has been a major challenge. The level of effort to ship a new feature can be high, and the time-to-market can range from a minimum of one week (we release our apps weekly) to several months or quarters. Once released, a feature must be supported in a range of app versions while undergoing continuous maintenance and improvements with each new release.</p><p>Server-driven UI (SDUI) was an obvious way to address these challenges, and many product teams had already adopted SDUI or were planning to adopt SDUI in 2019 when we began developing the BNF to standardize and simplify mobile development. The COVID-19 pandemic accelerated our efforts as we <a href="https://engineeringblog.yelp.com/2020/06/how-businesses-have-reacted-to-covid-19-using-yelp-features.html">added features</a> to help businesses navigate an extremely dynamic, challenging time. We realized we needed to make a significant investment in order to adopt SDUI across our entire app.</p><h2 id="business-and-technical-requirements">Business and Technical Requirements</h2><p>We defined a handful of important business and technical goals for our foundation:</p><ol><li>Ship Biz App features more quickly and consistently on iOS and Android</li> <li>Reduce the level-of-effort required to build, test, deploy, and monitor new Biz App features</li> <li>Support dynamic, highly-personalized content</li> <li>Give our marketing and product teams more direct control over the content in the Biz App</li> </ol><h2 id="alternatives-considered">Alternatives Considered</h2><p>Before we began building our own foundation, we reviewed a couple alternatives:</p><h3 id="webviews">Webviews</h3><p>The Biz App was already using webviews to share content with our Yelp for Business web app (<a href="https://biz.yelp.com">biz.yelp.com</a>). That said, we’d been slowly migrating away from webviews for the past two years for several reasons:</p><ul><li>Webviews require careful handshaking between native and web apps</li> <li>Most mobile app engineers don’t have experience debugging web apps, and most front-end engineers don’t have experience debugging mobile apps</li> <li>Native screens offer superior user experience (UX) over webviews, e.g. they are faster and more tightly integrated with the platform</li> </ul><h3 id="react-native">React Native</h3><p><a href="https://reactnative.dev/">React Native</a> would allow us to ship mobile app features more quickly and consistently, and our front-end developers could contribute to our mobile app more easily. React Native would be faster than webviews and more tightly integrated with the platform. However, React Native had some significant downsides for our existing Biz App and developer community:</p><ul><li>We didn’t already use React Native at Yelp, and most of our mobile developers didn’t have professional React Native experience</li> <li>We couldn’t reuse our existing code or native libraries without extensive bridging, which feels counter to building a foundation</li> </ul><p>Once we decided to build our own foundation, we established some design principles to guide our efforts.</p><h3 id="adopt-best-practices">Adopt Best Practices</h3><p>We would adopt Yelp-specific or industry-standard best practices when possible. Yelp already has consistent vendor, open-source, and internal libraries for mobile development. We use the latest features in <a href="https://developer.apple.com/documentation/uikit">UIKit</a> (iOS) and <a href="https://material.io/develop/android">Material Design</a> / <a href="https://developer.android.com/jetpack/androidx">Jetpack</a> (Android). On Android, we use our open-sourced <a href="https://engineeringblog.yelp.com/2019/05/introducing-bento.html">Bento</a> framework to build modularized UIs. On iOS, we have a similar internal framework. We wanted our foundation to build on these existing solutions rather than replace them.</p><h3 id="support-server-driven-ui-sdui">Support Server-Driven UI (SDUI)</h3><p>We would give our backend more control over screen content through server-driven UI. This would enable us to make changes more quickly and consistently on all clients. It would also enable dynamic, personalized content and give our marketing and product teams more direct control. Fortunately, SDUI wasn’t new to Yelp or the Biz App, where several product teams had already adopted SDUI for their features. We would learn from these efforts to create a shared SDUI framework for our foundation.</p><h3 id="enable-customization">Enable Customization</h3><p>We would enable customization. Though we wanted to encourage reuse and consistency, we didn’t want to restrict product teams from writing custom code where it makes sense. Otherwise, they would simply build their features without the foundation.</p><h3 id="create-a-supporting-toolchain">Create a Supporting Toolchain</h3><p>We would create tools that simplify or automate common tasks, such as debugging, testing, logging, monitoring, and documentation.</p><h2 id="core-concepts">Core Concepts</h2><p>The BNF has only four core concepts to keep the system simple and intuitive. Since mobile screens are the heart of every application, the BNF provides a <strong>generic screen</strong> that hosts <strong>generic components</strong>. When the user interacts with generic components, we trigger <strong>generic actions</strong> to update the UI or applications state, using <strong>generic properties</strong> to provide a way for generic components to observe application state without strong coupling.</p><div class="c2"></div><p>We’ll go through these concepts and show how they all work together to support mobile feature development.</p><h3 id="generic-screen">Generic Screen</h3><p>A generic screen is a flexible template that can support any screen in the Biz App. Before the BNF, adding a new screen required boilerplate code, such as a custom view controller (iOS) or activity/fragment (Android). Fortunately, mobile screens are constrained by the geometry of mobile devices, so we created one highly configurable screen.</p><p>A generic screen consists of a number of sections containing one or more generic components. Each section represents a part of the screen, such as the top/bottom navigation bar or scroll view.</p><div class="c2"></div><p>A generic screen must be configured to display content. The BNF supports both remotely and locally configured screens.</p><h4 id="remotely-configuring-a-generic-screen-sdui">Remotely configuring a generic screen (SDUI)</h4><p>A remotely configured generic screen uses an endpoint on our REST API to load a JSON screen configuration resource:</p><div class="language-plaintext highlighter-rouge highlight"><pre>/ui/{business_id}/screens/{name}/configuration/v1 </pre></div><p>The endpoint has path arguments that specify the target business ID and the logical name of the screen, e.g. home.</p><p>We use <a href="https://swagger.io/specification/v2/">Swagger 2.0</a> to document our REST APIs and auto-generate client networking libraries. Let’s look at some definitions for our screen configuration.</p><p>The screen configuration object (<code class="language-plaintext highlighter-rouge">ScreenConfigurationV1</code>) has properties for each section on the screen, e.g. <code class="language-plaintext highlighter-rouge">components</code> is the main scroll view. We version the screen configuration object and the screen configuration endpoint whenever we add new properties to this object, such as a new section.</p><div class="language-yaml highlighter-rouge highlight"><pre>ScreenConfigurationV1: properties: header: $ref: #/definitions/GenericComponent components: type: array items: $ref: #/definitions/GenericComponent sticky_bottom_components: type: array items: $ref: #/definitions/GenericComponent data: $ref: #definitions/GenericScreenData required: - components - data type: object </pre></div><p>Each section can be configured with one or more generic components. The <code class="language-plaintext highlighter-rouge">GenericComponent</code> object only contains an ID (<code class="language-plaintext highlighter-rouge">learn-more-button</code>) and a type (<code class="language-plaintext highlighter-rouge">generic_button_v1</code>).</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericComponent: properties: id: type: string type: type: string required: - id - type type: object </pre></div><p>The data for each component is stored in a separate <code class="language-plaintext highlighter-rouge">GenericScreenData</code> object that maps a component ID to a <code class="language-plaintext highlighter-rouge">GenericComponentData</code> object, which was our best approximation for a Swagger union type that worked with our Swagger codegen pipeline, which automatically generates networking code for iOS and Android clients. <code class="language-plaintext highlighter-rouge">GenericActionData</code> plays a similar role for generic actions.</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericScreenData: properties: id_to_component_data: description: Map component ID to its GenericComponentData $ref: #/definitions/IdToGenericComponentData id_to_action_data: description: Map action ID to its GenericActionData $ref: #definitions/IdToGenericActionData required: - id_to_component_map - id_to_action_map type: object GenericComponentData: properties: generic_button_v1: $ref: #definitions/GenericButtonDataV1 generic_text_v1: $ref: #definitions/GenericTextDataV1 ... type: object GenericActionData: properties: generic_open_url_v1: $ref: #definitions/GenericShowOpenUrlV1 generic_close_screen_v1: $ref: #definitions/GenericCloseScreenV1 ... type: object </pre></div><p>There are some benefits to storing configuration data separately from references:</p><ul><li>We can reuse the same data across multiple references, reducing the size of the screen configuration</li> <li>We can debug the screen configuration more easily with all configuration data in a flat map</li> </ul><h4 id="locally-configuring-a-generic-screen">Locally configuring a generic screen</h4><p>A locally configured generic screen uses either a Kotlin or Swift domain-specific language (DSL).</p><div class="language-kotlin highlighter-rouge highlight"><pre>screenConfiguration { header { navBar(title = "Welcome!") } components { text("Yelp is working on some cool things!", style = HEADER1) button( "Learn more on our blog", tappedActions = actions { openUrl("https://engineeringblog.yelp.com") } ) } } </pre></div><p>Though locally configured screens can’t be updated without a client release, they still satisfy many of our requirements, such as shipping features more quickly and reducing the level-of-effort. Not every screen has dynamic, personalized content that benefits from being server-driven, and some screens are simply hard to make server-driven.</p><h3 id="generic-components">Generic Components</h3><p>A generic component is a basic building block for a generic screen. The BNF supports a rich, extensible library of components.</p><p>Every component has an unique ID and type. We use a naming convention to distinguish generic, reusable component types (<code class="language-plaintext highlighter-rouge">generic_button_v1</code>) from components that are customized for one feature (<code class="language-plaintext highlighter-rouge">feature_ad_preview_v1</code>). However, the BNF doesn’t handle generic or feature-specific component types differently, so we refer to all components as generic components.</p><h4 id="configuring-components">Configuring components</h4><p>Generic components must be configured with data. In remote screen configurations, each component type has an associated data object. When adding new features to the component, we always version the component type and data object.</p><div class="language-yaml highlighter-rouge highlight"><pre>definitions: GenericButtonDataV1: properties: text: type: string style: type: string enum: - primary - secondary - tertiary size: type: string enum: - standard - large - small viewed_actions: description: Actions to fire when the app user views the button type: array items: $ref: #/definitions/GenericAction tapped_actions: description: Actions to fire when the app user taps the button type: array items: $ref:#/definitions/GenericAction required: - text - style - size - tapped_actions type: object </pre></div><p>In a local screen configuration, the generic component can be configured with our DSL:</p><div class="language-kotlin highlighter-rouge highlight"><pre>button( text = "Learn more on our blog", style = PRIMARY, tappedActions = actions { openUrl("https://engineeringblog.yelp.com") } ) </pre></div><p>Both produce the same result on the client:</p><div class="c2"></div><h4 id="composing-components">Composing components</h4><p>The BNF has several ways to build larger components from smaller pieces. First, many mobile features can be broken into a vertical stack of simpler components, such as buttons, text, icons, and images. Second, many features can be built by composing components with a container component.</p><p>For example, the Biz App has cards to promote the products and services Yelp offers to businesses. The promotional cards are built from a stack of simpler components and a bordered container (<code class="language-plaintext highlighter-rouge">generic_bordered_container_v1</code>), which contains a feature-specific component for each product (<code class="language-plaintext highlighter-rouge">feature_call_to_action_v1</code>).</p><div class="c2"></div><p>iOS and Android provide mechanisms to recycle views when scrolling, so using vertical stacks of simpler, reusable components improves scroll performance.</p><p>Initially, we were worried that containers would impact scroll performance and memory consumption, especially with high levels of nesting. But we kept finding designs that benefited from containers. In practice, we don’t nest more than one or two levels. On Android, containers are nested <a href="https://developer.android.com/reference/androidx/recyclerview/widget/RecyclerView">RecyclerViews</a> that share a common <a href="https://developer.android.com/reference/androidx/recyclerview/widget/RecyclerView.RecycledViewPool">RecycledViewPool</a>, allowing re-use of simpler components such as text, buttons, and images.</p><h4 id="rendering-components-on-clients">Rendering components on clients</h4><p>On the client, components are rendered with a factory associated with the component type. The same factory handles multiple versions of the same component. We typically have one implementation of each component (<code class="language-plaintext highlighter-rouge">GenericButtonComponent</code>) on the client, and the factory maps the server-driven component data to an internal configuration.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class GenericButtonComponentFactory: GenericComponentFactory { // Used by the BNF infrastructure to build a catalog of // available & deprecated types override val availableTypes = listOf(V1, V2) override val deprecatedTypes = listOf(V1) override fun create( component: GenericComponent, data: GenericComponentData ) = when (component.type) { V1 -> createV1(component.id, data.generic_button_v1) V2 -> createV2(component.id, data.generic_button_v2) else -> throw IllegalStateException("Unexpected component ${component.type}") } fun createV1(id: String, data: GenericButtonDataV1): GenericButtonComponent { // Convert GenericButtonDataV1 to an internal state // Construct & return the GenericButtonComponent } fun createV2(id: String, data: GenericButtonDataV2): GenericButtonComponent { // Convert GenericButtonDataV2 to an internal state // Construct & return the GenericButtonComponent } companion object { const val V1 = "generic_button_v1" const val V2 = "generic_button_v2" } } </pre></div><h3 id="generic-actions">Generic Actions</h3><p>A generic action is a side effect that occurs when the user interacts with a screen or component. A generic screen or component can trigger actions under any number of conditions, such as when the user views the screen or taps the component.</p><p>Like generic components, every generic action has a unique ID (<code class="language-plaintext highlighter-rouge">open-blog-url</code>) and type (<code class="language-plaintext highlighter-rouge">generic_open_url_v1</code>), and we use naming conventions to distinguish between generic and feature-specific actions (<code class="language-plaintext highlighter-rouge">feature_close_business_v1</code>).</p><p>As with generic components, the BNF was designed to support a rich, extensible library of actions. Here’s a sampling of actions:</p><table><thead><tr><th>Generic Action</th> <th>Description</th> </tr></thead><tbody><tr><td>generic_open_url_v1</td> <td>Opens a deep link, which supports “https”, “tel”, “yelp”, and “yelp-biz” schemes</td> </tr><tr><td>generic_close_screen_v1</td> <td>Closes the current screen and opens an optional URL to navigate to the next screen</td> </tr><tr><td>generic_show_screen_v1</td> <td>Opens another screen using a nested screen configuration</td> </tr><tr><td>generic_reconfigure_screen_v1</td> <td>Reconfigures the current screen with a new screen configuration</td> </tr><tr><td>generic_update_property_v1</td> <td>Updates the value of a generic property, which represents a piece of application state</td> </tr><tr><td>generic_scroll_to_component_v1</td> <td>Scrolls the screen to a specified component</td> </tr><tr><td>feature_close_business_v1</td> <td>Marks the current business as closed, which has a lot of feature-specific side-effects</td> </tr></tbody></table><h4 id="configuring-actions">Configuring actions</h4><p>In a remote screen configuration, each action type has a corresponding data model:</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericOpenUrlDataV1: properties: url: description: Link to be opened when the action is triggered type: string required: - url type: object </pre></div><p>In a local screen configuration, actions can be configured with the DSL in an actions block:</p><div class="language-kotlin highlighter-rouge highlight"><pre>button( text = "Learn more on our blog", style = PRIMARY, tappedActions = actions { openUrl("https://engineeringblog.yelp.com") } ) </pre></div><h4 id="handling-actions">Handling actions</h4><p>We use an event-based architecture on both iOS and Android to handle user interactions. Generic actions are events, which are either Swift structs or Kotlin data classes. For example, on Android, we have an <code class="language-plaintext highlighter-rouge">OpenUrlEvent</code> to model <code class="language-plaintext highlighter-rouge">generic_open_url_v1</code> in remote screen configurations.</p><div class="language-kotlin highlighter-rouge highlight"><pre>data class OpenUrlEvent(val url: String): GenericScreenEvent() </pre></div><p>Android uses a Model-View-Intent (MVI) architecture where components publish events (intents) to a shared event bus. When the user taps a component, the component will publish its <code class="language-plaintext highlighter-rouge">tappedActions</code>.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class GenericButtonComponentViewHolder : GenericComponentViewHolder<GenericButtonComponentState>( R.layout.view_generic_button_component ) { lateinit var tappedActions: List<GenericScreenEvent> private val button by clickView<GenericButton>(R.id.button) { eventBus.sendEvents(tappedActions) } override fun bind(state: GenericButtonComponentState) { button.configure(state) tappedActions = state.tappedActions } } </pre></div><p>The event will be delivered to a matching intent handler that knows how to process the user’s intent and update the UI state.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class NavigationIntentHandler: GenericScreenIntentHandler() { @Event(OpenUrlEvent::class) fun handleOpenUrl(event: OpenUrlEvent){ with(event.url) { when { startsWith("tel:") -> openTelLink(this) startsWith("https:") -> openSecureHttpLink(this) startsWith("http:") -> openUnsecureHttpLink(this) startsWith("yelp-biz:") -> openCustomLink(this) else -> reportUnsupportedLinkError(this) } } } } </pre></div><h3 id="generic-properties">Generic Properties</h3><p>Most UIs are dynamic; they need to respond to user interactions and changes in application state. For example, businesses can exchange messages with their customers, and we want to show the number of unread messages as a badge component.</p><div class="c2"></div><h4 id="modeling-generic-properties">Modeling generic properties</h4><p>We represent a generic property using a dot-separated hierarchical path and an associated data type:</p><div class="language-plaintext highlighter-rouge highlight"><pre>businesses.{business_id}.inbox.messages.unread.count<integer> </pre></div><p>A generic property can have path parameters that provide additional context. For example, each business has a separate inbox, so the <code class="language-plaintext highlighter-rouge">{business_id}</code> parameter corresponds to the unique business ID.</p><p>To a generic component, a generic property is just a strongly-typed variable that it can read or write. The generic component doesn’t know the meaning of the data (the number of unread messages) or how the data is stored or updated.</p><p>In a remote screen configuration, we use the <code class="language-plaintext highlighter-rouge">GenericProperty</code> object to model properties.</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericProperty: properties: path: description: A dot-separated hierarchical path for the property type: string type: description: Represents the property type type: string required: - name - type type: object </pre></div><h4 id="supporting-generic-properties">Supporting generic properties</h4><p>Each generic property has a generic property manager that handles reads and writes.</p><p>On Android, we resolve a generic property into an <a href="http://reactivex.io/RxJava/3.x/javadoc/io/reactivex/rxjava3/core/Observable.html">RxJava Observable</a> backed by a <a href="http://reactivex.io/RxJava/3.x/javadoc/io/reactivex/rxjava3/subjects/BehaviorSubject.html">BehaviorSubject</a>, which remembers the latest value. A generic component subscribes to the <code class="language-plaintext highlighter-rouge">Observable</code> to receive new values and update its view.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class BusinessInboxPropertyManager: GenericPropertyManager<Int> { val inboxPropertyDefinition = GenericPropertyDefinition( "businesses.{business_id}.inbox.messages.unread.count" ) override val properties = listOf(inboxPropertyDefinition) private val subjectMap = mutableMapOf<String, BehaviorSubject<Int>>() override fun get(path: String): Observable<Int> { return getOrCreateSubject(path).hide() } override fun set(path: String, value: Int) { getOrCreateSubject(path).onNext(value) } private fun getOrCreateSubject(path: String): BehaviorSubject<Int> { return subjectMap[path] ?: BehaviorSubject.create<Int>().also { subjectMap[path] = it } } } </pre></div><h4 id="building-dynamic-components-with-generic-properties">Building dynamic components with generic properties</h4><p>The BNF supports a <code class="language-plaintext highlighter-rouge">generic_badge_v1</code> component that represents a basic badge with a dynamic count using a generic property.</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericBadgeDataV1: properties: dynamic_count: $ref: #/definitions/GenericProperty required: - dynamic_count type: object </pre></div><p>On Android, we map the generic property to an <code class="language-plaintext highlighter-rouge">Observable<Int></code> in the component’s MVI state.</p><div class="language-kotlin highlighter-rouge highlight"><pre>// The BadgeComponent’s MVI state stores an Observable data class BadgeComponentState( val dynamicCount: Observable<Int>, @ColorRes val color: Int = R.color.red ) // The BadgeComponenFactory resolves a generic property into // the Observable required by the MVI state using the // GenericProperties registry. fun createBadgeComponentState(data: GenericBadgeDataV1) = BadgeComponentState( dynamicCount = GenericProperties.get(data.dynamicCount.path) ) </pre></div><p>The component subscribes to the <code class="language-plaintext highlighter-rouge">Observable<Int></code> and updates the badge to reflect the current code.</p><div class="language-kotlin highlighter-rouge highlight"><pre>// The BadgeComponent subscribes to the Observable state.dynamicCount .doOnSubscribe { // Keep the badge invisible until we have the first count badgeView.isVisible = false } .doOnSuccess { // Update the value of the badge! badgeView.value = it // Don’t show the badge unless there’s a non-zero count badgeView.isVisible = (it > 0) } .doOnError { // If there’s an error, hide the badge badgeView.isVisible = false } .subscribe() .autodispose() </pre></div><p>We’re still experimenting with generic properties and refining the use cases. We believe they are a necessary concept to unlock dynamic, server-driven UIs.</p><h2 id="current-use-cases">Current Use Cases</h2><p>We’re using the BNF to power the Home, Yelp Ads, and Business Info, and More tabs. These tabs are remotely configured screens because they host dynamic, personalized content. The Yelp Ads tab hosts the <a href="https://blog.yelp.com/2021/05/yelp-releases-new-yelp-for-business-features-enabling-more-effective-advertising-and-adding-control-and-value-for-business-owners">Ads Dashboard</a> screen, which was the first screen built entirely from scratch using the BNF. We’ll share more about this in a future blog post; stay tuned!</p><div class="c2"></div><p>We’re also using the BNF to power several in-product marketing screens. These screens are usually remotely configured to give our marketing and product teams more direct control, but sometimes we build them locally first using our screen configuration DSL.</p><div class="c2"></div><p>Finally, we’re using the BNF to build debug screens to prototype new designs or test individual generic components, actions, or properties. These screens are locally configured with our screen configuration DSL.</p><div class="c2"></div><h2 id="future-directions">Future Directions</h2><h3 id="building-a-better-backend">Building a Better Backend</h3><p>SDUI pushes more of the business and presentation logic into the backend. Most backend engineers aren’t familiar with mobile app development or building mobile UIs. Consequently, they need better infrastructure for making, testing, and deploying their changes. We also need better tooling for our marketing and product teams to make changes, too.</p><h3 id="adopting-swiftui-and-jetpack-compose">Adopting SwiftUI and Jetpack Compose</h3><p>One of our design principles is “Adopting Best Practices.” We’ve therefore watched the evolution of <a href="https://developer.apple.com/xcode/swiftui/">SwiftUI</a> and <a href="https://developer.android.com/jetpack/compose">Jetpack Compose</a> with great interest. Both frameworks support building composable, dynamic UIs with a simple declarative syntax. We hope to adopt these new frameworks in the near future.</p><h3 id="adopting-graphql">Adopting GraphQL</h3><p>Yelp is currently migrating our web and mobile apps from individual REST APIs to a unified <a href="https://graphql.org/">GraphQL</a> schema. We’re planning to migrate the BNF to GraphQL, which offers better support than REST for making changes without breaking backwards compatibility. Mobile clients must write explicit GraphQL queries that describe the types and fields they support. With our REST API, we are frequently creating new versions of entire objects (<code class="language-plaintext highlighter-rouge">GenericButtonDataV7</code>) or APIs just to add one field safely. With GraphQL, we can evolve our schema incrementally.</p><h3 id="building-a-yelp-native-foundation">Building a Yelp Native Foundation</h3><p>Our Consumer App and Biz App handle separate sides of the same transaction – connecting consumers to great local businesses. In many cases, building a new feature requires changes in both apps. For example, when the Biz App added features for businesses to provide <a href="https://blog.yelp.com/2020/12/covid-related-updates-for-your-yelp-page#Edit-your-COVID-19-Advisory-Alert">COVID-related updates</a> to consumers, the Consumer App added corresponding features for consumers to see those updates.</p><p>When we started the BNF in 2019, product teams working on the Consumer App were also starting a shared server-driven foundation for similar reasons. Unfortunately, the Biz App and Consumer App had different REST APIs and separate Git repositories. We made the practical decision to share ideas and techniques but not code. Now we’re slowly moving towards a common Yelp Native Foundation by migrating to a unified GraphQL schema and adopting monorepos.</p><p>We’re very excited about the future of SDUI at Yelp and in the industry as a whole. Many companies, such as <a href="https://medium.com/airbnb-engineering/a-deep-dive-into-airbnbs-server-driven-ui-system-842244c5f5">Airbnb</a> and <a href="https://doordash.engineering/2021/08/24/improving-development-velocity-with-generic-server-driven-ui-components/">Doordash</a>, have recently published the details of their own shared, server-driven foundations, and there are open-source efforts, such as <a href="https://github.com/ZupIT/beagle">Beagle</a>. We’ve noticed many similarities between our work and these projects, which suggests there are some natural design patterns for implementing SDUI. We hope this blog post contributes to the growing SDUI community. Keep an eye on this blog for updates on our progress!</p><div class="island job-posting"><h3>Become a Mobile Software Engineer at Yelp</h3><p>Want to help us grow our mobile foundation on iOS?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/50448189-a770-4214-8f7c-407798d7707f?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Awesome Women in Engineering Hosts its First Virtual Summit</h1> <p>Mon, 25 Oct 2021 02:00:00 +0200</p> <div class="image-caption"></div><p>Yelp’s employee resource group for women in engineering, <a href="https://www.yelp.com/engineering/awe">Awesome Women in Engineering (AWE)</a>, recently held its first virtual summit! The summit was designed for women and allies at Yelp to learn, network, and have fun. AWE started in 2013 with a mission to build a strong community for women and allies at Yelp by facilitating professional career-building activities, networking, leadership, and mentorship opportunities. As a resource group, we provide support and organize activities targeted towards professional growth for women engineers, helping them to maximize their potential at Yelp and beyond. We are excited to share the different activities that helped make this a successful event.</p><h2 id="everything-was-perfect-working-at-a-company-which-supports-events-hosted-by-women-and-with-many-women-as-speakers-is-amazing---thais-a-software-engineer">“Everything was perfect. Working at a company which supports events hosted by women, and with many women as speakers is amazing!” - Thais A., Software Engineer</h2><div class="image-caption"><p class="subtle-text"><small>Our speakers for the summit</small></p></div><p>We’d previously hosted a <a href="https://engineeringblog.yelp.com/2019/10/first-awe-summit-sf.html">similar summit</a> in our San Francisco office, but this summit was 100% virtual as we’ve since moved to a more distributed work environment. This enabled us to have events at times accessible to our distributed teams either in Europe or North America. We hosted several events ranging from technical talks to networking sessions to workshops, giving women and allies the opportunity to share their experiences and learn from the experiences of others.</p><h2 id="i-got-to-meet-awesome-women-that-i-dont-interact-with-often---maoreen-m-technical-sourcer">“I got to meet awesome women that I don’t interact with often” - Maoreen M., Technical Sourcer</h2><div class="image-caption"><p class="subtle-text"><small>Miriam leading the keynote speech</small></p></div><p>A highlight of this summit was the keynote speech given by Miriam Warren, Yelp’sChief Diversity Officer. Miriam spoke about her journey at Yelp, building and empowering communities, demystifying networking, and knowing your story. It was also fascinating to hear about her journey joining nonprofit boards and the ways these experiences helped her grow her career and learn from people in other industries.</p><div class="image-caption"><p class="subtle-text"><small>A panel discussion about career growth</small></p></div><p>Many other members of AWE also gave talks. Some of those talks were focused on technical learning. For example, we heard about statistical thinking, the math used in our ads algorithms, and measuring product success. Some talks were centered more on the role of diversity in our work, such as creating an accessible product, reducing biases in algorithms, and diversity in recruiting and hiring. Other talks were geared towards career growth where we heard from women in various roles about their journeys.</p><div class="image-caption"><p class="subtle-text"><small>Our virtual networking event was hosted on gather.town</small></p></div><p>The summit also incorporated interactive events. We hosted two ally skills workshops, which took participants through real-world scenarios and consisted of group discussions about how to act as an ally in each situation. There was also a technical workshop that covered the basics of machine learning followed by an interactive session where everyone built a basic model. Lastly, we had a virtual networking session where participants were able to meet new people and get to know each other through icebreaker questions.</p><p>The summit was an amazing opportunity for women and allies to build deeper connections, learn from each others’ experiences, and feel empowered to always be our most authentic selves. We’re proud to have done this event in a distributed environment and plan to look back at what worked and what didn’t for participants so we can do it again in the future, while continuing to inspire women through AWE’s many other initiatives.</p><p>Acknowledgements: Dorothy Jung, Chie Shu, Trisha Walsh, and Grace Yuan</p><div class="island job-posting"><h3>Interested in joining the awesome women in engineering and product at Yelp?</h3><p>We're hiring! Check out our Careers page for more open positions.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Nrtsearch: Yelp’s Fast, Scalable and Cost Effective Search Engine</h1> <p>Tue, 21 Sep 2021 02:00:00 +0200</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"><div><label class="flex-box">Search for </label> <label class="flex-box">Near </label> </div></form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2021 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>Engineering Career Series: Building a thriving engineering team</h1> <p>Thu, 12 Aug 2021 02:00:00 +0200</p> <p>This post brings our Engineering Career Series to an end. I hope you’ve enjoyed reading it as much as we’ve enjoyed sharing Yelp’s philosophy on building engineering careers in a thoughtful, equitable, and enjoyable way.</p><p>As the series has shown, building a thriving engineering team requires ongoing investment in people and in processes. It requires you to recognize and acknowledge your successes and failures, and continue to iterate and improve. There are no quick fixes and the job is never truly done, but the rewards of improving are huge, for the individuals and for the success of your company as a whole.</p><p>What we’ve tried to share with you during this series is not that we’re perfect and that we have all the answers. Instead we wanted to give you some idea of what the journey has been like to get where we are now, and to be open about some of the challenges along the way that you may also encounter in your engineering career – whether you’re an engineer, a technical leader, or a manager.</p><p>If there’s one thing I’d like you to take away from the series, it’s that this is <em>worth the effort</em>. There are concrete steps you can take as leaders that will change your engineering culture for the better, and there are contributions that anyone involved in engineering can make that will make people’s careers (and lives) happier, fairer, and more successful.</p><p>At Yelp, we’re committed to giving the resources to everyone involved to keep making these efforts, to continuously improve our engineering culture and the experience of everyone who works here. We’d love to welcome anyone else who is as passionate about creating a diverse and inclusive engineering team to <a href="http://www.yelp.com/careers">join us</a>, or simply to get in touch and share your experiences.</p><p>If you’ve not read the rest of the series, here’s a quick recap:</p><h3 id="hiring-a-diverse-team-reducing-bias-in-engineering-interviews"><a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">Hiring a diverse team: reducing bias in engineering interviews</a></h3><p>How Yelp has approached hiring over the years, and the major lessons we learned in how to reduce bias.</p><h3 id="using-structured-interviews-to-improve-equity"><a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">Using structured interviews to improve equity</a></h3><p>A key change to our interview process improved equity of outcomes considerably.</p><h3 id="how-we-onboard-engineers-across-the-world-at-yelp"><a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html">How we onboard engineers across the world at Yelp</a></h3><p>Once you’ve hired someone amazing, you need to set them up for success on day one.</p><h3 id="career-paths-for-engineers-at-yelp"><a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">Career paths for engineers at Yelp</a></h3><p>How we designed and redesigned our framework for career growth and levelling, and how that shift increased fairness and equity.</p><h3 id="technical-leadership-at-yelp"><a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">Technical leadership at Yelp</a></h3><p>Why we approach technical leadership as a role you can choose to take on at Yelp, rather than just a level within our career path framework.</p><h3 id="how-yelp-approaches-engineering-management"><a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html">How Yelp approaches engineering management</a></h3><p>What “success” looks like for managers at Yelp, how we hire them, what we ask them to do and to value, and how we’ve built this into the career path for managers.</p><h3 id="ensuring-pay-equity--career-progression-in-yelp-engineering"><a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html">Ensuring pay equity & career progression in Yelp Engineering</a></h3><p>What we learnt from committing to publishing our analysis of pay equity and career progression to all of engineering annually, no matter what the results.</p><h3 id="fostering-inclusion--belonging-within-yelp-engineering"><a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-fostering-inclusion-and-belonging-within-yelp-engineering.html">Fostering inclusion & belonging within Yelp Engineering</a></h3><p>Improving inclusion and belonging requires you to provide for teams and groups in many different ways. We designed systems and processes that give people the support they need in the time, place and manner they need it.</p><div class="post-gray-box">This post is part of a series covering how we're building a happy, diverse, and inclusive engineering team at Yelp, including details on how we approached the various challenges along the way, what we've tried, and what worked and didn't.<p>Read the other posts in the series:</p><ul><li><a title="Engineering Career Series: Building a happy, diverse, and inclusive engineering team" href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">Building a happy, diverse, and inclusive engineering team</a></li> <li><a title="Engineering Career Series: Hiring a diverse team by reducing bias" href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">Hiring a diverse team by reducing bias</a></li> <li><a title="Engineering Career Series: Using structured interviews to improve equity" href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">Using structured interviews to improve equity</a></li> <li><a title="Engineering Career Series: How we onboard engineers across the world at Yelp" href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html">How we onboard engineers across the world at Yelp</a></li> <li><a title="Engineering Career Series: Career paths for engineers at Yelp" href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">Career paths for engineers at Yelp</a></li> <li><a title="Engineering Career Series: Technical Leadership at Yelp" href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">Technical Leadership at Yelp</a></li> <li><a title="Engineering Career Series: How we think about engineering management" href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html">How we think about engineering management</a></li> <li><a title="Engineering Career Series: Ensuring Pay Equity & Career Progression in Yelp Engineering" href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html">Ensuring Pay Equity & Career Progression in Yelp Engineering</a></li> <li><a title="Engineering Career Series: Fostering inclusion & belonging within Yelp Engineering" href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-fostering-inclusion-and-belonging-within-yelp-engineering.html">Fostering inclusion & belonging within Yelp Engineering</a></li> <li><a title="Engineering Career Series: Building a thriving engineering team" href="https://engineeringblog.yelp.com/2021/08/engineering-career-series-building-a-thriving-engineering-team.html">Building a thriving engineering team</a></li> </ul></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Fostering inclusion & belonging within Yelp Engineering</h1> <p>Thu, 29 Jul 2021 02:00:00 +0200</p> <p><a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">Recruiting</a>, <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">hiring</a>, and <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html">onboarding</a> new employees in Engineering at Yelp is a multi-team, cross-functional effort as we have laid out in our Career Series blog posts. But once people are here, how do we retain them? While <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">career advancement</a>, <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">technical leadership</a>, and <a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html">pay equity</a> are all important components to building a happy engineering team, we believe fostering inclusion and belonging is also a fundamental component in supporting, and thus retaining, people. While this is an area that’s received a lot of recent attention in the tech industry, we’ve prioritized inclusion and belonging for many years because we want all of our colleagues to feel like an integral part of our team and share their unique perspectives.</p><p>In this post, we’ll discuss some of the building blocks that make up our inclusion and belonging programs, many of which were developed in partnership with Yelp’s Culture team.</p><p><strong>Employee Resource Groups</strong></p><p>One of the ways we support belonging is through Yelp Employee Resource Groups (YERGs), which are groups of employees that come together to support each other and other employees by way of community, programming, and events. The groups can be formed around shared social identities, characteristics, or life experiences. Yelp has <a href="https://www.yelp.careers/us/en/culture-at-yelp">many YERGs</a> including YelpCares (community, non-profit volunteering), YelpParents, Women at Yelp (WAY), VetConnect, and Yelp Asian Pacific Islanders (YAPI). Three of our YERGs were started by members of our Engineering team: Awesome Women in Engineering (AWE), ColorCoded, and Neurodiversity & Mental Health.</p><p>Each YERG is led by several employees who facilitate programming and support the group. We also use an executive sponsorship model for all of our YERGs, where a senior leader provides mentorship and guidance, connections across departments, removes any blockers the group may face as they run their programming, and works with the leads to champion and promote the group company-wide.</p><p><strong>Awesome Women in Engineering (AWE)</strong></p><p>AWE started as a social group in April 2013 before employee resource groups came into existence at Yelp. The founding leaders of AWE organized several activities like networking lunches, book clubs, and public speaking workshops, and coordinated with Yelp’s Recruiting team to send AWE members to represent Yelp at external events (e.g., the <a href="https://ghc.anitab.org/about/">Grace Hopper Conference</a>). The next phase was to build a stronger community of women engineers at Yelp.</p><p>As a resource group, AWE provides support for and organizes activities targeted towards professional growth for women engineers and allies, helping maximize their potential at Yelp and beyond. AWE has grown considerably these last eight years and offers programs focused on being champions for women in Engineering, public speaking, internal and external networking, allyship, mentorship, and hosting internal events.</p><p>AWE and our other YERGs provide avenues for engineers to take on leadership opportunities by coordinating an event, facilitating a discussion about a book, or becoming a program lead. YERGs allow engineers to work on these skills in a safe and supportive environment with a focus on growth instead of perfection.</p><p>As a result of our remote work environment over the last year, AWE has transitioned to hosting its events virtually. This has allowed employees across time zones and countries to join the group and participate in events they could not have attended previously. As we continue supporting engineers working in multiple time zones, we intend to continue making programming available virtually.</p><p>\ <strong>ColorCoded</strong></p><p>Back in 2016, a few Yelp engineers in San Francisco started ColorCoded as a social group with the goal of supporting engineers of color at Yelp. Over the last five years, ColorCoded has grown to become one of Yelp’s employee resource groups, cultivating a community of engineers of color and their allies at Yelp. The group’s executive sponsor, employee leadership team, and members work in partnership to provide professional development and leadership activities, networking events, and community engagement opportunities.</p><p>Before the COVID-19 pandemic, ColorCoded organized various in-person activities in San Francisco, such as résumé workshops with Bay Area nonprofits, employee panel discussions, lunch book discussions, and more. With the onset of the pandemic, transition to remote work, and the Black Lives Matter movement in 2020, ColorCoded shifted programming to better meet the needs of our community members and expanded our reach to include more members from other Yelp offices. Five programs were established: Community Check-Ins, Race Matters, Virtual Happy Hours, Ally Skills Workshops, and Community Voices. Race Matters is a monthly discussion series where Yelp employees learn and discuss the historical context of racism and how racism affects Black, Indigenous, and People of Color (BIPOC) communities in the United States, and we’re hoping to expand this programming to cover the historical context of other countries where we have employees in the future. Community Check-Ins are another monthly discussion series where members gather together and discuss current events.</p><p>At times, ColorCoded also partners with other employee resource groups, such as Awesome Women in Engineering (AWE) and Yelp Asian Pacific Islanders (YAPI), to put on events together.</p><p><strong>Neurodiversity and Mental Health</strong></p><p>Neurodiversity is a movement championing the premise that autism and other conditions like attention-deficit/hyperactivity disorder, dyslexia, anxiety, post-traumatic stress syndrome, dyscalculia, and apraxia are normal variations of the human brain and thought process. As natural variations, these differences should be celebrated and supported.</p><p>This recently-created YERG is made up of employees who are neurodiverse, have diagnosed or undiagnosed mental health conditions, care about their mental health, and/or are allies to these individuals. The group works to create a more inclusive environment for neurodiverse individuals and individuals with mental health conditions. Though starting within Engineering, the group now has representation from departments across Yelp.</p><p>Our most successful event to date was an open roundtable discussion towards the beginning of the pandemic. The adjustment to regional lockdowns brought an additional focus on mental health and how best to support each other. In the roundtable event, we welcomed employees to discuss how they were dealing with the transition. We are currently planning a panel with a few speakers to share their experiences at Yelp, incorporating neurodiversity in our existing diversity training, working on new training for managers, and raising awareness about existing tools Yelp provides to employees to foster wellness.</p><p><strong>Work-life balance</strong></p><p>Historically, Yelp Engineering leaders have championed work-life balance and have long valued the well-being of their teams. This is reflected in our <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">career leveling rubric</a>, with a dimension dedicated to sustaining and improving the well-being of our colleagues, as well as an expectation of our <a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html">engineering managers</a>. In response to the pandemic, we implemented new policies to best support work-life balance for our employees in Engineering.</p><p>The first one focuses on offering flexibility around when you work. Employees living in different time zones with different schedules shouldn’t need to fully align all of their working hours. Within Engineering, we’ve implemented a flexible working policy that introduces the concept of “core hours,” observed from 11am to 3pm in one’s local time, where the balance of the day’s hours can be before or later. However, even these core hours are flexible and can be adjusted to accommodate unique needs of individuals and teams, such as a parent needing to pick up their child from daycare over lunch. This practice offers some form of predictability for collaborating teammates and other teams to know when they can expect colleagues to be available while still giving employees the autonomy to set a schedule that works best for them.</p><p>Another new policy we implemented is the option for most full-time employees in Engineering to work 80% of a full-time workload for 80% of their full-time pay, providing engineers another opportunity to adapt their work schedule to suit their current life priorities and preferences.</p><p><strong>Distributed workforce</strong></p><p>The COVID-19 pandemic showed us that we can function as a company with nearly all of our employees working remotely. In some cases, people have reported being more productive without the usual in-office distractions and noise. We also know that for some, especially parents or other caregivers, being home and removing commutes has allowed them to continue to provide care and work full-time.</p><p>Even when offices reopen, <a href="https://blog.yelp.com/2021/05/returning-to-yelp-offices-in-2021">Yelp is giving employees a choice to continue working as a distributed remote workforce</a>, unless their role specifically requires otherwise. A new relocation policy offers clear guidance around relocating to new locations within one’s country or between the countries in which Yelp operates (Canada, Germany, UK, and USA).</p><p><strong>Putting it all together</strong></p><p>Through YERGs and the policies mentioned above, we are making the space and providing the opportunities for folks to bring their full authentic selves to Yelp and have the flexibility to work in a way that works best for them. We are proud of our investments in hiring great people in Engineering and supporting their sense of inclusion and belonging once they have joined us. That said, our work isn’t done. We will continue to evolve and incorporate a multi-faceted approach to inclusion and belonging. We will continue to offer training in diversity, equity & inclusion, promote and support YERGs, and find ways, like flexible working arrangements, to support engineers in doing their best work.</p><p><strong>Next up</strong></p><p>Yelp CTO, Sam Eaton, will wrap up our Engineering Career Series. If you’d like to join an organization passionate about inclusion and belonging (or any of the other topics we’ve covered), <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Ensuring Pay Equity & Career Progression in Yelp Engineering</h1> <p>Thu, 15 Jul 2021 02:00:00 +0200</p> <p>At Yelp, we care deeply about ensuring all employees are compensated fairly for their contributions, regardless of their gender, race, and ethnicity. Within Yelp Engineering, we work hard to achieve equal pay for equal work through a combination of tactics:</p><ul><li>Well-defined career levels and corresponding pay bands</li> <li>A systematic levels calibration process across teams</li> <li>Transparency of our outcomes with the entire Engineering team</li> </ul><p>In a <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">previous blog post</a>, we described how we think about career progression and leveling. Each level within Engineering and Engineering Management has an associated merit band, equity band, and cash bonus target based on location. We use our leveling framework to help guide managers to place their employees at a position within those bands. For example, an engineer recently promoted to IC3 would likely fall towards the lower end of the IC3 level framework and pay band.</p><p>In order to ensure managers interpret and apply our career leveling framework consistently, we run calibration conversations on a quarterly basis. Calibration conversations are discussions among management peer groups about performance expectations of individuals on their team. These calibration conversations contribute towards more equitable pay by making sure expectations are consistent across teams.</p><p>Our frameworks and processes would be meaningless if we didn’t closely analyze our compensation data to ensure they are actually working. Within our Engineering org, we have committed to conducting a pay equity analysis annually and sharing the results internally with the entire Engineering team. We’re pleased to share some highlights from our latest analysis below.</p><p>As we look into our data, a few things immediately come to mind. First, the data is a snapshot of a point in time and is not entirely complete as we don’t have demographic data for all employees. Our analysis includes our full-time, individual contributor Engineering employees who have voluntarily provided their race, ethnicity and/or gender information, which is about three-quarters of this population. Second, we don’t expect pay to be identical for all people within a level, as we mentioned above when we talked about how we think about pay. Small pay gaps are to be expected for a number of factors that are unrelated to race, ethnicity, or gender – for example, performance, impact, and growth within level.</p><p>Third, we show gender in terms of women and men because we do not yet have enough data to represent a more nuanced view of gender identity, but we are working on improving our data to represent this more completely in the future. With respect to race and ethnicity, we have combined Black, Hispanic, Native Hawaiian/Other Pacific Islander, and American Indian or Alaska Native employees into an under-represented minority (URM) group in the data due to small sample size.</p><p>So how do we do the analysis? It’s important to understand <em>compa-ratio</em>. Our salary bands were developed by our compensation team and leaders, guided by our pay philosophy and the competitive market landscape. Compa-ratio is computed for each salary band, by taking each individual’s salary divided by the middle of the salary band for that role and level (50th percentile). We use the median compa-ratio - the red line represents the median compa-ratio of the population at that level and the gray bars represent quartiles.</p><p>Without further ado, let’s show you some data! First, we’ll start with gender. As you can see from the chart below, the median compa-ratio for men is 100% while women’s is 101%, which means, on average, men and women are paid nearly identically. Men do outnumber our women in Engineering and their distribution is slightly higher.</p><div class="c1"></div><p>Next, let’s talk about race and ethnicity. As you’ll see in the chart below, all groups have compa-ratio medians within 2% of each other and the distribution of employees around the comp ratio medians appear relatively equal.</p><div class="c1"></div><p>This is just a high level snapshot of the analysis we do at Yelp. We also dig deep into level progression to ensure employees progress across our levels at similar rates regardless of gender, race, or ethnicity. As an example, the chart below represents gender by level and tenure. It shows that progression through levels occurs at similar rates for men and women. At 2 years tenure, the majority of our employees sit at IC1, IC2, or IC3. By 5 years, the majority of employees are at IC3, IC4, IC5, or IC6.</p><div class="c1"></div><p>As we cut the data, we run into smaller sample sizes that can result in disproportionate differentials. Whenever we find outliers, our leadership team looks at pay and level information on a case-by-case basis to ensure the outlier is due to legitimate, nondiscriminatory reasons like scope of impact, and takes action where needed through out-of-cycle level or pay adjustments to achieve equity and fairness. We’ve learned through this analysis that our framework and methodology have resulted in equitable pay.</p><p>We are always trying to improve our pay equity analysis. We continue to iterate how we look at total compensation to ensure equitable pay that attracts, motivates, and retains Engineering talent. We also have an opportunity to better report on gender identity in a non-binary way. We share our pay equity data with our employees annually and typically review the data twice a year (although with the challenges we faced during the pandemic, we prioritized investments in resources to assist our employees in 2020 and skipped the analysis that year). This ongoing analysis coupled with transparency and communication not only builds trust in our leaders and processes, but also keeps us accountable for our pay practices.</p><p><strong>Up next: Fostering Belonging and Inclusion at Yelp</strong></p><p>Continuing the conversation of gender identity, race, and ethnicity at Yelp, Trisha Walsh, Tenzin Kunsal, and Ian Fijolek will talk about our Employee Resource Groups and efforts made to promote a healthy work/life balance as well as mental health. If you’d like to join a team passionate about pay equity and inclusion, <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Why Yelp's hiring strategy in Canada no longer includes offices</h1> <p>Mon, 12 Jul 2021 02:00:00 +0200</p> <p>When Yelp first started building engineering and product teams in Canada in 2019, our plan was to create a workforce based out of our Toronto office. Over the past year as we adapted to being an entirely remote workforce we realized, like many companies, that people don’t need to work in offices to be collaborative and successful. In fact, through remote work surveys sent to our employees, we found that most people are happier and more productive when they have the option to work remotely.</p><p>We’re now hiring engineering and product roles as fully remote in Canada, as well as in all of our locations across North America and Europe. We plan to <a href="https://blog.yelp.com/2021/05/returning-to-yelp-offices-in-2021">open our offices</a> worldwide this year, allowing employees to decide how many days per week, if any, they’d like to work from an office. As we continue to grow while working remotely, we’ve remained focused on how to best support employees. In addition to our <a href="https://www.yelp.careers/us/en/benefits-at-yelp-in-canada">standard benefits</a>, we’re offering a $100 monthly reimbursement and a one-time payment of $450 to support the costs of working from home.</p><p><strong>Growing as a distributed workforce</strong></p><p>The freedom to work from anywhere within the <a href="https://www.yelp.careers/us/en">locations we hire in</a> — including Ontario, British Columbia, Quebec, and Alberta — has allowed us to reach a wider pool of individuals from a broader variety of backgrounds. Since about half of our global technical hires will be based in Canada this year, we’re excited to bring our engineering and product opportunities to local communities and welcome more employees with diverse experiences.</p><p>As Yelp’s technical teams become increasingly distributed, we’re being intentional about creating a culture where everyone can maintain a healthy work-life balance and have equal opportunities for impact, growth, and success. We’re taking a close look at our communication styles and creating best practices for collaborating across time zones. We’re also enabling people to make connections both inside and outside of their own organizations, as well as continuing to provide valuable mentorship opportunities. For example, we host social events for our new hires, provide a dedicated mentor matching program, and encourage participation in <a href="https://engineeringblog.yelp.com/2017/02/open-sourcing-yelp-beans.html">Yelp Beans</a> — an internal tool we use to help employees meet colleagues within the company.</p><p>Employees in all Yelp locations have the support of our many Employee Resource Groups (ERGs) to help make meaningful connections. These include ERGs focused on our employees in the engineering and product space, such as <a href="https://www.yelp.com/engineering/awe">Awesome Women in Engineering</a>, Women in Product, and ColorCoded, just to name a few. Our goal is to enable all employees to bring their authentic selves to work and to be successful, regardless of their location or background.</p><p><strong>Bringing together diverse cultures to build something greater</strong></p><p>Since Yelp began seeking technical talent in Canada, our goal has been to create a workforce that reflects the demographics of the Canadian population. By increasing the locations people can choose to work from, we’re able to create an even more diverse organization that brings new expertise to help us solve increasingly complex challenges. We’re focused on proactively growing and cultivating an employee community based on a variety of backgrounds, talents, and perspectives. To achieve our goals, our technical talent team partners closely with our engineering and product teams to ensure we’re <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">building happy, diverse, and inclusive teams</a>, <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">hiring a diverse team by reducing bias</a>, and using structured <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">interviews</a> and <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">promotion practices</a> to improve equity.</p><p>In our technical hiring, we’ve set a goal to exceed Canada’s national average with regards to the representation of women and underrepresented minorities in the tech community. Between Q4 2020 and Q2 2021, we’ve consistently met these goals in our technical recruiting efforts, including hiring a higher percentage of underrepresented minorities than is represented across the entire country of Canada, <a href="https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&Geo1=PR&Code1=01&Geo2=PR&Code2=01&Data=Count&SearchText=canada&SearchType=Begins&SearchPR=01&B1=All&TABID=1">according to their 2016 census</a>. Our technical recruiting team, itself a group of individuals representing a variety of backgrounds, is passionate about increasing the representation of underrepresented groups in tech — not only because it’s a proven smart business move, but also because it’s morally the right thing to do.</p><p><strong>Sound like a fit? We’d like to get to know you.</strong></p><p>Yelp is looking for Product Managers, Software Engineers, Engineering Managers, Data Scientists, Business System Analysts, Product Designers, and more to join our growing team in Canada. If you’re looking to work at a company that values <a href="https://blog.yelp.com/category/diversity-and-inclusion">diversity, inclusion, belonging</a> and work-life balance, we’d love to hear from you!</p><p>Check out our <a href="https://www.yelp.careers/us/en/search-results">careers site</a> to see our current opportunities.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Analyzing Experiments with Changing Cohort Allocations</h1> <p>Tue, 06 Jul 2021 02:00:00 +0200</p> <p>Have you ever run an A/B test and needed to change cohort allocations in the middle of the experiment? If so, you might have observed some surprising results when analyzing your metrics. Changing cohort allocation can make experiment analysis tricky and even lead to false conclusions if one is not careful. In this blog post, we show what can go wrong and offer solutions.</p><p>At Yelp, we are constantly iterating on our products to make them more useful and engaging for our customers. In order to ensure that the Yelp experience is constantly improving, we run A/B tests prior to launching a new version of a product. We analyze metrics for the new version versus the previous version, and ship the new version if we see a substantial improvement.</p><p>In our A/B tests, we randomly choose what product version a given user will see — the new one (for users in the test cohort) or the current one (for those in the status quo cohort). In order to make the new version’s release as safe as possible, we often gradually ramp up the amount of traffic allocated to the test cohort. For example, we might start with the test cohort at 10%. During this period, we would look for bugs and monitor metrics to make sure there are no precipitous drops. If things look good, we would ramp our test cohort allocation up, perhaps going to 20% first before ultimately increasing to 50%.</p><div class="image-caption"><p class="subtle-text"><small>Example cohort allocation changes throughout an experiment</small></p></div><p>In this situation, we have multiple runs of the experiment with different cohort allocations in each run. This blog post will show how to properly analyze data from all runs of the experiment. We will discuss a common pitfall and show a way to avoid it. We will then frame this problem in the language of causal inference. This opens up numerous causal inference-based approaches (we survey a couple) that can yield further insight into our experiments.</p><p>Comparing metrics between cohorts can get tricky if cohort allocations change over time. In this section, we show an example where failing to account for the changing cohort allocation can cause one to get misleading results.</p><p>For concreteness, suppose that we are trying to improve our home and local services experience, with a view towards getting more users to <a href="https://blog.yelp.com/2016/04/yelp-request-a-quote">request a quote</a> for their home projects on Yelp. The metric we are trying to optimize in this example is the conversion rate — what fraction of users visiting home and local services pages decides to actually request a quote.</p><p>We run an A/B test to ensure that the new experience improves conversion versus the status quo. We have two runs, one each in the winter and the spring; in the second run, we increase the fraction of traffic allocated to the test cohort from 10% to 50%. The cohort allocations and true per-cohort conversion rates in each experiment run are as in the table below.</p><table><thead><tr><th>Time period</th> <th>Experiment Run</th> <th>Cohort</th> <th>% of traffic assigned to cohort</th> <th>Conversion Rate</th> </tr></thead><tbody><tr><td>Winter</td> <td>1</td> <td>Status Quo</td> <td>90%</td> <td>0.15</td> </tr><tr><td>Winter</td> <td>1</td> <td>Test</td> <td>10%</td> <td>0.15</td> </tr><tr><td>Spring</td> <td>2</td> <td>Status Quo</td> <td>50%</td> <td>0.30</td> </tr><tr><td>Spring</td> <td>2</td> <td>Test</td> <td>50%</td> <td>0.30</td> </tr></tbody></table><p>Notice that in this example, the conversion rate is higher in the spring. This can happen, for example, if home improvement projects are more popular in the spring than the winter, causing a higher fraction of visitors to use the Request a Quote feature. Importantly, there is no conversion rate difference between the two cohorts.</p><p>We will now simulate a dataset that one might obtain when running this experiment and show that if we fail to account for the changing cohort allocation, we will be misled to believe that the test cohort has a higher conversion rate.</p><p>In our simulated dataset, we will have ten thousand samples for each experiment run. A given sample will include information about the experiment run, cohort, and whether a conversion occurred. The cohort is randomly assigned according to the experiment run’s cohort allocation. The conversion event is sampled according to the true conversion rate in the given experiment run and cohort.</p><div class="highlighter-rouge highlight"><pre>import numpy as np import pandas as pd def simulate_data_for_experiment_run( total_num_samples: int, experiment_run: int, p_test: float, status_quo_conversion_rate: float, test_conversion_rate: float ): experiment_data = [] for _ in range(total_num_samples): cohort = np.random.choice( ["status_quo", "test"], p=[1 - p_test, p_test] ) if cohort == "status_quo": conversion_rate = status_quo_conversion_rate else: conversion_rate = test_conversion_rate # 1 if there is a conversion; 0 if there isn't conversion = np.random.binomial(n=1, p=conversion_rate) experiment_data.append( { 'experiment_run': experiment_run, 'cohort': cohort, 'conversion': conversion } ) return pd.DataFrame.from_records(experiment_data) </pre></div><div class="highlighter-rouge highlight"><pre>experiment_data = pd.concat( [ simulate_data_for_experiment_run( total_num_samples=10000, experiment_run=1, p_test=0.1, status_quo_conversion_rate=0.15, test_conversion_rate=0.15, ), simulate_data_for_experiment_run( total_num_samples=10000, experiment_run=2, p_test=0.5, status_quo_conversion_rate=0.30, test_conversion_rate=0.30, ), ], axis=0, ) </pre></div><p>The most straightforward way one might try to estimate the per-cohort conversion rate is to take the mean of the conversion column for all samples in each cohort. Effectively, this gives the number of conversions per cohort divided by the total number of samples in the cohort.</p><div class="highlighter-rouge highlight"><pre>def get_conversion_date_for_cohort( experiment_data: pd.DataFrame, cohort: str ): experiment_data_for_cohort = experiment_data[experiment_data.cohort == cohort] return experiment_data_for_cohort.conversion.mean() </pre></div><div class="highlighter-rouge highlight"><pre>get_conversion_date_for_cohort(experiment_data, "test") 0.2768492470627172 get_conversion_date_for_cohort(experiment_data, "status_quo") 0.20140431324783262 </pre></div><p>We observe a substantial difference between the conversion rate estimates for the status quo and test cohorts.</p><p>To get some intuition about whether it is statistically significant, let us create five thousand simulated datasets with the same parameters (cohort allocations and conversion rates). For each dataset, we will estimate conversion for the two cohorts, and look at the estimates’ distribution. The table below reports the mean and quantiles of the five thousand estimates of the status quo and test conversion rates. The table shows that the distributions in the estimated conversion rate for the two cohorts are very different, suggesting that the difference we observed is indeed statistically significant.</p><table><thead><tr><th> </th> <th>Mean</th> <th>2.5th percentile</th> <th>50th percentile</th> <th>97.5th percentile</th> </tr></thead><tbody><tr><td>Status Quo</td> <td>0.204</td> <td>0.197</td> <td>0.204</td> <td>0.210</td> </tr><tr><td>Test</td> <td>0.275</td> <td>0.246</td> <td>0.275</td> <td>0.287</td> </tr></tbody></table><p>Recall, however, that there is no conversion rate difference between the cohorts: for both experiment runs, the test and status quo cohorts have equal conversion rates. Thus, if we use the simple approach described above to analyze the experiment, we would be misled and think that the test cohort outperforms the status quo.</p><p>What is going on? What we are seeing is a consequence of the fact that the average conversion rates and cohort allocations both change between experiment runs. For the test cohort, the majority of samples come from the higher-conversion period of the second experiment run. The opposite is true for the status quo cohort. So, the calculated conversion rate is higher for the test cohort than for the status quo cohort.</p><p>(In fact, it is not hard to adapt this example so that the test cohort has a lower conversion rate than the status quo cohort in each experiment run, but a higher calculated conversion rate overall. In this case, we might be misled and release the test experience, despite the fact that it harms conversion. This phenomenon is an example of <a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox">Simpson’s Paradox</a>.)</p><p>How do we correctly compare the conversion for the experiment cohorts? We first observe that the average conversion rate over the entire dataset is 0.225, the average of the conversion rates in each experimental run. (There are ten thousand samples in each run, so we can take a simple average. If the number of samples per run were different, we would instead calculate the overall conversion rate using a weighted average; the weights would be the number of samples in each experiment run.) Since the two cohorts have the same conversion rate, the method used for estimating it should arrive at this number for both (up to some statistical noise).</p><p>The previous method reported a higher conversion rate for the test cohort because it had disproportionately many samples from the second experiment run. To correct for this imbalance, let us instead try to calculate per-cohort conversion rates separately for each experiment run, and then combine them with a weighted average. This approach is implemented below:</p><div class="highlighter-rouge highlight"><pre>def per_experiment_run_conversion_rate_estimator_for_cohort( data: pd.DataFrame, cohort: str, experiment_runs: List[int], ): data_for_cohort = data[data.cohort == cohort] conversion_rates = [] total_num_samples = [] for experiment_run in experiment_runs: conversion_rates.append( data_for_cohort[ data_for_cohort.experiment_run == experiment_run ].conversion.mean() ) total_num_samples.append( (data.experiment_run == experiment_run).sum() ) return np.average(conversion_rates, weights=total_num_samples) </pre></div><p>In the table below, we report statistics about conversion rate estimates on five thousand simulated datasets:</p><table><thead><tr><th> </th> <th>Mean</th> <th>2.5th percentile</th> <th>50th percentile</th> <th>97.5th percentile</th> </tr></thead><tbody><tr><td>Status Quo</td> <td>0.225</td> <td>0.218</td> <td>0.225</td> <td>0.232</td> </tr><tr><td>Test</td> <td>0.225</td> <td>0.213</td> <td>0.225</td> <td>0.238</td> </tr></tbody></table><p>We see that the estimates for the test and status quo conversion rates are close to the true value on average, and are close to each other.</p><p>In the rest of this blog post, we will provide a more theoretical justification for why this method, and another one based on regression, are appropriate for analyzing experiments where cohort allocations change over time. This will involve interpreting our problem in the language of causal inference.</p><p>The issues we faced when analyzing experiments with changing cohort sizes have a connection with causal inference. In this section, we will explore this connection, which will help us gain a better understanding of methods used to correctly calculate conversion rate (including the per-experiment run computation in the previous section).</p><h2 id="what-are-we-trying-to-measure">What are we trying to measure?</h2><p>We are trying to measure the causal effect on conversion from being in the test (versus the status quo) cohort (also known as the treatment effect). To do this, we imagine taking all the samples in our dataset. What fraction would convert if all of them were in the test cohort (call this Y<sub>T</sub>)? What fraction would convert if all were in the status quo cohort (call this Y<sub>SQ</sub>)? The difference between the two is the average treatment effect for the dataset.</p><p>Unfortunately, it is impossible to directly measure the average treatment effect as described above. Any given sample is in one cohort but not both, so it is impossible to know that sample’s outcome if it were in the other cohort. The calculation relies on some counterfactual data, e.g. for a sample in the status quo cohort, would it have converted had it been in the test cohort? This is known as the <a href="https://en.wikipedia.org/wiki/Rubin_causal_model#The_fundamental_problem_of_causal_inference">fundamental problem of causal inference</a>.</p><p>However, we can use our samples to estimate the average treatment effect.</p><h2 id="estimating-average-treatment-effect">Estimating average treatment effect</h2><p>The first attempt to estimate the average treatment effect was computing the average conversion rate per cohort. We computed the probability of conversion given that the cohort was test or status quo, and subtracted the two. We found that being in the test cohort was correlated with higher conversion. This correlation does not imply causation, however. The reason that being in the test cohort is correlated with conversion is that, given our cohort allocations, a user being in the test cohort means that they are more likely to be in the higher-conversion second experiment run.</p><p>Said another way, the experiment run is a confounding variable that produces a non-causal association between cohort and conversion. This is known as <a href="https://catalogofbias.org/biases/confounding/">confounder bias</a>. To properly estimate the causal effect of being in the test cohort, we have to control for the confounder. There are a number of standard ways of doing this in the causal inference literature (e.g. Section 3.2 of [1]).</p><h3 id="separate-conversion-rate-calculations-per-confounder-value">Separate conversion rate calculations per confounder value</h3><p>This approach tries to correct for confounder bias by computing per-cohort conversion rates separately for each value of the confounder (experiment run). To get the overall conversion rate for each cohort, we take a weighted average of the conversion rates per experiment run, with weights being the relative prevalence of each confounder value in the dataset. (See, for example, <a href="http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf">Equation 3.21</a> in [2].) This is equivalent to weighing by the total number of samples (test and status quo) in each experiment run. This gives estimates for Y<sub>T</sub> and Y<sub>SQ</sub>, and we can subtract them to get an estimate for the average treatment effect.</p><p>We did precisely this when we tried to properly calculate conversion rate per cohort (using the <code class="highlighter-rouge">per_experiment_run_conversion_rate_estimator_for_cohort</code> function). This approach makes sense because, for a given experiment run, the per-cohort calculation gives us an estimate of the conversion rate for that experiment run if all samples were in the given cohort (this relies on the fact that users are assigned at random to the status quo or test cohort). Then, the weighted average step gives us an estimate of what the conversion rate would be over the entire data set (all experiment runs).</p><h3 id="including-confounding-variable-in-a-regression">Including confounding variable in a regression</h3><p>Another approach for controlling for the confounder is to build a regression model for the outcome variable (conversion) as a function of the treatment variable (cohort, specifically a dummy variable encoding whether the sample is in the test cohort). If we simply regress conversion on the test cohort dummy variable, we will see a positive regression coefficient, which may lead us to conclude there is a positive treatment effect. However, in our running example — where the treatment effect is zero — there will be a positive coefficient just because being in the test cohort is correlated with conversion, which happens due to the presence of the confounder.</p><p>To fix this, we include the confounder as a predictor variable alongside the cohort. This will split the conversion effects that are due to the confounder and those due to being in the test cohort. The coefficient of the cohort variable will give us the average treatment effect.</p><p>Both the cohort and the experiment run are categorical variables, and we will encode them using dummy variables. For each categorical variable, we need one fewer dummy variable than the number of different values the variable can take. For our data with two cohorts and two experiment runs, the code below will create dummies for whether the user is in the test cohort and also for whether it is in the second experiment run.</p><div class="highlighter-rouge highlight"><pre>import statsmodels.formula.api as smf smf.ols( formula="conversion ~ C(cohort, Treatment('status_quo')) + C(experiment_run)", data=experiment_data ).fit().summary() </pre></div><p>This code uses the <a href="https://www.statsmodels.org/v0.12.0/example_formulas.html">formula</a> API in <a href="https://www.statsmodels.org/v0.12.0/index.html">statsmodels</a>. It stipulates that conversion is a linear function of cohort and experiment run. The <a href="https://patsy.readthedocs.io/en/v0.5.1/categorical-coding.html">C(·) notation</a> encodes these variables as dummy variables.</p><p>The results are:</p><table><thead><tr><th>Variable</th> <th>Coefficient</th> <th>Standard Error</th> </tr></thead><tbody><tr><td>Intercept</td> <td>0.1507</td> <td>0.004</td> </tr><tr><td>User is in test cohort</td> <td>0.0073</td> <td>0.007</td> </tr><tr><td>Second experiment run</td> <td>0.1427</td> <td>0.006</td> </tr></tbody></table><p>The intercept term is approximately equal to the baseline conversion rate (in the first experiment run and status quo cohort), namely 0.15.</p><p>We see a close to zero effect from being in the test cohort; the coefficient is almost equal to its standard error. On the other hand, we see an approximately 0.15 effect from being in the second experiment run. Indeed, samples in that experiment run have a conversion of 0.3, which is 0.15 higher than the conversion rate in the first experiment run.</p><h2 id="example-with-non-zero-treatment-effect">Example with non-zero treatment effect</h2><p>We modified our running example such that the test cohort conversion rate was 0.05 higher in each experiment run than the status quo conversion rate, and tested out our two methods for computing average treatment effect.</p><table><thead><tr><th>Time period</th> <th>Experiment Run</th> <th>Cohort</th> <th>% of traffic assigned to cohort</th> <th>Conversion Rate</th> </tr></thead><tbody><tr><td>Winter</td> <td>1</td> <td>Status Quo</td> <td>90%</td> <td>0.15</td> </tr><tr><td>Winter</td> <td>1</td> <td>Test</td> <td>10%</td> <td>0.20</td> </tr><tr><td>Spring</td> <td>2</td> <td>Status Quo</td> <td>50%</td> <td>0.30</td> </tr><tr><td>Spring</td> <td>2</td> <td>Test</td> <td>50%</td> <td>0.35</td> </tr></tbody></table><p>The overall conversion rates are 0.225 for the status quo cohort and 0.275 for the test cohort.</p><h3 id="separate-conversion-rate-per-confounder-value">Separate conversion rate per confounder value</h3><p>Running <code class="highlighter-rouge">per_experiment_run_conversion_rate_estimator_for_cohort</code> gives conversion rate estimates that are close to the actual values (0.225 and 0.275 for the status quo and test cohorts respectively).</p><h3 id="regression">Regression</h3><p>The regression gives the following coefficients:</p><table><thead><tr><th>Variable</th> <th>Coefficient</th> <th>Standard Error</th> </tr></thead><tbody><tr><td>Intercept</td> <td>0.1506</td> <td>0.004</td> </tr><tr><td>User is in test cohort</td> <td>0.0567</td> <td>0.007</td> </tr><tr><td>Second experiment run</td> <td>0.1431</td> <td>0.007</td> </tr></tbody></table><p>As before, the coefficient for the variable encoding whether the user is in the test cohort approximates the true average treatment effect (0.05). The coefficient for the variable encoding the second experiment run is approximately 0.15, once again as expected — in the second experiment run, conversions are that amount higher.</p><h2 id="simulation-study">Simulation study</h2><p>To better understand the two methods for estimating average treatment effects and the advantages of each, we ran a simulation study. In this study, we produced a large number of datasets with the same parameters and looked at the distribution of average treatment effects estimated by the two methods.</p><p>We will take a look at a particular example:</p><ul><li>Two experiment runs with 1000 samples per run (ten times lower than in the previous datasets; this helps better illustrate the statistical noise in our estimates)</li> <li>Test cohort allocation is 10% and 50% in the two runs respectively</li> <li>The status quo conversion rates are 0.15 and 0.30 in the two experiment runs respectively</li> <li>The test cohort conversion rates are 0.20 and 0.35 (0.05 higher than the status quo conversion rates)</li> </ul><p>We produced a total of 5000 datasets, and hence estimated the treatment effect 5000 times for each method.</p><p></p><p>The orange graphs in the figure above are histograms of the estimated average treatment effect for the separate conversion rate estimation method (top) and the regression method (bottom). Both distributions have means close to 0.05, the true average treatment effect, and have very similar shapes. The graphs in blue are the estimated average treatment effects for datasets that are the same as above, but where the status quo and test cohorts have the same conversion rate in each experiment run. These distributions have means of close to 0 as expected, since the true treatment effect is 0.</p><p>We have run a number of simulation studies, and have found that the two methods for estimating average treatment effect perform similarly. Overall, we believe that the most important thing is not the precise method one uses, but that one is aware of confounder bias, and takes steps to correct for it.</p><p>Nevertheless, it is good to keep the regression method in one’s tool chest because it can be easier to use in many instances. For one, software packages such as <code class="highlighter-rouge">statsmodels</code> automatically compute standard errors for regression estimates. Additionally, with regression, it is fairly straightforward to analyze more complicated experiments, such as when there are multiple confounders. (One example is if cohort allocations within experiment runs were different for different geographical regions; in this case, geographical region would be an additional confounding variable.)</p><p>Analyzing experiments where cohort allocations change over time can get a little complicated. Simply looking at the outcome variable for samples in the status quo and test cohorts can cause misleading results, and different techniques, which control for the confounding variable, are needed. We hope that this blog post has raises awareness of this issue and provides some solutions.</p><ul><li>Billy Barbaro for originally making me aware of the issue discussed in this post.</li> <li>Alex Hsu and Shichao Ma for useful discussions and suggestions, which ultimately helped frame this causal inference interpretation of the problem.</li> <li>Blake Larkin and Eric Liu for carefully reading over this post and giving editorial suggestions.</li> </ul><ol><li>Joshua D. Angrist and Jörn-Steffen Pischke. <em>Mostly Harmless Econometrics</em>. Princeton University Press, 2008</li> <li>Judea Pearl. “Controlling Confounding Bias.” In <em>Causality</em>. Cambridge University Press, 2009 <a href="http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf">http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf</a></li> <li>Adam Kelleher. “A Technical Primer on Causality.” <a href="https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41#.o1ztizosj">https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41#.o1ztizosj</a></li> </ol><div class="island job-posting"><h3>Become an Applied Scientist at Yelp</h3><p>Want to impact our product with statistical modeling and experimentation improvements?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/cc5ce7e2-26e9-4290-8847-c082632df9e8/Applied-Scientist-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: How we think about engineering management</h1> <p>Thu, 01 Jul 2021 02:00:00 +0200</p> <p>In our last post we talked about technical leadership, one of the growth paths available to software engineers at Yelp. In this post we’d like to share more about engineering management, which is another path that some software engineers choose after some time in the industry. We’ll start with an explanation of what engineering management is (and isn’t), discuss our approach to management, and talk about what makes it different from engineering. We’ll also discuss how people get started on a management path at Yelp, and what we do to help our management team grow in their roles.</p><h2 id="whats-an-engineering-manager">What’s an engineering manager?</h2><p>At Yelp, every engineering manager is accountable for the overall health, execution, and vision of their team. Managers safeguard Yelp’s culture and <a href="https://www.yelp.careers/us/en">values</a>, ensuring that it’s a great place to work. We expect managers to make good decisions that are best for the company and to put the interests of the team ahead of themselves. Sometimes this means handing off an interesting project to another team that is better equipped, or finding a role on another team at Yelp for a senior engineer who’s ready for a new challenge.</p><p>Each of us on the management team is a technologist with a background in doing the work, whether that be software engineering, information technology, machine learning, or something else. This background knowledge gives us the ability to understand what our teams do on a day-to-day basis, to empathize with the challenges they face, and to entrust our team with most of the decision-making necessary to build and operate our product and its infrastructure.</p><h3 id="what-does-it-mean-to-be-accountable-for-the-health-of-a-team">What does it mean to be accountable for the health of a team?</h3><p>Managers are responsible for the motivation, well-being, and career growth of their teammates. A manager’s first job is to build a trusting relationship with each person on the team, usually through weekly one-on-one meetings and quarterly career planning discussions. If someone is feeling excited or proud, the manager is there for a high five. If someone is stressed or upset, or not taking care of themselves, the manager is there to listen and support, first helping them to pinpoint what they are feeling.</p><p>We rely on managers to connect the right opportunities with the right people, and you can’t do that unless you know someone’s interests, aspirations, and concerns. Someone might want to get better at public speaking, so you find a project for them in a role that involves lots of presentations to other teams. Or they might have deep social anxiety, so you make sure they <em>don’t</em> have to present, or you can find alternate opportunities for them to build confidence and communication skills. As a teammate advances in their career, it’s important to work with them to find opportunities that will engage their desire for personal growth and learning. Managers need to establish a foundation of trust and open communication if they’re going to understand what each team member loves about their work.</p><p>Managers spend a lot of time listening and paying attention, especially in our one-on-ones with our team. We want to be there for people when they are worried, frustrated, or stuck. We don’t aim to solve all of their problems, but we will offer our perspective and feedback, ask questions, and connect with others who can help them. We also want to be there to celebrate wins alongside them, and to make sure their growth and achievements are recognized.</p><h3 id="what-is-a-managers-role-in-a-teams-execution">What is a manager’s role in a team’s execution?</h3><p>Managers are responsible for ensuring the team has processes, norms, and guardrails that allow the team to operate effectively and for everyone to do their best work. Each team will have its own personalities and preferences, but teams always need to be inclusive to be effective. For example, the manager of a Scrum team might run sprint planning themselves, or they might delegate that responsibility to engineers, but in either case they need to ensure that every team member feels involved and is an active contributor. Managers are constantly on the lookout for ways to help their team improve execution, and the best managers enable their teams to do this effectively.</p><p>People do their best work when they feel a sense of agency and autonomy over <em>what</em> the work is and <em>how</em> it gets done. This often requires managers to delegate much of the day-to-day technical work, like actually building and shipping software. Although all managers need to be prepared to roll up our sleeves and help the team during an emergency, it’s important that we avoid trying to do the same day-to-day work as the team. Writing code is fun – some of us really miss it – but if a manager is regularly doing pull requests, it raises uncomfortable questions: Is it safe to leave critical feedback on their work? Does the manager not trust the team? Is it a sign the team is understaffed? Is there nothing else the manager could do that would help the team perform better?</p><p>It’s no different with higher-level technical decision-making: it’s almost always better if the engineering team can handle challenging decisions on its own, instead of relying on their manager to make the call. Should we move from one database platform to another? Is it time for us to rewrite that ancient module that nobody understands anymore? Is our time best spent trying to make the app faster, or in rewriting our app framework so more teams can work in parallel? When these questions come up, the manager’s job is not to <em>decide</em> but to put structure around the team’s decision-making, using our own experience to guide the discussion and keep things moving forward.</p><p>As former engineers, we might love wrestling with technical problems, but as managers we often need to set aside our own interests in writing and pushing code to better support everyone else on the team. We do, however, rely heavily on our technical backgrounds to guide conversations, validate the team’s direction and investments, support technical growth of our teammates, and ask the right questions along the way.</p><h3 id="is-vision-just-a-fancy-word-for-fancy-presentations">Is “vision” just a fancy word for fancy presentations?</h3><p>Managers connect the dots between business value and engineering projects. Many teams have product managers to identify business opportunities and study what delivers the most value to our users. Meanwhile, engineers are most familiar with the product’s current technical capabilities and weaknesses, as well as which systems are incurring technical debt and cannot be easily extended or reused. Engineering managers facilitate an ongoing conversation that aligns the next set of technical investments with business value, whether it’s iterating on a feature or system in its current state, or taking on a bigger effort to refactor or rearchitect.</p><p>One of the key ways a manager supports their team is by planning ahead. Engineering managers need to see at least a few months into the future, beyond the current backlog, and ensure the team is prepared for what’s coming up. In addition to working closely with other Engineering teams, we collaborate with teams in Product Operations, Sales, and Customer Success to understand the business’s priorities and to help make sensible trade-offs. We try to strike a healthy balance between incremental improvements (where returns on investment might be clear, but limited, and big bets (where the uncertainty is higher, but for a bigger potential payoff). If we are going to make a big bet, we help the team to break it down into milestones to reduce risk.</p><p>In parallel, each Engineering team keeps track of its own prioritized backlog of technical investments and engineering opportunities; the team’s manager needs to make sure there is enough time and budget for the team to make meaningful progress on that backlog. Many teams will allocate time in each planning cycle to address maintenance issues and small refactors. Larger technical investments are motivated by patterns in bugs and failures, as well as developer velocity. Often iterating on a system over years will lead to difficulty in supporting it due to the accumulation of complexity, drift of business goals, and increases in both the volume of traffic to the system and number of engineers who interact with the system. In recent years, we’ve tracked our largest technical investments at an engineering-wide level to ensure that teams know they are priorities.</p><p>It’s important for every engineering manager to understand the bigger picture so that they can share context with the team on where things are going and how they fit together. Every Engineering team has more possible work that it can ever complete, so it’s critical for the manager to facilitate the conversation of investment levels in various areas.</p><h3 id="bringing-it-all-together">Bringing it all together</h3><p>Health, execution, and vision are interconnected. A healthy team with a good work-life balance requires clear, consistent processes for triaging issues and making commitments. A team that consistently makes high quality decisions in a collaborative fashion is only possible if the team believes their manager when they say, “This decision is yours to make.” A team where every engineer feels motivated and challenged requires a manager who is thinking ahead and anticipating what comes next.</p><p>One example that touches on all three of these areas is incident response. Over the years, we’ve become much better at dealing with emergency incidents, putting in place protocols that ensure that engineers communicate and support each other through the incident, then discuss and write a (blameless) postmortem with follow-up action items (which can include longer-term engineering investments). Following an incident, managers check in with engineers taking personal time to deal with incidents to offer some paid time off to recover. As a management team we’ve prioritized introspection by our teams and time set aside for continuous improvement.</p><h2 id="how-does-someone-become-a-manager">How does someone become a manager?</h2><p>Sometimes an engineer will express interest in management and explore it with their leadership team; other times, we’ll see potential in someone and encourage them to consider it. We know that there is a lot of variation in effective leadership styles, and in some cases it has taken years of coaching and encouraging an engineer for them to give management a try. To ensure that we’re being open-minded about who can be a manager, we continue to develop leadership training for all engineers, not just ones who have self-selected into the management path, and we are asking all managers to have deeper conversations with their level IC3 reports on career development options, since this is where we typically see branching between the management and technical leadership career paths.</p><p>In any case, we only consider engineers who’ve already demonstrated some of the skills required to be a good manager. That could be mentoring other engineers on the team and helping train new hires. It could be leading projects and keeping track of team initiatives. We want to be sure that anyone under consideration for management understands what the role means at Yelp and has shown themselves to be a role model for Yelp’s values.</p><p>Management roles are not unlimited; they become available through team growth, reorganizations, and (sometimes) departures. This means that stepping into a manager role often involves changing teams. We think this is healthy; it helps new managers avoid the trap of trying to manage the team <em>and</em> remain one of the key technical contributors. If you’re learning to manage people for the first time, you have to be able to focus on that new set of skills and challenges. That is easier if it’s a new set of systems than what you worked on as an engineer.</p><p>Through some trial and error, we developed a training program we call “proto-manager” to help individuals try out the management role without making it seem irreversible. We wanted to give them exposure to the role and a vote of confidence from leadership, but still allow for the option to say, “Not for me. I’d rather keep writing code.”</p><p>As a proto-manager, the engineer takes over one-on-ones with the engineers on the team and accountability for the team’s execution; compensation planning and performance review are still handled by the team’s manager. Over the next few months, the proto-manager will get regular feedback from their team and their manager, and will track their progress against the expectations laid out in the first level of our Engineering Manager leveling system.</p><p>Proto-managers enroll in a training program we’ve developed that details our philosophy, approach, and toolset for managers. They also gain access to all of the resources, documents, and meetings that their EMs have, so they can learn as quickly and effectively as possible. After a few months, the proto-manager and their manager decide whether to move forward and make it official.</p><h2 id="how-does-someone-grow-as-a-manager">How does someone grow as a manager?</h2><p>Mountains of books about management get published every year, but most of the growth of a manager comes from doing the job. It’s unlikely you’ll suddenly become adept at having crucial conversations with someone who is on the path to burning out just by reading a book or attending a training. It takes experience and practice to unwind the reasons that a complex project with a strong team still isn’t running according to plan. However, we have several habits and programs to support new managers as they learn on the job.</p><p>New managers gain a lot by getting advice from their peers. Directors hold weekly meetings with small groups of peer managers, with a focus on knowledge sharing and support within the group. Every attendee contributes to the agenda with books, articles, or videos that they found useful or insightful, but the majority of discussion often centers around what we call “people stories.” Managers can bring interesting or challenging situations to the group and the other attendees will listen, ask questions, and offer suggestions. Many of these stories are about coaching, giving feedback to, or finding opportunities for a particular individual. There is absolute confidentiality within these meetings; the goal is for a manager to get advice on how to help their team and grow professionally. We’ve found that these meetings are most effective when they are kept small because we want them to create a sense of safety and trust, and for each manager to feel like they can seek advice from a trusted circle of peers.</p><p>We also run a monthly manager meeting with all engineering managers at Yelp, which is an information sharing meeting that covers topics like updates to our leveling system and compensation planning, updates from around the Engineering team, and talks given by managers (for example, about allyship). This also provides a regular forum to ask questions of senior managers.</p><p>Over the years we’ve done several versions of management training, creating cohorts of new managers (those new to management or new to Yelp) and scheduling a series of sessions focused on topics like decision-making, coaching, and career development for their reports. As we’ve progressed, we’ve worked to adapt this to align more with Yelp’s overall management and leadership training, refresh the content, and ensure that it scales to work across worldwide engineering offices and in a distributed team environment.</p><p>Finally, we implemented a manager mentorship program early on, finding that managers derived a lot of benefits from meeting with other managers. Many new managers find themselves with mentors both inside and outside of their current group.</p><h2 id="how-do-we-track-manager-career-development">How do we track manager career development?</h2><p>Our manager leveling system mirrors the <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">engineering leveling system</a>, sharing a common set of five dimensions: Technical Skill, Ownership, Business Insight, Continuous Improvement, and Leadership. We expect most managers to be well-rounded across these different areas. Advancement in these areas is generally tied to having accountability for increased scope. Senior managers tend to deal with more ambiguity, think more about how technology can deliver business value, mentor and manage more senior reports, and manage budgets and hiring plans. They solve problems by putting processes and structures in place, pursuing opportunities to improve in a changing business landscape, and steering the growth and restructuring of our teams.</p><p>While some of these areas are self-explanatory, we have several subdimensions in Continuous Improvement and Leadership that we’d like to highlight: Mentorship, Well-Being, and Community. These reflect our values as a team, ensuring that managers are looking for opportunities to grow others, support their teams, and build strong relationships. At higher levels, we expect managers to build and scale programs that sustain Yelp Engineering.</p><h2 id="do-i-need-to-become-a-manager-to-keep-growing">Do I need to become a manager to keep growing?</h2><p>Absolutely not! We understand that a management career is a separate path from becoming an excellent software engineer, and we strive for both of these career paths to be full of growth opportunities. Nobody should ever feel forced or compelled to step into management to advance their career, because they’ll wind up in a role they don’t enjoy while their team won’t get the support they need.</p><p>While we’ve supported engineers who have been interested in transitioning into other functions like Product Management and Data Science, many engineers choose to stay on the technical career path at Yelp. For more information, check out two other blog posts in this series: <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">Career Paths for Engineers</a> and <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">Technical Leadership</a>.</p><h2 id="building-yelps-management-team">Building Yelp’s Management Team</h2><p>Between the two of us, we have 19 years of management experience at Yelp. We’ve hired, mentored, and managed a lot of managers at Yelp, many of whom are in their first management roles. Over the years we’ve helped to define and articulate our management culture. It’s been incredibly rewarding to build the team and support it. While we know the management path is not for everyone, it brings together a lot of challenges in helping a team work effectively to define and achieve success together. Many things can go wrong when people come together to build software, and managers can help a team to overcome all sorts of challenges and celebrate both individual and team success along the way.</p><p>We’ve scaled our management culture to a team of more than 150 managers. In an earlier blog post we talked about our career growth framework, which has been a useful tool to standardize career conversations and compensation scales. In our next post we’ll discuss how we measure and ensure pay equity and fairness in career progression across Yelp Engineering.</p><p>If this all sounds good to you, and you’re excited to continue developing your management career, <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Technical Leadership at Yelp</h1> <p>Thu, 17 Jun 2021 02:00:00 +0200</p> <p>Hi there!</p><p>In this post we’re discussing technical leadership, a topic that is paramount to any engineering organization, but is also hard to define. Even observing whether your team, organization, or company has good technical leadership can be a challenge. You might be thinking right now, “Am I a good technical leader?”</p><p>To help describe how Yelp thinks about technical leadership, we have two of our Group Tech Leads (a.k.a. GTL, more on what this is later) writing this post. They are both seasoned Yelpers who have held a number of technical leadership roles — they were even willing test subjects for Yelp’s early experiments in defining such roles.</p><p>Jason Sleight has been with Yelp for six years in various machine learning (ML) oriented roles, and is currently the GTL for Yelp’s <a href="https://engineeringblog.yelp.com/2020/07/ML-platform-overview.html">ML platform</a>, which consists of a collection of centralized systems for ad hoc computing, data ETL, and training/serving ML models.</p><p>Josh Walstrom has been with Yelp for seven years in various backend, iOS, and Android roles and is currently the GTL for our <a href="https://business.yelp.com/tools/business-mobile-app/">“Yelp for Business”</a> mobile apps, which enable business owners to manage their presence on Yelp.</p><h2 id="what-is-technical-leadership">What is technical leadership?</h2><p>First things first, technical leadership is not a single concept, but rather an encapsulation of several distinct functions. For the sake of simplicity, we’re going to bucket them into a few categories. We call someone a Tech Lead (TL) when they focus on these functions.</p><h3 id="own-technical-direction-within-an-area-of-responsibility-aor">Own technical direction within an Area of Responsibility (AoR)</h3><p>Yelp is a highly collaborative environment where product, design, management, and technical leaders share goals and direction, but each focus on different aspects based on their expertise. For example, a product manager’s (PM) expertise is market fit, and a TL’s expertise is technical execution. A PM might focus on whether a team should increase weekly active users via increasing app downloads versus improving retention of existing users, while a TL might focus on whether we should prioritize componentizing UI elements to enable app-pitch interstitials versus improving caching to reduce page load times.</p><p>We like our TLs to focus their attention within an AoR. By having a defined AoR, we can ensure TLs are involved in all the necessary planning, decisions, etc. for that system. In some cases, an AoR is a direct mirror of the team’s mandate (e.g., managing MySQL deployments); in others, an AoR is a sub- or cross-section of an important initiative (e.g., migrating to a new service discovery tech stack). In any event, AoRs are long-lived concepts with multi-quarter or even multi-year roadmaps, and it is the TL’s responsibility to champion that process.</p><h3 id="ensure-the-technical-success-of-their-aor">Ensure the technical success of their AoR</h3><p>Once you have a technical direction established, you need to execute a sequence of projects towards those goals. Ensuring success can take many forms and often varies depending on the lifecycle phase of an AoR. In a new AoR, TLs are often very hands-on by creating proof-of-concepts and prototypes. As systems grow, the TL might lead a project with a few other engineers to bring the system to production quality and release it for early adopters to try out. And finally, as the system matures with broader adoption, the TL needs to step back and facilitate team processes for triaging issues, implementing new features, and other maintenance tasks.</p><h3 id="provide-technical-mentorship-to-engineers-working-in-their-aor">Provide technical mentorship to engineers working in their AoR</h3><p>Engineers love to make progress, and that includes in their personal skills. While engineering managers (EMs) are ultimately accountable for engineers’ continued growth, TLs are closer to the day-to-day technical contributions of engineers in their AoRs and best positioned to give them feedback on how to improve and grow. On a micro level, this includes giving feedback on code robustness, efficiency, maintainability, etc. On a macro level, this includes exposing engineers to new technologies, creating training materials for the AoR’s systems, and helping engineers connect with relevant stakeholders.</p><p>At Yelp, we associate career progression with increased impact (see our recent blog on <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">career leveling</a>). TLs act as force multipliers and explicitly budget time to spend on coordination of efforts, championing their AoR, and mentorship. Clearly these are characteristics that lend themselves towards high impact, and consequently our TLs tend to be relatively advanced in Yelp’s career levels.</p><p>However, there are additional ways to be impactful. Tasks like debugging a logging system, refactoring a complex data model, and optimizing page load times require in-depth technical expertise. A TL might not be the best positioned for these tasks; instead an engineer that spends more time deep in the code has the right expertise. You can view this as a depth versus breadth distinction, and a healthy organization needs both types of skill sets. Having TL as a career progression step would funnel engineers into a breadth-first mindset at the detriment of deep technical experts.</p><p>In the past, this distinction was somewhat ambiguous at Yelp, and being a TL was occasionally viewed as a career progression step by engineers. To combat this incorrect perception, we’ve recently refreshed our TL program to make it explicit that TL is a role, as well as to re-establish explicit support networks for our TLs like training programs and senior level mentorship.</p><p>The TL role involves more than just guiding technical work and mentoring engineers in their AoR.</p><h3 id="tls-are-engineer-advocates">TLs are engineer advocates</h3><p>TLs identify and remove roadblocks for engineers in their AoR. By nature, engineers excel at finding workarounds or tolerating solvable problems because they want to ship features. Some roadblocks are fairly obvious, such as an unstable deployment pipeline. Other roadblocks can be more subtle, such as not having access to the best tools or training.</p><p>When the roadblocks are within their AoR, TLs work with their EMs and PMs to schedule time for proper solutions even if that means deferring some work on the product roadmap. When the roadblocks are outside their AoR, TLs ask their fellow TLs for help.</p><h3 id="tls-are-allies-for-their-stakeholders">TLs are allies for their stakeholders</h3><p>TLs work closely with PMs and EMs in their AoR. TLs participate actively in the early planning process for new features, advising on technical feasibility, level of effort, and potential risks. TLs ensure PMs have the data they need to optimize their product strategy, and TLs ensure EMs can complete new features on schedule with high quality.</p><p>Many AoRs have stakeholders outside Yelp Engineering, such as Sales & Marketing, Business Operations, and Finance. TLs work hard to develop empathy for the needs of these stakeholders.</p><h3 id="tls-are-communication-hubs">TLs are communication hubs</h3><p>TLs simplify communication by coordinating the flow of information around their AoR. This aspect of the role means TLs spend less time building and more time reading, writing, listening, or talking.</p><p>TLs cultivate a sufficiently broad context about their AoR through a generous reading list of emails, Slack messages, product & technical specifications, JIRA tickets, and GitHub PRs. TLs also use external sources to develop context. They read blog posts, attend technical conferences, and participate in open-source communities.</p><p>TLs meet regularly with other TLs to exchange context, creating a collaborative community of technical leaders focused on solving “big picture” problems. They also meet with EMs, PMs, and external vendors. In short, being a TL means you’ll have more meetings. Not as many as an EM or PM but more than a typical engineer.</p><p>GTLs own the technical work of a group of overlapping or related AoRs, each with its own TL. Effectively, GTLs create second level AoRs that cut across clean organizational and technical boundaries to address critical business needs and drive company-wide technical initiatives.</p><p>We introduced the GTL role because some hard, cross-cutting problems seemed impossible to solve without a dedicated owner. In the beginning, the role GTL didn’t have a formal application process, and the expectations weren’t clear, except that a GTL was a TL with a much larger and more ambiguous scope. Though we’re still figuring some things out, the role of the GTL has matured considerably over the past few years. There’s a formal application process and clearer set of expectations for GTLs beyond those we’ve covered for TLs in general.</p><h3 id="gtls-are-experts-in-the-their-fields">GTLs are experts in the their fields</h3><p>GTLs stay informed on industry best practices, future trends, and potential risks. They understand and influence Yelp’s business strategy. Using their expertise and knowledge, GTLs guide their groups in making the good technical decisions that best support Yelp’s business strategy.</p><h3 id="gtls-keep-a-holistic-view-on-engineering-health-and-success">GTLs keep a holistic view on engineering health and success</h3><p>As a community, GTLs have awareness across most of Yelp Engineering, and they work together to support the overall health and success of the engineering organization.</p><p>GTLs monitor recent incidents/retrospectives for trends and consistent issues that need more focused attention (particularly help from other GTLs). GTLs look for valuable cross-group projects that may otherwise be missed, evaluate potential solutions, and make recommendations for next steps. In many cases, these projects require careful planning and long-term investments spanning years, not just months or quarters.</p><h3 id="gtls-foster-technical-leadership">GTLs foster technical leadership</h3><p>Being the most widely-scoped technical role at Yelp, GTLs are positioned to develop technical leadership across engineering, not just within their AoR. This involves both training new TLs and ensuring that existing TLs are set up for success.</p><p>Each week, GTLs hold office hours, alternating between North American- and European-friendly time slots. Anyone, not just TLs, can attend these office hours to ask questions, solicit feedback on technical proposals, or just listen to what’s being discussed. GTLs also participate in asynchronous discussions on Slack.</p><h3 id="gtls-are-exemplars-of-yelps-culture">GTLs are exemplars of Yelp’s culture</h3><p>Finally, as highly visible leaders, GTLs set positive examples of our <a href="https://www.yelp.careers/us/en">Yelp values</a>. They “play well with others” by handling disagreements constructively and demonstrating how to solve problems through consensus rather than authority. They are “tenacious” and “unboring” by finding creative solutions to Yelp’s most difficult, far-reaching problems. They are “authentic” by communicating openly and honestly, and they “protect the source” by making reliable products and services that help connect Yelp’s consumers with great local businesses.</p><h2 id="up-next-how-yelp-approaches-engineering-management">Up next: How Yelp approaches engineering management</h2><p>We hope you’re enjoying this blog series and find the peek into Yelp’s engineering culture meaningful! Next up we’ll discuss Yelp’s approach to engineering management, how we measure managers’ success, and provide a glimpse at their responsibilities and values.</p><p>Finally, if you’ve been reading these posts and think Yelp sounds like a great place to work (it is!), then head over to our Careers site – <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Modernizing Business Data Indexing</h1> <p>Mon, 07 Jun 2021 02:00:00 +0200</p> <p>On the Yelp app and website, there are many occasions where we need to show detailed business information. We refer to this process as Data Hydration, filling out a “dry” business with compelling, rich data. Whether on the home screen, search results page or business details page, there is a large set of properties we may show about any given business, everything from name and address to photos, <a href="https://blog.yelp.com/2019/03/yelp-announces-verified-licenses-bringing-peace-of-mind-to-booking-a-professional">Verified Licenses</a>, insights, and more. These properties are stored in a variety of different databases, and their display is subject to a significant amount of filtering and transformation logic. All of this creates challenges for scaling and performance.</p><p>One technique we rely on heavily is the use of materialized views. Using this technique, we gather the data from the various sources and apply the transformation logic offline, storing it in a single key-value store for rapid fetching. The indexing process for this system for many years was our home-grown ElasticIndexer, which has become outdated and doesn’t take advantage of recent advances in Yelp’s backend data processing infrastructure. This post tells the story of our migration from the legacy system to an improved ElasticIndexer 2, meeting several challenges in the process and ultimately delivering a host of advantages.</p><p>Let’s take a closer look at our materialized view and the role it plays in our Data Hydration system. As a motivating example, let’s look at the delivery property. This shows up in the UI when a restaurant offers delivery through the Yelp platform.</p><div class="c1"></div><p>For various reasons, the form of a business’s delivery-related data stored in our database is not the same as that served to clients such as the website or app. For one, the database schema is relatively static to accommodate data from years ago, while the client applications are constantly changing. Also, the database form is optimized for data modeling, while the form sent to clients is optimized for speedy processing. Thus, transformation logic needs to be applied to the data fetched from the database before being sent to the clients.</p><p>The central challenge of maintaining a materialized view of this property is to react to changes in the underlying data store to update the view with the transformed property. This all must happen in real-time to avoid serving stale data. This becomes especially complicated when a property depends on multiple database tables, which is true for many properties including delivery availability.</p><p>For many years, we used ElasticIndexer to index the materialized view for our Data Hydration platform. ElasticIndexer listens to table change logs (implemented as a separate MySQL table) in the underlying databases, and, in response to changes, will issue database queries and run the transformation logic, ultimately writing the result to the Cassandra materialized view. As a performance and scaling measure, the change logs only contain the primary key of the row being changed, so re-fetching the row from the database is required for any non-trivial transformation. In cases where the business ID is not the primary key of the database table, a domain-specific language (DSL) was used to establish a mapping between a given row and the relevant business ID. This process is illustrated below.</p><div class="c1"></div><p>While this system has generally served us well, there are several downsides to this approach. First, the need to re-issue queries to the database unnecessarily increases the load on the database and introduces race conditions. Database deletes are not supported, as the row would be gone when the indexer would query it. Rewinding the materialized view to an arbitrary point is not possible. Specifying relationships between the different tables was awkward in the special-purpose DSL. Having properties based on the current time was hacky to implement. And parallelizing the process was difficult given the implementation of the change log.</p><p>There must be a better way…</p><p>As stream processing tools and systems such as <a href="https://flink.apache.org/">Flink</a> become more mature and popular, we have decided to create our next generation Data Hydration indexing system based on these new technologies. MySQL is the authoritative source of truth for most applications at Yelp. We stream real-time changes in MySQL to Kafka topics using <a href="https://github.com/Yelp/mysql_streamer">MySQLStreamer</a>, which is a database change data capture (CDC) and publishing system. Once this data is available in Kafka data pipelines, we have a variety of handy customized stream processing tools based on <a href="https://flink.apache.org/">Flink</a> to do most of the necessary data transformation on business properties before storing them in materialized views in Cassandra:</p><ul><li>StreamSQL: A Flink Application for performing queries on one or more Kafka data streams using syntax supported by Apache Calcite.</li> <li><a href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery</a>: A Flink service built at Yelp, performing un-windowed joins across keyed data streams. Each join output is in the form of a data stream.</li> <li>Aggregator: A Flink-based service that aggregates Data Pipeline messages. Think of it as the GROUP BY SQL statement over streams.</li> <li>Apache Beam: An open-source unified programming model that allows users to write pipelines in a set of different languages (Java, Python, Go, etc.) and to execute those pipelines on a set of different backends (Flink, Spark, etc.).</li> <li><a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Cassandra Sink</a>: A Flink-based data pipeline data connector for Apache Cassandra. It is responsible for reliably loading data pipeline messages into Cassandra in real time.</li> <li>Timespan Updater: A Flink-based tool to schedule data transformation tasks based on date and time conditions.</li> </ul><p>There are several cases that data transformation requires complex logics that the above tools alone cannot implement. To support such cases we define the logic in a stand-alone service that Beam jobs can communicate to retrieve transformed data. The following figure illustrates a high level overview of our new indexing topology:</p><div class="c1"></div><p>The new mechanism reduces database load dramatically as most of the transformation is done inside Flink applications. With this system, the source of data changes can be any data stream, and we are no longer limited to getting changes only from MySQL. Backfilling data in case of adding new properties or failures is relatively easy by changing data pipeline schemas and resetting/rewinding input streams’ offsets. All of the data pipeline tools at Yelp support delete operations, which makes it very easy to delete business properties from materialized views. This ensures that we don’t store stale data in Cassandra. Since both Kafka and Flink are built for distributed environments, they provide first class parallelization capability, which can be used to increase indexing throughput, especially during backfilling data for all businesses.</p><p>One of the main challenges that we faced during the migration was porting complex business logic from stand-alone services (Python or Java) into Flink applications, despite having various Flink applications to cover various specific use cases. Some of these logic migrations required complex streaming topology that were hard to maintain and monitor.</p><p>The legacy indexer’s logic was in multiple microservices. Not only was this logic used by the legacy indexer, but also other applications and clients. That’s why we couldn’t simply move the logic to the data pipeline. We would have had to create duplicate logic in our Flink applications to keep other parts of Yelp’s microservice ecosystem working smoothly. This could easily lead to discrepancies in application logic in microservices and Flink applications, especially when new complex logic that is hard to create in our generic Flink applications is added to a microservice. This was the reason that we had to keep some of the logic in microservices and call them from Beam jobs, whenever they were needed.</p><p>One of the biggest requirements for this project was to switch to the new system without causing any down time for the downstream services. We achieved this goal by a multi-step launch process for each property at a time. We ran the legacy and the new indexers in parallel so that both Cassandra clusters had the same data. The next step was to verify if the data in the new cluster matched that of the old one. Because of the large amount of data and the real-time indexing aspect of both of the indexers, we couldn’t simply do a direct one-to-one comparison between records in each cluster by querying them directly from Cassandra. That’s why we modified the consumer service of this data to pull data for a small percentage of requests from the new Cassandra cluster in the background, while it was serving users with the data from the old cluster. Then we logged both old and new data. After collecting enough data samples, we ran a sanity check script to verify that the new data was correct. It was only after this step that we had enough confidence to switch the consumer service to read data from the new cluster.</p><p>Fantastic! ? We now have a proper monitoring system for our data ingestion system, which gives us granular information and control on each component. Maintenance has become a lot easier. We can now scale up/down the indexer for each property according to its load without affecting indexing jobs in other properties.</p><p>We now have a proper dead-letter queue that can be utilized to backfill properties for businesses that fail for various reasons. With this tool we would know the exact count of failing records, if they ever happen.</p><p>Many people were involved in this project, but special thanks to Yujin Zhang, Weiheng Qiu, Catlyn Kong, Julian Kudszus, Charles Tan, Toby Cole, and Fatima Zahra Elfilali who helped with the design and implementation of this project.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Are you interested in using streaming infrastructure to help solve tough engineering problems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5d226cd-6ea1-4d12-b875-725b331202b7?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Career paths for engineers at Yelp</h1> <p>Thu, 03 Jun 2021 02:00:00 +0200</p> <p>About 5 years after joining Yelp, I was managing several teams in our <a href="https://www.yelp.careers/us/en/yelp-jobs-in-germany">Hamburg, Germany office</a> and asked my manager, a director at the time, what were the expectations for an engineering manager versus a director. While the conversation was helpful to me at that moment, the gist was basically “we haven’t written that down.” As you can imagine, it’s hard to know both where you stand and how to grow if that’s <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">not captured anywhere for you to read</a>.</p><h2 id="where-we-started">Where we started</h2><p>For many years, the career track for engineers at Yelp was not documented. People still advanced in their careers; we just didn’t have written, consistent guidelines on how. For example, some engineers took on Tech Lead responsibilities, but it wasn’t always clear whether that was a temporary role or a level. In early 2016, we introduced our first directors within engineering management, but there was no engineering-wide documentation on what led to one title vs. another.</p><h2 id="how-did-we-get-here">How did we get here?</h2><p>With over 500 engineers by this time in early 2016, you’re probably wondering what led us to this point. There are even 10-person startups with leveling systems in place. A couple of concerns made us hesitant to roll out anything:</p><ul><li>Many of our engineering leaders came from organizations with toxic leveling systems, characterized by contentious career conversations between managers and engineers that, instead of focusing on growth, involved stressful political games and quid pro quo schmoozing around annual performance reviews and promotion periods.</li> <li>We associated leveling with titles in our minds, and we wanted to avoid the latter. We saw at other organizations how titles led to folks earlier in their careers having their input disregarded purely due to their title. Many of Yelp’s greatest accomplishments come from interns having an equal seat at the table, and that’s an aspect of our culture we were keen to retain.</li> </ul><p>That said, we all agreed the status quo was no longer working. Folks who joined, and even long-timers, weren’t sure what was expected of them since expectations were not explicitly captured anywhere. Similarly, it wasn’t clear why their compensation changed or didn’t over time. And, as we continued to grow, interviewers weren’t consistent in what they expected of candidates, so we needed a small group of calibrators to vet candidates. That process also wasn’t scaling and was <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">susceptible to implicit bias</a>.</p><h2 id="first-attempt">First attempt</h2><p>With that in mind, we set out on our first attempt to capture expectations of software engineers and tech leads. We produced two documents along with self-assessment sheets that engineers could use to assess themselves on a scale of 1 to 5 from “no experience” to “very confident.”</p><h3 id="what-worked">What worked</h3><p>This was a good exercise in turning the implicit into explicit and prompted all of engineering leadership to think about and write down what we’re looking for in engineers and what we value as an organization. It also was a good starting point for interviewers to calibrate on what to look for in candidates.</p><h3 id="what-didnt-work">What didn’t work</h3><p>For starters, it was a single set of expectations for all engineers. For example, one of our expectations, “be a stabilizing force within your team, technically, emotionally, and culturally,” means something very different for someone who just joined Yelp for their first job or for someone with 15+ years of experience. Since these expectations were one-size-fits-all, they weren’t linked to compensation, which left that part still ambiguous for folks. Finally, the self-assessment was entirely voluntary, and not everyone completed it.</p><h2 id="second-attempt">Second attempt</h2><p>We felt it was important to address the one-size-fits-all aspect of the expectations. We reviewed a number of blog posts and articles from our peers that helped us break out our expectations into a two-dimensional grid (shout-outs to <a href="https://labs.spotify.com/2016/02/08/technical-career-path/">Spotify</a>, <a href="http://dresscode.renttherunway.com/blog/ladder">Rent the Runway</a>, <a href="http://engineering.chartbeat.com/2015/06/05/engineering-ladders/">Chartbeat</a>, and <a href="http://joelonsoftware.com/articles/ladder.html">Fog Creek</a> for sharing their journeys and results).</p><p>We established 5 milestones to indicate the scale of impact an engineer was having:</p><ul><li><strong>Self:</strong> You’re focused on what you can personally deliver.</li> <li><strong>Team:</strong> You have a significant impact on your whole team.</li> <li><strong>Group:</strong> Your contributions are recognized and sought out by engineers across several teams or your tech community.</li> <li><strong>Company:</strong> Your work is impactful across the entire company.</li> <li><strong>Industry:</strong> You drive changes that advance Yelp’s interests across the industry.</li> </ul><p>And we assessed those against 5 dimensions that captured what we valued as an engineering organization:</p><ul><li><strong>Technical Skill:</strong> Your depth of knowledge and expertise in your specific domain or current position.</li> <li><strong>Ownership:</strong> You take responsibility for your actions as well as those of your team, and you hold others to the same standards. You deliver projects with tangible results consistently and in a timely fashion.</li> <li><strong>Business Insight:</strong> You understand how projects and decisions benefit Yelp as a company. You design & build solutions to deliver long-term value while also being flexible to accommodate rapid change.</li> <li><strong>Continuous Improvement:</strong> You continuously learn and grow, and you invest time in mentoring others. You never accept the status quo for yourself, your peers, or the org as a whole.</li> <li><strong>Leadership:</strong> You communicate clearly and effectively. You optimize for the group’s success and advocate Yelp’s values of inclusivity and support. You strengthen your team by championing Yelp internally & externally.</li> </ul><p>This time, we also asked that all engineers have an assessment conversation with their manager to see where they stood in the milestones instead of making it optional.</p><h3 id="what-worked-1">What worked</h3><p>This addressed the main pain point of our first attempt by providing a graduated scale that outlined different responsibilities for someone just starting out in their career, versus for someone who had been at Yelp for several years already. We also established a clear expectation around progression by requiring all engineers move from Self to Team in every dimension within two years. We also made this exercise mandatory at rollout, which helped us collect data across the organization.</p><h3 id="what-didnt-work-1">What didn’t work</h3><p>Post rollout, many engineers weren’t motivated to keep having the conversation with their manager. Since the framework still wasn’t tied to their compensation, it wasn’t clear how this benefited them. It also wasn’t granular enough for the makeup of our teams at the time. Most of engineering sat at either Self or Team with more senior engineers approaching Group. The Company and Industry milestones, while aspirational, felt out of reach. Lastly, our framework didn’t summarize up to a single level, which made it hard for recruiters to explain to candidates that were accustomed to level terminology from other companies.</p><h2 id="third-attempt">Third attempt</h2><p>So with all that, in 2018, we embarked on a third attempt. We still didn’t want to introduce titles for the same reasons we avoided them initially, so our levels would be kept private and not be associated with titles.</p><p>Our third framework addressed some of the key lessons from our first two attempts:</p><ul><li>We dropped the top Industry milestone as being too aspirational, and we added more steps between the remaining milestones. We now had six levels, IC1 through IC6, for the four steps in our second attempt. The IC1-IC6 terminology also matches what other companies use.</li> <li>Most importantly, we did what the previous frameworks didn’t with this third attempt: we tied it to compensation. Each level has an associated merit band, equity band, and cash bonus target based on role and location.</li> <li>Finally, as <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">Kent and Grace wrote about previously</a>, we revamped our interview process to be based on our dimensions and began using levels and associated compensation bands when hiring.</li> </ul><p>This framework also came with a web app, based on <a href="https://github.com/Medium/snowflake">Medium’s Snowflake</a>, to navigate the dimensions and levels.</p><figure><p style="text-align: middle;"></p> <figcaption><small>Yelp Product & Engineering Levels Web App</small> </figcaption></figure><h3 id="what-worked-2">What worked</h3><p>Everyone in engineering uses the leveling system; it’s not optional. Job offers include a level, we record them as part of their employee profile, and compensation adjustments are based on their progress in or across levels. Seeing our success, other departments throughout Yelp adopted our model, and we now have levels using the same format for our Product, Business Systems Analyst, IT, and Engineering Management roles, with other teams adopting it every quarter. All of these various leveling frameworks are visible to everyone at Yelp, so people know what to expect if they’re considering a role change.</p><h3 id="what-we-missed">What we missed</h3><p>I didn’t call this section “third time’s the charm” because, let’s be honest, we didn’t get everything right even with this third attempt.</p><p>After the initial rollout, managers were mostly left on their own to calibrate. Senior levels, IC5+, required calibration discussions with the Chief Technology Officer, but the process was adhoc for levels prior to that. Transparency around leveling was also inconsistent, with some managers having a shared document with their engineers, and others keeping the whole process opaque.</p><p>In recent quarters, we’ve been addressing these pain points by rolling out an organization-wide calibration process among manager groups that happens every quarter before promotions are submitted. In addition, managers now maintain a historical record of an engineer’s progression that is shared with them and allows them to contribute to the leveling process with their own data points.</p><h2 id="there-is-no-done">There is no “done”</h2><p>As an organization grows, its leveling framework and processes will start to break down. What was once widely understood will become foreign to newer members of the team. For that reason, we’re constantly reviewing and iterating on what we have.</p><p>While employees now have a tool at their disposal to understand the expectations once they join, hiring managers still have to verbally walk through all this with potential new hires, making it challenging for them to clearly understand what’s expected of a certain level. To address this, we’ll be publishing our engineer and engineering manager leveling frameworks in the coming months. We believe everyone should know what they’re signing up for and what career growth looks like at the company they’re joining.</p><h2 id="up-next-technical-leadership-at-yelp">Up next: Technical leadership at Yelp</h2><p>One area that overlaps our career framework is technical leadership. Unlike some of our peers, we view technical leadership as a role, not a level. In our next post, Jason Sleight and Josh Walstrom, two of our Group Technical Leads, will walk us through why we approach it that way and how we’ve worked to build a collaborative, cross-pollinating community of technical leaders who work together regularly to solve some of Yelp’s biggest problems.</p><p>If you also dream of leveling frameworks and are passionate about fostering an environment for career growth, <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: How we onboard engineers across the world at Yelp</h1> <p>Thu, 20 May 2021 02:00:00 +0200</p> <p>Like most companies, Yelp has had to make substantial changes to the way we onboard new team members over the past year. Yelpers have always been naturally good at fostering a welcoming and supportive atmosphere for new employees. Translating this into a welcoming and supportive <em>virtual</em> atmosphere hasn’t happened organically. As we grow distributed teams across the United States, Canada, and Europe, the new ways in which we prepare, welcome, train, and support our employees have become, and will continue to be, important for the advancement of Yelp’s Engineering & Product organizations.</p><p>Going into 2020, we knew we were already outgrowing a number of our programs, and we were presented with even more challenges as we made the shift to remote work, and then permanently distributed teams. This post aims to share some of the ways we’ve adapted to these changes, and the lessons we’ve learned along the way.</p><h2 id="welcoming-new-hires">Welcoming New Hires</h2><p>Showing up to an office on your first day at a new job can be nerve-racking. Pre-COVID, we did our best to put new hires at ease. They were connected with their mentor as soon as they arrived, their desk was set up with the right tech and a welcome kit, and they were invited to a team lunch. When we made the shift to remote work and distributed hiring, we knew we had to figure out how to replicate this welcoming environment virtually.</p><p>Through partnerships with our IT and Workplaces departments, we now ensure that the all-important first day welcome kit is shipped with new hires’ equipment. Once folks have had the chance to login, they begin their day with a check-in with their new manager and mentor. These chats are casual, and help give every new hire the lay of the land for their first week. Managers also arrange for new hires to connect with other members of their team and anyone they might work with on a regular basis. Previously, these introductions would’ve happened informally around the office and in meetings, but we now intentionally schedule these as one-on-one conversations so that new hires don’t miss out on making important connections.</p><p>We also make sure a new hire’s entire team can gather for a virtual lunch, coffee, or watercooler chat on their first day. This gives everyone the chance to welcome their new teammate in a fun atmosphere, and helps new hires start putting faces to names.</p><h2 id="tackling-remote-onboarding-challenges">Tackling Remote Onboarding Challenges</h2><p>While we were able to replicate some of the social aspects of being in an office, we were also hit with a whole new set of logistical challenges. “What time do I show up?” turned into “When can I expect a call from my manager?” Managers had to figure out how to connect with new hires in different time zones, or even in different countries. And simply having someone log in to a new computer at the office became an adventure in international shipping logistics.</p><p>After hitting a few speed bumps, we worked with teams across Yelp to set up new systems for onboarding new hires remotely. We send a regular cadence of reminders to managers, mentors, and new hires beginning two weeks before someone is expected to start. For managers and mentors, our onboarding team shares checklists to help them prepare, and most have developed complementary team-specific lists in an effort to maintain consistency across onboarding experiences. For the new hire, we share materials ahead of time, like an up-to-date technical primer that outlines the tools they’ll be using when they start. This provides folks with the foundational information they need to hit the ground running on their first day.</p><h2 id="revamping-new-hire-orientation">Revamping New Hire Orientation</h2><p>Historically, Yelp held a one-hour orientation session for new hires on their first day. This session, called “Space Camp,” was in-person and typically hosted by a leader from within Engineering. Space Camp was something we knew we were outgrowing. We tried to pack a lot of information into just 60 minutes, which overwhelmed a lot of new hires. And because we invited all Engineering and Product roles to this session, we had to keep the content broad. But this actually resulted in it not being helpful for a number of folks who attended.</p><p>We started looking at an overhaul of Space Camp in January 2020. But when COVID hit, we had to pivot. We knew we needed programming that would scale across time zones, be relevant for folks in a wide variety of roles, and provide everyone with a sense of belonging in their first few weeks. In close partnership with our People Operations team, we eventually landed on a blended approach that includes:</p><ul><li><strong>Virtual instructor-led orientation</strong>. Led by our People Operations team, this happens on a new hire’s first day, and covers everything from our company values to benefits.</li> <li><strong>On-demand e-courses.</strong> People Operations maintains a collection of resources for all Yelp employees, while our Technical Talent team maintains resources more specific to Engineering & Product. This includes a collection that dives into our Engineering team structure, culture, and goals for the year, as well as a collection of ramp-up materials for software engineers.</li> <li><strong>Virtual meet & greets.</strong> All new hires are invited to a virtual meet and greet with our Internal Community Manager within their first 30 days. These chats are fun and informal, and give new hires the opportunity to learn about unique aspects of Yelp, such as Yelp Employee Resource Groups.</li> </ul><p>In 2021, we’re looking to expand our virtual instructor-led and culture offerings. Currently in the works are virtual whiteboarding sessions called Build Me a Yelp, which are interactive introductions to Yelp’s infrastructure that give new hires the chance to ask questions. We’re also aiming to ensure all Engineering & Product new hires have the chance to connect with the Vice President of their organization in their first 90 days.</p><h2 id="strengthening-mentorship--local-buddy-programs">Strengthening Mentorship & Local Buddy Programs</h2><p>Strong mentorship is a crucial part of setting new hires up for success. We assign all new hires a mentor weeks before they start. This way mentors have plenty of time to prepare for their new teammate. Our mentors review and update team documentation, arrange for regular one-on-ones with their mentee, facilitate connections with other teams, set expectations, and provide feedback.</p><p>We combined the lessons we’ve learned over time with a few new guardrails once we started making the shift towards distributed teams. We now:</p><ul><li>Make sure all mentors complete a “Mentorship 101” course. More on this below!</li> <li>Match mentors and mentees that share the same core working hours.</li> <li>Recommend that all mentors have at least one year of experience on the team so they’ll have a good understanding of both Yelp and team-specific norms.</li> <li>No longer require that mentors be more technically experienced than their mentee. We view the job of a mentor to be showing the new hire what life at Yelp is like and how to be successful on the team, and we provide other technical resources to help fill in any potential gaps in knowledge.</li> </ul><p>We also recognize that remote mentorship is just…different. And we may even have cases where there may not be a mentor on the team that has any overlapping working hours with the mentee! In these instances, we encourage managers to find a “local buddy.” Buddies are coworkers who are in the same time zone, and can be a resource for questions and support if the new hire’s mentor and other teammates are offline.</p><p>We’ve also worked to find the positives in instances in which the mentor might typically start their day later than their mentee due to a time zone difference. Providing the new hire with items to complete on their own such as onboarding courses or small code fixes gives them a chance to feel productive and collect questions before they connect with their mentor each day. But once their mentor is online, we encourage everyone to default to over communicating. This includes sending messages to indicate availability to field questions, utilizing icons and status updates, and making sure working hours are publicly displayed on calendars. Some mentors have even experimented with keeping a call open on Slack or Google Meet throughout the day to simulate the ability to simply turn <a href="https://www.youtube.com/watch?v=2EwViQxSJJQ">to the left</a> and ask a quick question.</p><h2 id="training-for-mentors">Training for Mentors</h2><p>We know that “<a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">no process” is another name for “bias</a>,” so as we ramped up hiring going into 2021, we established a formal training program for all mentors. Training helps to ensure all mentors are providing a consistent experience for their mentees.</p><p>The first step for any new mentor is to complete an on-demand e-course focused on the fundamentals, like “why be a mentor?” and expectations for mentorship at Yelp. We regularly refresh this content to ensure the information is up-to-date and remains useful as the company (and the world around us!) changes.</p><p>We also create spaces for mentors to connect and discuss any questions, issues, or lessons learned. All mentors are invited to participate in live quarterly discussion sessions facilitated by experienced mentors and managers where they can bring questions to discuss with the group and compare approaches to common scenarios. If mentors have questions they’d like to discuss on the spot, they can turn to an internal channel to pose them to everyone who’s ever been a mentor.</p><p>Beyond providing fundamental training and spaces to connect, we’re exploring additional workshops to level up specific skills and help to build out a strong and lasting mentorship culture. In the previous section, we mentioned that providing feedback to their mentee is a part of a mentor’s responsibilities. We know this is a significant determining factor in how quickly the new hire can ramp up. So, we recently started offering a live training session on giving and receiving feedback to all mentors. We have plans to continue supplementing this workshop over time.</p><h2 id="ongoing-learning--mentorship-programs">Ongoing Learning & Mentorship Programs</h2><p>In addition to our new-hire and mentorship programs, we’re working to establish ongoing learning opportunities for everyone on our team, regardless of their tenure or role.</p><p>Like onboarding, this is an area where we’ve worked closely with our People Operations team. In 2020, we launched the Leadership Essentials and Development (LEAD) program for new managers. This program covers management basics like effective one-on-ones, coaching, and feedback, providing managers with the foundation they need to help their teams grow. In 2021, our People Operations team is expanding on this program by offering learning opportunities to support senior leaders.</p><p>We’re also working with teams across Engineering & Product to develop learning content that’s tailored to a wide range of audiences. Below is a little taste of what we have going on in this space:</p><ul><li>We’ve launched an internal podcast series. Hosted by leaders on our Product team, the series covers best practices and ways to upskill.</li> <li>We’re working with a group of Engineering Managers to develop resources on understanding, supporting, and working effectively with neurodiverse individuals.</li> <li>We’re hosting virtual workshops focused on agile skills development.</li> </ul><p>Finally, now that we’ve made a permanent shift towards distributed teams, we’ve established working groups to help ease this transition. We know this can be a tricky thing to nail, and we’re doing our best to get it right. These groups are focused on improving distributed mentorship and belonging programs, as well as establishing organization-wide policies for things like asynchronous standups, sprint planning, and roadmapping. It’s our hope that we’ll be able to provide everyone with the skills and resources they need to do their best work, no matter where they’re based.</p><h2 id="up-next-career-paths-for-engineers-at-yelp">Up next: Career Paths for Engineers at Yelp</h2><p>While it’s crucial to provide an informative and inclusive onboarding experience, and follow that up with continued opportunities for learning, we know creating the space for learning isn’t enough on its own. We also need to provide everyone with the same structured framework for growing a career at Yelp. In our next post, we’ll dive into our career paths framework, including the history behind our current leveling system, and how we view career growth as an ongoing conversation.</p><p>Lastly, if you’re finding these posts interesting and Yelp sounds like the kind of company culture that you’d like to be a part of… <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring!</a></p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Moderating Promotional Spam and Inappropriate Content in Photos at Scale at Yelp</h1> <p>Wed, 12 May 2021 02:00:00 +0200</p> <p>The <a href="https://trust.yelp.com/">trust</a> of our community of consumers and business owners is Yelp’s top priority. We take significant measures to maintain this trust through our state of the art <a href="https://trust.yelp.com/recommendation-software/">review recommendation algorithms</a> in order to maintain the integrity and quality of the content on our site. Albeit popular, review text is only one of the many types of user-generated content at Yelp. Photos are also a key piece of content and they are increasingly becoming an attack vector for spammers and inappropriate or other unwanted behavior. In this blog post we show how we built a scalable photo moderation workflow leveraging <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Yelp’s in-house real-time data streaming and processing pipeline</a>, simple heuristics, and deep learning models in order to deal with hundreds of thousands of photo uploads per day.</p><p>Yelp’s mission is to connect people with great local businesses. Local businesses are often small in size and might not have the resources to quickly identify and flag the content generated on their pages, especially if it is disruptive or deceiving, which could result in an impact in trust for both the business and its customers. Trust is deeply embedded in two Yelp values:</p><ul><li>Protect the source: community and consumers come first.</li> <li>Authenticity: tell the truth. Content found on Yelp should be reliable and accurate.</li> </ul><p>Yelp takes pride in its mission and values and it constantly strives to develop and improve the systems to protect business owners and users.</p><p>So we addressed two types of photo content spam: promotional and inappropriate.</p><p><strong>Promotional spam</strong> is an inappropriate commercial message of extremely low value which tries to disguise itself as business owner content and often leads the user to being scammed (e.g. by showing a fake customer support number). We consider this a type of <em>deceptive spam</em> because it erodes the trust the users have on our platform.</p><figure><p style="text-align: middle;"></p> <figcaption><small>Examples of promotional spam.</small> </figcaption></figure><p><strong>Inappropriate spam</strong> is content that can be interpreted as offensive or unsuitable in the specific context where it appears. Context is especially relevant for this type of spam as inappropriate content covers a broad range of situations where the classification can be fairly ambiguous depending on where the content appears or which content policy applies (<a href="https://www.yelp.com/guidelines">Yelp Content Guidelines</a>). We consider this a type of <em>disruptive spam</em> because it can be abusive and offensive if not outright disturbing. Examples of this type of spam are suggestive or explicit nudity (e.g., revealing clothes, sexual activity), violence (e.g., weapons, offensive gestures, hate symbols), drugs/tobacco/alcohol, etc.</p><p>Users and business owners upload hundreds of thousands of photos every single day.</p><p>At this scale, the infrastructure and the resources required for real-time classification are a considerable challenge due to tight response time constraints required to maintain a good user experience. Additionally, processing photos using neural networks requires expensive GPU instances. Real-time classification is also not an ideal choice in an adversarial space because it provides immediate feedback to an attacker trying to circumvent or reverse engineer our systems. Having an indeterminate delay between content upload and moderation significantly increases the time cost for an attacker to reverse engineer the system. Conversely, unwanted content should be moderated as quickly as possible to protect our users and since spam tends to be generated in waves, if we fail to swiftly remove it we will likely end up with large swathes of unsafe content on the platform.</p><p>There are also challenges specifically related to the machine learning (ML) algorithms used to process image data. Promotional and inappropriate spam is fairly rare on Yelp which creates the problem of extremely unbalanced data, making training and evaluating ML algorithms a lot more challenging. While we can use smart sampling techniques to produce balanced datasets for training purposes, evaluation in production is highly skewed on trying to minimize false positives, which in turn affects the recall of spammy content. Another concern that we need to address is the context of a photo, especially for inappropriate content (e.g. a photo of a lingerie model in a lingerie shop is perfectly fine but it is not if it is on a restaurant business page), and an adversarial space which requires the ability to react quickly to evolving threats and constantly keep our models up to date.</p><figure><p style="text-align: middle;"></p> <figcaption><small>Distribution of content types, the far right bar is "good" content.</small> </figcaption></figure><p>As we mentioned above, any moderation of user-generated content has to work in an adversarial space. Hence, we decided to not use any out-of-the-box or third party solutions which we considered vulnerable to an attacker reverse engineering because they are publicly available and therefore attackers can experiment with them and learn to bypass them before attacking Yelp. In this case, rolling out our own custom system plays security through obscurity to our advantage by buying us time against attackers which in turn allows us to remain ahead of the game.</p><p>Moreover, we discussed the issue in ML when dealing with class imbalance. In our solution we focused on precision while maintaining good recall. Precision and recall are inversely proportional but we prefer a “do no harm” approach where we minimize the false positive instances which would lead to removing valid content. This is incredibly important for businesses that have little content on their pages and for which removing a valid photo would have a non-negligible effect. A high precision solution also minimizes manual work for our content moderation team. This helps to deal with the continuous growth of Yelp since manual moderation does not really scale well and reduces the exposure to inappropriate content which can be psychologically taxing and potentially a liability.</p><p>Finally, while designing the system, we tried to leverage existing Yelp technologies and systems as much as possible to minimize engineering development cost and maintenance burden.</p><p>After considering the challenges in infrastructure, ML, and the adversarial space we settled for a <strong>multi-stage multi-model approach</strong> where there are two stages and different models for each stage and type of spam. The first stage is used to identify the subset of photos that are most likely to contain spam; the models in this stage are tuned to maximize spam <a href="https://en.wikipedia.org/wiki/Precision_and_recall">recall</a> while filtering out most of the safe photos. Essentially, this step changes the label distribution of the data fed into the second stage and in doing so it significantly reduces <a href="https://en.wiktionary.org/wiki/ham_e-mail">ham</a>/spam class imbalance and removes many potential false positives (consider the following: we do not perform inference on a large subset of photos in the second stage, and the final set of false positive is only limited to the false positives generated by the second stage, which may or may not intersect with the false positives generated in the first stage). The second stage is where the actual classification of the content happens; the models in this stage are tuned for <a href="https://en.wikipedia.org/wiki/Precision_and_recall">precision</a> because we aimed to send only a small amount of content to the manual moderation queue and we wanted to keep false positives to a minimum. Moreover, we have a set of heuristics playing alongside ML models which speed up the whole pipeline and are quickly tunable so that we can react in a small amount of time to a new threat our models are not capable of handling which give us the time to update our models while keeping users protected. Finally, we created a Review Then Publish (RTP) moderation workflow UI where images that are identified as spam are hidden from the users and sent to our <a href="https://trust.yelp.com/content-moderation/">content moderation team</a> for manual review. Yelp’s content moderation team then can decide to either restore a photo if it is a false positive or allow the photo to remain hidden if it’s malicious.</p><p>In the next sections we will dive into the details of what this solution looks like for each type of spam.</p><div class="c4"></div><p>Most of the promotional spam is characterized by fairly simple graphics containing a bunch of text that is used to deliver the spam message. Therefore, the image-spam identification models used in the first stage try to identify photos containing text or logos; these models are mostly heuristic based and are very resource efficient. In the second stage, we extract the text from the photos using a deep learning neural network. The spam classification is then performed on the text content leveraging a <a href="https://en.wikipedia.org/wiki/Regular_expression">regular expression</a> and <a href="https://en.wikipedia.org/wiki/Natural_language_processing">NLP</a> service. The fast path provided by the regular expressions allows for an efficient recall of most egregious cases and provides the capability to quickly react to content that is not being captured by the NLP models.</p><div class="c4"></div><h2 id="inappropriate-spam">Inappropriate Spam</h2><p>Inappropriate spam is much more complex than promotional spam because it covers a broad range of content. The classification is also heavily dependent on the context where it appears. In order to maximize recall, the first stage comprises two models: a thin <a href="https://en.wikipedia.org/wiki/Residual_neural_network">ResNet</a> trained on a binary classification task to identify inappropriate content in photos based on Yelp’s policies, and a deep <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">CNN</a> model trained on a binary classification task to identify photos containing people. This second model has been added specifically to maximize recall since many instances of inappropriate content involve people. The second stage combines a deep learning model trained on a multi-label classification task, where the output is a set of labels and associated confidence scores. The model is then calibrated for precision based on confidence scores and a set of context heuristics (e.g. the business category) that take into account where the content is being displayed.</p><div class="c4"></div><h2 id="dealing-with-spam-waves-and-adversarial-actors">Dealing with Spam Waves and Adversarial Actors</h2><p>So far we covered mostly the ML aspects of the system and just briefly mentioned how heuristics can be used to quickly enhance the system to adapt to the changing threats coming from adversarial actors. Spam often hits websites in waves of very similar content that is generated from fake accounts piloted by bots. Hence, we have a workflow and a couple of infrastructure improvements specifically to address that. Photos flagged as spam are tracked by a fuzzy matching service. If a user tries to upload an image and the image matches a previous spam sample it is automatically discarded. On the other hand, if there is no similar spam match it goes through the pipelines mentioned above and it could end up in the content moderation team queue. While awaiting moderation the images are hidden from the users so that they are not exposed to potentially unsafe content. The content moderation team can also act on entire user profiles instead of just a single piece of user content. For example, if a user is found to be generating spam, its user profile is closed and all associated content is removed. This sensibly improves spam recall because we need to catch only one image from a user to be able to remove all unwanted content generated by the spam bot profile. Finally, the traditional user reporting channel exists which provides us with feedback to monitor the effectiveness of our systems.</p><div class="c4"></div><p>In this blog post we covered some of the solutions Yelp developed to process hundreds of thousands of photos per day using a two stage processing pipeline powered by state of the art ML models. We also implemented a RTP moderation workflow so that problematic content is hidden from users until moderation happens. Finally the system provides us with the flexibility to quickly respond to adversarial actors, fake accounts, and spam waves.</p><p><a href="http://trust.yelp.com/">Trust & Safety</a> is taken very seriously at Yelp and we are proud of the work we do to protect our users and business owners. As a result, <a href="https://blog.yelp.com/2019/10/study-shows-97-of-people-buy-from-local-businesses-they-discover-on-yelp">Yelp is one of the most trusted review platforms on the web</a>.</p><div class="c4"></div><ul><li>Thanks to Jeraz Cooper for mentoring, countless code reviews, and enabling the photo support in the moderation UI.</li> <li>Thanks to Jonathan Wang for the insights on the inappropriate spam model.</li> <li>Thanks to Pravinth Vethanayagam and Nadia Birouty for consulting on system design and people and logo classifiers.</li> </ul><div class="island job-posting"><h3>Join Yelp</h3><p>Want to help us make Yelp a safer place?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/e9a3e447-7271-431d-b8d3-29168c9c01ef?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Using structured interviews to improve equity</h1> <p>Thu, 06 May 2021 02:00:00 +0200</p> <p>For years, Yelp continued to use an interview process that was created when we were a 50-200 person Engineering organization, with only a handful of interviewers:</p><ul><li>Each interviewer wrote their own interview questions</li> <li>A few senior leaders gave overall hire/no hire decisions for every panel</li> <li>Interviewers received ad hoc feedback from senior leaders when it seemed like they were too tough or too easy in their interviews</li> </ul><p>A few things went well:</p><ul><li>there was a strong sense of personal responsibility for both leaders and interviewers</li> <li>turnaround time for offer approvals was quick</li> <li>and <a href="https://blog.yelp.com/2021/04/yelp-values-employee-panel-theres-always-room-for-a-smile">Yelp values</a> could be preserved by senior leaders.</li> </ul><p>As the Engineering organization grew to more than 500 employees, the interviewer pool also grew, from tens to hundreds. This created a few challenges. It became harder to enforce similar standards across interviewers. It became increasingly difficult to tell whether a candidate was strong or if an interview question was correctly calibrated. This lack of structure made it difficult to confidently and consistently identify strong candidates. It also made it difficult to identify whether there were patterns of bias in our interview process. Faced with these challenges, we asked, <strong>“How do we continue to hire diverse, amazing talent as we scale our Engineering organization?”</strong></p><h2 id="creating-structured-interviews">Creating Structured Interviews</h2><p>A group of folks across Technical Talent and Engineering banded together to answer this question. Since others had gone down this path before us, we began with a review of prior work by <a href="https://medium.engineering/mediums-engineering-interview-process-b8d6b67927c4">Medium</a>, <a href="https://www.quora.com/What-is-the-engineering-interview-process-like-at-Stripe">Quora</a>, and <a href="https://sensu.io/blog/interviewing-engineers-at-sensu-e4fc35cd601f">Sensu</a>. Those references, along with our own internal review led to the creation of a structured interview process that reflected what we felt it took to succeed at Yelp. As a first step, we focused on standardizing questions across all of our open roles to four key question types:</p><ul><li>Problem Solving</li> <li>System Design</li> <li>Ownership, Tenacity, and Curiosity</li> <li>Playing Well with Others</li> </ul><p>The first two interview types focus on the candidate’s technical skills, and the latter two focus on non-technical skills and how aligned the candidate is with Yelp’s values. For the technical portions, we wanted to evaluate the candidate’s skill with technical tasks that would be common in the role they’re applying for, rather than their ability to memorize algorithms or easy-to-search-for trivia. To create these questions, we asked engineers across the organization to take a problem their team recently solved and create an example on a smaller scale. We strongly believe that using real life problems to evaluate skills captures what is needed to actually succeed at Yelp and helps us give more opportunities to people with different backgrounds to be successful in our hiring pipeline.</p><p>To evaluate questions, we standardized criteria that related to dimensions (Technical Skill, Ownership, Business Insight, Continuous Improvement, and Leadership) that we use internally for leveling engineers. This further aligned internal and external expectations of candidates and employees.</p><p>Moving to structured interviews allowed us to take the first step to both collect and analyze interview data in a meaningful way. We went from having no comparable feedback to thousands of technical and behavioral data points in a consistent format. This not only gives us the opportunity to monitor the health and size of our pipelines, but it also enables us to identify potential problems or biases at every stage of the interview process. When observing a gap or difference in dropoff rates, we are better able to drill down our focus to specific question sets or interviewers and determine what solutions to implement to directly mitigate bias.</p><h3 id="first-try-what-we-learned">First try: what we learned</h3><p>After introducing structured interviews, we soon identified a difference in pass rates across genders in the initial round of technical interviews. Upon closer inspection, we found instances where candidate performance was identical when measuring how many components of a coding question were completed. However, men were progressing to the next stage of the interview process at a higher rate than women. We were able to quickly reduce this gap by replacing individual interviewers’ judgment on a candidate’s performance with standardized pass/fail criteria, which ensured that all qualified candidates moved forward. This was the first of several successful modifications, which have collectively reduced the pass rate gap between genders. Making corrections to the early steps of the interview process has made a huge impact on gender diversity at every subsequent stage. This ultimately increased the likelihood of having more women make it to the final offer stage. With better pipeline observability, we’ve been able to more effectively hire diverse talent by mitigating these biases and reducing false negatives.</p><h3 id="second-try-defining-evaluation-criteria">Second try: defining evaluation criteria</h3><p>While we were now able to both pinpoint and remedy where drop offs were occurring in our interview process, our approach to reducing bias was still reactive. Interpretation of candidate performance varied amongst interviewers, even with the measures we had in place. We recognized that having structured interview questions wasn’t enough, and we needed explicit evaluation criteria for all of our interviews.</p><p>To address this, we introduced a points-based evaluation criteria to our structured interviews. In this initiative, we further clarified what signals we wanted interviewers to look for and capture. Points are awarded for expected candidate behaviors based on a rubric. Interviewers are required to provide an explanation for when and why points are deducted. This scoring framework can then be aggregated and converted to hiring and initial leveling decisions to maintain consistency across the larger organization. A key benefit of this framework is that interviewers can systematically measure candidate performance during the interview, but the onus of deciding final interview outcome, and, therefore, the possibility of unconscious (or even conscious) bias by the interviewer, is reduced.</p><h2 id="how-were-evolving">How we’re evolving</h2><p>If there’s anything we’ve learned from this journey, it’s that improving interviews is an ongoing process of review and adaptation. At Yelp, we’ve made this a shared priority between Technical Talent and Engineering. Our teams work closely with one another and have a dedicated task force with several subgroups composed of folks from both groups that meet on weekly cadences to put this commitment into action. While we still have a lot on our roadmap, here are some key lessons that we have learned so far:</p><ul><li>Making interview improvements requires a real partnership. It may seem obvious to say this, but, if you’re going to improve engineering interviews, you’re going to need subject matter experts from both engineering and recruiting to capture all the nuances that are often overlooked.</li> <li>Interviewer bias still exists in your hiring process even with a standardized process and structure. A good best practice to combat this is to make sure that the group working on interview processes is reflective of the demographics of your organization, or what you’d like your organization to be. Make sure women and underrepresented minorities are involved.</li> <li>A distributed workforce means different geographies with different cultural considerations and different employment norms, so include engineers representative of all your geographies when standardizing. Our initial task force failed to include folks from our European teams and, thus, some of our interview questions were geared towards Bay Area tech culture.</li> <li>Collecting feedback is imperative towards making progress, so make sure you create feedback loops from all stakeholders: recruiters, recruiting coordinators, interviewers, and hiring managers. Candidates are stakeholders, too, so make sure to have a process to get feedback on their interview experience.</li> <li>Standardization allows for easier review and change, whether that is the pipeline, the interview questions, interview evaluation, training - the list goes on. We’re still in the midst of rolling out our points-based evaluation criteria for structured interviews, and we’re able to move a lot faster instead of needing to reinvent the wheel!</li> </ul><h2 id="up-next-how-we-onboard-engineers-across-the-world-at-yelp">Up next: How we onboard engineers across the world at Yelp</h2><p>Equally important to bringing in diverse talent is everything that happens onwards from the moment a candidate becomes an official Yelper. In the next post in this series, we’ll take a closer look at the thought and logistics that we go through to set folks up for success, how we’ve streamlined our onboarding process for distributed teams in the virtual world, and the opportunities for continuous learning we provide to our employees through training and mentorship programs.</p><p>If you’re finding these posts interesting and Yelp sounds like the kind of company culture that you’d like to be a part of… <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>One year later: building Trust Levels during COVID</h1> <p>Thu, 29 Apr 2021 02:00:00 +0200</p> <p>From its devastating toll on local economies to its impact on the little things like handshakes and hugs, the COVID-19 pandemic seemed to leave nothing unchanged. Local businesses were especially impacted and forced to make big changes, many overhauling their operations overnight in order to adapt to the new normal.</p><p>Businesses turned to Yelp to communicate operational changes brought on by the pandemic. They kept their communities in the know by updating the <a href="https://www.protocol.com/manuals/small-business-recovery/yelp-pivot-small-businesses-coronavirus">COVID-19 section</a> on their business pages, which was launched at the beginning of the pandemic. They indicated new health and safety precautions, such as wearing masks and enforcing social distancing. They updated their hours, scaled back sit-down dining, pivoted to support takeout and delivery, and even introduced <a href="https://blog.yelp.com/2020/06/helping-local-businesses-reopen-during-covid-19-with-new-products-and-features">virtual service offerings</a> to remain accessible to their communities.</p><div class="c2"></div><p>Given the surge in businesses updating their Yelp pages with COVID-19 related changes, we knew it would be important to measure how confident we are that a given piece of business information is still accurate. To address this, Yelp’s Semantic Business Information team built a new internal system called Trust Levels. In this blog post, we will define Trust Levels, take a look at each part of the new system, and end with an example that ties all the pieces together.</p><h2 id="defining-trust-levels">Defining Trust Levels</h2><p>At Yelp, we call our business information “business properties.” Business properties include anything that we can describe about a business, such as business address, if the business is women-owned, if the business can repair appliances, etc. The numerous business properties found on each Yelp page can share special insights about businesses from retail and restaurants to home and local services.</p><p>Usually business owners can indicate the business properties on their business page. Consumers are also able to contribute information about a business, which can also be collected through our survey questions answered by people who have checked in or visited that business. Our User Operations team reviews changes to ensure quality and accuracy, and can modify the information as well. However, determining how confident we can be about any given piece of information became especially important as businesses repeatedly had to update how they operate due to changing local government policies.</p><p>In order to define our confidence levels, we first created a unified vocabulary that could be used across engineering and Product teams, to avoid each team creating its own definition of trust. We created Trust Level labels from Level 1 (L1), which means we are highly confident that the data is both accurate and current, to Level 4 (L4), which means we do not have strong or recent signals to determine accuracy. These levels, which we designed to be simple and easy to refer to, can then be used by various teams without needing to do their own calculations. For example, if a front-end team wants to only display information of the highest confidence level, they can do so by fetching the information and Trust Level from the backend and only displaying it if the Trust Level is L1.</p><div class="c2"></div><h2 id="calculation">Calculation</h2><p>Once we defined this shared vocabulary, we set out to calculate a Trust Level for each of the tens of millions of business property values on our platform. To start, we utilized one of our existing systems that tracks historical business data. The system logs all business changes to a dedicated Kafka stream for offline use cases. Each record contains a source type (business owner, external partner, etc.), source ID (which particular source provided the data), source flow (which feature or callsite the update came through), and timestamp. All of these fields are essential indicators when it comes to determining how confident we are that a given business property is correct.</p><div class="c2"></div><p>We also realized our property ingestion APIs could be improved to capture another important signal around data freshness. A lot of incoming updates we receive are “non-updating updates” — those with values that match what we already have on file. Previously, most of our ingestion flows discarded these as redundant, so we modified them to instead emit logs to a new, dedicated stream for verification events. Not all verifications are equivalent, so we made sure to include the same source-related fields described above in the Kafka stream with each event, preserving context about the verification that might be useful to us later.</p><p>Equipped with historical updates and verifications, we wrote a Spark ETL job to periodically pull these logs from S3, join them on business_id and business property, and then execute a series of rules to decide which Trust Level to assign to that pair. While we won’t detail the actual algorithm here, signals of recency and source type ended up being the biggest determiners of a given business property’s Trust Level.</p><h2 id="storage">Storage</h2><p>After calculating each Trust Level value, we needed a place to store them. Trust Levels are data describing properties, so it made sense to store these values alongside other business property metadata. A metadata table was considered multiple times in the past as we constantly fielded questions about when a property value was created, what time it was last updated, what source type updated the value, or from what flow the value was updated. Instead of running ad hoc queries and pulling together information from multiple datastores, we centralized the metadata in a new table to make it easier to access and eventually expose Trust Levels.</p><p>We called this table business_property_metadata and gave it the following schema:</p><div class="image-caption"></div><p>Here’s an example row of the business_property_metadata table:</p><div class="image-caption"></div><p>We chose Cassandra over MySQL as the underlying datastore for this new table, and while our rationale for this could be a blog post of its own, here are the main reasons.</p><p>We knew the table would hold tens of millions of rows, and we could safely assume clients of the data would be accessing it using the primary key (business_id, business_property_name). Cassandra provides good read and excellent write performance for data at this scale when rows are always queried on this key, which Cassandra uses in part as a partition key to distribute a table’s data to different nodes.</p><p>MySQL, which is used extensively throughout Yelp, offers different benefits that were less important for this particular use case. We don’t anticipate needing efficient joins of this metadata with other data entities, nor do we foresee the need for strict transaction mechanisms or strong consistency guarantees around these fields. Cassandra’s eventual consistency semantics are enough for this type of business information.</p><p>As a final note on storage, our metadata table is easily extendable. We have already included a column, provenance, that captures different fields around the data’s source in case our downstream consumers need access to that information. In the future, we will be able to add more types of metadata to the table as use cases arise.</p><h2 id="serving-trust-levels-online-and-offline">Serving Trust Levels (Online and Offline)</h2><p>The final step in enabling our Trust Levels was ensuring that the data was accessible for various teams to use. To do this, we created new dedicated REST API endpoints for querying and writing to our metadata table. We also backfilled our metadata table with historical data that we already had and calculated Trust Levels for those properties. We then migrated existing calls around business properties to our new API endpoints in order to write live updates to our metadata table. Now, with our metadata table filled with values, internal clients can access Trust Levels and other metadata through our online APIs.</p><p>For offline access, we already had existing data streams of our business property data published to <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Yelp’s Data Pipeline</a> and used by teams such as Search and Ads. We needed to make sure that the new metadata information was included alongside the property data in our data pipeline, while also ensuring that the data was easy to consume for our downstream clients.</p><p>In order to accomplish this, we first aggregated our data from the new metadata table along with other currently consumed tables, using a Yelp stream processing service called Flink Aggregator. The aggregator transforms the data stream to be similarly keyed by business_id, since metadata uses a different primary key (business_id, business_property_name). We then combine these streams using <a href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery</a> to produce one data stream that shows the entire current value of that business including metadata. This allows our downstream clients to utilize the same data stream with only slight modifications on their side to read the metadata — including Trust Levels — as well.</p><div class="c2"></div><h2 id="example-business-hours">Example: Business Hours</h2><p>To conclude, let’s connect all the steps described above by walking through an example for a business property that was updated a lot during COVID: Business Hours. Assume there is a business where the business owner last updated their hours two weeks ago, followed by a verification event from a data partner submitting the same hours one week ago. The following diagram illustrates the entire Trust Levels flow for this particular business property.</p><div class="image-caption"></div><p>Anyone at Yelp can now use this authoritative confidence label however they need it. A front-end team could use it to power new UI components indicating recently updated hours. A Search engineer could experiment with incorporating it as a feature in a ranking model. A data scientist could analyze if accurate business hours data is correlated with higher user engagement. Whatever it is, the Trust Levels data is ready for them, and becomes another tool we use to build helpful features for consumers and business owners during these unprecedented times.</p><h2 id="acknowledgements">Acknowledgements</h2><ul><li>We would like to thank Devaj Mitra, Surashree Kulkarni, Abhishek Agarwal, Pravinth Vethanayagam, Jeffrey Butterfield (author), Maria Christoforaki, Parthasarathy Gopavarapu, our Semantic Business Information team, our Database Reliability Engineering Team, and our Data Streaming teams who all helped make Trust Levels a reality.</li> <li>Thanks to Venkatesan Padmanabhan and Joshua Flank for technical reviewing and editing of this post.</li> </ul><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Hiring a diverse team by reducing bias</h1> <p>Thu, 22 Apr 2021 02:00:00 +0200</p> <p>Compared to where we started, Yelp’s technical organization has made a lot of headway over the years when it comes to diverse hiring. While our approach to this work continues to evolve, we’ve made significant progress in improving the diversity of our organizations by, among other things, reducing gender and ethnicity bias in our interview process. We’re here to share some of what we’ve learned to help others in their own efforts.</p><p>If you’ve come looking for the secret formula to emulate our success, I can’t help you there, unfortunately. Anyone offering otherwise is probably selling you something. And, to be sure, you’re going to need to buy some things along the way. But, if that newest iteration of Bias Blaster 9000 sounds too good to be true, that’s because it is. There are no easy fixes here.</p><p>In the 9 years I’ve been with Yelp, we have taken several major strides to evolve our hiring processes and strategy that got us to where we are today. What I’m covering in this blog is arguably the most critical of the changes we’ve made: tracking every bit of data possible and running regular analyses to better diagnose the bias in our engineering interviews.</p><h2 id="a-data-oriented-approach">A Data Oriented Approach</h2><p>Today, we’re monitoring every stage of every candidate’s interview process, as well as several data points about the candidates themselves. We track how many candidates apply organically versus how many are sourced and how each group performs on the first round interview. We monitor the offer rates by gender identity and how each group is converting to offer acceptance. We’re able to determine down to the level of individual questions in our interview process whether they are being passed at equal rates by all people being interviewed.</p><p>To know these things, we’ve become deliberate about tracking data and analyzing it. We automatically publish daily updates to a host of dashboards that monitor the health of our pipelines. We report weekly on the state of our hiring pipelines so that we can make adjustments as needed. We don’t make changes to our interview process without first knowing that we can measure the effects. With these procedures in place, we are truly able to systematically identify and address problems.</p><p>We’ve come a long way from where we started. It sounds somewhat absurd in hindsight, but early on during my tenure at Yelp, we didn’t even know how many people we needed to hire. We just knew that we needed more engineers, and that we needed them last month. We monitored how many people we were hiring per month alongside our offer-to-accept conversion rate. Sort of. As long as we remembered to track them, but it wasn’t a big deal if we forgot either. There was a lot of room for improvement.</p><h2 id="an-opportunity-to-start-fresh">An Opportunity To Start Fresh</h2><p>Just prior to the onset of the pandemic, our recruiting team was presented with a new opportunity as Yelp Engineering decided to expand its footprint to Toronto. As the pandemic unfolded, our plans pivoted from focusing on Toronto to remote hiring in Canada at large. This was our first opportunity to enter a new talent market properly with the knowledge we’d gained over the previous years.</p><p>And it seems to be working: In Q1 2021, 19% of our engineering hires in Canada identified as Black or Latinx (together, underrepresented minorities or URM), and we saw even more impressive gains in leadership positions, too.</p><h2 id="start-now">Start Now</h2><p>I’ve often regretted our inability to make quicker decisions for lack of data. It takes time to build up a sizable enough data set to understand your processes and detect the bias in them. In Yelp’s case, depending on the current rate of hiring, we’re typically able to understand the state of affairs with statistical significance after a month or two of data collection. There are of course variables that impact this. For example, top of funnel stages, such as first round interviews, produce more data.</p><p>Nine years ago, it would have taken us significantly longer to produce useful data. This is especially true at the later stages of the pipeline, when the number of candidates are reduced, and for demographics that are typically underrepresented in tech, because there weren’t enough in the pipeline to make statistically significant conclusions about. If you’re just getting started, the sooner you’re tracking recruiting data, the sooner you’ll be making meaningful changes to your processes.</p><h2 id="essential-data-points">Essential Data Points</h2><p>If we were starting fresh today, there are three data points I’d want to make sure we started collecting immediately.</p><ol><li><strong>Proceed/Did not proceed rates at every stage of the interview process</strong> - This one might go without saying, but it’s the foundation everything else is built on. Start from the point of contact all the way through to offer acceptance. Everything else is useless without an understanding of the proceed/did not proceed rates at each interview stage.</li> <li><strong>Candidate source</strong> - Knowing where your candidates are coming from generates a number of insights. Do applicants from career fairs or job boards get more offers? Most people jump to wanting to find the most successful sources, but it’s equally valuable to know your least successful sources. Candidates from certain sources falling out of your pipeline at a disproportionate rate can be very telling. We’ve seen this manifest as non-traditional CS educations, such as bootcamps, being rejected at disproportionate rates. This indicated that we needed to be more explicit in our evaluation criteria for interviews that we’re unconcerned with an applicant’s educational background, and changing these criteria has been highly effective at making sure candidates with a wide range of education backgrounds proceed equally through the pipeline.</li> <li><strong>Candidate demographics</strong> - Being able to analyze your pipeline by the demographics of the candidates is extremely helpful. For instance, it’s well known that there is a gender disparity in the tech space. Tying gender or ethnicity back to the previous two data points allows for powerful insights into which interview stages are problematic. As an example, we were able to identify early on in our data that women were less likely than men to attempt the code test, which is the first step to our interview process. A surprisingly effective intervention here was to ask all candidates a second time to participate, which is a good reminder that you don’t always need to reinvent the wheel to make change.</li> </ol><p>Point 3 comes with two <strong>very</strong> important caveats.</p><ol><li>Collecting this data is subject to different legal requirements depending on your location. Consult legal experts before moving forward.</li> <li><strong>No one responsible for making hiring decisions can have access to this data.</strong> The trackers that hold this data are managed by our operations people and access is granted only to the sourcers and recruiters tracking the data.</li> </ol><h2 id="assess-your-systems">Assess Your Systems</h2><p>Don’t let perfect be the enemy of good when tracking your data. Teams can be overwhelmed with the possibilities of what to track and how to go about it. A good place to start is by getting a grasp on what your existing systems can provide. You likely have some sort of applicant tracking system (ATS) that can provide some types of pipeline metrics. Learn what your system can do for you and how it does it.</p><p>It’s likely you’ll have to supplement your ATS with custom-made solutions, as there’s going to be data that your ATS is unable to provide. Don’t be too good for spreadsheets. I know, I know, there has to be a better way. There’s always a better way. Getting the data matters more than how you’re getting it. If spreadsheets allow you to track your data while you find a more permanent off-the-shelf solution or your teammates in engineering build you something, do it. We’ve relied on spreadsheets for years. Even though we’ve incorporated tools such as Tableau, spreadsheets have remained an important part of our system.</p><h2 id="proceduralize">Proceduralize</h2><p>Good, reliable data depends on maintaining consistent data collection practices. Depending on your systems, some of this will be automatic. At Yelp, we track a sizable amount of data manually and use our ATS for everything it can accurately, automatically track. For everything else, we rely on our recruiters and sourcers to manually track data in spreadsheets.</p><p>Each recruiter and sourcer has a centrally managed tracker that they use to track their candidates from start to finish. There is no room for interpretation about what data to collect and how to collect it. Every tracker is exactly the same and every team member tracks the same data.</p><p>Maintaining and analysing your data should also be the explicit responsibility of someone on your team. For us, things really took off when we created a full-time operations role within the recruiting organization. Taking this work seriously requires constant maintenance that can’t be done in “spare” time.</p><h2 id="pitfalls">Pitfalls</h2><p>This approach is not foolproof. There are mistakes to be made, and we’ve made our fair share of them. Chief among them is drawing conclusions from data that is not statistically significant. We’re often dealing with fairly small data sets, and it can be very tempting to make changes based on perceived patterns. In these cases, patience is key.</p><p>If something looks off, definitely pay attention, but try not to jump to conclusions. Rolling a change back hurts and also messes up your painstakingly collected data. It ends up being more damaging in the long run to make changes based on data that hasn’t reached significance.</p><h2 id="structured-interviews">Structured Interviews</h2><p>Part 2 of this post will go into detail on how we’ve built a structured interview process and acted on the data that we’ve collected. Layering structured interviews on top of our data collection and analysis practices has allowed us to make fine-grained tweaks to the interview process that would be otherwise impossible. Our insights have led us to a points-based system in our latest iteration of structured interviews that will further our goal of more equitably scoring interview performance.</p><p>Lastly, if you’re finding these posts interesting and Yelp sounds like the kind of company culture that you’d like to be a part of… <a href="http://www.yelp.com/careers">we’re hiring!</a></p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Engineering Career Series: Building a happy, diverse, and inclusive engineering team</h1> <p>Thu, 08 Apr 2021 02:00:00 +0200</p> <p>I considered writing this as a clickbaity listicle: “7 secrets of engineering team management - you won’t believe number three!” Unfortunately that’s impossible, because it’s a much harder topic, and anyway, number three is: “many years of ongoing investment in building the right team culture, making a lot of mistakes, and learning from them.” Less catchy, but much more what this series is going to try and cover…</p><p>I’ve been at Yelp for eight years now, and I’ve been leading engineering teams for almost 25 years in both the UK and the US, at a wide variety of companies, at different scales and stages of their development, and in very different parts of the technology industry.</p><p>Over that time, with the assistance of many of my colleagues and mentors, I’ve developed a set of principles that guide my approach to management and building engineering team cultures. When I joined Yelp, I found a company with <a href="https://www.yelp.careers/us/en/about-us">values that aligned very well</a> with these principles, at all levels of the company, so I’ve been lucky to be able to try and apply them thoroughly in practice here.</p><h2 id="teamwork-matters-more-than-individual-brilliance">Teamwork matters more than individual brilliance</h2><p>People are indeed individually brilliant, and everyone has unique life experience and talents to contribute to their work. However, it takes teams to build something at the scale of Yelp. Building a culture that values empathy and teamwork pays dividends. A corollary of this is that a strong <a href="https://www.gsb.stanford.edu/faculty-research/books/no-asshole-rule-building-civilized-workplace-surviving-one-isnt">“no assholes” rule</a> is vital.</p><h2 id="diversity-leads-to-success-but-only-if-theres-equity-inclusion-and-belonging">Diversity leads to success, but only if there’s equity, inclusion, and belonging</h2><p>There’s plenty of evidence that <a href="https://hbr.org/2016/11/why-diverse-teams-are-smarter">diversity makes teams more effective</a>, but that doesn’t mean that just hiring a diverse team automatically leads to success. To really succeed, you have to build a company culture where you genuinely deliver an inclusive and equitable experience for everyone. Building and cultivating a culture where everyone can thrive and feel like they belong requires you to constantly examine what you’re doing as a company and what the real impact of it is on your teams. That includes listening to people’s lived experiences and constantly trying to improve.</p><h2 id="distributed-teams-help-diversity">Distributed teams help diversity</h2><p>It’s a lot easier to build truly diverse teams if you’re not limited to having to hire people near the places you have offices. We’d already been hiring in multiple countries for some years at Yelp. Re-examining remote work and distributed teams during the pandemic has highlighted both the scale of the opportunity to really build teams that “meet people where they live,” but also the challenges in building successful distributed teams, abandoning the idea of a “head office” and creating a culture where everyone has an equal opportunity to succeed.</p><h2 id="no-process-is-another-name-for-bias">“No process” is another name for “bias”</h2><p>It’s really easy to have no process in small organizations - which is where you start by default, and the flexibility of not having a process offers lots of advantages at first. The thing is, you never really have no process, you just have a process that you’ve never written down and examined critically. And processes that you’ve never examined critically generally hide a world of unexamined unfairness, even with the best of intentions. You need to articulate and examine what these implicit processes are and make them more explicit, to eliminate that unfairness and the biased outcomes it produces. That means looking in depth at how you hire, how career advancement works, how you compensate people, how you think about technical leadership, and many other more innocuous seeming things where you encode unintentional biases into the system and culture of your company, influencing the likelihood that different people thrive or fail.</p><h2 id="you-have-to-walk-the-walk">You have to walk the walk</h2><p>It’s no use just <em>saying</em> you want to be better at building diverse, inclusive, happy teams. You need to actually change things, measure the results of your changes, look at that data, and then try and improve things again. This continuous iteration driven by data is vital, you must be really transparent and accountable about what you’re doing, and its successes and failures. This directly relates to the previous principle about process, but fundamentally underpins every effort to improve here. And yes, it’s hard. And you will fall on your face sometimes, publicly. And it will hurt. And you need to get up again and keep trying, because that’s the only way things will improve.</p><p>Rather than just hearing from me on how we’ve approached trying to live up to some of these principles at Yelp, we have a series of blog posts over the coming months to further explain. These blog posts will go into detail on the how as well as the why, and share some of what we’ve tried, what worked and what didn’t, in an attempt to give back to the many people whose ideas and learning we’ve built on over the years. Over the next few months we’ll cover:</p><h2 id="hiring-a-diverse-team-reducing-bias-in-engineering-interviews">Hiring a diverse team: reducing bias in engineering interviews</h2><p>How Yelp has approached hiring over the years, and the major lessons we learned. Once we started to standardise our approach to interviewing, we were able to analyse the data to find out if we were actually living up to our good intentions.</p><h2 id="how-we-onboard-engineers-across-the-world-at-yelp">How we onboard engineers across the world at Yelp</h2><p>Once you’ve hired someone amazing, you need to set them up for success on day one. The initial onboarding is vital, but is only part of the process. We’ve found that it’s critical to have a strong mentorship program for new hires, and that means choosing the right people to mentor and train them well. Mentorship doesn’t just stop at onboarding either, so we run an ongoing training and career development program to make sure people from diverse backgrounds can all succeed at Yelp.</p><h2 id="career-paths-for-engineers-at-yelp">Career paths for engineers at Yelp</h2><p>Yelp previously had a completely flat “no levels” individual contributor career framework for Engineering. We’ll cover how we designed and redesigned our framework for career growth and levelling to move away from that, and discuss how that shift increased fairness and equity.</p><h2 id="technical-leadership-at-yelp">Technical leadership at Yelp</h2><p>Why we approach technical leadership as a role you can choose to take on at Yelp, rather than just a level within our career levelling framework, and how we’ve tried to build a collaborative, cross-pollinating community of technical leaders who work together regularly to solve “big picture” problems, rather than just being experts in their own fields.</p><h2 id="how-yelp-approaches-engineering-management">How Yelp approaches engineering management</h2><p>What “success” looks like for managers at Yelp, what we ask managers to do and to value, how we’ve built this into the career path for managers, and how we hire and onboard them.</p><h2 id="ensuring-pay-equity--career-progression-in-yelp-engineering">Ensuring pay equity & career progression in Yelp Engineering</h2><p>“Walking the walk” meant actually examining in detail how we compensated people and how they progressed in their career, and whether that was actually fair and equitable across all demographics at Yelp. And then publishing the outcomes to the whole Engineering team and committing to do so annually, whatever the results were.</p><h2 id="fostering-inclusion--belonging-within-yelp-engineering">Fostering inclusion & belonging within Yelp Engineering</h2><p>Improving inclusion and belonging requires you to provide for teams and groups in many different ways, like supporting Employee Resource Groups to encourage communities to socialise, collaborate, and empower themselves, providing flexible working practices to suit people with different needs, abilities, and lifestyles, as well as designing systems and processes that give people the support they need in the time and place and manner they need it.</p><p>I hope you’ll find this series informative and helpful. I welcome the opportunity to share our triumphs and setbacks with you, and look forward to the feedback on what we’re doing well, and what we still need to learn to do better.</p><p>And last but not least, if this sounds like the kind of company culture that you’d like to be a part of, and you’d like to help make it better… <a href="http://www.yelp.com/careers">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Powering Messaging Enabledness with Yelp's Data Infrastructure</h1> <p>Mon, 05 Apr 2021 02:00:00 +0200</p> <p>In addition to helping people find great places to eat, Yelp connects people with great local professionals to help them accomplish tasks like making their next big move, fixing that leaky faucet, or repairing a broken phone screen. Instead of spending time calling several businesses, users can utilize Yelp’s <a href="https://blog.yelp.com/2020/08/yelp-reinvents-the-hiring-experience-for-home-and-local-services">Request a Quote</a> feature to reach out to several businesses at once, receive cost estimates from those businesses, and ultimately hire the right local professional for the job. This post focuses on how <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Yelp’s Data Pipeline</a> is used to efficiently compute which businesses are eligible for the feature, and also introduces Yelp’s Data Lake, which we use to track historical values of the feature for offline metrics and analytics.</p><div class="c2"></div><p>While most businesses can be reached via the phone number listed on their Yelp business page, only a subset of businesses are eligible for Yelp’s messaging feature (at least for now!). We refer to the ability for a business to receive messages from users as “messaging enabledness” (or sometimes, just enabledness). It is determined by checking several different conditions about the business. For instance, the business owner must opt-in to the feature and they must have a valid email address so they can be notified about new messages, among other things.</p><p>Computing messaging enabledness is tricky since checking all the criteria requires joining and fetching values from several different SQL tables. For some features, like deciding whether or not to display the “Request a Quote” button on a business’s page, it’s essential to correctly identify a business’s enabledness, even if it takes extra time to perform all those joins. For other applications of the data, such as analysis or indexing, we can tolerate the risk of a stale value in order to speed things up, so a cached mapping of an identifier for the business (business_id) to its messaging enabledness is stored in its own SQL table. This is kept up to date by a batch which runs periodically to recompute the value for all businesses.</p><p>In addition to storing the current state of enabledness, Yelp is also interested in persisting a historical record of messaging enabledness for businesses. This allows the company to measure the health of the Request a Quote feature in addition to being an invaluable source of information when investigating any pesky bugs that might pop up.</p><p>There are millions of businesses listed on Yelp, so storing this history in a SQL table is not efficient in terms of storage cost or query time. Another option was to store a nightly snapshot of the table, but that would have resulted in duplicated information day to day, would have been more difficult to query, and wouldn’t have captured multiple changes to the same business in a single day. What we really want to store is a change log of the table.</p><p>Remember that this data is stored in a SQL table. If you’ve been following along with the Data Pipeline posts on the Yelp engineering blog, you’ll know that Yelp has developed a tool called the <a href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Replication Handler</a> which publishes a message to our Kafka infrastructure for every update to a SQL database. By connecting this tool to the table caching businesses’ messaging enabledness, a full history of changes can be written to a Kafka stream. Now if only we had a way to store this stream…</p><h2 id="yelps-data-lake">Yelp’s Data Lake</h2><p>Yelp’s Data Lake is our solution for storing schematized data at scale. Our Data Lake is built on top of the Apache Parquet format, Amazon S3, and Amazon Glue. This S3 based architecture allows us to cheaply store data, making it possible for us to keep records over a long period of time. Our Data Lake implementation also integrates with our in-house schema management system, <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">Schematizer</a>.</p><p>Data from Kafka can easily flow into our Data Lake through our Data Lake Sink Connector. The connector provides a fully-managed way for moving data to the Data Lake, without engineers having to worry about any underlying infrastructure. All engineers need to do is specify which data they want in the Data Lake, either though our datapipe CLI tool or through our Pipeline Studio web UI.</p><div class="highlighter-rouge highlight"><pre>$ datapipe datalake add-connection --namespace main --source message_enabledness Data connection created successfully Connection #9876 Namespace: main Source: message_enabledness Destination: datalake </pre></div><p>Once in the Data Lake, data can power a wide variety of analytic systems. Data can be read with Amazon Athena or from Spark jobs. Using Redshift Spectrum, we also allow analytics from Redshift, where Data Lake data can be joined with data we put in Redshift using our <a href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Redshift Sink Connector</a>. Redshift Spectrum can also be used to power Tableau dashboards based on Data Lake data.</p><p>We previously mentioned that the messaging enabledness table was updated periodically. Even though changes to the table are being persisted to the Data Lake, with this approach we’re not able to identify the time the change happened and have no way to tell why the value changed (i.e. which of the criteria triggered this change).</p><p>In order to catch these changes in real time, each time an update happens that might affect a business’s enabledness, an asynchronous task can be submitted to recompute the value and store it in the table along with the reason for the change. The code looks something like this:</p><div class="highlighter-rouge highlight"><pre>def update_value(business_id, new_value): update_value_for_business(business_id, new_value) update_enabledness_async(business_id, reason) def update_enabledness_async(business_id, reason): current_enabledness = get_enabledness_from_cache(business_id) updated_enabledness = compute_enabledness(business_id) if not current_enabledness == updated_enabledness: set_enabledness(business_id, updated_enabledness, reason) </pre></div><p>While this works, you might be able to spot a shortcoming: anytime an engineer adds a new way to update a value which might change a business’s enabledness, they also need to remember to call the update_enabledness_async method. Even though this approach might capture all the changes when it is first written, as the code evolves over time a single mistake can cause the data stored in the table to be inaccurate.</p><div class="highlighter-rouge highlight"><pre>def update_value_v2(business_id, new_value) update_value_for_business(business_id, new_value) # Something is missing here... </pre></div><h2 id="reacting-to-the-changes">Reacting to the Changes</h2><p>Looking more closely at the system above, there is something peculiar about the update call. Recomputing enabledness isn’t really an operation that should happen after one of the criteria was updated. Instead it happens as a result of the value being changed. Rather than depending on engineers and code reviewers to remember that the <code class="highlighter-rouge">update_enabledness_async</code> function must be triggered manually, what if we could build a system that triggered this update as a result of the change?</p><p>Each of the criteria for enabledness is stored in a SQL table, and as we discussed earlier in the post, the Replication Handler can be used to publish changes to those tables to Yelp’s Data Pipeline! Consumers of those topics (specifically, <a href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">Paastorm spolts</a>) can be set up to call the <code class="highlighter-rouge">update_enabledness_async</code> task on any relevant change!</p><p></p><p>This post introduces the Yelp Data Lake and demonstrates how the Data Lake Sink Connector makes it easy to track the historical value of a business’s messaging enabledness. It also shows how the Streaming Infrastructure you’ve read about in previous blog posts (or at least the ones you’re about to read right after you finish this one!) is used to solve real engineering problems at Yelp, allowing systems to react to data changes without the need to write custom components or complex logic.</p><ul><li>Thanks to Mohammad Mohtasham, Vipul Singh, Francesco Di Chiara, and Stuart Elston who assisted at various stages of design and implementation of this project.</li> <li>Thanks to Blake Larkin, William O’Connor, and Ryan Irwin for technical review and editing.</li> </ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Are you interested in using streaming infrastructure to help solve tough engineering problems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5d226cd-6ea1-4d12-b875-725b331202b7?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Passwordless Login: Reengaging Business Owners with Less Friction</h1> <p>Mon, 01 Mar 2021 01:00:00 +0100</p> <p>As various teams at Yelp were focused on developing features to help businesses adapt to COVID-19, some teams were looking ahead and developing features that would help businesses in the later stages or after the pandemic.<br /></p><p>Early on in the pandemic, we saw some businesses pause advertising on Yelp as government regulations required many businesses temporarily close or limit their operations. However, businesses quickly adjusted to the local regulations, while implementing health and safety precautions to keep their staff and customers safe. Through this adjustment we wanted to ensure it was easy to restart advertising right where they left off.</p><p>Our data revealed that as of April 2017, most business owners used password-based logins to sign into their business owner account. However, if they forgot their password, it could be a frustrating experience for them to continue into the app.</p><div class="c2"></div><p>A typical Reset Password flow looked like:</p><div class="c2"></div><p>After receiving their Reset Password link, they were presented with</p><div class="c2"></div><p>Depending on what the users entered into the two input boxes, they could receive the following error messages:</p><ul><li>Please enter a password</li> <li>Oops, the passwords you entered don’t match!</li> <li>Please choose a password of at least 6 characters</li> <li>This password is insecure, please try a different one.</li> <li>This password has been used in the past year. Please enter a different password.</li> <li>… and more…</li> </ul><p>Our data showed that we sent one of these errors about 7,500 times a day.<br /></p><p>To resolve this, our solution was to remove the need to authenticate a business owner with a password by creating a passwordless login. Yelp will send a unique link (called Magic Links) which are short-lived (ie. 1-hour to up to 3 days) links that provide automatic login functionality. A Magic Link will automatically open the Yelp for Business app, verify a business owner’s credentials, log them in, and then optionally redirect them to anywhere in the app of our choosing. Magic Links are one-time use and time sensitive, so the links will eventually expire if they aren’t used.</p><p>To unlock the full capabilities of this feature, we also appended each Magic Link with a redirect link that takes them to a specific page after automatic successful login. The redirect link can consist of any deeplink that we already currently support. Particularly for this initiative, we redirected our users to our One Click Restart screen, which allowed business owners to restart their ads with Yelp.</p><p>We have implemented this logic on Android, iOS, and the web. Even if business owners did not have the Yelp for Businesses app installed on their device, they can still take advantage of this feature.<br /></p><p>With this solution, we are able to provide a seamless user journey for business owners to the end goal, securely and with only one click. Technically, it was a creative and innovative solution.</p><p>Figure 1 shows the original status quo before we implemented Magic Links and Figure 2 shows the sequence of steps after implementing Magic Links.</p><div class="c2"></div><p>At Yelp, UrlCatcherActivities are Activities that are responsible for handling deeplinks. We have deeplinks that are preceded with http://biz.yelp.com or yelp-biz://. Given two deeplinks that are exactly the same with the exception of the host, the app could reroute them differently. Since Magic Links could be sent via device notifications or emails, we needed to support both URI hosts.</p><p>The MagicLinkUrlCatcherActivity was responsible for intercepting all Magic Links via the Android Manifest and acting on it. It validated the Magic Link and provided feedback for both successful and unsuccessful validations.</p><p>Our Magic Link schemas looked like this: https://biz.yelp.com/login/passwordless/?return_url=https://biz.yelp.com/ads/i2kK8NtpmtuKf84NYm0d3A/</p><p>An invalid Magic Link could consist of an expired, malformed, or missing MAGICLINKTOKEN.</p><p>On successful validation, we logged the user into the app. The return_url is an optional parameter. If it is present and also a valid deeplink that we supported, we then forwarded the redirect url embedded in the Magic Link to downstream UrlCatchersActivities. From that point on, the app behaved as status quo. If the return_url was not specified, we redirected to the home screen.</p><p>On unsuccessful validation, the activity is responsible for redirecting the user to the Log In screen so that users may try to enter in their credentials manually. If successful, we redirected the user to the embedded link within the Magic Link.</p><p>When the project first began, we had (naively) thought that this project would be simple (refer to Figure 1). Our initial strategy was to write the Magic Link logic in both the UrlCatcherActivities, but like most projects, the more we worked on it the more we realized that there were a lot of edge cases we had to handle. Accounting for each edge case for each UrlCatcherActivity would duplicate code and double our blast radius. On top of that, each requirement change, no matter the size, would have to be duplicated. We quickly realized that we should refactor all the code into one place sooner rather than later.</p><p>The Magic Link high level logic (illustrated by diagram below) was refactored into the MagicLinkUrlCatcherActivity.</p><div class="c2"></div><p>Each UrlCatcherActivity already contained business logic that required some time to understand. By moving all Magic Link related logic into the MagicLinkUrlCatcherActivity, we only passed the optional redirect url into the downstream logic.</p><p>We didn’t need to test this end-to-end. We only had to concentrate on 4 main areas:</p><h3 id="input-validation">Input validation</h3><p>We validated that all deeplinks into the app were successfully triaged by the AndroidManifest to either go to one of the downstream UrlCatcherActivities or the MagicLinkUrlCatcherActivity</p><h3 id="magic-link-validation">Magic Link validation</h3><p>We tested that the MagicLinkUrlCatcherActivity was able to handle successful and unsuccessful validation of the Magic Link.</p><h3 id="redirect-links-are-passed-to-the-correct-catcher">Redirect links are passed to the correct Catcher</h3><p>Lastly, we wrote tests to ensure that the embedded redirect links within the Magic Link would get passed to the correct downstream UrlCatcherActivities. Also verify when there were no redirect links passed.</p><h3 id="analytics">Analytics</h3><p>We verified that the correct analytics were fired at specific points in the code so that Yelp can track usage and other metrics of interest.</p><p>Using Magic Links together with deeplinks reduced friction for our business owners to log into their accounts and resume advertising.. By making passwords obsolete on login we reduced user churn resulting from abandoned password resets. We hope this feature will help our business owners get the word out about their business and better communicate and engage with their customers.</p><p>Shoutout goes to Karlo Pagtakhan and Khushboo Puneet for working on this with me! Also, thank you to Blake Larkin, Eric Hernandez, Rajan Roy, Joshua Walstrom, Patrick Fitzgerald, and Mark Brady for technical review and editing.</p><div class="island job-posting"><h3>Become an Android Engineer at Yelp!</h3><p>We're working on cool interesting problems everyday! Come join our Android team!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/d13b9fe9-c523-4407-9432-7783d2848fca/Software-Engineer-Android-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Boosting user conversion with UX performance wins</h1> <p>Wed, 27 Jan 2021 01:00:00 +0100</p> <p>Everyone loves graphs going up and to the right, unless they reflect your page load timings. This blog post is about curtailing higher page load times. <a href="https://biz.yelp.com">Yelp for Business Owners</a> allows business owners to manage their listing, respond to reviews, edit their business information, upload business photos, and more. Business owners can also purchase Yelp Ads and profile products to target local audiences and enhance their business’s presence on Yelp. In this blog post, you’ll learn about the ways we improved the UX performance of our ads purchase flow by dramatically reducing the load times. You’ll be able to apply the same tactics to your own flow and hopefully achieve results similar to ours:</p><div class="c2"></div><p>Our core ads purchase flow is a single-page React application powered by Python-based backend services and GraphQL. Over the past couple of years, it has grown from a four step process to a <a href="https://blog.yelp.com/2020/09/getting-started-with-yelp-ads">seven step process</a> with new features to provide better ad campaign controls. However, as we added more features, performance suffered. Our page-load P75 timings increased from 3 seconds to 6 seconds for desktop users. This slowdown was even more pronounced for our mobile users due to increased constraints in network speeds and reliability.</p><p>It's a <a href="https://web.dev/why-speed-matters/">known fact</a> that faster-loading pages directly benefit user conversion. We wanted to measure how much faster performance affected the bottom line and ran a lightweight experiment to measure the relationship between performance and user conversion. We made some backend optimizations to reduce page load timings by one second, and immediately observed a 12% relative increase in conversion rate. This early win gave us confidence in our future investments along with full buy-in and support from our product team.</p><p>The first step in our performance effort was to set up a framework that would standardize the metrics and logging across all our flows. We decided to target two specific metrics:</p><h3 id="first-contentful-paint-fcp">First Contentful Paint (FCP)</h3><p>FCP is the browser’s time spent rendering any image or text after sending the page load request. It is widely accepted as a <a href="https://web.dev/first-contentful-paint/">key metric</a> in the industry to measure your web page’s performance. Targeting FCPs was critical because it hints to the user that their page is starting to load. During our experimentation, we found that a user was much more likely to leave our site during a page load than after they saw any content, even if they only saw a loading spinner. Since a page load event depends on multiple systems (such as web browser, routing layers, authentication proxies, etc.), we further broke down our FCP into the following units to help categorize our efforts:</p><ol><li> <p>Redirect time: How long the browser spent following redirects (HTTP 303s).</p> </li> <li> <p>Request time: How long the request-response cycle took for the main request inside Yelp servers.</p> </li> <li> <p>Rendering time: How long it took for the browser to render the first contentful paint after receiving the initial response.</p> </li> </ol><p>The image below shows the breakdown of our timings in the units discussed above.</p><div class="c2"></div><p>This metric measures the time spent by the browser to fully load a page. It captures any client-side rendering logic and async data fetching required to render the complete user experience. At Yelp, we call this metric <em>Yelp Page Complete</em> (YPC). It is critical to capture TTI since many of our applications render a shimmer or a page shell after the initial page load, and then the respective components fetch their data. TTI helps capture the entire user experience timings.</p><p>We have several other similar flows on the biz site with their own data fetching strategy. To make the integration convenient across all of them, we created a shared JavaScript package to consolidate all of the logic related to logging, polyfills, batching/throttling of logging related AJAX calls, etc. In the end, the integration only required adding a couple of lines to start logging all the performance metrics.</p><p>We relied on many tools that were critical to our effort that are worth mentioning here:</p><h3 id="zipkin">Zipkin</h3><p><a href="https://zipkin.io/">OpenZipkin</a> is an open-source distributed tracing system set up at Yelp. It helped identify bottlenecks during the request lifecycle inside Yelp servers. Our request travels through multiple services, and this tool was indispensable in identifying potential optimizations on the backend. Here is a sample Zipkin trace:</p><div class="image-caption"><div class="c2"></div><p class="subtle-text"><small>Source: https://zipkin.io/</small></p></div><h3 id="webpack-bundle-analyzer">Webpack Bundle Analyzer</h3><p><a href="https://github.com/webpack-contrib/webpack-bundle-analyzer">Webpack Bundle Analyzer</a> helped us visualize our JavaScript bundles’ content with an interactive zoomable treemap. This tool was crucial to identify optimizations in our frontend assets that we discuss later. Below is a sample treemap interaction from the plugin’s Github repository:</p><div class="image-caption"><div class="c2"></div><p class="subtle-text"><small>Source: https://github.com/webpack-contrib/webpack-bundle-analyzer</small></p></div><h3 id="splunk">Splunk</h3><p>We ingested all of our performance metrics in Redshift database tables and visualized them as <a href="https://www.splunk.com/en_us">Splunk</a> dashboards. These helped us track our progress in real-time while deploying changes. Below is an example dashboard:</p><div class="image-caption"></div><h3 id="chrome-devtools">Chrome DevTools</h3><p>Chrome’s tooling provided terrific insights into our frontend performance issues. We specifically relied on the <a href="https://developers.google.com/web/tools/chrome-devtools/evaluate-performance/timeline-tool">flame charts</a> under the Performance tab to identify where the browser’s main thread was blocking and how much time was being spent in our assets loading, parsing, and evaluation. Google <a href="https://developers.google.com/web/tools/lighthouse">Lighthouse</a> also provided actionable opportunities and diagnostic information.</p><p>After learning from the gathered metrics, we planned on tackling the performance issue on all fronts: backend, frontend, and infrastructure. Below are a few things that are worth sharing:</p><h2 id="frontend-optimizations">Frontend Optimizations</h2><p>Our JavaScript bundles had been growing slowly due to continuous feature additions over the past couple of years. Yelp’s in-house tooling already enforced general best practices such as gzip compression, bundle minification, dead code elimination, etc. So most of the issues were part of our application setup. After analyzing our bundle using the tooling above, we employed various techniques listed below to reduce our gzipped bundle size from 576 KB to 312 KB, almost 50% reduction!</p><ol><li> <p><strong>Code Splitting</strong>: Serving code for all the seven pages of our purchase flow during the initial page load was undoubtedly wasteful. We opted to use <a href="https://loadable-components.com/">loadable components</a> to create separate chunks for different steps that would load on demand. This chunking helped reduce our bundle size by 15%. However, loading these assets on demand added a small delay on every page load, so we wrote a helper function to preload all the chunks using the useful <a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/requestIdleCallback">requestIdleCallback</a> function to avoid any UX behavior changes.</p> </li> <li> <p><strong>Tree Shaking</strong>: Yelp’s recent default Webpack settings enable dead code elimination. Looking into our bundle treemaps, we realized that tree shaking wasn’t working for some of our older packages because they were still using older build settings. So, a hunt began to figure out all such packages, and we ended up further reducing our bundle size by 30% by just upgrading their build.</p> </li> <li> <p><strong>Replacing Packages with Heavy Footprint</strong>: We identified a few packages being used infrequently in our code that occupied an unreasonable portion of our bundle. The primary example was <a href="https://momentjs.com/">moment.js</a> that was used only twice but occupied 5% of the bundle. We were able to replace it with <a href="https://date-fns.org/">date-fns</a>, which is tree-shakeable. Fun fact: the project status of momentjs now itself recommends using alternatives.</p> </li> <li> <p><strong>Deduplicating Packages</strong>: We use Yarn for our dependency management, and (before Yarn V2) it didn’t deduplicate the packages with overlapping ranges. For our large apps, deduplication had a noticeable impact on our bundle sizes. Yarn V2 helped solve this problem for us.</p> </li> <li> <p><strong>Reducing Component Re-rendering:</strong> <a href="https://reactjs.org/blog/2018/09/10/introducing-the-react-profiler.html">React profiler</a> identified that specific core page components such as the navigation bar were re-rendering wastefully during the page load. This re-rendering blocked the main thread and delayed FCP. We resolved this by adding <a href="https://reactjs.org/docs/react-api.html#reactmemo">memoization</a> on top of these components.</p> </li> </ol><h2 id="server-side-optimizations">Server-side Optimizations</h2><p>Yelp’s growing service architecture presented some interesting roadblocks. As the request traveled through multiple services (including a monolith), its lifecycle was complicated. For example, the page-load request went through 3 services and depended upon up to 5 downstream services for fetching data. The efforts listed below helped us bring down our request timings:</p><ol><li> <p><strong>Removing Proxy Layers</strong>: All biz site requests were proxied through Yelp’s monolith because it handled authentication and authorization for logged-in business owners. This proxy was expensive. Earlier this year, we packaged up the authentication and authorization business logic into a reusable Python package. This optimization entailed integrating with that package, setting our service up to accept traffic directly from our routing layer, and rolling it out carefully via <a href="https://martinfowler.com/bliki/DarkLaunching.html">dark-launching</a>. It helped us save 250ms from our request time along by getting rid of legacy code.</p> </li> <li> <p><strong>Parallelizing Network Calls:</strong> We rely on several downstream services for fetching data during page load. Zipkin helped us uncover that we had laid out some of our network calls in a blocking manner that slowed down the entire request. At Yelp, we use Futures built with <a href="https://github.com/Yelp/bravado">Bravado</a>, which allows us to send network requests concurrently. We rewrote the request code to fire off all the network requests at the top of business logic and avoided starting any new network request later in the code. It helped us shave 300ms from our request timings. While this issue can regress, we documented best practices for this behavior to help prevent them in the future.</p> </li> <li> <p><strong>Eliminating Redirects</strong>: Legacy pages, old flows, third-party blog posts, etc., contributed to redirects before the user landed on the final URL/page. These redirects were a few seconds in some cases for our mobile traffic. We documented all of the redirects using the <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer">HTTP Referer</a> header and tackled them accordingly.</p> <div class="c2"></div> </li> <li> <p><strong>Server-side Rendering</strong>: Before this effort, our flow was rendered entirely client-side, i.e., we didn’t send any HTML in the request’s response. We only sent the JavaScript bundle and relied entirely on the browser and React app to generate HTML for serving the page’s content. We identified that this adversely affected our FCP, especially on mobile clients with limited CPU and memory. We already had a (React) component rendering service based on <a href="https://github.com/airbnb/hypernova">Hypernova</a> set up at Yelp. We integrated with that service and started rendering the first page’s markup from the server. We immediately saw significant benefits for all the clients. We transferred the rendering load to the server, as evident in the graphs below, but the rendering time took a steep drop and the net impact was lower FCP time. Also, long gone was our loading shimmer!</p> <div class="c2"></div> </li> <li> <p><strong>Pre-warming Cache:</strong> We have a few computationally expensive tasks in our requests, such as building a category tree object created by reading configurations from disk. We cached these objects in memory, but we identified that our higher latency P90 requests still suffered because they would always get a cache miss. We created an internal endpoint whose sole responsibility was to warm up all the caches and create expensive cacheable objects. We used a <a href="https://uwsgi-docs.readthedocs.io/en/latest/PythonDecorators.html#uwsgidecorators.postfork">uWSGI hook</a> that would be called every time a worker was created to make a call to this internal endpoint. It helped bring our P95s down by almost 2 seconds across all clients.</p> </li> <li> <p><strong>Vertical Scaling:</strong> Last but not least, we also tried deploying our service and its dependent services on highly performant <a href="https://aws.amazon.com/ec2/instance-types/z1d/">z1d.6xlarge EC2 instances</a>. We saw marginal improvements (up to 100msec) on page load timings, but some of the other computationally expensive AJAX APIs saw more significant gains. For example, our POST endpoint responsible for purchasing the products got 20% faster, leading to lower timeouts.</p> </li> </ol><p>After four months of focused effort with a dedicated engineering team, we achieved results that made this investment worthwhile. It was not just a win for our conversion metrics, but also for our customers, who now experienced substantially faster loading pages.</p><p>The keys results that we achieved for our ads purchase flow:</p><ul><li>We reduced our P75 FCPs from 3.25s to 1.80s, a 45% improvement.</li> <li>We reduced our p75 YPCs from 4.31s to 3.21s, a 25% improvement.</li> <li>We saw up to 15% lift in our conversion rate.</li> </ul><p>Below are a couple of graphs that show our progress over time:</p><div class="c2"></div><div class="c2"></div><h2 id="acknowledgements">Acknowledgements</h2><ul><li>Shoutout to my teammates on this project: Thibault Ravera, Bobby Roeder, Frank She, Austin Tai, Yang Wang and Matt Wen.</li> <li>Shoutout to Dennis Coldwell, Aaron Gurin, Blake Larkin and Alex Levy for technical review and editing.</li> </ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a6cfee89-2dd0-4451-bf52-746b9547dfb7/Software-Engineer-Full-Stack-Engineer-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Whose Code is it Anyway?</h1> <p>Wed, 13 Jan 2021 01:00:00 +0100</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"> <p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p> </form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>Now You See Me: How NICE and PDQ plots Uncover Model Behaviors Hidden by Partial Dependence Plots</h1> <p>Thu, 17 Dec 2020 01:00:00 +0100</p> <p>Many machine learning (ML) practitioners use <a href="https://scikit-learn.org/stable/modules/partial_dependence.html">partial dependence plots</a> (PDP) to gain insights into model behaviors. But have you run into situations where PDPs average two groups with different behaviors and produce curves applicable to none? Are you longing for tools that help you understand detailed model behavior in a visually manageable way? Look no further! We are thrilled to share with you our newest model interpretation tools: the Nearby Individual Conditional Expectation plot and its companion, the Partial Dependence at Quantiles plot. They highlight local behaviors and hint at how much we may trust such readings.</p><h2 id="a-not-nice-world">A not NICE world</h2><p>At Yelp, we have ML models for personalized user and business owner <a href="https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html">recommendations</a>, the <a href="https://engineeringblog.yelp.com/2020/02/accelerating-retention-experiments-with-partially-observed-data.html">retention</a> of advertisers, <a href="https://engineeringblog.yelp.com/2019/12/architecting-wait-time-estimations.html">wait time prediction</a> in <a href="https://restaurants.yelp.com/products/waitlist-table-management-software/">Waitlist</a>, <a href="https://engineeringblog.yelp.com/2020/01/modernizing-ads-targeting-machine-learning-pipeline.html">ads targeting</a>, <a href="https://engineeringblog.yelp.com/2014/12/learning-to-rank-for-business-matching.html">business matching</a>, etc. Although the prediction quality is always one of the key priorities for any ML model, we also care deeply about the interpretability of the model. As ML practitioners, we often use model interpretation tools to do sanity checks on how a model is generalizing from the features. More importantly, exposing the “why” behind a model’s behavior to its consumers, who often are not ML practitioners, can give them confidence in its accuracy and generalizability, or lead them to deeper applications and better business decisions.</p><p>Since most of our models are complex in order to achieve better prediction quality, they can also be harder to decipher. One common question in model understanding is, “How do changes in a feature’s values relate to changes in the prediction?” Previously, we used the popular <a href="https://scikit-learn.org/stable/modules/partial_dependence.html">PDP</a> and the <a href="https://ieeexplore.ieee.org/abstract/document/5949423">sensitivity plot</a> to take snapshots from the model that are easy for a human to understand. A PDP shows how predictions change, on average, when varying a single feature<sup><a href="https://engineeringblog.yelp.com#footnote1">1</a></sup> over its values (e.g., min to max) and holding the other features constant. PDPs can answer questions like, “What would users’ wait time be respectively if the local temperature were 30°F, 50°F, and 70°F?” In contrast, a sensitivity plot varies a single feature relatively (e.g., -15% to +15%) while holding the other features constant. Sensitivity plots can answer questions like, “If the weather had been 10% warmer for these days (temperatures on these days in general are different), how would wait time estimates have changed?”</p><p>However, these tools are not without some limitations. First of all, both PDP and sensitivity plots operate at an aggregated level, meaning that we average all the data points to achieve one single curve. This aggregation may hide differences in various subpopulations. For example, when creating either a sensitivity plot or a PDP, we could imagine the prediction goes up in half of the population and goes down in the other half when we increase a feature. When we average the two halves together, we may falsely conclude that the feature has no marginal contribution to the prediction. Secondly, when drawing PDPs or sensitivity plots over sparse data regions, both plots become untrustworthy. For example, if we only have a few restaurants open at 0°F, then it’s usually unwise to generalize from a PDP for wait times around such low temperatures.</p><p>To address these concerns, we came up with two new tools: the Nearby Individual Conditional Expectation (NICE) plot and its companion the Partial Dependence at Quantiles (PDQ) plot. Instead of the aggregate effect, the NICE plot individually draws changes in predictions due to local perturbations on top of the scatter plot between feature values and corresponding predictions. The PDQ plot helps to summarize the heterogeneity in the NICE plot by aggregating partial dependence at different quantiles of predictions. In practice, we often need to review the PDQ plot when we have difficulties in figuring out the general patterns in a NICE plot.</p><h2 id="what-is-the-nice-plot">What is the NICE plot?</h2><p>NICE plots examine the <a href="https://www.tandfonline.com/doi/abs/10.1080/10618600.2014.907095">Individual Conditional Expectation</a> in the neighborhood of the original feature values. Below is an example plot from one feature in one of our retention models.</p><div class="image-caption"><p class="subtle-text"><small>Note: This graph contains 1000 data points and each blue line consists of 7 points.</small></p></div><p>We made this NICE plot using the following algorithm:</p><ol><li>Select a random sample of data points (if your dataset is large).</li> <li>Make a scatter plot of feature values and model predictions (the black dots).</li> <li>Make nearby perturbations about each feature value (e.g., <code class="highlighter-rouge">lower_bound</code> = 0.9 * <code class="highlighter-rouge">feature_value</code> and <code class="highlighter-rouge">upper_bound</code> = 1.1 * <code class="highlighter-rouge">feature_value</code>) and evenly sample N points within the bounds (we recommend N to be odd so the original feature value is included).</li> <li>Record their corresponding perturbed predictions.</li> <li>Draw lines between the N points and corresponding predictions on the scatter plot (the blue lines).</li> </ol><p>A NICE plot foremost shows the bivariate distribution between feature values and their corresponding predictions. Therefore, it is straightforward to observe the sparsity between the two. In the above example, the model rarely gives a low prediction when the feature is smaller than 4 (exhibited by the white space on the bottom left corner) and it rarely gives a high prediction when the feature value is roughly smaller than 1 (illustrated by the white triangle-like shape on the top left corner).</p><p>More importantly, this plot only examines marginal effects at the neighborhood of each observed data point, which helps to show heterogeneous effects and may hint at any interaction effects. In the above graph, the marginal effect goes up and then goes down when the feature value is in the range of 0 to 1. Starting from 1, the effect is positive and large in magnitude until the feature value reaches to 2. In the range 2 to 4, we observe some heterogeneous effects: some lines are downward sloping while others are flat, and the flat ones are observed more often when the prediction gets larger. When the feature value is greater than 6, all the NICE lines are flat throughout the region.</p><p>On the other hand, the information in the PDP and the sensitivity plot of the same feature lacks many details.</p><div class="image-caption"><p class="subtle-text"><small>Note: the y-axes in these figures have narrower ranges than the previous NICE plot because of aggregation.</small></p></div><p>The PDP (left) correctly captures the most significant inverted V-shape structure when the feature is smaller than 4 and the flat shape afterwards, but it loses some subtleties contained in the V-shape. The sensitivity plot (right) is misleading. From it, you may conclude that tweaking the feature would yield a single-peaked relationship with the peak at -20% of the feature value, which is true only in aggregate. From the NICE plot, we can see this relationship fails to hold for most, if not all, individual data points: the marginal effects are flat when the feature values are greater than 4, and have the “wrong” shape for samples in the valley near one.</p><p>When trying to apply the above algorithm to binary or categorical features, we cannot make nearby perturbations and have to examine the change from one value to another. Below is one example from our system.</p><div class="image-caption"><p class="subtle-text"><small>Note: we apply jitter to the feature values to make the density easier to see.</small></p></div><p>As one can see, the NICE plot is still useful to demonstrate heterogeneous effects. In the above figure, the lines at the top of the figure (corresponding to high prediction values) are flatter. But when the predictions get smaller, the lines get steeper. For comparison, below is the PDP of the same feature:</p><div class="image-caption"><p class="subtle-text"><small>Note: the y-axis in this figure has a narrower range than the previous NICE plot because of aggregation.</small></p></div><p>Clearly, the PDP manages to capture the aggregated trend, but misses the differences in marginal effects when the predicted values are different. Using this plot one cannot see the heterogeneity in these marginal effects.</p><p>The structures contained in a NICE plot, however, may be both a blessing and a curse. When the model has complex interaction effects, it is hard for a human to decipher all the subtleties from the numerous dots and lines in a NICE plot. To mitigate this issue, we developed a companion tool: the PDQ plot.</p><h2 id="what-is-the-pdq-plot">What is the PDQ plot?</h2><p>A PDQ plot is a variation of the conventional PDP. It stands on the middle ground between the fully local NICE plot and fully global PDP. It plots the partial dependence conditional on some pre-specified quantiles of the predicted values, which helps to simplify the heterogeneity and emphasizes the major structures in a NICE plot. Here is the PDQ of the first NICE plot in this article:</p><div class="image-caption"></div><p>We made this PDQ plot using the following algorithm:</p><ol><li>Select the quantile values to be drawn. Our default values are 0.05, 0.25, 0.5, 0.75, and 0.95.</li> <li>For each quantile, find data points that can produce predictions that are close to the desired quantile (e.g., the desired quantile +/− 0.001).<sup><a href="https://engineeringblog.yelp.com#footnote2">2</a></sup></li> <li>Again for each quantile, generate and plot partial dependencies using only those samples.<sup><a href="https://engineeringblog.yelp.com#footnote3">3</a></sup></li> </ol><p>From the plot, we can easily identify a sharp inverted V-shape structure when the predicted value is small, but this non-monotonic effect gradually flattens out as the prediction increases. When the prediction is sufficiently large (starting from the 0.75 quantile), we do not see a significant drop after the initial rise.</p><p>In practice, one can use the corresponding PDQ plot to help make sense of the NICE plot. For example, it may be unclear to some readers that the non-monotonic effect gradually flattens as the prediction increases by just inspecting the NICE plot. Indeed, a lot is going on in a small region. But after observing the PDQ plot, one can go back and re-examine the NICE plot.</p><p>If PDQ plots can represent information confined in NICE plots in a concise fashion, why don’t we solely rely on them? Firstly, PDQ plots still need to aggregate some data. Therefore, it is difficult, if not impossible, to differentiate a mix shift from an inherent behavior change of the model by just examining a PDQ plot. For example, you may possibly think that the gradually flattened V-shape structure is because the negative marginal effects are less steep when the predictions are higher, which can be ruled out with the help of the corresponding NICE plot.</p><p>Moreover, PDQ plots have a data sparsity issue. We cannot observe the bivariate distribution in PDQ plots. Therefore, in some regions we do not have many, if any, data points. The following two figures constitute a good example.</p><div class="image-caption"></div><p>It is very tempting to conclude that the effect gradually flattens out after the feature value is greater than 50 when q=0.95 from the PDQ plot. However, we can see there are almost no data points with feature value greater than 50 and the prediction very high. Therefore, it is probably unjustified to assume such a relationship exists in that region.</p><p>Finally, why do PDQ plots work in practice? We have repeatedly observed that the patterns in NICE plots can be roughly grouped by the predicted values. This probably is because samples that produce similar predictions are similar for the purpose of a specific prediction task. Therefore, these samples are more likely to share a common marginal effect.</p><h2 id="conclusion">Conclusion</h2><p>A NICE plot is an individual conditional expectation plot restricted to feature values near the observed ones. It shows how the model would behave if we perturb a feature near its observed values while keeping all other features fixed. Reading a NICE plot can also tell us how much we can trust such behaviors because the plot contains information about data sparsity.</p><p>The PDQ plot helps to summarize the heterogeneity in the NICE plot by grouping partial dependence at different quantiles. We typically consult the corresponding PDQ plot when we have difficulties in figuring out the general patterns in a NICE plot. PDQ works because data points with similar predictions behave more similarly than the ones without in a given prediction task.</p><h2 id="acknowledgements">Acknowledgements</h2><p>The original idea of the NICE plot belongs to Jeffrey Seifried. Blake Larkin, Nelson Lee, Eric Liu, Jeffrey Seifried, Vishnu Purushothaman Sreenivasan, and Ning Xu (ordered alphabetically) help read through the earlier versions and make helpful comments.</p><h3 id="notes">Notes</h3><p><a name="footnote1" id="footnote1">1</a>: We can vary multiple features and draw a multivariate PDP, but the interpretation gets very difficult past two features!</p><p><a name="footnote2" id="footnote2">2</a>: To reduce noise, we do not just select a handful of data points exactly at the pre-defined quantile. In general, samples with different predictions may behave differently in terms of their marginal effects, and you don’t want to be fooled by a tiny sample.</p><p><a name="footnote3" id="footnote3">3</a>: You can check <a href="https://github.com/scikit-learn/scikit-learn/blob/master/examples/inspection/plot_partial_dependence.py">scikit-learn’s implementation</a> if you need help computing partial dependencies.</p><div class="island job-posting"><h3>Become an Applied Scientist at Yelp!</h3><p>Are you intrigued by data? Uncover insights and carry out ideas through statistical and predictive models.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/d0c0d643-2e39-4eb5-81a6-7e56b517f777/Applied-Scientist?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Orchestrating Cassandra on Kubernetes with Operators</h1> <p>Mon, 16 Nov 2020 01:00:00 +0100</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"> <p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p> </form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>Tales of a Mobile Developer on Consumer Growth</h1> <p>Fri, 13 Nov 2020 01:00:00 +0100</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"> <p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p> </form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>Minimizing read-write MySQL downtime</h1> <p>Mon, 09 Nov 2020 01:00:00 +0100</p> <p>The relational database of choice at Yelp is MySQL and it powers much of the Yelp app and yelp.com. MySQL does not include a native high-availability solution for the replacement of a primary server, which is a single point of failure. This is a tradeoff of its dedication to ensuring consistency. Replacing a primary server is sometimes necessary due to planned or unplanned events, like an operating system upgrade, a database crash or hardware failure. This requires pausing data modifications to the database while the server is restarted or replaced and can mean minutes of downtime. Pausing data modifications means that our users can’t perform actions like writing reviews or messaging a home service professional, and this amount of downtime must be minimized to the shortest amount possible. This post details how Yelp has integrated open-source tools to provide advanced MySQL failure detection and execute automated recoveries to minimize the downtime of our read-write MySQL traffic.</p><h2 id="characteristics-of-mysql-infrastructure-at-yelp">Characteristics of MySQL infrastructure at Yelp</h2><p>Our MySQL infrastructure is made up of:</p><ul><li>Hundreds of thousands of queries per second from HTTP services and batch workloads (lots of low latency, user facing web traffic!)</li> <li>Applications connect to MySQL servers through a layer 7 proxy, open-source ProxySQL</li> <li>MySQL clusters have a single primary and use asynchronous replication. Most deployments span geographically sparse data centers (we love scaling with MySQL replicas!)</li> <li>ZooKeeper based service discovery system, used for applications to discover proxies and proxies to discover MySQL databases</li> <li>Open-source Orchestrator deployed to multiple datacenters in raft consensus mode for high availability and failure detection of MySQL servers</li> </ul><p>MySQL primary replacements are performed due to MySQL crashes, hardware failure and maintenance (hardware, operating system, MySQL upgrades). For unplanned failures, Orchestrator detects the failure and initiates the recovery procedure. For planned server upgrades, an on-call engineer can invoke Orchestrator’s primary replacement procedure.</p><p>We are able to minimize MySQL downtime when replacing a MySQL primary because:</p><ul><li>MySQL clients (applications) remain connected to a proxy tier</li> <li>Orchestrator detects failure within seconds, then initiates MySQL specific recoveries and elects a new primary server</li> <li>the new primary server indicates to the service discovery system that it is the primary for a set of databases</li> <li>the proxy tier watches for the update to the service discovery system and adds the identity of the new primary server to its configuration</li> </ul><p>When the proxy tier has discovered the new primary server, the replacement is complete and applications are again able to write data to the database.</p><p>This procedure is completed in seconds!</p><div class="image-caption"></div><p>A closer look at how everything fits together:</p><ul><li>Individual components store and consume data in ZooKeeper, storing their own identities (IP addresses) and reading the identities of other components</li> <li>Applications establish connections to ProxySQL and issue queries</li> <li>ProxySQL maintains a connection pool to each MySQL server, and proxies client connections to connections in its pool</li> <li>Orchestrator maintains a connection pool to each MySQL server, constantly performing health checks and is ready to initiate a failure recovery when necessary</li> </ul><h2 id="proxysql-as-a-highly-available-proxy-layer">ProxySQL as a highly available proxy layer</h2><p>ProxySQL is a high performance, high availability, protocol aware proxy for MySQL. We love ProxySQL because it limits the number of MySQL connections to our MySQL servers and it permits us to replace MySQL servers without requiring applications to re-establish their database connections.</p><h3 id="deployment">Deployment</h3><p>We deploy ProxySQL using AWS Auto-scaling groups and AWS EC2. We configure these servers to run ProxySQL after powering on, using Puppet, and since they are relatively stateless we are able to add or replace ProxySQL capacity very quickly, in less than 10 minutes.</p><h3 id="configuring-proxysql-to-route-to-mysql-backends">Configuring ProxySQL to route to MySQL backends</h3><p>We use ProxySQL’s hostgroup functionality to group MySQL servers into tuples of (MySQL schema, MySQL role), where MySQL schema is one of our vertical shards to isolate workloads and MySQL role is one of {primary, replica, reporting replica} to isolate read/write, read only, and non-user facing read traffic respectively. A single MySQL user maps uniquely to a hostgroup, which means that an application only needs to present a username and password to ProxySQL to be routed and load balanced to the proper database and database role.</p><p>Each ProxySQL server must be configured with the set of available MySQL servers and continue to stay up to date as MySQL capacity is added, replaced, or when hosts transition between MySQL roles and therefore hostgroups. On a several minute interval, a script runs on each ProxySQL server to read the available MySQL servers and their roles from our ZooKeeper based service discovery system and load them into ProxySQL’s configuration as hostgroups. This script is idempotent and also contains important verification functionality, such as preventing a mass-removal of MySQL servers if an outage of the service discovery system is detected or ensuring that only one server exists in the “primary” hostgroup for each cluster. The latter verification method is a key component of ensuring that our primary failover system is safe in the face of network partitions.</p><h3 id="applications-connecting-to-proxysql">Applications connecting to ProxySQL</h3><p>Just as MySQL servers register into service discovery so that they can be discovered by ProxySQL servers, ProxySQL servers register into the same system so that applications are able to discover and connect to them. Applications read the identity of ProxySQL servers from service discovery and supply a username and password deployed with the application to initiate their MySQL connections.</p><h2 id="service-discovery">Service Discovery</h2><p>At Yelp, the data plane of our service discovery system consists of a daemon on each server that performs HTTP or TCP healthchecks on a service, and if the service is healthy, stores information including the IP address and port of the service in ZooKeeper. If a service fails to respond successfully to its healthcheck, this daemon will remove the state of the failing service instance. A separate daemon is responsible for reading the state in ZooKeeper and proxying requests through the service mesh.</p><h3 id="mysql-registration-and-healthcheck">MySQL registration and healthcheck</h3><p>MySQL servers are grouped by (MySQL schema, MySQL role) where MySQL role is a value in {primary, replica, reporting replica}. Both the MySQL schema and MySQL role values are represented as files on disk of each MySQL server. These files are understood by the process that performs health checks and are used to represent the (MySQL schema, MySQL role) groupings in ZooKeeper.</p><p>Our health check for the MySQL replica services is more thorough than only verifying that the MySQL port is open since these servers are running stateful workloads that require significant configuration. Before a MySQL replica is deemed to be healthy, it must pass all of the monitoring checks defined using our monitoring framework. To accommodate this, an HTTP service is deployed on each MySQL server to provide an HTTP health check endpoint which verifies that the server has passed all of its monitoring checks before the MySQL process is considered healthy. Some examples of these monitoring checks are:</p><ul><li>The server restored from backup successfully</li> <li>The server is replicating and is caught up to real time</li> <li>The server is “warmed” by streaming a MySQL buffer pool from another server in the cluster and loading it into its own buffer pool</li> </ul><h3 id="proxysql-healthcheck">ProxySQL healthcheck</h3><p>Because ProxySQL servers are lightweight and almost completely stateless, a ProxySQL server is considered healthy as long as it is listening for TCP connections on the defined ProxySQL port. After the ProxySQL process is launched and begins listening for TCP connections, it passes its health check and is discoverable by applications.</p><h2 id="orchestrator-driven-failure-recovery">Orchestrator driven Failure Recovery</h2><p>Orchestrator is an open source MySQL high availability and replication management tool that provides failure detection and automated recovery of MySQL servers. We deploy Orchestrator using its distributed raft mode in order to have the service be highly available and to provide improved failure detection of MySQL servers. Orchestrator’s failure recovery features solve the single point of failure presented with a single primary MySQL configuration mentioned earlier in this post.</p><p>Upon detecting a failure of a MySQL server, the multiple orchestrator instances running in raft mode will seek consensus of the identified failure, and if a quorum of instances agree, a recovery will proceed.</p><p>If the failed server is a replica and is a replication source for other replicas, Orchestrator will ensure that these replicas are re-configured to replicate from a healthy replication source. If the failed server is a primary, Orchestrator will proceed to set the failed primary to read-only mode (MySQL variable @@read_only=1), identify a candidate to be promoted to primary, re-configure replicas of the former primary to replicate from the candidate primary, and set the candidate primary to read-write mode (@@read_only=0). Orchestrator handles the MySQL specific changes for replacing a primary server and allows definitions of “failover hooks” to run custom defined commands during different phases of the recovery process.</p><h3 id="primary-failover-hooks">Primary Failover Hooks</h3><p>Orchestrator performs the MySQL specific part of the failover but there are still other changes required, such as modifying the file on disk representing a server’s MySQL role to the service discovery system. An HTTP service exists on each MySQL server in order to support this, and failover hooks are configured to send an HTTP request to both the former and newly promoted primaries to update their MySQL role. After this hook executes, the service discovery daemon will notice that the MySQL role of the promoted primary has changed and will update the identity of the primary server in ZooKeeper.</p><p>As mentioned earlier, each ProxySQL server runs a script on a several minute interval which reads the MySQL service discovery state in ZooKeeper and ingests this data to ProxySQL’s configuration. In order to reduce the recovery time after a primary failover, a separate process runs on ProxySQL servers to watch the identities of MySQL primaries in ZooKeeper and to initiate the previous process immediately when a change is noticed.</p><h2 id="perspective-of-a-mysql-client-during-a-primary-failover">Perspective of a MySQL client during a primary failover</h2><p>After Orchestrator issues <code class="highlighter-rouge">set @@read_only=1</code> on the former primary, clients will see INSERT/UPDATE/DELETE queries fail. These failures will remain until ProxySQL has updated its hostgroup configuration to replace the failed primary with the promoted one. Neither applications or ProxySQL need to create new TCP connections – clients remain connected to the same ProxySQL server and each ProxySQL server already has an existing pool of connections to the promoted primary because it was previously existing as a replica. After modifying its hostgroup configuration, a ProxySQL server is able to route MySQL traffic to the new primary.</p><h2 id="special-cases--network-partitioning-and-avoiding-split-brain">Special cases: network partitioning and avoiding split-brain</h2><p>This failure recovery system is carefully designed to make the right decision in failure scenarios caused by a network partition. A partial or incorrect failure recovery due to a network partition has the potential to leave the system with multiple primary hosts, each believing they are the primary, resulting in a divergence of the dataset known as “split-brain”. It is very difficult to repair a split-brain scenario, so we have several components in this system to help prevent this.</p><p>One mechanism to prevent the possibility of split-brain is validation in the logic which transforms the service discovery data in ZooKeeper into ProxySQL’s hostgroup configurations. If there is more than 1 primary registered in ZooKeeper, the script will refuse to make changes to the hostgroup configurations and emit an alert to page an on-call responder who can inspect and appropriately remediate this situation.</p><p>We also set Orchestrator’s PreventCrossDataCenterMasterFailover value to true so that Orchestrator would not ever elect a new MySQL primary in a separate datacenter. We use this setting because we would not want to change the datacenter of a MySQL cluster’s primary without considerable planning and because it reduces the surface area of potential network partition scenarios that could result in split-brain.</p><h2 id="conclusions">Conclusions</h2><p>Thanks to these systems, we are able to quickly recover from MySQL failures and maximize the availability of Yelp for our users, ensuring a smooth user experience.</p><div class="island job-posting"><h3>Become a Database Reliability Engineer at Yelp</h3><p>Want to help make our databases even more reliable?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/d88f19a8-f38a-4ceb-917d-d9d5a8ba0cc6/Senior-Software-Engineer-Database-Reliability-Engineering-NoSQL?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Introducing Folium: Enabling Reproducible Notebooks at Yelp</h1> <p>Wed, 21 Oct 2020 02:00:00 +0200</p> <p>Jupyter notebooks are a key tool that powers Yelp data. It allows us to do ad hoc development interactively and analyze data with visualization support. As a result, we rely on Jupyter to build models, create features, run Spark jobs for big data analysis, etc. Since notebooks play a crucial role in our business processes, it is really important for us to ensure the notebook output is reproducible. In this blog post, we’ll introduce our notebook archive and sharing service called Folium and its key integrations with our Jupyterhub that enable notebook reproducibility and improve ML engineering developer velocity.</p><h2 id="folium-for-notebook-archiving--sharing">Folium for Notebook Archiving & Sharing</h2><p>There are a few ways to archive and share notebooks (i.e., exporting to html, saving .ipnb files in Github, shared network drives). There are also some other higher-level frameworks for notebook archiving, but these frameworks lacked integration with Jupyterhub, searchability, and additional customizations presented in this post.</p><div class="image-caption"><p class="subtle-text"><small>Figure 1. Folium and Jupyterhub</small></p></div><p>Folium is a basic front-end service that also has APIs that interact with our Jupyterhub. These APIs enable uploading after developing a notebook. While uploading a notebook, the user is prompted for tags (i.e., project name, ticket) and a potential description fetched from the notebook automatically. The front-end service part provides the ability to search for notebooks by user, tag, or documentation of the notebooks. It also renders the notebooks in the webpage including the different notebook versions (more on this later!) and extracts a table of contents by extracting markdown in the notebook.</p><p>The functionality described above laid the basic foundation of notebook archiving and sharing, but we built several additional features that we want to share on helping with reproducibility of notebooks:</p><ul><li>The notebook running environment is logged so that we can easily reproduce the output.</li> <li>Versions of the same notebooks are grouped together to easily compare their differences.</li> <li>The shared notebooks can be directly imported into Jupyter server so that people can easily reproduce or improve on the existing notebooks.</li> <li>Adjust variables and rerun existing notebooks directly from Folium without going to Jupyterhub.</li> <li>Tags system allows searching and grouping related notebooks.</li> </ul><p>We will talk about each function in more detail in the following sections.</p><h2 id="logged-notebook-running-environment">Logged Notebook Running Environment</h2><p>We have a Jupyterlab extension installed on our Jupyterhub that takes care of import/export functionality to Folium. When exporting to Folium, the extension gathers the running environment from the current notebook servers, so that the key information is also logged into the notebook’s metadata. Currently we log which docker image and kernel are being used so that when re-running this notebook, we will be able to choose the correct working environment. We also log the memory and CPU/GPUs used so that users can pick the correct amount of resources in order to re-run the notebook. For different tasks, some might need more computation powers, versus some of the tasks may need to have higher memory. Without knowing how the resources are being used by current notebooks, we would likely get out-of-memory issues when rerunning the notebooks.</p><div class="image-caption"><p class="subtle-text"><small>Figure 2. Basic Notebook Information</small></p></div><h2 id="import-notebook-from-folium-to-jupyterhub">Import Notebook from Folium to Jupyterhub</h2><p>The same Jupyterlab extension mentioned above also allows us to import notebooks directly from Folium via its APIs. People can search and preview all the available Folium notebooks, and directly import them into Jupyterhub. We regularly use this function for collaboration and for improving on old models.</p><div class="image-caption"><p class="subtle-text"><small>Figure 3. Search Folium’s notebook archive and import within Jupyterhub</small></p></div><h2 id="grouping-of-different-versions-of-notebooks">Grouping of Different Versions of Notebooks</h2><p>Often an analysis is valuable enough that it needs to be repeated. This means a user will upload multiple similar notebooks. When a user does this, we group these similar notebooks together on the same page. Therefore we can directly compare the result of different versions of notebooks. We also use this feature to provide tutorials as well, where you can put the question and answer on the same page for people to learn by themselves. In addition to that, we also link the related code review on this notebook in the Folium, so that people can easily refer to the feedback for the notebooks.</p><div class="image-caption"><p class="subtle-text"><small>Figure 4. Different versions of notebooks linked together and related links are also highlighted.</small></p></div><h2 id="parametrized-notebooks">Parametrized Notebooks</h2><p>Besides importing a notebook to Jupyterhub, we also have the feature that allows users to directly rerun the notebook with different parameters in Folium. This helps us reuse the notebooks and quickly get us the result for similar analyses.</p><div class="image-caption"><p class="subtle-text"><small>Figure 5. Substitute variables and rerun notebooks from Folium</small></p></div><p>Searching is also a key thing in reusing the notebook. Without search integration, we will end up having lots of similar notebooks being recreated. This is exactly the issue we’re seeing before improving the tagging and search system on Folium. People have to constantly recreate the same notebooks, because they are not aware there are similar notebooks that can be easily imported and reused. As a result, we fixed the issue by automatically fetching the markdown from notebooks to generate required descriptions that helps users to search for specific notebooks. Free form tagging is also supported and being used widely for teams to tag the notebooks they owned or grouping the notebooks related to specific projects.</p><p>The Folium web service has a simple search results page (SERP) with filtering by tag and user. Also, the search API supporting the SERP is also leveraged for searching in the sidebar from Jupyter as shown in Figure 2.</p><h2 id="future-work">Future Work</h2><p>Folium is a tool that not only helps us share the code, but also helps us reuse the built notebooks to accelerate our daily work! On the roadmap, we are looking to continuously improve it by providing the ability to review notebooks, including a view of diffs and commenting. We are also adding more ways to get re-run notebooks delivered, including the option of emailed reports.</p><h2 id="acknowledgements">Acknowledgements</h2><p>Thanks to the Core ML team for building and continuously improving our Jupyter and Folium infrastructure, and thanks to Blake Larkin, Ayush Sharma, Shuting Xi, Jason Sleight for editing the blog post.</p><div class="island job-posting"><h3>Become an ML Platform Engineer at Yelp</h3><p>Interested in designing, building, and deploying ML infrastructure systems? Apply to become an ML Platform Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/53b90eff-b187-483b-969c-847cb332fb6d/ML-Platform-Engineer-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Flink on PaaSTA: Yelp’s new stream processing platform runs on Kubernetes</h1> <p>Wed, 14 Oct 2020 02:00:00 +0200</p> <p>At Yelp we process terabytes of streaming data a day using <a href="https://flink.apache.org/">Apache Flink</a> to power a wide range of applications: ETL pipelines, push notifications, bot filtering, sessionization and more. We run hundreds and hundreds of Flink jobs, so routine operations like deployments, restarts, and <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/state/savepoints.html">savepoints</a> don’t take thousands of hours of developers’ time, which would be the case without the right degree of automation. The latest addition to our toolshed is a new stream processing platform built on top of <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>, Yelp’s Platform As A Service. Sitting at its core, a <a href="https://kubernetes.io/">Kubernetes</a> <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator</a> automatically watches over the deployment and the lifecycle of our fleet of Flink clusters.</p><div class="image-caption"><p class="subtle-text"><small>Flink on PaaSTA on Kubernetes</small></p></div><h2 id="life-before-kubernetes">Life before Kubernetes</h2><p>Before the introduction of Kubernetes at Yelp, Flink workloads at Yelp were running on dedicated AWS <a href="https://aws.amazon.com/emr/">ElasticMapReduce</a> clusters which come with both Flink and <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">YARN</a> pre-installed. In order to make EMR instances work well with the rest of the Yelp ecosystem, our previous stream processing platform Cascade used to run a chunk of Yelp’s <a href="https://puppet.com/docs/pe/2019.8/pe_user_guide.html">Puppet</a> monolith in a <a href="https://www.docker.com/">Docker</a> container to apply configurations and to start the common set of daemons running on almost all Yelp’s hosts.</p><div class="image-caption"><p class="subtle-text"><small>Architecture of Cascade</small></p></div><p>Cascade also introduced a per-cluster controller component in charge of the Flink jobs life cycle (starting, stopping, savepointing) and monitoring which we call Flink Supervisor.</p><p>While this system served us well for years, our developers were experiencing a handful of limitations:</p><ul><li>It previously took around 30 minutes to spin up a new Flink cluster</li> <li>We needed trained human operators to manually deploy new versions or scale up resources for each cluster</li> <li>We could not upgrade to newer versions of Flink until AWS supported them</li> <li>The complexity of running Puppet in Docker and of maintaining a very different infrastructure from the rest of Yelp was time consuming</li> </ul><p>When Kubernetes started to gain more and more momentum both outside and inside the company, we decided that it was time for a change.</p><p><a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a> is Yelp’s Platform As A Service and runs all Yelp’s web services and a few other stateless workloads like batch jobs. Originally developed on top of <a href="http://mesos.apache.org/">Apache Mesos</a>, we are now migrating it to Kubernetes. This opened up the opportunity to support more complex workloads thanks to Kubernetes’ powerful primitives. Flink was the first in line and <a href="http://cassandra.apache.org/">Cassandra</a> is coming up in the very near future (be on the lookout for a new blog post!), both of them developed in tight collaboration with our Compute Infrastructure team.</p><p>Instead of “just” running Flink on top of Kubernetes using something off-the-shelf, we went down the road of developing a full-fledged platform that would make the experience of running Flink workloads as similar as possible to running any other service at Yelp. We did so to greatly reduce the knowledge necessary for a user to operate Flink clusters and to make our infrastructure very homogeneous with the rest of Yelp’s ecosystem.</p><p>With Flink on PaaSTA, provisioning a cluster is as easy as writing a YAML configuration file. New code deployments all happen automatically as soon as they are committed to git via <a href="https://www.jenkins.io/">Jenkins</a>. The commands provided by PaaSTA for starting, stopping, reading logs or monitoring a web service work exactly the same for any Flink cluster.</p><div class="image-caption"><p class="subtle-text"><small>paasta status command output</small></p></div><p>In addition to UX improvements, we managed to reduce the average time to spin up a Flink cluster from 30 minutes to under 2 minutes and we are now free to hop on the latest version of Flink on our own schedule.</p><h2 id="peeking-inside-the-hood">Peeking inside the hood</h2><p>At the core of Flink on PaaSTA sits our custom Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator</a>, watching over the state of Flink clusters running on Kubernetes and making sure that they always match what is described in the configuration defined by the users.</p><p>Our PaaSTA glue translates this configuration into Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">Custom Resources</a>, which the operator reads and updates with information taken from the Flink clusters, like the jobs list and status. These resources are also used by the PaaSTA commands to fetch what to show to the users and to interact with the operator for operations like start and stop.</p><p>The operator knows how to map the high-level definition of a Flink cluster resource into the right Kubernetes primitives like <a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/">Deployment</a> for scheduling the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html#taskmanagers">TaskManagers</a>, <a href="https://kubernetes.io/docs/concepts/services-networking/service/">Service</a> to make the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html#jobmanager">JobManager</a> discoverable by the other components in the cluster or <a href="https://kubernetes.io/docs/concepts/services-networking/ingress/">Ingress</a> to make the Flink web dashboard accessible by our users. The operator together with Jenkins schedules these components in Docker containers which allow us to customize the Flink installation and to select our Flink version of choice for each application.</p><p>You may find it surprising to see in the diagram below that our legacy Supervisor component still has a place in our new platform. At Yelp we like to approach all our projects with a practical spirit, infrastructure migrations included. While everything the Supervisor is doing could be worked into the operator, we decided to keep it around to reduce the development time by re-using existing features. Even more importantly, minimizing the scope of changes also helped to make the migration from Cascade to PaaSTA as easy as possible for our existing users.</p><p>For example, we deploy the Supervisor as a Kubernetes <a href="https://kubernetes.io/docs/concepts/workloads/controllers/job/">Job</a> to leverage its logic for triggering <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/state/savepoints.html">savepoints</a> of all the Flink jobs running on a cluster just before the operator shuts it down.</p><div class="image-caption"><p class="subtle-text"><small>Components of a Flink PaaSTA cluster</small></p></div><p>If you’d love to hear more about the details, we encourage you to check out our <a href="https://youtu.be/hL5nNAMx8Bk">talk at Flink Forward</a>.</p><h2 id="what-now">What now?</h2><p>Freeing us from the need to manage hundreds of Flink clusters, Flink on PaaSTA unlocked a new world of possibilities for our users and our Stream Processing team.</p><p>On the infrastructure side, we are now close to adding <a href="https://beam.apache.org/">Apache Beam</a> support to Flink on PaaSTA in order to make Python stream processing a first-class citizen at Yelp. We are also working on implementing auto scaling and per-job cost reporting for Flink clusters.</p><p>On the UX side, we are developing tools to allow our users to define complex pipelines of streaming components with a single configuration file. We are also busy building features to shape our on-line machine learning platform.</p><p>Stay tuned if you want to hear about all the above and more!</p><div class="island job-posting"><h3>Data Streams Platform Engineer at Yelp</h3><p>Want to build next-generation streaming data infrastructure?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5ccf6d4-d3c2-49cf-9692-9a9497ed4467/Senior-Platform-Engineer-Data-Streams?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>The Dream Query: How we scope projects with GraphQL</h1> <p>Wed, 07 Oct 2020 02:00:00 +0200</p> <p>At Yelp, new web pages and app screens are powered by GraphQL for fetching data.</p><p>This blog post describes the <strong>Dream Query</strong> – a pattern our feature teams use when refactoring or creating new pages.</p><p><em>(<a href="https://engineeringblog.yelp.com/2020/04/open-sourcing-dataloader-codegen.html">Check out our previous blog post</a> to see how we dynamically codegen DataLoaders to implement the server layer!)</em></p><h2 id="scoping-a-new-feature-with-graphql">Scoping a new feature with GraphQL</h2><p>Let’s jump in with an example!</p><p>Imagine your team is tasked with creating the new version of the “Header component” for the website (we’ll use the Yelp.com website in our example). You may receive a design mock that looks like this:</p><div class="image-caption"><p class="subtle-text"><small>Mock Header Component</small></p></div><p>Your mission (should you choose to accept, of course): turn this into code.</p><p>Alongside the usual planning activities such as OKR docs and estimated timelines, we’ve found it particularly helpful to write out a theoretical GraphQL query that could power the page or component - aka the “Dream Query”.</p><h2 id="writing-a-dream-query">Writing a Dream Query</h2><p>The first step is to identify what dynamic data we need to display. In our case of the Header component above, we can see the UI showing:</p><ul><li>Number of unread inbox messages</li> <li>User’s profile photo</li> </ul><p>Therefore we might write something like this: ?</p><div class="language-graphql highlighter-rouge highlight"><pre>query{loggedInUser{profilePhoto(size:"small"){src}inbox{unreadMessageCount}}}</pre></div><p>The idea of the dream query is to let developers “just write” the query they wished they could write to power the page - with as low barriers to entry as possible. In other words, imagine you’re writing the UI code, and you magically have everything already available to you. What query would you write?</p><p>Here’s a few points to keep in mind:</p><ol><li><strong>Try to use real types</strong> that already exist in the schema. (Use GraphiQL’s docs tab to search.)</li> <li><strong>It’s ok if you don’t get this perfect!</strong> Large schemas can contain hundreds of types, and it’s easy to miss stuff. (This will ideally be caught in review.)</li> <li><strong>It’s ok to query for types that don’t exist yet.</strong> (That’s kind of the point here!)</li> <li><strong>It’s ok if your team is totally new to GraphQL.</strong> Don’t worry if you aren’t super confident in the syntax yet - this will at least provide a great starting point for reviewers.</li> <li><strong>Write the Dream Query before the real application code</strong> is written - ideally as part of the scoping or planning phase. This cuts down on the overall iteration cycle, since we aren’t writing any real code to implement resolver methods yet.</li> </ol><h2 id="review">Review</h2><p>Once written, share the Dream Query widely. We do this in a Google Doc, so folks can comment line by line.</p><p>The goals here are to:</p><ul><li><strong>Refine the query</strong> such that it meets our schema design guidelines. (At Yelp, we have a community-driven schema review group specifically set up for this.)</li> <li><strong>Find other teams who may be stakeholders</strong> in the types being created.</li> <li><strong>Understand the time investment</strong> needed for the backend portion of the project (i.e. for creating new GraphQL resolvers).</li> </ul><h2 id="graphql-faker">graphql-faker</h2><p>During review, it’s important not to block feature development. We want to be able to parallelize the backend and frontend work.</p><p>We’ve found <a href="https://github.com/APIs-guru/graphql-faker">graphql-faker</a> to be really helpful. It’s a super nifty tool for mocking up a schema, such that you can make “real” queries and iterate on the Dream Query in a live GraphQL playground.</p><p>This also lets client developers hook up the graphql-faker endpoint inside their application - meaning we can use a Dream Query in development to build the view layer while the schema is still in review.</p><h2 id="incrementally-using-the-dream-query">Incrementally using the Dream Query</h2><p>When writing new large pages, or incrementally refactoring a non-GraphQL page to use GraphQL, we may want to roll things out incrementally. Perhaps not all the resolvers can be implemented straight away.</p><p>We’ve found it helpful to paste in the whole Dream Query into the app:</p><div class="language-jsx highlighter-rouge highlight"><pre>const GET_HEADER_DATA = gql` query GetHeaderData { loggedInUser { city displayName # TODO: Uncomment and use when each field is supported # profilePhoto(size: "small") { # src # } # inbox { # unreadMessageCount # } # yearsElite } } `; function MyPage() { const { data, loading, error } = useQuery(GET_HEADER_DATA); if (error) throw error; if (loading) return null; const { displayName, city } = data; return <Header displayName={displayName} city={city} />; } </pre></div><p>Tickets can be created to uncomment specific fields. This provides a way to break up, parallelize and track how much work there is left to complete the migration. When a type becomes available in the schema, we can uncomment the lines and use it in production.</p><h2 id="why-a-dream-query-and-not-a-dream-schema">Why a dream query and not a dream schema?</h2><p>Schema proposals are great to see too! We recommend starting with the query first, since this maps directly the interface the product will be using. It also allows those unfamiliar with the schema to quickly get an understanding of the shape of data the client is requesting, without having to go through all the options that the existing/proposed schema allows for.</p><h2 id="takeaways">Takeaways</h2><p>The Dream Query</p><ul><li>is used as a way to communicate what data the component or page needs, and what new backend work needs to be done</li> <li>forces us as developers to think critically about what types we’re adding and encourages reuse of existing schema</li> <li>provides an opportunity for schema reviewers to catch issues early, before iteration cycles are spent committing to suboptimal schema design</li> <li>provides a way to chunk up migrations to GraphQL</li> </ul><p>Mark Larah, Software Engineer (<a href="https://twitter.com/mark_larah">@mark_larah</a>)</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5d226cd-6ea1-4d12-b875-725b331202b7/Software-Engineer-Application-Backend-remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Improving the performance of the Prometheus JMX Exporter</h1> <p>Fri, 02 Oct 2020 02:00:00 +0200</p> <p>At Yelp, usage of <a href="https://prometheus.io/">Prometheus</a>, the open-source monitoring system and time series database, is blossoming. Yelp is initially focusing on onboarding infrastructure services to be monitored via Prometheus, one such service being <a href="https://kafka.apache.org/">Apache Kafka</a>. This blogpost discusses some of the performance issues we initially encountered while monitoring Kafka with Prometheus, and how we solved them by contributing back to the Prometheus community.</p><h3 id="kafka-at-yelp-primer">Kafka at Yelp primer</h3><p>Kafka is an integral part of Yelp’s infrastructure, clusters are varied in size and often contain several thousand topics. By default, Kafka exposes a lot of metrics that can be collected, most of which are crucial to understand the state of a cluster/broker during incidents, or gauge the overall health of a cluster/broker. By default, Kafka reports metrics as JMX (<a href="https://en.wikipedia.org/wiki/Java_Management_Extensions">Java Management Extensions</a>) MBeans.</p><h3 id="prometheus-metrics-primer">Prometheus metrics primer</h3><p>One of the ways to export metrics in Prometheus is via <a href="https://prometheus.io/docs/instrumenting/exporters/">exporters</a>. Exporters expose metrics from services in a <a href="https://prometheus.io/docs/instrumenting/exposition_formats/">format</a> that Prometheus understands. Prometheus shards are then able to collect metrics exposed by these exporters.</p><p>The Prometheus community officially maintains the <a href="https://github.com/prometheus/jmx_exporter">JMX Exporter</a>, an exporter that can be configured to expose JMX MBeans from virtually any JVM-based process as Prometheus metrics. As mentioned above, Kafka is one such process.</p><hr /><p>In order to make Kafka metrics available in Prometheus, we decided to deploy the JMX Exporter alongside Kafka.</p><div class="image-caption"><p class="subtle-text"><small>Figure: Architecture of Prometheus metric collection for a 3-broker Kafka cluster</small></p></div><p>When we initially deployed the JMX Exporter to some of the clusters, we noticed collection time could be as high as 70 seconds (from a broker’s perspective). We tried running the exporter as a Java agent and tweaking the configuration to collect only metrics that were interesting to us, but this did not improve the speed.</p><div class="image-caption"><p class="subtle-text"><small>Figure: Collection time (in seconds) of a single Kafka broker with no prior code change.</small></p></div><p>This meant that metrics usable by automated alerting or engineers would have, at best, one datapoint per time series every 70 seconds. This would have made monitoring an infrastructure supporting real-time use cases difficult, e.g: noticing spikes in incoming traffic, garbage collection pauses, etc. would be more difficult to spot.</p><p>We dug into the JMX Exporter codebase and realised some operations were repeated at every collection. Sometimes hundreds of thousands of times per collection. For Kafka, some metrics are available with a topic-partition granularity; if a Kafka cluster contains thousands of topic-partitions, thousands of metrics are exposed. One of the operations that seemed the most costly was <a href="https://github.com/prometheus/jmx_exporter/blob/ce04b7dca8615d724d8f447fa25c44ae1c29238b/collector/src/main/java/io/prometheus/jmx/JmxCollector.java#L375">matching MBean names against a configured set of regular expressions</a>, which then computes Prometheus sample <a href="https://github.com/prometheus/jmx_exporter/blob/ce04b7dca8615d724d8f447fa25c44ae1c29238b/collector/src/main/java/io/prometheus/jmx/JmxCollector.java#L408">name</a> and <a href="https://github.com/prometheus/jmx_exporter/blob/ce04b7dca8615d724d8f447fa25c44ae1c29238b/collector/src/main/java/io/prometheus/jmx/JmxCollector.java#L421">labels</a>.</p><p>The set of regular expressions is immutable over the lifespan of the exporter and between configuration reloads. This means that if an MBean name matches one of the regular expressions (or does not match any) during the first metric collection, it will match it for all collections until the configuration is changed or reloaded. The result of matching MBean names against the set of regular expressions can hence be cached and the time-consuming task of matching regular expressions (and computing sample name and labels) skipped during further collections.</p><p>After introducing this cache, heavy computations are made only once throughout the lifespan of the exporter. The initial collection does the heavy work of caching and takes a significant amount of time to complete, however subsequent collections take very little time. Collections that used to take 70 seconds, now take around 3 seconds. This allows us to have more fine-grained dashboards and alerting.</p><div class="image-caption"><p class="subtle-text"><small>Figure: Collection time (in seconds) before and after enable rules caching. Red line shows the number of MBeans in the cache.</small></p></div><p>This change is now available in the upstream <a href="https://github.com/prometheus/jmx_exporter/pull/518">jmx_exporter</a>, and can be toggled on/off depending on the use case.</p><h3 id="looking-further">Looking Further</h3><p>As mentioned in the introduction, the usage of Prometheus at Yelp is growing and many systems and teams rely on it for monitoring, dashboards and automated alerting. The changes to the JMX exporter are only a small part of a large initiative driven by our <a href="https://www.yelp.careers/us/en/job/5fb956a7-4777-48d2-bc5e-ef49b5a2e300/Site-Reliability-Engineer">Production Engineering team</a>, watch this space for more insights into this journey!</p><h3 id="acknowledgements">Acknowledgements</h3><p>Brian Brazil for code reviews and best practices</p><div class="island job-posting"><h3>Site Reliability Engineering at Yelp</h3><p>Want to build and manage scaleable, self-healing, globally-distributed systems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/5fb956a7-4777-48d2-bc5e-ef49b5a2e300/Site-Reliability-Engineer?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Introducing Yelp's Machine Learning Platform</h1> <p>Wed, 01 Jul 2020 02:00:00 +0200</p> <p>Understanding data is a vital part of Yelp’s success. To connect our consumers with great local businesses, we make millions of recommendations every day for a variety of tasks like:</p><ul><li>Finding you immediate quotes for a plumber to fix your leaky sink</li> <li>Helping you discover which restaurants are open for delivery right now</li> <li>Identifying the most popular dishes for you to try at those restaurants</li> <li>Inferring possible service offerings so business owners can confidently and accurately represent their business on Yelp</li> </ul><p>In the early days of Yelp circa 2004, engineers painstakingly designed heuristic rules to power recommendations like these, but turned to machine learning (ML) techniques as the product matured and our consumer base grew. Today there are hundreds of ML models powering Yelp in various forms, and ML adoption continues to accelerate.</p><p>As our ML adoption has grown, our ML infrastructure has grown with it. Today, we’re announcing our ML Platform, a robust, full feature collection of systems for training and serving ML models built upon open source software. In this initial blog post, we will be focusing on the motivations and high level design. We have a series of blog posts lined up to discuss the technical details of each component in greater depth, so check back regularly!</p><h2 id="yelps-ml-journey">Yelp’s ML Journey</h2><p>Yelp’s first ML models were concentrated within a few teams, each of whom created custom training and serving infrastructure. These systems were tailored towards the challenges of their own domains, and cross pollination of ideas was infrequent. Owning an ML model was a heavy investment both in terms of modeling, as well as infrastructure maintenance.</p><p>Over several years, each system was gradually extended by its team’s engineers to address increasingly complex scope and tighter service level objectives (SLOs). The operational burden of maintaining these systems took a heavy toll, and drew ML engineers’ focus away from modeling iterations or product applications.</p><p>A few years ago, Yelp created a Core ML team to consolidate our ML infrastructure under centrally supported tooling and best practices. The benefits being:</p><ol><li>Centrally managed systems for ML workflows would enable ML developers to focus on the product and ML aspects of their project without getting bogged down by infrastructure.</li> <li>By staffing our Core ML team with infrastructure engineers, we could provide new cutting edge capabilities that ML engineers might lack expertise to create or maintain.</li> <li>By consolidating systems we could increase system efficiency to provide a more robust platform, with tighter SLOs and lower costs.</li> </ol><p>Consolidating systems for a topic as broad as ML is daunting, so we began by deconstructing ML systems into three main themes and developed solutions within each: interactive computing, data ETL, and model training/serving. The approach has worked well, and allowed teams to migrate portions of their workflows on to Core ML tooling while leaving other specialized aspects of their domain on legacy systems as needed.</p><p>In this blogpost, I’ll discuss how we architected our model training and serving systems into a single, unified model platform.</p><h2 id="yelps-ml-platform-goals">Yelp’s ML Platform Goals</h2><p>At a high level, we have a few primary goals for our ML Platform:</p><ul><li>Opinionated APIs with pre-built implementations for the common cases.</li> <li>Correctness and robustness by default.</li> <li>Leverage open source software.</li> </ul><h3 id="opinionated-apis">Opinionated APIs</h3><p>Many of Yelp’s ML challenges fall into a limited set of common cases, and for these we want our ML Platform to enforce Yelp’s collective best practices. Considerations like meta data logging, model versioning, reproducibility, etc. are easy to overlook but invaluable for long term model maintenance. Instead of requiring developers to slog through all of these details, we want our ML Platform to abstract and apply best practices by default.</p><p>Beyond canonizing our ML workflows, opinionated APIs also enable us to streamline model deployment systems. By focusing developers into narrower approaches, we can support automated model serving systems that allow developers to productionize their model via a couple clicks on a web UI.</p><h3 id="correctness-and-robustness-by-default">Correctness and robustness by default</h3><p>One of the most common pain points of Yelp’s historical ML workflows was system verification. Ideally, the same exact code used to train a model should be used to make predictions with the model. Unfortunately, this is often easier said than done – especially in a diverse, large-scale, distributed production environment like Yelp’s. We usually train our models in Python but might deploy the models via Java, Scala, Python, inside databases, etc.</p><p>Even the tiniest inconsistencies can make huge differences for production models. E.g., we encountered an issue where 64-bit floats were unintentionally used by a XGBoost booster for predictions (XGBoost only uses 32-bit floats). The slight floating point differences when numerically encoding an important categorical variable resulted in the model giving approximately random predictions for 35% of instances!</p><p>Tolerating sparse vector representations, missing values, nulls, and NaNs also requires special consideration. Especially when different libraries and languages have differing expectations for client side pre-processing on these issues. E.g., some libraries treat zero as missing whereas others have a special designation. It is extremely complicated for developers to think through these implementation details let alone even recognize if a mistake has occurred.</p><p>When designing our ML Platform, we’ve adopted a test-driven development mindset. All of our code has a full suite of end-to-end integration tests, and we run actual Yelp production models and datasets through our tests to ensure the models give exactly the same results across our entire ecosystem. Beyond ensuring correctness, this also ensures our ML Platform is robust enough to handle messy production data.</p><h3 id="leverage-open-source-solutions">Leverage Open Source Solutions</h3><p>ML is currently experiencing a renaissance of open source technology. Libraries like Scikit-learn, XGBboost, Tensorflow, and Spark have existed for years and continue to provide the foundational ML capabilities. But newer additions like Kubeflow, MLeap, MLflow, TensorFlow Extended, etc. have reinvented what an ML system should entail and provide ML systems with much needed software engineering best practices.</p><p>For Yelp’s ML Platform, we recognized that any in-house solution we might construct would be quickly surpassed by the ever-increasing capabilities of these open source projects. Instead we selected the open source libraries best aligned with our needs and constructed thin wrappers around them to allow easier integrations with our legacy code. In cases where open source tools lack capabilities we need, we’re contributing solutions back upstream.</p><h2 id="ml-platform-technological-overview">ML Platform Technological Overview</h2><p>In future blog posts, we’ll be discussing these systems in greater detail, so check back soon. For now, I’ll just give a brief overview of the key tech choices and a model’s life cycle within these systems.</p><div class="image-caption"></div><h3 id="mlflow-and-mleap">MLflow and MLeap</h3><p>After evaluating a variety of options, we decided on <a href="https://mlflow.org/">MLflow</a> and <a href="https://mleap-docs.combust.ml/">MLeap</a> as the skeleton of our platform.</p><p>MLflow’s goal is to make managing ML lifecycles simpler, and contains various subcomponents each aimed at different aspects of ML workflows. For our ML Platform, we especially focused on the MLflow Tracking capabilities. We automatically log parameters and metrics to our tracking server, and then developers use MLflow’s web UI to inspect their models’ performance, compare different model versions, etc.</p><p>MLeap is a serialization format and execution engine, and provides two advantages for our ML Platform. Firstly, MLeap comes out of the box with support for Yelp’s most commonly used ML libraries: Spark, XGBoost, Scikit-learn, and Tensorflow – and additionally can be extended for custom transformers to support edge cases. Secondly, MLeap is fully portable, and can run inside any JVM-based system including Spark, Flink, ElasticSearch, or microservices. Taken together, MLeap provides a single solution for our model serving needs like robustness/correctness guarantees and push-button deployment.</p><h3 id="typical-code-flow-in-our-ml-platform">Typical Code Flow in our ML Platform</h3><div class="image-caption"><p class="subtle-text"><small>Offline Code Flow for Training a Model in our ML Platform</small></p></div><p>Developers begin by constructing a training dataset, and then define a pipeline for encoding and modeling their data. Since Yelp models typically utilize large datasets, Spark is our preferred computational engine. Developers specify a Spark ML Pipeline for preprocessing, encoding, modeling, and postprocessing their data. Developers then use our provided APIs to fit and serialize their pipeline. Behind the scenes, these functions automatically interact with the appropriate MLflow and MLeap APIs to log and bundle the pipeline and its metadata.</p><div class="image-caption"><p class="subtle-text"><small>Online Code Flow for Serving a Model in our ML Platform</small></p></div><p>To serve models, we constructed a thin wrapper around MLeap that is responsible for fetching bundles from MLflow, loading the bundle into MLeap, and mapping requests into MLeap’s APIs. We created several deployment options for this wrapper, which allows developers to execute their model as a REST microservice, Flink stream processing application, or hosted directly inside Elasticsearch for ranking applications. In each deployment option, developers simply configure the MLflow id for the models they want to host, and then can start sending requests!</p><h2 id="whats-next">What’s Next?</h2><p>We’ve been rolling out our ML Platform incrementally, and observing enthusiastic adoption by our ML practitioners. The ML Platform is full featured, but there are some improvements we have on our roadmap.</p><p>First up is expanding the set of pre-built models and transformers. Both MLflow and MLeap are general purpose and allow full customization, but doing so is sometimes an involved process. Rather than requiring developers to learn the internals of MLflow and MLeap, we’re planning to extend our pre-built implementations to cover more of Yelp’s specialized use cases.</p><p>We’d also like to integrate our model serving systems with Yelp’s A/B experimentation tools. Hosting multiple model versions on a single server is available now, but currently relies on clients to specify which version they want to use in each request. However, we could further abstract this detail and have the serving infrastructure connect directly to the experimentation cohorting logic.</p><p>Building on the above, we would like to have the actual observed events feed back into the system via Yelp’s real-time streaming infrastructure. By joining the observed events with the predicted events, we can monitor ML performance (for different experiment cohorts) in real-time. This enables several exciting properties like automated alerts for model degradation, real-time model selection via reinforcement learning techniques, etc.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>How businesses have reacted to COVID-19 using Yelp features</h1> <p>Mon, 15 Jun 2020 02:00:00 +0200</p> <p>Yelp periodically releases an open, all-purpose dataset for learning. The dataset is a subset of our businesses, reviews, and user data to inform government policy, academic research, and business strategy, among other uses. It has provided opportunities including teaching students about databases, helping others study natural language processing, sampling production data while learning to create mobile apps, and discovering compelling <a href="https://www.yelp.com/dataset/challenge/winners">research findings</a>. <a href="https://www.yelp.com/dataset">Our most recent dataset</a> was published in March 2020.</p><p>Businesses everywhere are adapting to the effects of the Coronavirus and have been using Yelp <a href="https://blog.yelp.com/2020/05/supporting-local-businesses-and-the-yelp-community-with-new-products-and-features">features</a> to stay connected with their customers. To this end, we’re releasing an addendum dataset including the following components, as of June 10, 2020:</p><ul><li><a href="https://blog.yelp.com/2020/03/new-page-features-to-communicate-covid-19-response">COVID-19-related business highlights</a></li> <li>Restaurants with delivery/takeout enabled</li> <li>Restaurants partnered with Grubhub</li> <li>Businesses with <a href="https://biz.yelp.com/support/call_to_action">Call to Action</a> buttons enabled</li> <li>Does the business still have <a href="https://blog.yelp.com/2016/04/yelp-request-a-quote">Request A Quote</a> enabled?</li> <li>Has the business created a <a href="https://blog.yelp.com/2020/04/coronavirus-alert-banner-examples-for-business-pages">custom page banner</a> during COVID-19?</li> <li>Temporary closures</li> <li><a href="https://blog.yelp.com/2020/05/supporting-local-businesses-and-the-yelp-community-with-new-products-and-features">Virtual Services offered</a></li> </ul><p>We hope researchers, academics, and any interested parties will utilize this new data, along with our <a href="https://www.yelpeconomicaverage.com/yelp-coronavirus-economic-impact-report.html">most recent economic impact report</a>, to investigate and further understand the broad-ranging effects of the coronavirus pandemic. Download the new Yelp dataset <a href="https://www.yelp.com/dataset/download">here</a>.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>dataloader-codegen: Autogenerate DataLoaders for your GraphQL Server!</h1> <p>Wed, 08 Apr 2020 02:00:00 +0200</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"> <p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p> </form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>An Ever Evolving Company Requires an Ever Evolving Communication Plan</h1> <p>Fri, 06 Mar 2020 01:00:00 +0100</p> <p>It’s 2014 and your teams are divided by platform, something like: Web, Mobile Web, Android, and iOS.</p><div class="image-caption"></div><p>In order to launch features, product managers jump from platform to platform and teams move fast. Really fast. Lines of code in each repository increase to the point where you now name them “monoliths.” A few engineers maintain these monoliths when they need to, but no one is solely dedicated to the task. Engineers are distributed by platform; so communication on when to maintain the monoliths is easy, but presents another problem.</p><p><strong>Can you continue to ship code efficiently if you depend entirely on these monoliths?</strong> It turns out that as you increase developers and the size of the code base, the number of rollbacks and unscheduled mobile point releases also increases. At first you notice only a few rollbacks, but as your team grows, you start to estimate when all pushes result in a rollback. This is not the typical “up and to the right graph” that companies look for.</p><p><strong>Since rollbacks sound like a blocker, we’ve come up with an alternative: microservices. Then another: product teams.</strong></p><div class="image-caption"></div><p>Now the company can scale both infrastructure and team organization. Product teams have a common set of infrastructure; the button used on the Growth team is the same button used on the Contributions team. Function-based (core) teams spin up. They work on the parts that individual maintainers worked on in the days of the monolith. They’re dedicated to making sure that, in the long term, we’re coding sustainably. Communication becomes harder. In fact, communication complexity continues to increase. Core teams used to do all the changes needed for maintenance/infrastructure upgrades, but the organization has gotten so large they need to rely on product teams to do the bulk of the work. Core teams generate a list of maintenance items that product teams need to work on, but product teams have to concentrate on adding new products.</p><p><strong>How do we prioritize work?</strong> Before we prioritize work, we need to identify who’s responsible for what. To tackle this problem, Core teams create tooling. Ownership becomes more defined with added metadata to “entities,” an abstract term used to describe things like code and alerting. All this ownership becomes shareable via the ownership service, and, we can now track migrations across the engineering organization with a tool called “migration-status.” We start by defining migrations from a “core team” perspective, but also have migrations from other infrastructure teams. Now that product teams are multi-disciplinary, we start to bombard them with an increasing number of messages to upgrade/migrate their infrastructure. Communication complexity increases and efficiency decreases.</p><p>We start thinking of a way to tie together priorities from multiple teams. We need a solution that has a global view and seeks to control communication complexity. Just like how a notification platform for your users needs to figure out the right messages to send, we need a tool to surface the right reminders to the right teams. So, which messages are sent to which users?</p><p><strong>Over the next few blog posts, we’ll walk you through what the Engineering Effectiveness Metrics (EE Metrics) Platform is and how we use it to reduce communication complexity.</strong> The first blog post will dive into our “Ownership” service. We’ll be talking about what it is, how we use it, and the value that it brings to our engineering organization. The second post will cover how we use the EE Metrics tool to increase awareness of developer velocity and code quality and to improve prioritization of critical migrations for product teams. We do all of these things while maintaining a safe space for teams and individuals.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Supporting Spark as a First-Class Citizen in Yelp’s Computing Platform</h1> <p>Mon, 02 Mar 2020 01:00:00 +0100</p> <p>Yelp extensively utilizes distributed batch processing for a diverse set of problems and workflows. Some examples include:</p><ul><li>Computation over Yelp’s review corpus to identify restaurants that have great views</li> <li>Training ML models to predict personalized business collections for individual users</li> <li>Analytics to extract the most in-demand service offerings for Request a Quote projects</li> <li>On-demand workloads to investigate surges in bot traffic so we can quickly react to keep Yelp safe</li> </ul><p>Over the past two years, Yelp engineering has undertaken a series of projects to consolidate our batch processing technologies and standardize on Apache Spark. These projects aimed to simultaneously accelerate individual developer workflows by providing easier access to powerful APIs for distributed computing, while also making our systems more robust, performant, and cost efficient on a macro scale.</p><h2 id="background">Background</h2><p>Throughout Yelp’s history, our batch processing framework of choice was MapReduce, executed via Amazon Elastic MapReduce (AWS EMR). We even constructed our own open source framework, <a href="https://github.com/Yelp/mrjob">mrjob</a>, which abstracts the details of the underlying MapReduce execution infrastructure away from developers. This way they could focus on the application-specific portions of their workflow instead, like defining their Map and Reduce steps. This framework has served us well over the years, and every day our production environment executes hundreds of mrjobs.</p><p>Over time though, Yelp developers were increasingly drawn towards Apache Spark. The foremost advantage of Spark is in-memory computing, but additional advantages include a more expressive API and large library of open source extensions for specialized workloads. This API flexibility makes it easier for developers to write their distributed processing workloads and results in higher-quality code that is both more performant and easier to maintain. However, without a well-supported backend, provisioning Spark resources was an intensive process that made deploying Spark jobs to production a challenge and all but eliminated Spark from contention for ad hoc workflows en masse.</p><p>Better support for Spark seemed like a promising direction, so our first step was to add Spark support into Yelp’s mrjob package (seen in mrjob v0.5.7). This enabled developers to write Spark code using the familiar mrjob framework and execute their Spark jobs on AWS EMR. Results from early adopters were encouraging, with one Yelp engineering team going so far as to convert over 30 of their legacy MapReduce mrjob batches into Spark mrjob batches, resulting in an aggregated 80% runtime speedup and 50% cost savings! Clearly, Spark was a direction that could add substantial value to Yelp’s distributed computing platform.</p><p>Running Spark mrjob batches on AWS EMR was viable for many production batch use cases, but also demonstrated a few problems. Firstly, it was painful to connect to the rest of Yelp’s infrastructure, and consequently workloads had to operate in isolation (e.g., they couldn’t make requests to other Yelp services). Secondly, it was painful to use for ad hoc workloads since it required launching an AWS EMR cluster on demand, which could take up to 30 minutes between provisioning and bootstrapping.</p><p>Integrating Spark as a first-class citizen in Yelp’s computing platform as a service, <a href="https://github.com/Yelp/paasta">PaaSTA</a>, enabled us to ease these pain points while also inheriting all of PaaSTA’s capabilities.</p><h2 id="spark-on-paasta">Spark on PaaSTA</h2><p>PaaSTA is Yelp’s defacto platform for running services and containerized batches. At its core it’s (currently) built on <a href="http://mesos.apache.org/">Apache Mesos</a>. Spark has native support for Mesos, but was a new framework for PaaSTA, which had previously only executed <a href="https://mesosphere.github.io/marathon/">Marathon</a> for long-running services, and Yelp’s in-house batch scheduling system, <a href="https://github.com/Yelp/Tron">Tron</a>, for containerized batches. To set up Spark as a framework in PaaSTA, we needed to select several configuration settings and design the interfaces Spark on PaaSTA would expose to Yelp developers.</p><p>We elected to run the Spark driver as a Mesos framework in a Docker container using the Spark client deploy mode. Since losing the driver is catastrophic for Spark clusters, we constrained Spark drivers to only run on a dedicated Auto Scaling Group of on-demand EC2 instances. On the other hand, Spark’s resilient data model provides automatic recovery from executor loss, allowing us to run Spark executors in Docker containers on a cluster of EC2 spot instances. For simplicity, we configured Spark on PaaSTA such that the Spark driver and executors used the same Docker image pulled from our internal Docker registries.</p><div class="image-caption"></div><p>Yelp has two primary use cases for Spark: offline batches and ad hoc interactive computing. To serve these needs, we created two APIs for Spark on PaaSTA:</p><ul><li>A command line interface that developers use to schedule Spark batches. Behind the scenes, this interface injects the necessary Spark configuration constraints to connect to PaaSTA’s Mesos masters, provision executors on PaaSTA’s Mesos Agents, pull images from the appropriate Docker registry, and create a SparkSession object.</li> <li>A Python package that developers invoke from arbitrary Python code (e.g., Jupyter notebooks). Much like our command line interface, this package injects the necessary Spark configuration constraints and then returns the resulting SparkSession object.</li> </ul><p>Both of these APIs allow developers to have full control over how to configure their Spark cluster and can provide overrides for any of Spark’s configuration settings (e.g., executor memory, max cores, driver results size, etc.). PaaSTA will then use those values to provision and configure a Spark cluster as requested.</p><p>Beyond Spark configuration settings, Yelp developers also have the ability to specify the Docker image that Spark uses. This enables developers to easily include custom code (e.g., from their production service) in their Spark workflows. To reduce developer overhead, we’ve constructed an internal debian package that developers can install into their Docker image to automatically include many Spark extensions that are valuable for Yelp workflows, like hadoop-aws, spark-avro, etc.</p><h2 id="isolating-spark-jobs-with-a-dedicated-mesos-pool">Isolating Spark Jobs with a Dedicated Mesos Pool</h2><p>Initially, we provisioned Spark frameworks on the same Mesos pools as Yelp’s other Mesos frameworks. However, we quickly recognized that Spark workloads have drastically different characteristics than the long-running Marathon services our Mesos pools were configured to support. Two differences in particular convinced us to create a dedicated Mesos pool for Spark jobs.</p><p>Firstly, Spark workflows are stateful. Yelp heavily uses AWS EC2 spot instances to drive cost savings for our computing platforms, which means an instance can be reclaimed by AWS at any time. Moreover, the PaaSTA cluster autoscaler dynamically scales the Mesos cluster to maintain a desired utilization, and can kill service instances on underutilized Mesos agents without warning. Since Yelp’s Marathon services are stateless and frequently have multiple concurrent instances, these abrupt disruptions are mostly inconsequential. While Spark can recover from losing an executor, losses can result in cached RDDs to be recomputed, thereby increasing load on upstream datastores and degrading developer experience. With a dedicated pool, we can use AWS instance types with lower reclamation rates, and a specially tailored Spark autoscaler (discussed in the next section) can minimize the probability of executor loss caused by PaaSTA.</p><p>Secondly, Spark workflows are more memory-intensive than service workloads. While one of Spark’s primary advantages over MapReduce is in-memory computing, that advantage is only realized if the Spark cluster has sufficient memory to hold the necessary data. At Yelp, it’s common for our developers to request terabytes of memory in aggregate for a single Spark cluster alongside several hundred CPUs. That memory-to-CPU ratio is substantially different from stateless Marathon services, which are typically CPU-bound with low memory footprints. With a dedicated pool, we’re able to populate Spark frameworks’ Mesos agents with AWS instances that have higher memory capacity and SSD drives in order to deliver a more cost effective system with higher resource utilization.</p><h2 id="autoscaling-spark">Autoscaling Spark</h2><p>Yelp has been autoscaling our PaaSTA clusters for several years, reducing infrastructure costs by only running as many servers as necessary. We generally use a fairly standard reactive autoscaling algorithm which attempts to keep the most utilized resource (e.g., CPUs or memory) at a desired level (around 80%). For example, if 90 out of 100 CPUs in the cluster were in use, it would add another ~12 CPUs to the cluster to bring CPU utilization back to 80%. If, later on, only 80 of the now 112 CPUs are utilized, it will downscale the cluster back to 100 CPUs. This approach works well for most workloads we run at Yelp, the majority of which are long-running services whose load varies gradually throughout the day in proportion to web traffic.</p><p>Spark workloads, however, do not have gradually varying needs. Instead, the typical workload makes a large, sudden request for hundreds or thousands of CPUs, abruptly returning these resources a few hours (or even minutes) later when the workload completes. This causes sudden load spikes on the cluster, which is problematic for a reactive autoscaling approach for two reasons.</p><p>Firstly, a reactive autoscaling approach can only trigger scaling actions when the cluster is already over or under utilized. This is not problematic for gradually shifting workloads since the extra load usually fits into the cluster headroom (i.e., the 20% of CPUs the autoscaler keeps unallocated) while additional capacity is added. However, large Spark jobs can easily exceed the cluster headroom, preventing the workload and anything else on the cluster from obtaining additional resources until more machines are provisioned.</p><p>Secondly, relying on resource utilization obscures the true quantity of resources needed. If a cluster with 100 CPUs is at 100% utilization and our desired utilization is 80%, then the aforementioned reactive autoscaling strategy will provision 25 additional CPUs regardless of how many CPUs are needed. While it’s possible that 25 CPUs will be sufficient, if the Spark workload requested a thousand CPUs then it will take 11 autoscaling cycles (!) to reach the desired capacity, impeding workloads and causing developer frustration. By relying only on current utilization, we have no way to distinguish between these two cases. See below for an example in which it took our reactive algorithm four cycles and almost one and a half hours to scale the cluster from 100 to 500 CPUs for a Spark job.</p><div class="image-caption"></div><p>To solve these problems, we turned to Clusterman, our modular cluster autoscaler, which makes it simple to write custom autoscaling code for specific pools of machines. (You can check out our <a href="https://engineeringblog.yelp.com/2019/02/autoscaling-mesos-clusters-with-clusterman.html">blogpost</a> to learn more about it!) First, we extended the APIs that developers use to start Spark on PaaSTA jobs to send the Spark workflow’s resource needs to Clusterman. We then created a custom Clusterman signal for Spark that looks at these reported resource needs and compares them to the list of Spark frameworks currently registered with our Mesos clusters. If the framework associated with a given resource request is still running or we’re within a several minute grace period, that resource request is included in Clusterman’s allocation target. Because Clusterman knows the full resource requirements of each job as soon as it starts, we can make sure that Spark on PaaSTA jobs wait as little as possible for resources, regardless of quantity. The graph below shows our new approach performing the same task as the previous one in 15 minutes instead of one and a half hours!</p><div class="image-caption"></div><h2 id="spark-on-paasta-results">Spark on PaaSTA Results</h2><p>Over the past two years, we’ve seen accelerating adoption of Spark on PaaSTA among Yelp developers. Roughly 80% (and climbing) of all scheduled batches are now running Spark on PaaSTA instead of legacy mrjob on AWS EMR! In addition, Yelp developers create hundreds of Spark clusters every day for their ad hoc workloads.</p><p>Aside from improving job performance and developer experience, moving to Spark on PaaSTA has also resulted in meaningful cost savings. As mentioned earlier, our legacy mrjob package runs on AWS EMR. Since EMR is a managed platform on top of EC2, AWS bills for EMR by taking the underlying EC2 cost and adding a premium. In essence, you can think of EMR as having a usage tax, with the EMR tax rate equal to the EMR premium divided by the EC2 cost. Figure 2 shows the EMR tax rate for different configurations of M5 and R5 instances. In many cases, the EMR tax is a substantial portion of overall EMR costs, and since Yelp uses spot instances heavily, our aggregate savings by moving Spark jobs from EMR to PaaSTA is over 30%.</p><div class="image-caption"></div><p>Given the success of early Spark adoption, an organizational goal for Yelp in 2019 was to migrate all batch processing workloads to Spark on PaaSTA. As you might expect, migrating hundreds of legacy batches (many of which have been running without intervention for years) was a daunting endeavor. Rather than going through batches one by one, we instead migrated legacy MapReduce mrjobs en masse via a mrjob extension that wraps MapReduce code and executes it via Spark on PaaSTA. While rewriting MapReduce jobs to fully utilize Spark capabilities results in peak performance, we’ve observed significant wins just from running the existing Map and Reduce steps in Spark instead of MapReduce.</p><h2 id="conclusions">Conclusions</h2><p>Looking back on our journey, adding Spark support to our computing platform has gone fairly smoothly. Nevertheless, there are still a few things we want to improve.</p><p>The first is system efficiency. By default, Mesos uses round robin task placement, which spreads the Spark executors to many Mesos agents and results in most agents containing executors for many Spark frameworks. This causes problems for cluster downsizing and can yield low cluster utilization. Instead, we would prefer to pack executors onto fewer Mesos Agents. We are currently experimenting with a patch to Spark’s Mesos scheduler that instead greedily packs executors onto hosts. We also plan to investigate Spark’s Dynamic Resource Allocation mode as a further improvement to cluster efficiency.</p><p>The second is stability. We’ve made a deliberate decision to run Spark on spot EC2 instances to benefit from their lower cost, but this choice also means that executor loss is possible. While Spark can recover from this, these events can lead to substantial recomputations that disrupt developer workflows—in worst cases getting stuck in a perpetual crash-recomputation loop that has to be manually terminated. These issues are magnified as the size of our Spark clusters continue to grow; and clusters with thousands of CPUs and TBs of memory are likely to experience at least one executor loss. Some solutions we plan to explore include aggressive checkpointing and/or selectively utilizing on-demand EC2 instances.</p><p>Finally, we’re in the process of converting our Spark deployment from Mesos to Kubernetes, with the primary advantage being that Kubernetes provides additional control layers for us to tune cluster stability, responsiveness, and efficiency. These changes are being made as part of PaaSTA itself, meaning that we can change the backend infrastructure without developers needing to alter their Spark usage!</p><p>We’re continuing to invest in Spark as a premier computing engine at Yelp, so stay tuned for further updates!</p><h2 id="acknowledgments">Acknowledgments</h2><p>Special thanks to everyone on the Core ML and Compute Infrastructure teams for their tireless contributions to bring Spark to all of Yelp!</p><div class="island job-posting"><h3>Become a Distributed Systems Engineer at Yelp</h3><p>Interested in designing, building, and deploying core infrastructure systems? Apply to become a Distributed Systems Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a368cc58-18d4-4d0a-a58e-44b9da767322?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Accelerating Retention Experiments with Partially Observed Data</h1> <p>Thu, 20 Feb 2020 01:00:00 +0100</p> <p>Here at Yelp, we generate business wins and a better platform by running A/B tests to measure the revenue impact of different user and business experience interventions. Accurately estimating key revenue indicators, such as the probability a customer retains at least $n$-days ($n$-day retention) or the expected dollar amount a customer spends over their first $n$ days ($n$-day spend) is core to this experimentation process.</p><p>Historically at Yelp, $n$-day customer or user retention was typically estimated as the proportion of customers/users we observed for more than $n$ days who retained more than $n$ days. Similarly, $n$-day spend was estimated as the average amount spent over the first $n$ days since experiment cohorting by businesses we have observed for at least $n$ days.</p><p>Recently, we transitioned to using two alternative statistical estimators for these metrics: the Kaplan-Meier estimator and the mean cumulative function estimator. These new approaches consider censored data, i.e. partially observed data, like how long a currently subscribed advertiser will retain as a customer. Accordingly, they offer several benefits over the previous approaches, including higher statistical power, lower estimate error, and more robustness against within-experiment seasonality.</p><p>By performing Monte-Carlo simulations [<a href="https://engineeringblog.yelp.com#1">1</a>, Chapter 24], we determined that using these estimators allowed us to read A/B experiment metrics a fixed number of days earlier after cohorting ends without any drop in statistical power. This amounted to a 12% to 16% reduction in overall required cohorting and observation time, via a 25% to 50% reduction in the time used to observe how people respond to the A/B experiences. Altogether, this improved our ability to iterate on our product.</p><p>The value of a Yelp customer can be quantified in two primary and informative directions: how long a user / business remains active / subscribed in our system (known as retention), as well as the total dollar amount they generate over their lifetime (known as cumulative spend). When we experiment on different user / business experiences, we make a point of estimating the effect of these changes on retention and spend metrics before we make a final ship decision.</p><p>As a proxy, sometimes dollars might be replaced with less noisy units like ad clicks, page views, etc., but for the purposes of this blog post we will focus on altering a business experience and analyzing $n$-day retention and spend. The conclusions carry over equally well to experimentation settings that either deal with users or with similarly defined proxy metrics.</p><p>The diagram below illustrates the typical lifecycle of an A/B test focused on retention and spend.</p><p></p><p>Depending on the type of experience change and the time window used for measurement, the cohorting and observing phases can in many instances take the most time of the whole experimentation pipeline. As such, the acceleration of the observing phase detailed here can provide improvements in our ability to iterate on our product.</p><p>One possible retention measure is “what percentage of those in cohort $C$ who subscribed to product $P$ at any point during the experiment went on to retain for more than $n$ days,” where $C$, $P$, and $n$ are parameters the experimenter can adjust. Yelp previously computed this $n$-day retention measure within each experiment cohort as follows: of the customers who started purchasing product $P$ during the experiment who we have observed for at least $n$ days since their initial purchase, report the proportion who are still subscribed at day $n$.</p><p>Our measure for dollars spent is analogous: we measure $n$-day spend by considering “on average, how many dollars do those in cohort $C$ spend on products $P_1, P_2, \ldots, P_k$ over their first $n$ days after being cohorted.” Here, $C$, the product basket $P_1, P_2, \ldots, P_k$, and $n$ are freely adjustable as above.</p><p>Standard error estimates for both of these metric estimators which rely on the Central Limit Theorem are always reported as well [<a href="https://engineeringblog.yelp.com#1">1</a>, Theorem 6.16]. These estimators are unbiased and consistent (as the number of people observed for at least $n$ days grows), assuming that the retention of customers is independent of the relative time they enter the experiment.</p><p>There are a number of potential concerns that can arise from using the estimators mentioned above.</p><p><strong>Variance:</strong> One of the greater concerns is that estimators could have large variance if only a small number of customers had been observed in a cohort during the experiment period. Let’s suppose we want to cohort individuals into an experiment for 70 days and are interested in estimating the 60-day retention of each cohort in the experiment. At day 75, the uncensored estimators above only access the users who arrived during the first 15 days of experiment cohorting, even though we have been cohorting users for 5 times as long. Therefore, unless the sample size is very large or the underlying retention/spend distribution is very concentrated, our estimator will have large variance at this point. This forces us to wait more days after the experiment ends to get a genuine metric read or detect a bona-fide difference in retention or spend of the cohorts.</p><p><strong>Seasonality:</strong> In addition to the variance issue, the retention or spend characteristics of individuals cohorted into an experiment may vary with the experiment runtime. In this situation, the uncensored estimators will in general be biased away from the true population mean retention and spend over the experiment window until after the observation phase is fully completed. For example, if we run a revenue experiment that starts right before Christmas, the 60-day retention estimate 75 days after the hypothetical experiment above was started would be heavily biased towards HVAC contractors and retail stores instead of the large fraction of restaurants that were closed over the holidays. It is feasible that we could declare a difference between the two populations at day 75 and make a conclusion about the experiment results, even though the underlying estimates are biased and reflect only a subset of the population.</p><p>Our solution to a more effective retention estimate, which can mitigate the problems mentioned above, is based on the so-called Kaplan-Meier estimator of the “survival curve.” The survival curve $S(t)$ is a function of time $t$ which returns the probability that someone would retain for at least $t$ days after subscribing. Accordingly, if we had access to the population survival curve, $S(n)$ would return the proportion of businesses in our population who would retain at least $n$ days, precisely our retention metric. The Kaplan-Meier estimator is a nonparametric estimator of the whole survival curve. Evaluating the estimated curve at time $t = n$ days gives an estimate of the desired retention metric:</p><p></p><p>For a coarser discretization of time than typically used, this Kaplan-Meier estimate of $n$-day retention first writes the $n$-day churn as \[S(n) = \prod_{t=1}^n \mathrm{Pr}(\text{remains subscribed through day }t |\text{ subscribed for first }t - 1\text{ days}).\] At this point, each multiplicand is estimated as $h_t$ , the fraction of people we have observed for at least $t$ days and were subscribed at the end of day $t - 1$, who then stayed subscribed through the end of day $t$. The full estimate is then \[S(n) = \prod_{t=1}^n h_t.\] This equals the status quo version that does not incorporate censored data if we instead used the individuals observed for at least $n$ days to compute each $h_t$ instead of the larger sample size afforded by using those observed for at least $t < n$ days. Better utilization of available information can increase the precision of our estimates, increase statistical power to detect differences in cohort retentions, and mitigate sensitivity to time-dependent retention characteristics over the course of the experiment.</p><p>This estimator is a consistent estimator of the whole survival curve (computed by varying $n$) as both the number of individuals and the length of time we observe each of them increase [<a href="https://engineeringblog.yelp.com#2">2</a>]. It is not, in general, unbiased [<a href="https://engineeringblog.yelp.com#3">3</a>], and is also affected by seasonality in the same way that the status quo estimator is affected, although in simulations seasonality had less of an effect on the estimates than with the status quo approach.</p><p>In the cumulative spend setting, we employed the mean cumulative function estimator detailed in [<a href="https://engineeringblog.yelp.com#4">4</a>]. This mean cumulative function estimator writes the total spend of a business through day $n$ after cohorting as the sum of the spend on the first day after cohorting, the spend on the second day after cohorting, etc., all the way through the spend on the $n$-th day after cohorting. Each day-$t$ spend is then estimated as the average day-$t$ spend of people we have seen for at least $t$ days. Since the sample size used to estimate day-1 spend is usually much greater than that used to estimate day-60 spend, this estimator can achieve greater power than a status quo estimator that restricts the sample in each day-$t$ spend estimate to only the people observed for all $n$ days.</p><p>This estimator is unbiased, consistent as the number of businesses seen through day $t$ increases, and has mean squared error no greater than that of the status quo estimator. In the presence of seasonality, this estimator will be biased in the same way the status quo estimator will be, but we observe in practice that it typically has lower mean squared error despite this fact.</p><p>The variance of our estimate of expected spend over the $n$ days following cohorting can be written as follows. Mathematically, if $s_t$ is the random variable giving the distribution over dollars spent by a business throughout their $t$-th day after being cohorted, then the variance of this $n$-day spend estimate is \[\sum_{t=1}^{n}\frac{\mathrm{Var}(st)}{m_t} + \sum_{t\neq t’} \frac{\mathrm{Cov}(s_t, s_{t’})}{max(m_t, m_{t’})},\] where $m_t=|\{i:\text{individual }i\text{ observed through day }t\}|$. This can be estimated in practice by plugging in empirical unbiased estimates of $\mathrm{Var}(s_t)$ and $\mathrm{Cov}(s_t,s_{t’})$. The covariance terms are summed for all $t\neq t’$ which are both no more than $n$. This result is similar to the one presented in [<a href="https://engineeringblog.yelp.com#4">4</a>] but differs in our level of discretization.</p><p>In order to realize any acceleration in experimentation under the new estimators, we had to create a policy where experimenters would compute their A/B test metrics using these new estimators earlier than they would under the old, uncensored approaches, all while maintaining a comparable statistical power. Because of a relatively limited number of historical A/B tests with which to evaluate this speed-up empirically, we decided to rely on Monte-Carlo simulation to determine the speed-up to prescribe in practice. Although we ended up going with a simpler policy of reading metrics a fixed number of days earlier, such Monte-Carlo simulation of the speedup could be computed in a bespoke way for each proposed A/B test. This would, in some situations, achieve a much greater speed-up than available under the uniform policy we ended up using, at the expense of complexity.</p><h2 id="the-simulation-framework">The Simulation Framework</h2><p>All of our simulation data are generated according to the following probabilistic model:</p><p>An experiment is defined as a collection of initial subscription times $t \sim \mathrm{Uniform}(0,T \text{ days})$ which arrive uniformly between 0 days and $T$ days. $T$ was a pre-set constant that was set to be $K$ days for spend simulations and $K+10$ days for retention simulations; these are similar enough (and well within the range of typical experiment fluctuation) that the results should be interpreted identically. Also note that this is the continuous uniform distribution: people can arrive half-way or three-quarters of the way through any given day. Every individual has some underlying mean retention time $\mu(t)$ which is typically a constant in every scenario except Simulation 3 where $\mu(t) = T’ + b (2t/T - 1)$ to simulate within-experiment seasonality of revenue characteristics. Given the mean retention time $\mu(t)$, the retention time $R$ of a subscriber is exponentially distributed with mean $μ(t)$, which results in a subscription from time $t$ to time $t + R \sim t + \mathrm{Exp}(\text{mean}=\mu(t))$. Moreover, in all but the last spend simulation, the amount that someone spends in a day is precisely a constant times the fraction of a day they were an active subscriber. We don’t include non-subscribers in the spend simulation here; non-subscribers are emulated in the stress test later. For a target sample size $m$ to collect during the simulated experiment, we independently sample $m$ such subscriptions to create the experimental data.</p><p>For every simulated experiment, we then wish to estimate the $n$-day retention/spend at time $K + T_r$ where the read time $T_r = 0 \text{ days}, 1 \text{ day}, \ldots,$ etc. since the experiment finished. When measuring retention and spend at time $T_r$ since the experiment finished, we do not have access to any events (e.g. a subscriber churning) at time later than $T_r$.</p><p>All experiment scenarios and results are averaged over 1000 independent trials in the retention simulations and 1500 independent trials in the spend simulations.</p><p>In this simulation, we generated experimental cohorts according to the above data model under various amounts of cohorted subscribing customers and mean retention times. These retention and sample size characteristics were chosen to run the gamut of experimental data we would expect to see in practice. Then, we matched the cohorts with the same sample size pairwise in order to compute the probability that we could detect (with a $z$-test) the bona-fide difference in retention / spend between the two hypothetical A/B experiences the different cohorts would receive. We estimated the $n$-day retention probability and $n$-day spend at day $0, 1, \ldots , n$ after the experiment cohorting ended using both the status quo estimator and the Kaplan-Meier / mean cumulative function approaches. We stopped estimating spend and retention at day $n$ after the end of the experiment because all the data are guaranteed to be uncensored at this point, and accordingly the status quo and proposed estimators coincide exactly.</p><p>In all scenarios of interest, the test based on the uncensored approaches have lower statistical power (lower probability of detecting the bona-fide retention difference) than the Kaplan-Meier / mean cumulative function based one where statistically comparable. This is particularly noticeable for moderate sample sizes and moderate differences: in one simulated scenario representative of reality, the status quo based test detects the difference less than half of the time on the day the experiment ends, while the Kaplan-Meier approach succeeds over 80% of the time. In two-thirds of the scenarios tested, the Kaplan-Meier approach succeeds at least 5 percentage points of the time more than the status quo approach the day the experiment ends, and in the majority of those cases the difference is over 10 percentage points.</p><p>Looking at the simulations results differently, this can be quantified in terms of accelerating the number of days we need to achieve the same statistical power (within a 1% or similarly small relative tolerance) we would achieve if we computed the status quo estimators at day $n$ after cohorting ends (the typical time we historically have read retention / spend experiment metrics.) The speed-up we observed for the mean cumulative function (relative to the total time used for cohorting and waiting to read retention and spend) for the various scenarios considered are presented below. The results for retention with the Kaplan-Meier estimator are similar and are not shown here. Note that the intervals of relative speed-ups are not confidence intervals — they are point estimates — but reflect the fact that the total time used to cohort and wait for retention historically has not been fixed and instead varies within a range of $L$ to $U$ days. If $k$ is the number of days earlier we read our metrics, the reported interval is simply $k / U$ to $k / L$.</p><table><thead><tr><th class="c1"><strong>Relative Speed-up</strong></th> <th class="c1"><strong>0.1% Power Tolerance</strong></th> <th class="c1"><strong>1% Power Tolerance</strong></th> <th class="c1"><strong>2% Power Tolerance</strong></th> </tr></thead><tbody><tr><td class="c2"><strong>Mean</strong></td> <td class="c2">20-27%</td> <td class="c2">25-33%</td> <td class="c2">28-38%</td> </tr><tr><td class="c2"><strong>25th Percentile</strong></td> <td class="c2">8-11%</td> <td class="c2">13-18%</td> <td class="c2">17-22%</td> </tr><tr><td class="c2"><strong>50th Percentile</strong></td> <td class="c2">14-19%</td> <td class="c2">18-24%</td> <td class="c2">23-30%</td> </tr><tr><td class="c2"><strong>75th Percentile</strong></td> <td class="c2">32-42%</td> <td class="c2">43-58%</td> <td class="c2">46-61%</td> </tr></tbody></table><p>To incorporate these estimators across all of Yelp’s experiment analysis, we dictated that individuals should read their experiment metrics with speed-up corresponding to the 50th percentile speed-up we observed in these simulations, under a 0.1% power tolerance as compared to the previous status quo approach. In doing so, under the assumption that our simulations were as representative as we believe, about half of experiment settings would see power no less than 0.1% lower than the status quo approach, but almost all would see power no less than 2% lower than the status quo approach. In light of the marked increase in our ability to iterate on Yelp’s products, this felt like a more-than-fair trade to make. Since in many circumstances the speed-up can be much greater than 12-16% over status quo, bespoke recommendations can and will be made in situations when rapid experimentation is extremely important to Yelp’s bottom line.</p><h2 id="stress-test-1-robustness-against-seasonality">Stress Test 1: Robustness against Seasonality</h2><p>In order to check that our simulations don’t break down in real world scenarios, we ran a number of stress tests that injected more extreme versions of reality into our data generating model, checking that the results largely mirrored what we see with the original data model. We only considered retention in this simulation, and not cumulative spend.</p><p>In the first of these stress tests, we consider the case where the average subscriber retention time in a cohort is fixed at some number of days, but where the average retention of an individual varies with respect to when they initially make a purchase during the experiment. Fixing some day-zero bias $b$, the average retention of an individual is linearly interpolated between $C + b$ and $C - b$ over the duration of the experiment in a way such that the population average stays the same. For biases chosen from a predefined set of candidates, we track the bias of the $n$-day retention probability estimates made by the status quo and Kaplan-Meier approaches as we re-calculate the metrics after the experiment ends.</p><p>The Kaplan-Meier retention estimator has uniformly lower bias than the status quo approach where they are statistically comparable. Indeed, in situations with positive day-zero bias, the bias of the Kaplan-Meier estimator is on the order of 50% of the bias of the status quo approach. Moreover, the bias of the Kaplan-Meier estimator decreases super-linearly with respect to how long we wait to make the measurement, while the bias of the status quo estimator decreases linearly. Here, linearity means that the error is a line with a negative slope. This is different from the typical use of “linear decrease,” which is commonly used to denote a geometric decay in error. This result increases our confidence that the new estimators won’t return worse results in situations that have within-experiment seasonality.</p><p>In the second stress test, we modified the data generating model so that the amount a person spends each day is not a uniform constant multiple of whether or not they are subscribed, but instead a constant multiple of whether or not they are subscribed that varies across individuals according to some heavy-tailed and bi-modal distribution reflective of actual spend distributions in Yelp products. Bi-modality emulates the inclusion of non-spenders and those who subscribe to much cheaper products in the experiment, while the heavy tail simply reflects the distribution over purchase amounts for people who do subscribe to a variable-cost product like advertisements.</p><p>In short, the distribution over speed-ups seen in the table above is largely the same with this new noise added, although the speed-ups are slightly reduced. Since the reduction is quite small, as seen in the following table, we can be more confident that our simplified data generating process used in the initial power simulations does reflect reality. Nevertheless, it seems prudent to revise our expectations stated earlier about the properties of our “read metrics $n$ days earlier” policy: about half of experiment settings would see power no less than 1% lower than the status quo approach, but almost all would see power no less than 2% lower than the status quo approach.</p><table><thead><tr><th class="c1"><strong>Relative Speed-up</strong></th> <th class="c1"><strong>0.1% Power Tolerance</strong></th> <th class="c1"><strong>1% Power Tolerance</strong></th> <th class="c1"><strong>2% Power Tolerance</strong></th> </tr></thead><tbody><tr><td class="c2"><strong>Mean</strong></td> <td class="c2">19-26%</td> <td class="c2">22-30%</td> <td class="c2">25-34%</td> </tr><tr><td class="c2"><strong>25th Percentile</strong></td> <td class="c2">0-0%</td> <td class="c2">13-17%</td> <td class="c2">15-21%</td> </tr><tr><td class="c2"><strong>50th Percentile</strong></td> <td class="c2">14-18%</td> <td class="c2">16-22%</td> <td class="c2">19-25%</td> </tr><tr><td class="c2"><strong>75th Percentile</strong></td> <td class="c2">38-51%</td> <td class="c2">38-51%</td> <td class="c2">38-51%</td> </tr></tbody></table><p>The Kaplan-Meier and mean cumulative function estimators are simple-to-use tools which can return reduced-variance estimates of $n$-day retention and cumulative spend. In simulations, these estimators afford a speed-up in non-engineering experiment runtime of 12-16% over uncensored approaches. Combining this computational evidence with real-world experimentation has increased Yelp’s ability to iterate on our product and operations more efficiently.</p><p>I would like to thank Anish Balaji, Yinghong Lan, and Jenny Yu for crucial advice and discussion needed to implement the changes to experimentation described here across Yelp. In addition, I genuinely appreciate all the comments from Blake Larkin, Yinghong Lan, Jenny Yu, Woojin Kim, Daniel Yao, Vishnu Purushothaman Sreenivasan, and Jeffrey Seifried that helped refine this blog post from its initial draft into its current form.</p><ol><li>Wasserman, L.. “All of statistics: a concise course in statistical inference.” Springer-Verlag New York, 2004.</li> <li>Bitouzé, D., B. Laurent, and P. Massart. “A Dvoretzky–Kiefer–Wolfowitz type inequality for the Kaplan–Meier estimator.” In Annales de l’Institut Henri Poincare (B) Probability and Statistics, vol. 35, no. 6, pp. 735-763. 1999.</li> <li>Luo, D., and S. Saunders. “Bias and mean-square error for the Kaplan-Meier and Nelson-Aalen estimators.” In Journal of Nonparametric Statistics, vol. 3, no. 1, pp. 37-51, 1993.</li> <li>Nelson, W.. “Confidence Limits for Recurrence Data – Applied to Cost or Number of Product Repairs.” In Technometrics, vol. 37, no. 2, pp. 147-157, 1995.</li> </ol><div class="island job-posting"><h3>Become an Applied Scientist at Yelp</h3><p>Want to impact our product with statistical modeling and experimentation improvements?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bc9dbcab-f8c8-475b-8637-4dc3becb790c?description=Applied-Scientist_Engineering_San-Francisco-CA?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Yelp Takes on Grace Hopper 2019!</h1> <p>Wed, 12 Feb 2020 01:00:00 +0100</p> <p>Last October we sent a group of Yelpers to the 2019 Grace Hopper Celebration! Here are a few takeaways and reflections from some of our attendees.</p><h2 id="who-attended">Who attended?</h2><ul><li>Surashree K., software engineer on Semantic Business Information</li> <li>Clara M., product design lead on Content</li> <li>Anna F., machine learning engineer on Semantic Business Information</li> <li>Nikunja G., software engineer on Infrastructure Security</li> <li>Catlyn K., software engineer on Stream Processing</li> </ul><div class="image-caption"></div><h2 id="what-was-your-favorite-session">What was your favorite session?</h2><p><strong>Surashree</strong>: Honestly, it’s hard to choose, but the one that stuck with me was the talk by Jackie Tsay and Matthew Dierker on Google’s Smart Compose, the Gmail feature that helps people write emails faster by auto-completing sentences. It was interesting to learn about inherent biases the earlier versions of the model had, and the engineering decisions that went into combatting those. The speakers also talked about some of the feedback they received; one that was especially moving was from a non-native English speaker who was happy to have a feature that would make writing emails in English easier.</p><p><strong>Clara</strong>: Definitely the talk on AI Meets Creativity by Dr. Pinar Yanardag. It was fascinating how she analyzed the way AI algorithms can actually inspire the creative process. She shared an example of an algorithm that analyzed dress patterns and generated a pattern that a fashion designer went on to create. She also shared examples of how this could work for graffiti art, pizza recipes, and even making perfume. The talk was not only visually compelling but also made me think a lot more about how AI can actually help boost creativity rather than stifle it.</p><p><strong>Anna</strong>: One of my favorite sessions was FarmBeats, Microsoft’s AI and IoT system for agriculture by Zerina Kapetanovic. She first described the challenges in setting up “smart” data-driven agriculture, including low rural internet connectivity, electricity access, and the high cost of sensors. She then walked us through creative solutions for each problem, ranging from clever uses of solar panels to dangling smartphones from balloons to approximate drone footage. It was inspiring to see how this collection of workarounds and approximations came together into a coherent and precise solution.</p><p>A recurring theme throughout many of the sessions I attended was how AI can enhance human capabilities by putting better tools in more people’s hands. AI enables us to get great results even in uncontrolled situations or where precision hardware isn’t available. The FarmBeats talk, described above, demonstrates this in the field of agriculture. I’m excited to see what specialist tools AI will make commonplace in the future.</p><p><strong>Nikunja</strong>: I work in Security, and to see a good representation of women in this field was a welcome change. One of my favorite sessions was the interactive security game put together by three engineers working at OneMedical. The goal was to secure a fictional organization that challenged you to prioritize security projects within an ever-changing threat landscape. The session was highly interactive, informational, and most of all, extremely fun! I never anticipated that such a session could be presented at a conference like Grace Hopper, and I brought back some major takeaways to share with my team at Yelp. What was the best career advice you received?</p><h2 id="what-was-the-best-career-advice-you-received">What was the best career advice you received?</h2><p><strong>Surashree</strong>: My one key takeaway from the conference was the importance of standing up for yourself and others. One of the talks by the CEO of AnitaB.org, Brenda Wilkersan, and COO Jacqueline Copeland, highlighted the still pertinent issue of the gender pay gap in tech and how it isn’t enough for companies to simply hire more diverse people, they also need to create an environment where all groups feel supported. One of our mottos here at Yelp is “Play well with others,” and this talk reminded me that the confidence we have in our daily lives comes from a certain level of privilege that we have to recognize.</p><p><strong>Catlyn</strong>: Don’t be ashamed to ask for more. According to one of the execs from Uber, women ask for less than their worth during the hiring process, resulting in a skewed sense of self-evaluation. We also tend to refrain from responsibility unless we’re certain we’ll be the perfect candidate for the role. But no one’s perfect and it’s okay to figure things out along the way. You’ll never be 100% ready, so why not just seize the opportunity and enjoy the challenge!</p><p><strong>Surashree</strong>: Be prepared to be totally overwhelmed! There are some things you can do to make your life easy during those three days—download the app, check your schedule every day, carry a bottle of water, talk to as many people as possible, and attend as many talks as you can. But really, the sheer size of the conference and the activity around you will be hard to take in at first. Our recruiting team does a wonderful job of organizing, so our job as attendees is really to just make the most of the GHC experience.</p><p><strong>Nikunja</strong>: I feel that however much you prepare, in the end you’ll still feel overwhelmed and unprepared, so my number one suggestion is to go with the flow once you’re there. Having said that, it’s absolutely essential to do some prep before going, like organizing your schedule (pre-registering for sessions and having a good balance of booth duty and conference talks). Also, try attending a mix of sessions! GHC is unique in that it has so many different tracks in one place, so take advantage of it.</p><p><strong>Anna</strong>: Take travel time into account when signing up for sessions! The conference center is half a mile long, and you don’t want to miss a session or show up out of breath due to poor planning.</p><h2 id="what-was-your-most-memorable-moment">What was your most memorable moment?</h2><p><strong>Surashree</strong>: My most memorable moment came right at the beginning of the conference: the keynote by Aicha Evans, the CEO of Zoox. An immigrant from Senegal, her life went from one of domestication to now being the CEO of a company that’s building the next generation of autonomous cars. Her story was soft, gritty, and inspiring—all at once. Her question, “Whose genius are you going to ignite?” highlighted the importance of mentorship and giving back, something I believe we do very well here at Yelp.</p><p><strong>Clara</strong>: The closing keynotes with the DJ playing! It was such a fun atmosphere and a great way to close out the conference by hearing from so many amazing women doing innovative things in their industry.</p><p><strong>Nikunja</strong>: To be honest, this is a tough one, as the whole experience was very memorable in itself. However, there’s one that takes the cake. The conference has several award categories; one of them is the Student of Vision award, which was given to Jhilika Kumar, an undergrad student at Georgia Tech. At such a young age, Jhilika is the founder of AxisAbility, an organization she started to help the lives of differently abled people. Her passion for this cause stems from personal experience. Her brother faced so many challenges in his youth, and she wants to help him and others like him lead a better life. Her video, speech, and determination left so many people inspired, and showed us that no matter how young you are, you can create change.</p><h2 id="why-should-one-go-to-ghc">Why should one go to GHC?</h2><p><strong>Clara</strong>: Honestly, at first I was skeptical about going to GHC as a product designer since I always thought it was a conference for software engineers. However, once I was there I realized how impactful it is for any woman working in the tech industry. I not only learned about new technologies, but also got the chance to be inspired by and network with other women in my industry. I was also surprised by how many people came by the booth looking to talk specifically with me since they knew Yelp had sent a product designer, which apparently not many other companies had. In general, it’s a great opportunity for everyone—product designers included!</p><p><strong>Catlyn</strong>: It was a great experience to be surrounded by so many brilliant women engineers who either are or once were facing the same career challenges that I am right now. I felt enlightened and empowered attending the talks and chatting with others from companies all over the world. I would strongly encourage everyone, especially those early on in their career, to attend some of the workshops to help you find out what kind of path you want to pave going forward and how you can get there.</p><p><strong>Surashree</strong>: Knowing what I know now, the biggest reason to go to GHC for me is to hear the stories from the lives of other female engineers. Working at Yelp, in the harmonious and safe environment that we have, it can be easy to overlook that not everyone has had the same advantages as myself, and not everyone’s experiences in tech have been the same. There are women who’ve had to deal with difficult situations- perhaps a toxic work culture, or misogyny in some form–and have come out stronger and brave enough to talk about it at conferences like this. So I’d say, go to GHC to learn about other people’s experiences and gain new perspectives!</p><div class="island job-posting"><h3>Become a Web Developer at Yelp Toronto!</h3><p>Join our Engineering team and help millions of people connect with local businesses on Yelp.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a6cfee89-2dd0-4451-bf52-746b9547dfb7?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Open-Sourcing Varanus and Rusty Jetpack</h1> <p>Thu, 06 Feb 2020 01:00:00 +0100</p> <p><em>The <strong>monitor lizards</strong> are large lizards in the genus <strong><a href="https://en.wikipedia.org/wiki/Monitor_lizard">Varanus</a>.</strong></em></p><p>Some time ago, our Android app got into a loop of sending data, due to some unlikely interactions between several different systems, which briefly overwhelmed our servers before we were able to turn it off. Fortunately, key code was behind an experiment. Otherwise, apps could have continued misbehaving for days, as there is no guarantee users would immediately update the app. It took an unusual combination of circumstances for this to happen, but this kind of problem seems to be a pervasive concern across the industry, and there are few tools to prevent it.</p><p>Furthermore, even at the best of times, mobile data can be hard to manage. Every now and then an article comes up about how a widely used app has eaten up users’ data. Unfortunately, there aren’t very many good tools for tracking how much data is sent, or what exactly is responsible for sending too much data in the first place.</p><p>Also, because updates are optional, all the code you’ve ever written is out there, somewhere (and we’ve had an Android app for almost as long as Android has existed!). If something goes wrong, you may not be able to push a fix to enough people, and with millions of users, all sorts of strange things can happen. While it’s unlikely that something goes catastrophically wrong and you can’t get enough people to update, it’s not impossible.</p><h2 id="what-does-varanus-do">What Does Varanus Do?</h2><p>In building out Varanus, we had two main goals:</p><ol><li>Always be able to turn off unwanted data on the client, no matter what.</li> <li>Observe how much traffic is generally being sent so we can spot if something weird happens.</li> </ol><p>With three constraints:</p><ol><li>It should be exceptionally simple and hard to break.</li> <li>It should work without anyone having to do anything.</li> <li>It can be dropped into our different apps with minimal effort.</li> </ol><p>Also, since it seemed that a lot of people were concerned with this problem but lacked the resources to spend time fixing it, we saw this as an opportunity to contribute something useful to the community.</p><h2 id="how-does-it-work">How Does It Work?</h2><div class="image-caption"><p class="subtle-text"><small>All traffic on the app automatically passes through Varanus</small></p></div><p>Basically, since all network traffic passes through Varanus, Android developers don’t even have to think about it for it to work. It counts the number of bytes and requests, and bins them by arbitrary categories of traffic that can be specified programmatically. An error message from the server (or <a href="https://en.wikipedia.org/wiki/Content_delivery_network">CDN</a>) then tells the app to hold off on sending more traffic for a bit—one message says to stop sending all traffic, and the other says to stop a specific category of traffic.</p><p>The code is entirely client-side, and no new backend infrastructure is needed (as long as you have a way of sending custom HTTP error codes from your server if necessary). Also, no coordination between devices is required, and turning off traffic is simple: all you need is a runbook.</p><p>Varanus is built around OkHttp interceptors, but with a bit of extra work, there’s no reason other clients couldn’t be supported. It’s also written entirely in Kotlin (like all new code at Yelp).</p><h2 id="where-can-i-find-more-details">Where Can I Find More Details?</h2><p>Take a look at the <a href="https://github.com/Yelp/android-varanus/blob/master/README.md">README</a>, or the code itself. We have a sample app that explains how it should be used.</p><p>In preparation for targeting Android 10, Yelp’s Android apps were migrated to use Android X libraries. Unfortunately, with the size of our apps’ codebases, the provided migration tool in Android Studio didn’t work for us. Rusty Jetpack was then born as a <a href="https://engineeringblog.yelp.com/2018/11/all-about-yelp-hackathon.html">Hackathon</a> project to help ensure seamless adoption across many developers with little downtime.</p><h2 id="what-does-rusty-jetpack-do">What Does Rusty Jetpack Do?</h2><p>The tool migrates all files in a git repository to use the new Android X package name spaces. This includes imports, fully qualified references, pro-guard declarations, and warnings about gradle packages that need to be changed. While this does mean the code won’t compile immediately after using the tool, most of the mundane work is taken care of. And best of all, it achieves all of this in under one second for our largest repository!.</p><p>Rusty Jetpack is critical to preventing downtime during migrations with rapidly changing codebases. Migrations can easily be kept up to date with the latest changes by re-running the tool, and then being distributed to developers (once the migration has been pushed) for quick adoption without major disruption.</p><p>To learn more, <a href="https://github.com/Yelp/rusty_jetpack">check out the repository here</a>!</p><div class="island job-posting"><h3>Become an Android Software Engineer at Yelp</h3><p>Want to help us make even better tools for our Android engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/2c6736d6-7c8e-4f57-8912-15a71815eef0?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Modernizing Ads Targeting Machine Learning Pipeline</h1> <p>Thu, 30 Jan 2020 01:00:00 +0100</p> <p>Yelp’s mission is to connect users with great local businesses. As part of that mission, we provide local businesses with an <a href="https://biz.yelp.com/support/advertising">ads product</a> to help them better reach out to users. This product strives to showcase the most relevant ads to the user without taking away from their overall search experience on Yelp. In this blog post, we’ll walk through the architecture of how this is made possible by using one of the largest machine learning systems at Yelp: <strong>Ads Targeting System</strong>.</p><p>The Ads Targeting System is a machine learning (ML) system designed to serve only the most relevant ads to users based on their intentions and context on Yelp. There are two primary types of ML models in the ads targeting domain: Click Through Rate (CTR) prediction, and Objective Targeting (OT). Both help determine the likelihood of downstream actions, such as calling a business after clicking on an ad.</p><p>In this post, we’ll primarily focus on architecting ML systems at scale rather than on algorithmic details or feature engineering. For more info on the algorithmic side of our CTR prediction model, check out one of our previous <a href="https://engineeringblog.yelp.com/2018/01/growing-cache-friendly-trees-part2.html">posts</a> where we discuss optimizations made to the XGBoost prediction library.</p><p>Below is a simplified version of the Ads Targeting and Delivery System:</p><div class="image-caption"></div><p>The <strong>Ad Delivery</strong> service is a low-latency online service written in Java that processes incoming ad requests. It generates features for the incoming request using the <strong>Ad Feature Generation</strong> library, loads the model from the <strong>Ad Model Store</strong>, generates CTR prediction, and then ranks ads accordingly.</p><p>The <strong>Ad Targeting service</strong> is a batch processing service written with <a href="https://github.com/Yelp/mrjob">MR Job</a>: Python Map-Reduce library open-sourced by Yelp. This is the service that we’ll discuss and redesign in this blogpost. Its main features include processing logs using the same Ad Feature Generation library, training ML models, and storing them in the <strong>Ad Model Store</strong>. MRJob also comes with a feature that allows you to call Java code from Python to carry out map reduce operations (as can be seen <a href="https://github.com/Yelp/mrjob/blob/master/mrjob/step.py#L421">here</a>). Using the same Feature Generation library ensures that all feature computation, both on and offline, remains consistent.</p><p>This <a href="https://engineeringblog.yelp.com/2018/01/building-a-distributed-ml-pipeline-part1.html">blog post</a> on CTR prediction illustrates how the Ads Targeting Machine Learning Pipeline used to look:</p><div class="image-caption"></div><p>We processed Ad Event JSON logs, downsampled them in Spark, and extracted features from the set of logs with Hadoop MR jobs. We then proceeded to Model training with XGBoost and model evaluation with Hadoop MR jobs using AWS EMR as the compute infrastructure. This pipeline served us well and helped us iterate in an ad hoc fashion to create newer and better ad targeting models. That being said, we did face several issues as the system matured due the following:</p><ul><li>As all stages of the pipeline were closely coupled, failure in any of the intermediate steps required restarting the pipeline</li> <li>This close coupling also meant that changing the feature generation logic or sampling strategy required running the entire pipeline</li> <li>The pipeline was closely linked with certain EMR instance types and AMI images that restricted either our upgrade to newer versions of Java or trying newer EMR instances (e.g., upgrades in other Java dependencies and online Java services wouldn’t work with the current Ad targeting service, making it impossible to retrain a model or add a new feature)</li> <li>As our system matured and we started adding more models to our ads targeting system, the cost of training grew</li> </ul><p>To solve the above issues, we decided to re-architect the Ad Targeting Service and its interaction with the other main components of the Ad Targeting and Delivery Systems. Keeping an eye on the big picture and setting goals is very important when re-designing a system as large as this. For us, that meant focusing on:</p><ul><li>Making it easy to retrain existing models</li> <li>Making feature generation cheaper, easier, and faster</li> <li>Leveraging Yelp’s internal Spark tooling and infrastructure (rather than relying on EMR)</li> <li>Improving monitoring and alerting, and providing easy promotion of models in production</li> </ul><p>We decided to use Spark as the underlying engine for this ML system as it allowed us to leverage our own in-house Spark on Mesos infrastructure. This infrastructure provides us with a quick and cheap way to spin up clusters and get started with writing big data workflows. Moreover, moving away from Hadoop map-reduce jobs on EMR increased speed and cut costs. This, coupled with the availability of PySpark (the official Python API for Spark), made the decision even easier, since most of our code and infrastructure is built with Python.</p><p>Armed with better infrastructure and tooling around Spark and its natural fit to our big data ML use-case, we decided to rewrite the Ads Targeting Service in PySpark. The new service now contains the same stages as before: <code class="highlighter-rouge">Sampling -> Feature generation -> Training -> Evaluation</code>, just with all the stages computed with PySpark.</p><h2 id="overview-of-modernized-architecture-based-on-spark">Overview of Modernized Architecture Based on Spark</h2><div class="image-caption"></div><p>Above is the current machine learning pipeline powered by the updated Ad Targeting Service. Three significant changes were made here:</p><h3 id="use-spark-as-the-compute-infrastructure">Use Spark as the Compute Infrastructure</h3><p>Spark batches were more efficient both in terms of time and cost. Moving to Spark allowed us to leverage the existing infrastructure at Yelp that enabled us to write ETL jobs and carry out distributed machine learning with XGBoost. This was a very cost-effective move since now we only pay for spot EC2 compute resources (and not for the EMR stack on top of it!).</p><h3 id="decouple-the-ml-pipeline-into-stages">Decouple the ML Pipeline into Stages</h3><p>The batches we created process logs, perform sampling, and generate features that are scheduled to run daily and checkpoint results on S3. Decoupling these batches gave us flexibility: we now have different feature generation strategies on top of the same sampling output, whereas in the older architecture each new feature generation strategy required re-computing sampling output. It also made the system relatively robust; since failure in later stages (say training/evaluation) didn’t disrupt the whole pipeline, engineering and operating costs were reduced.</p><h3 id="automated-monitoring-and-alerting">Automated Monitoring and Alerting</h3><p>We leveraged Yelp’s modernized Data Landscape (<a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">1</a>, <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">2</a>, <a href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">3</a>) and built our monitoring capabilities on top of this infrastructure. Instead of manually running Jupyter notebooks to monitor metrics, we computed these in batches, loaded them to our <strong>AWS Redshift</strong> data warehouse, and created <strong>Splunk</strong> dashboards on top of it. This made it really easy for PMs and engineers to make model promotion/deployment decisions.</p><h2 id="feature-generation-with-java-and-pyspark">Feature Generation with Java and PySpark</h2><p>The online Ad Delivery Java service and the offline Ad Targeting Python service share the same Ad Feature Generation Java library. From there the question then arises: how are we leveraging PySpark to generate features? <em>Hint: What language is Spark written in?</em> Let’s unpack this!</p><h3 id="dataflow-in-pyspark">DataFlow in PySpark:</h3><div class="image-caption"></div><p>Spark is written in Scala (a JVM language), and PySpark is a Python wrapper on top of it. PySpark relies on <a href="https://www.py4j.org/">Py4J</a> to execute Python code that can call on objects that reside in the JVM. To do that, Py4J uses a <a href="https://www.py4j.org/py4j_java_gateway.html">gateway</a> between the JVM and the Python interpreter, and PySpark sets it up for you with SparkContext. This SparkContext has access to the JVM and all packages and classes known to the JVM. You can see where this is heading…</p><p>To carry out distributed feature generation via PySpark, all we had to do was add our feature generation JAR to the Spark JVM and use SparkContext to refer to these classes. Since Yelp executes Spark within Docker, we added the JARs to our service’s Docker images, then loaded the image in Spark drivers and executors. We then had a feature generation Java library accessible via PySpark! The diagram above, taken from the <a href="https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals">PySpark Wiki</a>, illustrates the above design. As you can see, Py4J essentially carries out all data communication with the JVM.</p><h3 id="implementation">Implementation:</h3><p>You can imagine a simple example of a Java class with a method that prints “Hello World!” that is then called from PySpark to get the printed string: “Hello World!”. This implementation was illustrated in <a href="https://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html">this blog</a> (in Scala, but the same principle applies for Java), so we won’t get into it here. Instead, we’ll demonstrate how to apply this principle to our use-case.</p><p>Say we have some JSON logs containing information about ads that can be read in PySpark as PythonRDD. Now, we want to extract/transform features from these logs using our Java library. One way of doing this is via Java UDF (as is illustrated <a href="https://dzone.com/articles/pyspark-java-udf-integration-1">here</a>). However, there’s a limitation to this approach: it requires Java classes to have a zero-argument constructor. This can be seen in the official <a href="https://github.com/apache/spark/blob/be4faafee43d7b8810cf19deacd22e91b19ccfc6/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L685">Spark UDF Registration</a> code. Since for our use case we wanted to have the ability to parameterize our classes, this approach didn’t work for us.</p><p>Hence, we went with the following: first, we created a Java class that extends <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/FlatMapFunction.html"><code class="highlighter-rouge">flatMapFunction</code></a> interface. This allowed us to generate any number of output rows per row of the input RDD by passing an object of this class to Spark’s <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?#pyspark.RDD.mapPartitions"><code class="highlighter-rouge">mapPartitions</code></a> function. The Java class also gets a list of Java mappers that we want to apply to the input RDD to generate the output fields. One of these Java mappers calls into the feature generation library to extract the transformed features. The library itself essentially consists of simple Java classes that can extract features by applying simple transforms or business logic on raw JSON logs.</p><p>Now that we have all the Java classes ready, we can do the following on the Python side:</p><div class="language-python highlighter-rouge highlight"><pre> from pyspark.mllib.common import _java2py from pyspark.mllib.common import _py2java # Step 1: First convert the PythonRDD object into a java RDD object. java_rdd_object = _py2java(python_rdd.ctx, python_rdd) # Step 2: Get the Java class that implements flatMapFunction interface, initialize it, # and pass some mappers to it to apply on the Java RDD java_flat_map_function_object = flat_map_function_package.ClassWithFlatMapFunctionInterface( initParamA, initParamB, [ MapperA(arg_a), MapperB(arg_b), MapperForFeatures(), MapperForLabels() ] ) # NOTE: As one can see above, we can parameterize our mappers as opposed to JavaUDF functions # Step 2: Call mapPartitions on that java object (effectively calling java code) # and get the output as a java RDD instance. mapped_java_rdd_object = java_rdd_object.mapPartitions( java_flat_map_function_object ) # The above mapped_java_rdd_object now consists of results of all the 4 mappers above applied # Step 3: Convert the java RDD object back into a python RDD object. mapped_python_rdd = _java2py(python_rdd.ctx, mapped_java_rdd_object) </pre></div><p>Voila! Now we have a PythonRDD of features that was generated via the Java feature generation code.</p><h2 id="model-training-with-distributed-xgboost-on-spark">Model Training with Distributed XGBoost on Spark</h2><p>We use <a href="https://www.mlflow.org/docs/latest/tracking.html">MLFlow-tracking</a> to track and log our model training runs. This provides us with a lot of visibility into our model training metrics, an easy way of logging and visualizing hyperparameters and even sharing the model-training reports. Another cool feature of MLFlow-tracking is the ability to query the model-training runs and retrieve the best models based on metrics. We leverage this feature to automate our evaluation and monitoring pipelines.</p><p>To train our ads targeting models, we heavily rely on <a href="https://xgboost.readthedocs.io/en/latest/">XGBoost</a>. However, distributed training with XGBoost on Spark took some work to accomplish. Since the official library (version <= 0.9) doesn’t provide a Python/PySpark interface, we wrote our own wrapper on top of <a href="https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html">XGBoost4J-Spark</a>. We also implemented a SparseVectorAssembler instead of using the <a href="http://spark.apache.org/docs/latest/ml-features#vectorassembler">VectorAssembler</a> provided by Spark, since the default implementation doesn’t integrate well with XGBoost on Spark (<a href="https://github.com/dmlc/xgboost/pull/4805">issues</a> dealing with missing values). Another limitation of XGBoost on Spark is that it’s not as fault-tolerant as native Spark algorithms. This becomes an issue when trying to use AWS Spot instances for model training, since at times when the spot instances became unavailable, the training job dies. Thus, we created a separate pool of on-demand resources to carry out large-scale distributed training with XGBoost on Spark.</p><h2 id="automated-retraining-monitoring-and-alerting">Automated Retraining, Monitoring, and Alerting</h2><p>We use <a href="https://github.com/Yelp/Tron">Yelp’s Tron</a> scheduler to schedule our batch processing jobs. The entire new pipeline is scheduled via Tron, where log-processing and feature generation batches run daily and model-training runs every few days. Through running A/B experiments, we’ve observed that pure retraining of models with newer data leads to <strong>~1% improvement</strong> in our primary metric, which then compounds over time.</p><p>While scheduling model retraining is simple, deployment, monitoring and alerting can be a more difficult process. To instill confidence among developers and PMs to go in production with newly trained models, we developed a solid monitoring infrastructure that does the following:</p><ul><li>Daily model evaluation that replays traffic for all models in production and models yet to be deployed in production(this helps us capture model drift and decays)</li> <li>Live Splunk dashboards of business, online model, and offline model evaluation metrics</li> <li>Scoring verification systems that verify online and offline scoring matches and ensures that features don’t drift between online and offline modes</li> </ul><p>Having a good monitoring infrastructure improves developer velocity in deploying newly trained models. It’s analogous to having a good CI/CD infrastructure for code deployment.</p><p>With this newly designed service and regular retraining-deployment cycle, we’ve seen a vast improvement in our model metrics that has further translated to improving business metrics such as click-through-rate, sell-through-rate, and lower cost per clicks for our advertisers.</p><p>This means that we’ve not only improved serving more relevant and useful ads to our users, but have also reduced the cost for our advertisers to serve ads, making Yelp a more cost-effective platform for their business.</p><h2 id="conclusion">Conclusion</h2><ul><li>Designing large ML systems is hard due to additional complexities introduced by data and models, but it’s especially important when it’s a big part of your product</li> <li>Sometimes ML systems need to evolve (from Hadoop MR to Spark); ML engineers shouldn’t shy away from this just because it’s infrastructure and not modeling</li> <li>Decouple system components and checkpoint data often so that each component can be independently worked and improved upon</li> <li>Create infrastructure such that training+evaluating+monitoring models are easy and automated. Make this infrastructure instill confidence in developers to deploy newly trained models</li> <li>Retraining models with newer data can provide good gains with almost zero effort, so take advantage of it!</li> </ul><h2 id="acknowledgements">Acknowledgements</h2><p>A huge thanks to engineers from the Applied ML, Core ML, and Ads Platform teams, without whom such a broad cross-team collaborative effort wouldn’t have been possible. Credit to the contributors: Chris Farrell, Jason Sleight, Aditya Mukherjee, Abhy Vytheeswaran, Vincent Kubala.</p><div class="island job-posting"><h3>Become a Machine Learning Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Streams and Monk – How Yelp is Approaching Kafka in 2020</h1> <p>Wed, 22 Jan 2020 01:00:00 +0100</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"> <p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p> </form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2019 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>Automated IDOR Discovery through Stateful Swagger Fuzzing</h1> <p>Thu, 09 Jan 2020 01:00:00 +0100</p> <p>Scaling security coverage in a growing company is hard. The only way to do this effectively is to empower front-line developers to be able to easily discover, triage, and fix vulnerabilities before they make it to production servers.</p><p>Today, we’re excited to announce that we’ll be open-sourcing <a href="https://github.com/Yelp/fuzz-lightyear">fuzz-lightyear</a>: a testing framework we’ve developed to identify <a href="https://blog.detectify.com/2016/05/25/owasp-top-10-insecure-direct-object-reference-4/">Insecure Direct Object Reference (IDOR) vulnerabilities</a> through stateful <a href="https://www.wired.com/2016/06/hacker-lexicon-fuzzing/">Swagger fuzzing</a>, tailored to support an enterprise, microservice architecture. This integrates with our Continuous Integration (CI) pipeline to provide consistent, automatic test coverage as web applications evolve.</p><h2 id="the-problem">The Problem</h2><p>As a class of vulnerabilities, IDOR is arguably one of the most difficult to systematically defend against in an enterprise codebase. Its ease of exploitation, combined with its potential for impact, makes it a high-risk vulnerability that we want to minimize as much as possible.</p><p>In the security industry, there are two main approaches to defending against threats. First, try to <strong>prevent</strong> them from happening. If this isn’t possible, make sure you can <strong>detect</strong> them for fast remediation.</p><p>The problem with IDOR is that it’s difficult to do either one.</p><h3 id="hard-to-prevent">Hard to Prevent</h3><p>The main problem with preventing IDOR vulnerabilities is that there’s no system that can be easily implemented to mitigate it. For <a href="https://www.acunetix.com/websitesecurity/cross-site-scripting/">Cross Site Scripting (XSS)</a>, attacks, you can leverage an effective templating system. For <a href="https://portswigger.net/web-security/sql-injection">SQL Injection attacks</a>, you can use parameterized queries. For IDOR, a common industry recommendation is to leverage a mapping (e.g., random string) to make it harder to enumerate values as an attacker. However, practically speaking, this is not as easy as it seems.</p><p>Maintaining a mapping leads to two categories of caveats:</p><ol><li> <p>Cache Management</p> <p>Let’s assume you have an endpoint that’s currently vulnerable to IDOR attacks: <code class="highlighter-rouge">/resource/1</code>. Now, you want to implement a mapping that masks this ID in the URL with a random string: <code class="highlighter-rouge">/resource/abcdef</code>, where <code class="highlighter-rouge">abcdef</code> maps to 1.</p> <p>In this contrived example, you may be tempted to deprecate the old endpoint and just use the new one. However, this may break browser caches, user bookmarks, and pages indexed by search engines. Imagine taking an unexpected SEO hit when trying to roll out your IDOR-prevention system!</p> <p>The alternative is that you can redirect traffic from the old endpoint to the new one, and let it bake in production for an extended period of time. However, for the time the redirect is in place, you would still be susceptible to IDOR vulnerabilities. Furthermore, this mapping is publicly harvestable during this period, so there’s a chance that someone may store and use it at a later time to perform the same attacks – just with less enumerable values.</p> </li> <li> <p>Handling Internal References</p> <p>ID references are littered throughout many different internal systems: various logs, Kafka messages, and database entries to name a few. When you transition from one reference method to another, how do you make sure that none of these systems break?</p> <p>One good approach is to only use the mapped string for public-facing assets and its numeric counterpart for internal references. However, how do you enforce this to be true? There will always be more data ingresses, and the problem space might be reduced to a whack-a-mole approach of either translating it at the new data ingress or handling both types of IDs downstream.</p> </li> </ol><p>Another common industry recommendation is to merely perform access control checks before manipulating resources. While this is easier to do, it’s more suitable for spot-fixing, as it’s a painfully manual process to enforce via code audits. Furthermore, it requires <strong>all</strong> developers to know when and where to implement these access control checks. For example, if you put it at the ORM level, you may need to consider legitimate administrative cases for when you need to “bypass” these checks. If you put it at your view layer (assuming MVC layout), you may find yourself duplicating code everywhere.</p><p>How can you ensure all developers are actively thinking about this attack vector, <em>and</em> know how to mitigate it?</p><h3 id="slow-to-detect">Slow to Detect</h3><p>Detection strategies for this class of vulnerabilities are also somewhat lackluster. While manual code audits are effective, they don’t scale and are often expensive. Off-the-shelf static code analyzers prove more noisy than they’re worth, and a complicated taint analysis model would be required due to the various number of places that access control checks can be done.</p><p>Traditional API fuzzing may seem like another valid option, but this is not the case. The issue with traditional fuzzing is that it seeks to break an application with the assumption that failures allude to vulnerabilities. However, this is not necessarily true. As a security team, we care less about errors that attackers may or may not receive. Rather, we want to identify when a malicious action <strong>succeeds</strong>, which will be completely ignored by traditional fuzzing.</p><h2 id="the-solution">The Solution</h2><p>In February 2019, Microsoft released a <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/02/paper2.pdf">research paper</a> that describes how stateful Swagger fuzzing was able to detect common vulnerabilities in REST APIs, including IDOR vulnerabilities. The premise of this strategy is as follows:</p><ol><li> <p>Have a user session execute a sequence of requests.</p> </li> <li> <p>For the same sequence of requests, have an attacker’s session execute them. This is to ensure that the user and the attacker are able to reach the same state.</p> </li> <li> <p>For the last request in the sequence, have the attacker’s session execute the user’s request. If this is successful, a potential vulnerability is found.</p> </li> </ol><div class="image-caption"><p class="subtle-text"><small>Detecting IDOR in a hypothetical sequence of requests</small></p></div><h3 id="stateful-fuzzing-vs-traditional-fuzzing">Stateful Fuzzing vs. Traditional Fuzzing</h3><p>Generally speaking, the art of using fuzzing requests to find vulnerabilities relies on one core assumption: <strong>applications should be able to handle any input thrown at them</strong>. This means that when the application breaks due to “malformed” input, it’s indicative of a potential exploit and warrants further investigation.</p><p>The issue with this approach is that as a security team, we care less about whether an application breaks for a specific user and more about successful requests in situations where they should have failed.</p><p><a href="https://swagger.io">Swagger</a>, as a standardized API specification, is fantastic for programmatically defining the rules of engagement for the fuzzing engine. Furthermore, by making it stateful, we can simulate user behavior through proper API requests/responses which keep state between each response. This state can then be used to fuzz future request parameters so that a single request sequence is able to accurately simulate a user’s session, enabling <a href="https://principlesofchaos.org/">chaos engineering testing</a>.</p><p>Finally, user session testing allows for carefully crafted scenarios to assert various security properties of a given API. In this case, we leveraged this to check whether users are able to access private resources that don’t belong to them.</p><p>The simplicity of this concept was profound. It provided a means to scale IDOR detection in an automated fashion through integration with our CI pipeline. However, while our solution was inspired by Microsoft’s research, we encountered several issues when adapting it to our ecosystem.</p><h2 id="issues">Issues</h2><h3 id="infrastructure-dependencies">Infrastructure Dependencies</h3><p>With a microservice architecture, services often have dependencies on other services. This means that in order to fuzz a given service, we would need to spin up its dependent services along with any other nested dependent services. To address this, we leveraged <a href="https://docs.docker.com/compose/">Docker Compose</a> to spin up a sandbox environment so we could perform acceptance testing with the service.</p><p>Acceptance testing is the practice of treating your service as a blackbox and testing whether the entire system as a whole behaves as expected. Through a microservice lens, this differs from integration tests (that mock out external dependencies), as acceptance tests spin up sandboxed instances for more realistic end-to-end testing. Since fuzz-lightyear identifies potential IDOR vulnerabilities by analyzing successful requests, it complements this framework nicely. Running tests in sandboxed instances also prevents leaving after-effects on staging or production databases so we don’t pollute our data with fuzzed, random input. Acceptance tests are typically integrated into CI/CD pipelines but can also be run locally by developers.</p><p>One popular tool we use at Yelp to facilitate running acceptance tests is Docker Compose. This allows developers to define service dependencies in one single YAML file and enables them to start/stop them easily. By leveraging this tooling, we gain two advantages. First, we empower developers by seamlessly integrating into their established development/testing workflow. Second, it integrates effortlessly with our existing CI pipeline to provide continuous coverage, and also tests for IDOR vulnerabilities in a generated sandboxed environment with all the new changes.</p><h3 id="incomplete-resource-lifecycle">Incomplete Resource Lifecycle</h3><p>A fundamental assumption in the original research paper is that the tested application supports all CRUD (Create, Retrieve, Update, Delete) methods. This allows for stateful fuzzing, as any resource can be created and manipulated within the application’s API.</p><p>However, this is not the case at Yelp. Often, services only provide interfaces to retrieve and update resources directly corresponding to that service, but rely on other services to create such resources. This means that stateful fuzzing would not be effective–since there’s no way to test the retrieval of a resource – if we didn’t create it within the request sequence.</p><p>For example, service A has an endpoint X which takes a <code class="highlighter-rouge">business_id</code> as an input, but service A itself doesn’t have the ability to create businesses. By itself, the stateful fuzzing algorithm would never be able to test endpoint X since we have no way of generating a business!</p><p>We can’t just tack on another service’s API to the request sequence generation process, since this would expand the search space of the algorithm too much. Therefore, our solution is to provide developers the ability to define factory fixtures that can be used while fuzzing. This is what a fixture looks like:</p><noscript> <pre>@fuzz_lightyear.register_factory('userID') def create_biz_user_id(): return do_some_magic_to_create_business()</pre></noscript><p>This registers <code class="highlighter-rouge">create_biz_user_id</code> as a provider for the <code class="highlighter-rouge">userID</code> resource, so that if <code class="highlighter-rouge">fuzz_lightyear</code> needs a <code class="highlighter-rouge">userID</code> resource in a request, it can use the factory to generate it. This fixture system makes it easy for developers to configure <code class="highlighter-rouge">fuzz_lightyear</code> to generate vulnerability-testing request sequences by reducing the complexity of creating dependency resources.</p><h3 id="expected-direct-object-reference">Expected Direct Object Reference</h3><p>Not all endpoints that allow a direct object reference need to be authenticated. They could simply be providing <strong>non-sensitive information</strong> about the object being queried. For example, consider our open-sourced <a href="https://engineeringblog.yelp.com/2017/02/open-sourcing-yelp-love.html">Yelp Love</a> app. This <a href="https://github.com/Yelp/love/blob/df39553935d92514e5b78c075d1b9849a6cb3c62/views/web.py#L92-L117">endpoint</a> requires authentication, but the details which it returns are not sensitive in the context of the app. Thus, it doesn’t make any sense to check for IDOR vulnerabilities in this case.</p><p>To address this, we implemented an endpoint whitelisting system to configure which endpoints should be excluded from a <code class="highlighter-rouge">fuzz_lightyear</code> scan. This allows developers to configure the testing framework to only alert off high-signal endpoints, therefore minimizing test flakiness.</p><h2 id="takeaways">Takeaways</h2><p>Automated IDOR detection is a difficult task. Even with the Microsoft-inspired stateful fuzzing approach, there were still limitations to applying this concept in a microservice ecosystem. To address these issues, we designed a testing framework that allows developers to easily configure dynamic tests and integrate them smoothly into our CI pipeline. In doing so, we can achieve continuous, automated IDOR coverage, as well as empower developers to be able to address these issues independently.</p><p>Curious to check it out? View more details on fuzz-lightyear on our <a href="https://github.com/Yelp/fuzz-lightyear">Github</a> page.</p><h2 id="contributors">Contributors</h2><p>I would like to credit the following people (in alphabetical order) for their hard work in building this system and in continuing to bolster Yelp’s security.</p><ul><li><a href="https://www.linkedin.com/in/aaronloo">Aaron Loo</a></li> <li><a href="https://www.linkedin.com/in/joeysclee">Joey Lee</a></li> <li><a href="https://github.com/OiCMudkips">Victor Zhou</a></li> </ul><div class="island job-posting"><h3>Security Engineering at Yelp</h3><p>Want to transform industry-leading ideas into actionable, scalable solutions to help keep the Yelps secure? Apply to join!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/30bfc49d-efdd-4543-9748-d95bef5692ae?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Streaming Cassandra into Kafka in (Near) Real-Time: Part 2</h1> <p>Wed, 18 Dec 2019 01:00:00 +0100</p> <p>The <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">first half</a> of this post covered the requirements and design choices of the Cassandra Source Connector and dove into the details of the CDC Publisher. As described, the CDC Publisher processes Cassandra CDC data and publishes it as loosely ordered PartitionUpdate objects into Kafka as intermediate keyed streams. The intermediate streams then serve as input for the DP Materializer.</p><h2 id="data-pipeline-materializer">Data Pipeline Materializer</h2><p>The DP Materializer ingests the serialized PartitionUpdate objects published by the CDC Publisher, transforms them into fully formed Data Pipeline messages, and publishes them into the Data Pipeline.</p><p>The DP Materializer is built on top of Apache Flink, a stream processing framework. Flink has been used in production at Yelp for a few years now across various streaming applications. It provides an inherent state backend in the form of RocksDB, which is essential for guaranteeing inorder CDC publishing. In addition, Flink’s checkpoint and savepoint capabilities provide extremely powerful fault tolerance.</p><p>The application has two main phases:</p><ul><li>Schema Inference (or the “bootstrap phase”)</li> <li>ETL (or the “transform phase”)</li> </ul><h3 id="schema-inference">Schema Inference</h3><p>During the bootstrap phase, the avro schema necessary for publishing to the Data Pipeline is derived from the Cassandra table schema. The process begins by building the Cassandra table metadata objects (<em>CFMetaData</em>) used by the Cassandra library. Loading this metadata is required to use library functionality to act on the serialized Cassandra data from the CDC Publisher stream. The metadata object contains information on the table primary key, column types, and all other properties specified in a table CREATE statement. This schema representation is processed to produce an avro schema where each Cassandra column is represented by an equivalent avro type.</p><p>As the DP Materializer is deployed outside of the Cassandra cluster, it cannot load the table metadata from files on the local node (like the CDC Publisher). Instead, it uses the Cassandra client to connect to Cassandra and derive the CFMetaData from the schema of the table being streamed. This is done in the following steps:</p><ol><li>Once connected to a cluster, the create table and type (for UDTs) statements are retrieved.</li> <li>Cassandra’s query processor is used to parse the retrieved create statements into the table metadata objects.</li> <li>Information about columns previously dropped from the table is retrieved and added to the metadata built in the previous step. Loading the dropped column information is required to read table data created prior to the column being dropped.</li> </ol><div class="image-caption"><p class="subtle-text"><small>Loading Table Metadata from Cassandra</small></p></div><p>Once the metadata is loaded, the DP Materializer builds the avro schema from the metadata. A couple of key things happen in this derivation phase:</p><ol><li>The table’s partition key and clustering key(s) are mapped as the primary keys of the avro schema.</li> <li>All other columns in the table (except the partition and clustering keys) are created as nullable. In the event of schema changes in the table, this guarantees that the corresponding avro schemas are always compatible to their previous versions (except when re-adding a column with a different type, which in itself <a href="https://issues.apache.org/jira/browse/CASSANDRA-14843">can</a> <a href="https://issues.apache.org/jira/browse/CASSANDRA-14948">cause</a> <a href="https://issues.apache.org/jira/browse/CASSANDRA-14913">issues</a>).</li> </ol><p>Schema generation currently supports nearly all valid Cassandra column types (except when prohibited by Avro), including collections, tuples, UDTs, and nesting thereof.</p><h4 id="schema-change-detection">Schema Change Detection</h4><p>As the above schema inference is part of the bootstrap phase, the DP Materializer needs the ability to detect Cassandra schema changes online and update the output Avro schema automatically. To achieve this, it implements Cassandra’s schema change listener interface, provided by the Cassandra client, to detect when a change is made to the schema of the tracked table. Once detected, the corresponding Cassandra metadata is updated and the avro schema is rebuilt from the updated metadata.</p><h3 id="etl-or-consume-transform-and-publish">ETL (or Consume, Transform, and Publish)</h3><p>This phase of the DP Materializer is where the serialized PartitionUpdate objects from the CDC Publisher are consumed, processed, and transformed into Data Pipeline messages for publishing into the Pipeline. The consumer and publisher are provided out-of-the-box by Flink, so this section primarily focuses on the transformer portion of the DP Materializer.</p><div class="image-caption"><p class="subtle-text"><small>Data Pipeline Materializer</small></p></div><h4 id="state-architecture">State Architecture</h4><p>The transformer is backed by Flink’s RocksDB state. This state is abstracted as a collection of map objects, with each map corresponding to a partition key from the Cassandra table. Each map object has, as its keys, the clustering keys from that partition in Cassandra. A PartitionUpdate, containing at most one row, is stored as the value for its corresponding clustering key in the map. For tables which do not have defined clustering keys, each map contains a single entry with a null key.</p><div class="image-caption"><p class="subtle-text"><small>State Structure</small></p></div><p>State loading and memory management is handled internally by Flink. In addition, Flink’s stream keying mechanism guarantees that all updates for a partition key will be routed to the same worker and processed against the same map object persistently across application restarts.</p><p>Note that the PartitionUpdate objects from the CDC Publisher can be both duplicated multiple times and out-of-order (by writetime). In addition, oftentimes a PartitionUpdate may not contain the full content of a Cassandra row.</p><h4 id="the-transformer">The Transformer</h4><p>The central piece of the application is the transformer, which:</p><ul><li>Processes the Cassandra CDC data into a complete row (with preimage) for the given avro primary key (Cassandra partition key + clustering key[s]) for publishing to the Data Pipeline.</li> <li>Produces final output message with the appropriate Data Pipeline message type.</li> </ul><p>The transformer uses the row (PartitionUpdate) saved in the map objects in the state, along with the incoming PartitionUpdate objects from the CDC Publisher to generate the complete row content, the previous row content (in the case of UPDATE, DELETE messages), and the type of the output message.</p><p>This is achieved by deserializing the input PartitionUpdate and merging it with the saved PartitionUpdate. This is done using the same PartitionUpdate merge functionality Cassandra uses to combine data from SSTables during reads. The merge API takes in two PartitionUpdate objects, one from the Flink state and the other from the CDC Publisher’s output stream. This produces a merged PartitionUpdate which is used to build an avro record with the schema derived during the bootstrap phase. If the previous row value is needed, it is derived from the saved PartitionUpdate in the Flink state. In the end, the state is updated with the merged PartitionUpdate.</p><div class="image-caption"><p class="subtle-text"><small>Determining Row States</small></p></div><p>This process handles duplicate and out-of-order PartitionUpdate objects. The use of Cassandra’s merge functionality results in the same “last write wins” conflict resolution as a Cassandra read. To avoid publishing duplicate messages, it is verified that the input PartitionUpdate changes the row state. This is done by computing the md5 digests of the saved and merged PartitionUpdate objects. If the digests are the same, the PartitionUpdate is ignored.</p><p>The merge, update state, and publish logic can be summarized below:</p><ul><li>The incoming PartitionUpdate is merged with the saved PartitionUpdate (if it exists) and the corresponding Data Pipeline message is determined: <ul><li>If the merged PartitionUpdate contains live (non-tombstoned) data and the saved does not, a CREATE message is published.</li> <li>If both the merged and saved PartitionUpdate objects contain live data, an UPDATE message is published if the md5 digests of the objects are different.</li> <li>If the merged PartitionUpdate contains tombstoned data but the saved one contains live data, a DELETE message is published.</li> </ul></li> <li>If the md5 digests of the saved and merged PartitionUpdate objects are different, then the merged PartitionUpdate is saved in the state.</li> </ul><p>Thus, at the end of the transform phase, a message with the appropriate Data Pipeline message type and the full row content is ready to be published into the Data Pipeline.</p><h2 id="supporting-backfills">Supporting Backfills</h2><h3 id="bootstrapping-a-stream">Bootstrapping a Stream</h3><p>A limited amount of CDC logs can be <a href="http://cassandra.apache.org/doc/latest/operating/cdc.html#warnings">stored on a Cassandra node</a>. Thus, when a table is set up to be streamed by the connector, only the data available in the CDC directory at the time (and going forward) will be processed. However, to maintain the stream-table duality, all of the existing data in the Cassandra table needs to be replayed into the stream.</p><p>To achieve this, the backfill bootstrap process reads through the data stored on disk as SSTables. To ensure that the set of SSTable files are not modified by compaction during the backfill, the table’s snapshot is taken and the SSTables are processed off of that snapshot. The Cassandra SSTable reader returns the scanned data as a series of PartitionUpdate objects. The CDC Publisher processes these PartitionUpdate objects in the same way as commit log segments and publishes them into Kafka, where they’re subsequently transformed into Data Pipeline messages by DP Materializer.</p><p>This process is followed whenever a Cassandra table is first set up to be tracked by the connector. This is also done if there’s a need to rebuild the state in the DP Materializer.</p><h3 id="rebuilding-a-stream">Rebuilding a Stream</h3><p>If a tracked table’s output stream becomes corrupted or is deleted (unlikely but possible), the stream can be rebuilt by replaying the stored state of the DP Materializer. As all of the serialized PartitionUpdate objects are stored in the state, there’s no need to republish data from the SSTables.</p><h2 id="limitations-and-future-work">Limitations and Future Work</h2><h3 id="partition-level-operations">Partition Level Operations</h3><p>The current system design processes each row change independently. A single input message to the DP Materializer will emit at most one message into the Data Pipeline. Changes at a partition level that affect the value of multiple rows are not currently supported. These include:</p><ul><li>Full partition deletion (only when using clustering)</li> <li>Ranged tombstones</li> <li>Static columns</li> </ul><p>There is, however, a potential path to support. The DP Materializer stores all rows in a single Cassandra partition as entries of the same map object during processing. It is conceivable to also store the partition level state separately. When this state changes, the DP Materializer could iterate through the entire map (Cassandra partition) and produce Data Pipeline messages for all affected rows.</p><h3 id="ttl">TTL</h3><p>TTL’ed data is currently not supported by the connector. TTL values are ignored and data is considered as live based on its writetime.</p><h3 id="dropping-tombstones">Dropping Tombstones</h3><p>There’s no support to drop tombstones from DP Materializer’s Flink state. They will remain there indefinitely unless overridden with new data. It may be possible to drop old tombstones when updating row state, similar to the gc_grace_seconds parameter on tables. However, this would not help for rows that are never updated. In addition, great care would need to be taken to ensure backfilling or repairing a table does not create zombie data in the output stream.</p><h3 id="publishing-latency">Publishing Latency</h3><p>As mentioned earlier, commit log segments must be full and no longer referenced by memtables before being made available for processing by Cassandra. In spite of the CDC log filler implementation, some latency is introduced in publishing to the Data Pipeline. This limitation should be overcome in Cassandra 4, which introduces the capability to read live commit log segments and will thus ensure that the publishing latency is as close to real time as possible.</p><h2 id="learnings">Learnings</h2><p>The Cassandra Source Connector has been running in production at Yelp since Q4 2018. It supports multiple use cases, which have helped in surfacing some quirks about its design choices:</p><h3 id="avro-as-a-serialization-format">Avro as a Serialization Format</h3><p>The maximum number of cells (rows * columns) allowed by Cassandra in a single partition is two billion. This means that a row could potentially have two billion columns. However, Avro serialization and deserialization becomes a bottleneck once the number of columns starts going into the hundreds and cannot hold up with the potential maximum number of columns. Horizontal scaling might be needed for consumers depending on the throughput requirements and size (in number of columns) of the Cassandra table being streamed.</p><p>In addition, a few Cassandra data types (such as DECIMAL) don’t have intuitive Avro data type equivalents. In such cases, either the columns cannot be supported or custom avro data types have to be defined.</p><h3 id="flink-state-size">Flink State Size</h3><p>As every single row from the table is stored as a serialized PartitionUpdate in the state, the state size can grow up to be huge for large tables. The state size becomes a bottleneck for code pushes and maintenance as it has to be reloaded for every deployment and restart of the application. Additional work is required for minimizing the time for saving and loading state for huge tables.</p><h2 id="tldr">TL;DR?</h2><p>Yelp presented the Cassandra Source Connector at Datastax Accelerate 2019. You can watch it <a href="https://www.youtube.com/watch?v=p2GLvYActRw">here</a>.</p><div class="post-gray-box">This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into data stores like Redshift, Salesforce, and Elasticsearch.<p>Read the posts in the series:</p><ul><li><a title="Billions of Messages a Day - Yelp's Real-time Data Pipeline" href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Billions of Messages a Day - Yelp's Real-time Data Pipeline</a></li> <li><a title="Streaming MySQL tables in real-time to Kafka" href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Streaming MySQL tables in real-time to Kafka</a></li> <li><a title="More Than Just a Schema Store" href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">More Than Just a Schema Store</a></li> <li><a title="PaaStorm: A Streaming Processor" href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">PaaStorm: A Streaming Processor</a></li> <li><a title="Data Pipeline: Salesforce Connector" href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">Data Pipeline: Salesforce Connector</a></li> <li><a title="Streaming Messages from Kafka into Redshift in near Real-Time" href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Streaming Messages from Kafka into Redshift in near Real-Time</a></li> <li><a title="Open-Sourcing Yelp's Data Pipeline" href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">Open-Sourcing Yelp's Data Pipeline</a></li> <li><a title="Making 30x Performance Improvements on Yelp’s MySQLStreamer" href="https://engineeringblog.yelp.com/2018/02/making-30x-performance-improvements-on-yelps-mysqlstreamer.html">Making 30x Performance Improvements on Yelp’s MySQLStreamer</a></li> <li><a title="Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift" href="https://engineeringblog.yelp.com/2018/04/black-box-auditing.html">Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift</a></li> <li><a title="Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch" href="https://engineeringblog.yelp.com/2018/06/fast-order-search.html">Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch</a></li> <li><a title="Joinery: A Tale of Un-Windowed Joins" href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery: A Tale of Un-Windowed Joins</a></li> <li><a title="Streaming Cassandra into Kafka in (Near) Real-Time: Part 1" href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Streaming Cassandra into Kafka in (Near) Real-Time: Part 1</a></li> <li><a title="Streaming Cassandra into Kafka in (Near) Real-Time: Part 2" href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-2.html">Streaming Cassandra into Kafka in (Near) Real-Time: Part 2</a></li> </ul></div><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Interested in solving problems like these? Apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/2cfdf523-06dd-41d9-b025-3db1b45f0548?description=Software-Engineer-Data-Production-Backend_Engineering_London-UK?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Architecting Restaurant Wait Time Predictions</h1> <p>Thu, 12 Dec 2019 01:00:00 +0100</p> <p>Is there a restaurant you’ve always wanted to check out, but haven’t been able to because they don’t take reservations and the lines are out the door?</p><p>Here at Yelp, we’re trying to solve problems just like these and delight consumers with streamlined dining experiences. Yelp Waitlist is part of the Yelp Restaurants product suite, and its mission is to take the mystery out of everyday dining experiences, enabling you to get in line at your favorite restaurant through just the tap of a button.</p><p>For diners, in addition to joining an online waitlist, Yelp Waitlist provides live wait times and queue updates. For restaurants, it facilitates table management and reduces stress and chaos by the door by allowing guests to sign up remotely. The flow is simple: diners see the current wait times at a Waitlist restaurant and virtually get in line right from the Yelp app.</p><p>If you want to know more about the product, check out <a href="https://blog.yelp.com/2019/09/yelp-waitlist-new-predictive-wait-time-and-notify-me-features">this related</a> post!</p><div class="image-caption"></div><p>Wait estimates are modeled as a machine learning problem. When you request to be seated at a restaurant through Waitlist, a machine learning model is alerted behind the scenes to generate a prediction. The ability of this model to provide reasonable wait estimates is what makes the online waitlist possible, so you have some bit of AI to thank the next time you enter a line from the comfort of your home.</p><p>The prediction endpoint is part of a larger system that enables the generation of the estimated time. This blog post is aimed at describing the Waitlist machine learning system that bridges hungry diners to their tasty food.</p><h3 id="the-system">The System</h3><p>As you can imagine, the system needs to be as up to date as possible with the state of the restaurant (e.g., how many people are currently in line), and the many other contextual factors that determine an estimate as accurately as possible. For example, you cannot expect the wait time to extend beyond the closing time of the restaurant. Additionally, there are certain latency requirements to serve a high volume of QPS.</p><p>The system can be broken down into three components:</p><ol><li>The offline training pipeline where model iteration, data-wrangling, and ETLs happen.</li> <li>Online serving which tracks the current state of the restaurant and responds to requests.</li> <li>Analytics providing model performance reports and analyses.</li> </ol><p>We chose to use XGBoost as the model to generate wait estimates. Offline training happens via Spark and an <a href="https://engineeringblog.yelp.com/2018/01/growing-cache-friendly-trees-part2.html">optimized XGboost Java Library</a> that helps us meet latency requirements is used online.</p><p>We faced two main challenges while architecting the machine learning system:</p><ol><li>The requirement of serving users live predictions from a Spark ML model with an online service in Python.</li> <li>The cold start problem when adding new businesses to the product.</li> </ol><p>Most of the system was initially designed to make the first challenge possible. We slowly added components to enable training and prediction with more features once we felt confident in the system’s ability to work seamlessly on its own. The second challenge was addressed by the use of XGboost, which can make predictions with partial feature-sets. Though these predictions may not be very accurate at first, retraining helps improve them over time.</p><p>Below is a simplified view of the system:</p><div class="image-caption"></div><h3 id="the-various-components-in-the-above-diagram-are">The various components in the above diagram are:</h3><ul><li><strong>Data Warehouse</strong>: Source of data for training, backed by Redshift.</li> <li><strong>Offline Service</strong>: Service responsible for training the model. This is written in Python and uses Spark for model training due to the quantity of data involved (tens of millions of instances after sanitization).</li> <li><strong>Feature ETLs</strong>: Spark-based ETLs for generating additional features. These are non-time-sensitive features which are shared both online and offline.</li> <li><strong>Model Server</strong>: In-house Java service which stores the trained model and is optimized for high-throughput traffic.</li> <li><strong>Online Stores</strong>: Available features generated from Spark-ETL, as well as up-to-date restaurant data. This encompasses: <ul><li>Cassandra for storing results from Spark-ETLs</li> <li>MySQL for storing the restaurant’s state</li> </ul></li> <li><strong>Online Service</strong>: Service responsible for generating predictions in real time and making calls to the online stores and model server to do so. This service is written in Python.</li> </ul><p>As hinted above, we rely heavily on Spark for building models, as well as for deriving additional features. It’s important to note, however, that the online service does not make use of Spark, which can result in different data access and manipulation patterns before being fed into the model to make a prediction.</p><p>A lot of care goes into ensuring that the set of features we compute offline match those we compute online. A theoretical example of a mismatched online/offline feature would be different orderings for one-hot encoded feature columns, which, despite having identical raw data, can result in different feature vectors.</p><p>Figure 2 (below) breaks down the model development, evaluation, and launch pipelines:</p><div class="image-caption"></div><h3 id="model-development-pipeline">Model development pipeline:</h3><p>At this stage, the model flows from human intuition/ideation to reality. This encompasses:</p><ul><li>Feature-extraction ETLs</li> <li>Feature-set blueprints: Feature definitions intended to enforce online/offline consistency (e.g., what subset of features this particular feature-set contains, its data types, etc.)</li> </ul><h3 id="evaluation-pipeline">Evaluation pipeline:</h3><p>This ensures that the newly trained model obtains an acceptable performance with regard to business metrics. This pipeline is a combination of automation and human decision making. For example, a metric could track the percentage of diners who waited more than five minutes beyond their quoted estimate.</p><p>The steps for evaluation include:</p><ul><li>Running an evaluation batch for the freshly trained model.</li> <li>Comparing performance against previous benchmarks.</li> <li>Evaluating if the new model is a viable candidate for experimentation/release. (Unfortunately not all candidates are viable; this can be attributed to the probabilistic nature of machine learning projects.)</li> <li>The models that pass this stage promise superior performance compared to the status quo model.</li> </ul><h3 id="experimentation-model-launch-pipeline">Experimentation/ Model-Launch pipeline</h3><p>At this stage, we’re convinced of the model’s promise and want to experiment with it in the real world. To maintain confidence that the model will operate in production as it did in offline evaluation, we promote the model to “dark-launch” mode.</p><p>To do this, we need to be able to reproduce the feature-set in the online service. This means:</p><ul><li>Each incoming user request contains a partial feature-set.</li> <li>The rest of the features are pulled from online data stores.</li> <li>The feature-set is guaranteed to maintain the same format as the training data (thanks to feature-blueprints).</li> </ul><p>Once we have the ability to make predictions from the online service, we can proceed to the dark-launch phase. Here, we:</p><ul><li>Surface our candidate model as a ghost/dark model.</li> <li>Enable the model to see live incoming requests and produce estimates for these requests (without surfacing them to the user).</li> <li>Use the event logs generated from the experiment launch to measure the performance of all models across all samples.</li> </ul><p>We’ve seen several benefits from dark-launching our model:</p><ul><li>Comparing performance across different cohorts of businesses without affecting estimates.</li> <li>Weeding out any differences in online and offline model-pipelines. Since both perform their own set of computations, etc., there’s plenty of scope for mistakes and we can ensure that offline and dark-launch give identical prediction estimates for the same candidate.</li> <li>Checking the latency of the new model and ensuring we don’t violate any SLOs.</li> </ul><p>At any given time, we can have several models launched live, several dark-launched, and several under development.</p><p>We can typically verify within a few days’ time if the dark-launched model is working as expected; if not, we can begin to investigate any discrepancies. If the results are as expected, we can slowly start rolling out the new model. This slow rollout is intended to capture feedback loops that we’re not exposed to during dark-launch.</p><h3 id="whats-a-feedback-loop">What’s a Feedback Loop?</h3><p>Whenever we surface an estimate to the user, we set an expectation of the time they’ll be seated, thereby affecting when the user shows up to the restaurant. If, for instance, this causes the user to arrive at the restaurant after their table is actually available, they may have a longer overall wait time (our label) than if we’d given them a shorter estimate. These instances are tracked in our logs and we try our best to reduce such inaccuracies. The feedback loop here happens when our label data is influenced by our prediction.</p><p>Factors like this add sensitivity to our system, which underscores the importance of providing accurate wait estimations.</p><h3 id="measuring-success">Measuring Success</h3><p>Within this problem area are a variety of metrics we can track, and choosing the right ones is always a challenge. We need to cater to the needs of not only the users (by ensuring they wait only as long as expected), but also the restaurant and its staff (not sending enough people to occupy empty tables vs. sending too many people at the same time which puts pressure on the hosts).</p><p>We can observe a few of these metrics using data streamed into our logs. A few others can be gauged through user-feedback surveys (which in itself has the propensity to be biased), and whatever else that cannot be observed, we hypothesize. We’re constantly trying to collect as much data as possible to improve the coverage of each quantitative and qualitative metric.</p><p>Measuring success is not trivial, especially given that the set of restaurants we serve is constantly growing and providing more opportunities to observe new behavioral patterns. With each model that we build and deploy, we learn a little more about our system, helping us better measure success. So far this strategy has worked well for us.</p><h3 id="conclusion">Conclusion</h3><p>Wait-time estimation is a unique problem we could only begin to address because of the state-of-the-art tooling and support from the wonderful people at Yelp! We continue to make updates to the algorithms and migrate our system to use more efficient tooling to make our estimates as accurate as possible so that you - our customer - don’t have to wait longer than you need at your favorite restaurant.</p><h3 id="acknowledgements">Acknowledgements</h3><p>Huge thanks to my indispensable team for all their contributions: Chris Farrell, Steve Thomas, Steve Blass, Aditi Ganpule, Saeed Mahani, Kaushik Dutt, and Sanket Sharma.</p><div class="island job-posting"><h3>Become a Software Engineer at Yelp</h3><p>Passionate about solving problems with Machine Learning?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Streaming Cassandra into Kafka in (Near) Real-Time: Part 1</h1> <p>Thu, 05 Dec 2019 01:00:00 +0100</p> <p>At Yelp, we use Cassandra to power a variety of use cases. As of the date of publication, there are 25 Cassandra clusters running in production, each with varying sizes of deployment. The data stored in these clusters is often required as-is or in a transformed state by other use cases, such as analytics, indexing, etc. (for which Cassandra is not the most appropriate data store).</p><p>As seen in previous posts from our Data Pipeline series, Yelp has developed a robust connector ecosystem around its data stores to stream data both into and out of the Data Pipeline. This two-part post will dive into the Cassandra Source Connector, the application used for streaming data from Cassandra into the Data Pipeline.</p><h2 id="data-pipeline-recap">Data Pipeline Recap</h2><p>Yelp’s Data Pipeline is an abstraction on top of Apache Kafka (explained in <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">this blog post</a>) and is backed by a schema registry called <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">Schematizer</a>. It currently serves as the backbone of hundreds of use cases at Yelp, ranging from analytics and experimentation to notifications, ranking, and search indexing.</p><div class="image-caption"><p class="subtle-text"><small>The Data Pipeline Ecosystem at Yelp</small></p></div><p>Here’s a quick recap of the Data Pipeline:</p><ul><li>Data published into the Data Pipeline must be schematized. In essence, data cannot be published if it doesn’t have a predefined schema.</li> <li>For data backed by data stores, the corresponding streams in the Data Pipeline must conform to the <a href="https://docs.confluent.io/3.1.1/streams/concepts.html#duality-of-streams-and-tables">stream-table duality</a>.</li> <li>Every message in the Data Pipeline must contain the full content of an equivalent row in the data store. In addition, UPDATE and DELETE messages must also contain the previous snapshot of the equivalent row before the change.</li> </ul><h2 id="challenges-with-streaming-data-from-cassandra">Challenges With Streaming Data From Cassandra</h2><p>Due to the nature of how Cassandra works, meeting the aforementioned Data Pipeline requirements can present some challenges.</p><h3 id="achieving-ordering-of-writes">Achieving Ordering of Writes</h3><p>Cassandra uses multiple replicas of data for availability. However, there’s no actual concept of a global replication stream. Each write is independently replicated, with all nodes eligible to coordinate. As a result, concurrent writes may be processed in different orders on different replicas. Cassandra uses several mechanisms (hinted handoffs, repairs, last write wins) to ensure that data is eventually consistent. Although the replicas eventually agree on the final value of the data, this does not resolve the differences in write order. Thus, the Cassandra Source Connector needs to conform to the write ordering guarantees similar to those of Cassandra.</p><h3 id="obtaining-complete-row-content">Obtaining Complete Row Content</h3><p>There’s no requirement for Cassandra writes to contain all table columns. Even if this were the case, the current state of the row would depend on both the data in the write and all previously written data that shadows it. Thus, the write data alone is not sufficient to determine the new row state.</p><h3 id="obtaining-previous-row-content">Obtaining Previous Row Content</h3><p>As is the case when determining new row value, knowledge of the row state prior to a given mutation is required. This prior row state represents the accumulation of all previous writes.</p><h3 id="distributed-data-ownership">Distributed Data Ownership</h3><p>The ownership of data in Cassandra is distributed between the nodes in each datacenter. There’s no special “master”; all nodes are able to coordinate writes. Thus, processing these writes to a cluster involves combining information from multiple nodes.</p><h2 id="possible-approaches">Possible Approaches</h2><p>Several approaches were considered when designing the Cassandra Source Connector. <a href="https://wecode.wepay.com/posts/streaming-cassandra-at-wepay-part-1">This post</a> by WePay gives a solid description of the primary streaming options available along with the pros and cons of each, including:</p><ul><li>Writing to both Cassandra and Kafka (“Double Writing”)</li> <li>Writing directly to Kafka and using a Cassandra Sink to load the data in Cassandra (“Kafka as Event Source”)</li> <li>Processing the commit log exposed by Cassandra’s Change Data Capture or CDC (“Parsing Commit Logs”)</li> </ul><p>The use of Kafka Connect’s <a href="https://docs.lenses.io/connectors/source/cassandra.html">Cassandra Source</a> was also investigated. This connector streams data from a Cassandra table into Kafka using either “bulk” or “incremental” update modes. Both modes function by periodically polling the table for data. Bulk mode performs a full table scan, publishing the entire result, while incremental mode queries the rows written since the last sampling. Both modes have their disadvantages:</p><ul><li>Bulk mode table scans are very expensive on large tables, and each scan publishes a lot of duplicate data.</li> <li>Incremental mode is only viable for a certain type of workload. The writes must be append-only with monotonically increasing columns (such as timestamps) as part of the primary key. Additionally, polling for this data can cause extra cluster load.</li> </ul><p>Ultimately, a solution based on processing Cassandra CDC made the most sense for the connector.</p><p>Cassandra’s distributed deployment characteristics coupled with both the need to achieve an ordering of writes and meet Data Pipeline semantics made creating a single application quite challenging. Thus, the Cassandra Source Connector was built as two separate components, each addressing a subset of these issues:</p><div class="image-caption"><p class="subtle-text"><small>Cassandra Source Connector at a High Level</small></p></div><p><strong>CDC Publisher</strong>: A service running locally on Cassandra nodes that uses CDC to publish raw Cassandra writes into intermediate Kafka streams. These streams serve as unified commit logs, removing the aspect of distributed data ownership and defining an order of events to process.</p><p><strong>Data Pipeline Materializer</strong> (<strong>DP Materializer</strong>): An application running on Apache Flink which processes raw Cassandra writes produced by the CDC Publisher and publishes them as Data Pipeline messages.</p><h2 id="cdc-publisher">CDC Publisher</h2><p>The CDC Publisher produces all writes made in Cassandra tables as serialized partition updates into table-specific Kafka streams.</p><h3 id="processing-cassandra-writes-with-cdc">Processing Cassandra Writes with CDC</h3><p>The <a href="http://cassandra.apache.org/doc/latest/operating/cdc.html">Change Data Capture (CDC)</a> capability introduced in version 3.8 of Cassandra is used by the CDC Publisher to process writes.</p><p>Normally (with CDC disabled), writes are stored by Cassandra in the following manner:</p><ul><li>Client writes are persisted to memtables and the commit log by every node</li> <li>Memtables are periodically flushed to SSTables on disk</li> </ul><div class="image-caption"><p class="subtle-text"><small>Cassandra Write Path</small></p></div><p>The commit log is composed of a series of fixed-sized files (defaulted at 32MB) called “commit log segments”. Once the memtables are flushed to SSTables, these segments are discarded by Cassandra.</p><p>If CDC is enabled, all Cassandra commit log segment files containing writes to a tracked table are flagged. When the files are no longer referenced by corresponding memtables, they’re moved into a separate directory (instead of being discarded).</p><div class="image-caption"><p class="subtle-text"><small>Cassandra Write Path with CDC</small></p></div><p>There are several challenges with using the current implementation of Cassandra’s CDC:</p><ul><li>Per-node processing: As each node stores only a portion of the complete table data, CDC must be processed on multiple nodes.</li> <li>Replication: The same write is stored on each data replica, resulting in duplicate processing.</li> <li>Partial data: Commit log segments only contain the information from incoming writes and do not have the full view of the corresponding rows.</li> <li>CDC does not contain schema information about the tables.</li> <li>CDC directory size limit: If the CDC directory gets too large in size, the node will reject new table writes.</li> <li>Poorly bounded latency: Commit log segments must be full and no longer referenced by memtables before being made available for processing. For clusters with low write rates, the commit log segments can take a while to fill up, affecting latency.</li> </ul><p>Despite these drawbacks, CDC was used because it is the solution developed by the Cassandra open source community for processing committed data. This also means that any future improvements to the CDC implementation can be leveraged by upgrading Cassandra versions.</p><h3 id="wrangling-cdc">Wrangling CDC</h3><h4 id="deployment">Deployment</h4><div class="image-caption"><p class="subtle-text"><small>CDC Datacenter Deployment</small></p></div><p>To ensure that processing CDC doesn’t cause any performance issues on the actual cluster, a virtual Cassandra datacenter is created, which is logically separate from the standard region-specific datacenters. The CDC Publisher is deployed only on the nodes of this datacenter. As all writes go to data replicas in all datacenters, this is sufficient to ensure coverage of all table changes. Additionally, nodes in this datacenter can be provisioned differently as they don’t serve live client read requests.</p><h4 id="bounding-latency">Bounding Latency</h4><p>As mentioned earlier, one of the issues with using CDC is that the latency (defined as the time between the write to Cassandra and the data being made available for processing) is poorly bounded. CDC only allows processing of commit log files that are no longer needed, meaning they should be full and not referenced by an existing memtable. To introduce predictable latency bounds to the connector, the following approaches were adopted:</p><h6 id="removing-memtable-references">Removing Memtable References</h6><p>Memtables are periodically flushed by Cassandra to SSTables when they get too large. However, a table with a low write rate will rarely be flushed, thus delaying CDC processing for the whole cluster. To ensure this does not happen, an explicit flush of all memtables is triggered at periodic intervals (typically 5-10 minutes) for nodes in the CDC datacenter. This ensures that a full commit log segment will only wait, at most, one flush interval before it can be processed. As only the CDC datacenter nodes are flushed, there’s no impact to client read performance in the other datacenters.</p><h6 id="filling-segments">Filling Segments</h6><p>Commit log segment sizes are fixed. If the tracked table has a slow write rate, it may be a while before a segment completely fills up. This fill-up time is bound by creating a process separate from the CDC Publisher which writes to a “filler” table at a predictable rate. This table is replicated only in the CDC datacenter and is fully replicated to all nodes. To limit any performance impact, fewer large writes (~100K) are performed, only a single key is written to, and the data is aggressively TTL’ed.</p><h3 id="processing-cdc">Processing CDC</h3><p>To aid with the processing of CDC commit log segments, the Cassandra library provides a handler interface for applications to implement. This interface allows processing of a stream of all mutations (writes) present in a commit log segment. The <em>Mutation</em> class is the Java object Cassandra uses to represent data, namely:</p><ul><li>A <em>Mutation</em> contains <em>PartitionUpdate</em> objects for multiple tables</li> <li>A <em>PartitionUpdate</em> contains <em>Row</em> objects for a single partition key value</li> <li>A <em>Row</em> contains data for a single clustering key value</li> </ul><div class="image-caption"><p class="subtle-text"><small>Structure of a Cassandra Mutation</small></p></div><p>The primary function of the CDC Publisher is to break these mutations up into individual PartitionUpdate objects. If a PartitionUpdate contains multiple rows, these are further broken down into a series of updates with single rows. Thus, each update contains data only for a single Cassandra primary key.</p><div class="image-caption"><p class="subtle-text"><small>Breakdown of a Mutation into Individual Row Objects</small></p></div><p>Each of the resulting PartitionUpdate objects is serialized for publishing to Kafka streams. Serializers provided by the Cassandra library are used for serialization before publishing.</p><h3 id="publishing-to-kafka">Publishing to Kafka</h3><p>The PartitionUpdate payloads are used to build messages to publish to the intermediate Kafka stream. Each message includes:</p><ul><li>The serialized PartitionUpdate</li> <li>The Cassandra messaging version used for serialization</li> <li>Metadata for auditing (host, file, position, etc.)</li> </ul><p>The messages are then published to table specific Kafka streams. A stream can have multiple partitions for scalable publishing; in which case, messages are routed to Kafka partitions based on the Cassandra partition key. Thus, all writes for a single partition key will end up in the same topic-partition.</p><div class="image-caption"><p class="subtle-text"><small>Publishing CDC to a Multi-Partition Kafka Topic</small></p></div><h4 id="intermediate-kafka-streams">Intermediate Kafka Streams</h4><p>The resulting Kafka streams contain all writes to the tracked Cassandra tables. As all updates to a primary key reside in the same topic partition, this sets an ordering of writes for each key.</p><p>While there’s no guarantee events will be in writetime order, there’s also no guarantee that writes will commit to a Cassandra replica in writetime order. Additionally, there will be a duplicate write copy for each data replica. Even though this is the case, the intermediate streams act as unified commit logs for the tables. They provide an order of events per key that can be deterministically processed into the ordered stream of row updates needed for publishing to the Data Pipeline.</p><h4 id="stream-consistency">Stream Consistency</h4><p>Given that the connector uses the Cassandra write path, the consistency of the resulting Kafka stream will not be more consistent than the underlying datastore. As writes are published from each replica in their local commit order, the processed stream should initially be no less consistent than reading from a single replica. As data from additional replicas is processed, the stream becomes eventually consistent. When all replicas have published updates, the consistency will be equivalent to a read covering all CDC datacenter nodes.</p><p>The time-boundness of this eventual consistency is determined by the write consistency level used by the Cassandra clients. If the update has to immediately show up in the stream, a high consistency level (e.g., EACH_QUORUM) must be used to ensure commits to nodes in the CDC datacenter. If a lower/local consistency is used for writes, the PartitionUpdate may not appear in the output stream (in the worst case) until the next table repair. Note that this is in line with the guarantees given to clients reading Cassandra directly.</p><h2 id="whats-next">What’s Next?</h2><p>At this point, the intermediate Kafka streams contain Cassandra PartitionUpdate objects partitioned by keys and in a loosely ordered manner. These objects must now be deserialized, converted into ordered Data Pipeline messages, and published into the pipeline. This is done through the DP Materializer.</p><p>The DP Materializer will be covered in the second half of this two-part post. Stay tuned!</p><div class="post-gray-box">This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into datastores like Redshift, Salesforce, and Elasticsearch.<p>Read the posts in the series:</p><ul><li><a title="Billions of Messages a Day - Yelp's Real-time Data Pipeline" href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Billions of Messages a Day - Yelp's Real-time Data Pipeline</a></li> <li><a title="Streaming MySQL tables in real-time to Kafka" href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Streaming MySQL tables in real-time to Kafka</a></li> <li><a title="More Than Just a Schema Store" href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">More Than Just a Schema Store</a></li> <li><a title="PaaStorm: A Streaming Processor" href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">PaaStorm: A Streaming Processor</a></li> <li><a title="Data Pipeline: Salesforce Connector" href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">Data Pipeline: Salesforce Connector</a></li> <li><a title="Streaming Messages from Kafka into Redshift in near Real-Time" href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Streaming Messages from Kafka into Redshift in near Real-Time</a></li> <li><a title="Open-Sourcing Yelp's Data Pipeline" href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">Open-Sourcing Yelp's Data Pipeline</a></li> <li><a title="Making 30x Performance Improvements on Yelp’s MySQLStreamer" href="https://engineeringblog.yelp.com/2018/02/making-30x-performance-improvements-on-yelps-mysqlstreamer.html">Making 30x Performance Improvements on Yelp’s MySQLStreamer</a></li> <li><a title="Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift" href="https://engineeringblog.yelp.com/2018/04/black-box-auditing.html">Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift</a></li> <li><a title="Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch" href="https://engineeringblog.yelp.com/2018/06/fast-order-search.html">Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch</a></li> <li><a title="Joinery: A Tale of Un-Windowed Joins" href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery: A Tale of Un-Windowed Joins</a></li> <li><a title="Streaming Cassandra into Kafka in (Near) Real-Time: Part 1" href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Streaming Cassandra into Kafka in (Near) Real-Time: Part 1</a></li> </ul></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Organizing and Securing Third-Party CDN Assets at Yelp</h1> <p>Wed, 20 Nov 2019 01:00:00 +0100</p> <p>At Yelp, we use a <a href="http://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html">service-oriented architecture</a> to serve our web pages. This consists of a lot of frontend services, each of which is responsible for serving different pages (e.g., the search page or a business listing page).</p><p>In these frontend services, we use a couple of third-party JavaScript/CSS assets (<a href="https://reactjs.org">React</a>, <a href="https://babeljs.io/docs/en/babel-polyfill">Babel polyfill</a>, etc.) to render our web pages. We chose to serve such assets using a third-party Content Delivery Network (CDN) for better performance.</p><p>In the past, if a frontend service needed to use a third-party JavaScript/CSS asset, engineers had to hard-code its CDN URL. For example:</p><div class="language-html highlighter-rouge highlight"><pre><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.8.3/jquery.min.js" ></script> </pre></div><p>With hundreds of engineers working at Yelp, it was difficult to ensure the following (for each third-party asset):</p><ul><li><code class="highlighter-rouge"><script></code> or <code class="highlighter-rouge"><link></code> tags had a subresource integrity checksum via the <code class="highlighter-rouge">integrity</code> attribute <em>(see the section on <a href="https://engineeringblog.yelp.com#subresource-integrity-checksums">Subresource integrity checksums</a> below)</em></li> <li>URLs used the HTTPS protocol</li> <li>Only public CDN providers (approved by our security team) were used</li> <li>Engineers could update to the latest versions easily</li> </ul><p>Here at Yelp, we’ve built our frontend services using a Python service stack, with <a href="https://trypyramid.com">Pyramid</a> as our web framework and <a href="https://uwsgi-docs.readthedocs.io/en/latest">uWSGI</a> as our web server.</p><p>We created a shared Python package, <code class="highlighter-rouge">cdn_assets</code>, for storing the URLs and subresource integrity checksums of our third-party JavaScript/CSS assets.</p><p>For each asset, we simply used a Python dictionary with the asset’s semantic version as the key. For example:</p><div class="language-py highlighter-rouge highlight"><pre># React (facebook.github.io/react) CDN_SCRIPT_REACT = { '16.8.6': CDNAsset.construct_asset( cdn=CDNDomain.CDNJS, library='react', version='16.8.6', filename='umd/react.production.min', filename_unminified='umd/react.development', extension='js', integrity='sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1', integrity_unminified='sha384-u6DTDagyAFm2JKvgGBO8jWd9YzrDzg6FuBPKWkKIg0/GVA6HM9UkSxH2rzxEJ5GF', ), '16.8.5': CDNAsset.construct_asset( # … similar properties for this version ), # … more versions… } # Babel Polyfill (babeljs.io/docs/usage/polyfill) CDN_SCRIPT_BABEL_POLYFILL = { '6.23.0': CDNAsset.construct_asset( cdn=CDNDomain.CDNJS, library='babel-polyfill', version='6.23.0', filename='polyfill.min', filename_unminified='polyfill', extension='js', integrity='sha384-FbHUaR69a828hqWjPw4PFllFj1bvveKOTWORGkyosCw720HXy/56+2hSuQDaogMb', integrity_unminified='sha384-4L0QKU4TUZXBNNRtCIbt9G73L2fXYHnzgCjL65qwFxsXPvuAf1aB6D3X+LIflqu3', ), # … more versions… } # … more assets… </pre></div><h2 id="usage">Usage</h2><p>Here’s a Python code snippet which shows how the asset is included in our <a href="https://github.com/Yelp/yelp_cheetah">Yelp-Cheetah</a> templates:</p><div class="language-py highlighter-rouge highlight"><pre>CDN_SCRIPT_REACT['16.8.6'].generate_script_tag(minified=True) # returns <script src="https://cdnjs.cloudflare.com/ajax/libs/react/16.8.6/umd/react.production.min.js" integrity="sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1" crossorigin="anonymous"></script> </pre></div><h2 id="scaffolding-infrastructure">Scaffolding Infrastructure</h2><p>To facilitate ease of use and maintenance, we developed some scaffolding infrastructure to:</p><ul><li>Define public CDN providers (e.g., <a href="https://cdnjs.com/about">Cloudflare CDNJS</a>, <a href="https://developers.google.com/speed/libraries">Google CDN</a>, etc.)</li> <li>Render minified scripts & styles in the production environment and unminified scripts & styles in the development environment</li> <li>Create a helpful <code class="highlighter-rouge">generate_script_tag</code> method, which allows consumers of this package to easily generate an HTML <code class="highlighter-rouge"><script></code> tag with the correct subresource integrity SHA <em>(see the section on <a href="https://engineeringblog.yelp.com#comparing-cryptographic-hash-functions">Comparing cryptographic hash functions</a> below)</em></li> </ul><p>We made it easy for engineers to add a new version by creating a <a href="https://www.gnu.org/software/make"><code class="highlighter-rouge">make</code></a> target to calculate the integrity checksum, like so:</p><div class="language-sh highlighter-rouge highlight"><pre># Usage: make sri-hash --urls="URL1[ URL2 ... URLn] $ make sri-hash --urls="https://cdnjs.cloudflare.com/ajax/libs/react/16.8.6/umd/react.production.min.js" sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1 </pre></div><h2 id="testing">Testing</h2><p>We wrote tests which iterate all versions of all assets to ensure that:</p><ul><li>URLs point to a valid asset on the CDN</li> <li>Integrity SHA checksums are correct</li> <li>URLs begin with <code class="highlighter-rouge">https://</code> and end with <code class="highlighter-rouge">.js</code> or <code class="highlighter-rouge">.css</code></li> </ul><p>Here’s a snippet from one of our test files:</p><div class="language-py highlighter-rouge highlight"><pre># `all_cdn_scripts` is a Pytest fixture; it’s not shown in this snippet. @pytest.mark.parametrize('script', all_cdn_scripts) def test_integrity_hashes_match(script): # Test that the unminified URL doesn’t error and has the right integrity hash. resp = requests.get(script.url_unminified) resp.raise_for_status() assert ( 'sha384-{}'.format(base64.b64encode(hashlib.sha384(resp.content).digest()).decode('utf8')) == script.integrity_unminified ) # Test that the minified URL doesn’t error and has the right integrity hash. resp = requests.get(script.url) resp.raise_for_status() assert ( 'sha384-{}'.format(base64.b64encode(hashlib.sha384(resp.content).digest()).decode('utf8')) == script.integrity ) def test_sha384_for_all_checksums(all_cdn_scripts): SHA384_CHECKSUM_LENGTH = 64 for cdn_script in all_cdn_scripts: assert cdn_script.integrity.startswith('sha384-') assert cdn_script.integrity_unminified.startswith('sha384-') checksum = cdn_script.integrity.replace('sha384-', '') assert len(checksum) == SHA384_CHECKSUM_LENGTH checksum = cdn_script.integrity_unminified.replace('sha384-', '') assert len(checksum) == SHA384_CHECKSUM_LENGTH def test_valid_https_urls(all_cdn_scripts): https_url_validator = URLValidator(schemes=['https'], message='HTTPS URL validation failed') for cdn_script in all_cdn_scripts: https_url_validator(cdn_script.url) def test_valid_script_files(all_cdn_scripts): for cdn_script in all_cdn_scripts: assert cdn_script.url.endswith('.js') def test_minified_and_unminified_urls(all_cdn_scripts): for cdn_script in all_cdn_scripts: assert cdn_script.url.endswith('.min.js') assert not cdn_script.url_unminified.endswith('.min.js') </pre></div><p>Yelp serves tens of millions of users every month. Ensuring that these users are protected should an attacker gain control of the CDN we’re using is of prime importance. That’s where subresource integrity checksums come into the picture.</p><h2 id="subresource-integrity-checksums">Subresource Integrity Checksums</h2><p>The <a href="https://developer.mozilla.org/docs/Web">web docs on Mozilla Developer Network</a> define <a href="https://developer.mozilla.org/docs/Web/Security/Subresource_Integrity">Subresource Integrity</a> as:</p><blockquote> <p>A security feature that enables browsers to verify that resources they fetch (for example, from a CDN) are delivered without unexpected manipulation. It works by allowing you to provide a cryptographic hash that a fetched resource must match.</p> </blockquote><p>Support for subresource integrity checksum verification is achieved by adding an <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script#attr-integrity"><code class="highlighter-rouge">integrity</code></a> attribute on the <code class="highlighter-rouge"><script></code> or <code class="highlighter-rouge"><link></code> tags. For example:</p><div class="language-html highlighter-rouge highlight"><pre><script src="https://cdnjs.cloudflare.com/ajax/libs/react/16.8.6/umd/react.production.min.js" integrity="sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1" ></script> </pre></div><p>The web browser will calculate a hash from the contents of the <code class="highlighter-rouge"><script></code> or <code class="highlighter-rouge"><link></code> tag. It will then compare this hash with the <code class="highlighter-rouge">integrity</code> attribute’s value. If they don’t match, the browser will stop the <code class="highlighter-rouge"><script></code> or <code class="highlighter-rouge"><link></code> tag from executing.</p><p>As per the <a href="https://www.w3.org/TR/SRI/#cryptographic-hash-functions">Subresource Integrity (SRI) specification</a>:</p><blockquote> <p>Conformant user agents must support the SHA-256, SHA-384 and SHA-512 cryptographic hash functions for use as part of a request’s integrity metadata and may support additional hash functions.</p> </blockquote><p>Although both SHA-256 and SHA-512 are supported, we recommend using the SHA-384 cryptographic hash function for the integrity attribute. This is largely because SHA-384 is <a href="https://en.wikipedia.org/wiki/SHA-2#cite_note-9">less susceptible</a> to <a href="https://en.wikipedia.org/wiki/Length_extension_attack">length extension attacks</a>. (See <a href="https://github.com/w3c/webappsec/issues/477">github.com/w3c/webappsec — SRI: upgrade examples to sha384?</a> and <a href="https://github.com/mozilla/srihash.org/issues/155">github.com/mozilla/srihash.org — Why SHA384?</a> for further information.)</p><h2 id="always-using-https-for-loading-cdn-assets">Always Using HTTPS for Loading CDN Assets</h2><p>At Yelp, we’ve migrated web traffic to be served exclusively using <a href="https://en.wikipedia.org/wiki/HTTPS">HTTPS</a> and <a href="https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security">HSTS</a>. If you’re interested in learning more, check out these excellent blog posts by my colleagues: <a href="https://engineeringblog.yelp.com/2016/09/great-https-migration.html">The Great HTTPS Migration</a> and <a href="https://engineeringblog.yelp.com/2017/09/the-road-to-hsts.html">The Road To HSTS</a>.</p><h3 id="protocol-relative-urls">Protocol Relative URLs</h3><p>It’s recommended to use HTTPS while serving CDN assets instead of protocol-relative URLs. Quoting the article <a href="https://www.paulirish.com/2010/the-protocol-relative-url">“The Protocol-relative URL”</a> by <a href="https://www.paulirish.com">Paul Irish</a>:</p><blockquote> <p>Now that SSL is <a href="https://www.eff.org/encrypt-the-web-report">encouraged for everyone</a> and <a href="https://istlsfastyet.com">doesn’t have performance concerns</a>, this technique is now an anti-pattern. If the asset you need is available on SSL, then always use the https:// asset. Allowing the snippet to request over HTTP opens the door for attacks like the <a href="http://www.netresec.com/?page=Blog&month=2015-03&post=China%27s-Man-on-the-Side-Attack-on-GitHub">recent Github Man-on-the-side attack</a>. It’s always safe to request HTTPS assets even if your site is on HTTP, however the reverse is not true. More guidance and details in <a href="https://github.com/konklone/cdns-to-https#conclusion-cdns-should-redirect-to-https">Eric Mills’ guide to CDNs & HTTPS</a> and <a href="https://www.digitalgov.gov/2015/08/14/secure-central-hosting-for-the-digital-analytics-program">digitalgov.gov’s writeup on secure analytics hosting</a>.</p> </blockquote><p>The work described in this blog post has been carried out and supported by numerous members of the Engineering Team here at Yelp. Particular credit goes to engineers on our Core Web Infrastructure (Webcore) team.</p><div class="island job-posting"><h3>Become a Software Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bd07a618-9b6f-4920-91c6-99280f1b268d?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Remember Clusterman? Now It's Open-Source, and Supports Kubernetes Too!</h1> <p>Mon, 11 Nov 2019 01:00:00 +0100</p> <p>Earlier this year, I wrote a <a href="https://engineeringblog.yelp.com/2019/02/autoscaling-mesos-clusters-with-clusterman.html">blog post</a> showing off some cool features of our in-house compute cluster autoscaler, Clusterman (our Cluster Manager). This time, I’m back with two announcements that I’m really excited about! Firstly, in the last few months, we’ve added another supported backend to Clusterman; so not only can it scale Mesos clusters, it can also scale Kubernetes clusters. Second, Clusterman is now open-source on <a href="https://github.com/Yelp/clusterman">GitHub</a> so that you, too, can benefit from advanced autoscaling techniques for your compute clusters. If you prefer to just read the code, you can head there now to find some examples and documentation on how to use it; and if you’d like to know a bit more about the new features and why we’ve built them, read on!</p><div class="image-caption"></div><h2 id="going-from-mesos-to-kubernetes">Going from Mesos to Kubernetes</h2><p>Over the last five years, we’ve <a href="https://www.youtube.com/watch?v=tXbLMRhLQQE">talked</a> (and <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">written</a>) a lot about our compute stack at Yelp; we’ve gone from our monolithic <code class="highlighter-rouge">yelp_main</code> repo to a fully-distributed, service-oriented architecture running in the cloud on top of Apache Mesos and our in-house platform-as-a-service, <a href="https://github.com/Yelp/paasta">PaaSTA</a>. And, truthfully, without that move, we wouldn’t have been able to grow to the scale that we are now. We’ve been hard at work this year preparing our infrastructure for an even more growth, and realized that the best way to achieve this is to move away from Mesos and onto Kubernetes.</p><p>Kubernetes allows us to run workloads (Flink, Cassandra, Spark, and Kafka, among others) that were once difficult to manage under Mesos (due to local state requirements). We strongly believe that managing these workloads under a common platform (PaaSTA) will boost our infrastructure engineers’ output by an order of magnitude (can you imagine spinning up a new Cassandra cluster with just a few lines of YAML? We can!).</p><p>In addition, we’re migrating all of our existing microservices and batch workloads onto Kubernetes. This was a point of discussion at Yelp, but we eventually settled on this approach as both a way to reduce the overhead of maintaining two competing schedulers (Mesos and Kubernetes), and to take advantage of the fast-moving Kubernetes ecosystem. Thanks to the abstractions that PaaSTA provides, we’ve been able to do this migration seamlessly! Our feature developers don’t know their service is running on top of an entirely different compute platform.</p><p>Of course, to make this migration possible, we need to build support for Kubernetes into all our tooling around our compute clusters, including our very important autoscaler, Clusterman. Due to Clusterman’s modular design, this was easy! We simply defined a new connector class that conforms to the interface the autoscaler expects. This connector knows how to talk to the Kubernetes API server to retrieve metrics and statistics about the state of the Kubernetes cluster it’s scaling. These metrics are then saved in our metrics data store, which is sent to the signals and autoscaling engine to determine how to add or remove compute resources.</p><h2 id="why-clusterman--why-now">Why Clusterman? Why Now?</h2><p>We’re big proponents of open-source software at Yelp; we benefit from the efforts of many other open-source projects and release what we can back into the community. Ever since Clusterman’s inception, we’ve had the dream of open-sourcing it, and now that it has support for Kubernetes, there’s no better time to do so!</p><p>Whenever a project like this is released, the first question people ask is, “Why should I use your product instead of this other, established one?” Two such products are the <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-automatic-scaling.html">AWS Auto Scaling for Spot Fleet</a> and the <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler">Kubernetes Cluster Autoscaler</a>. So let’s compare and contrast Clusterman with them:</p><table><thead><tr><th class="c1">Clusterman</th> <th class="c1">Auto Scaling for Spot Fleet</th> <th class="c1">Kubernetes Cluster Autoscaler</th> </tr></thead><tbody><tr><td class="c2"><em>Supports any type of cloud resource (ASGs, spot fleets, etc)</em></td> <td class="c2">Only for Spot Fleets</td> <td class="c2">Only supports homogeneous cloud resources (all compute resources must be identical)</td> </tr><tr><td class="c2"><em>Pluggable signal architecture</em></td> <td class="c2">Three different scaling choices: target tracking, step functions, or time-based</td> <td class="c2">Scales the cluster when pods are waiting to be scheduled</td> </tr><tr><td class="c2"><em>Can proactively autoscale to account for delays in node bootstrapping time</em></td> <td class="c2">No proactive scaling</td> <td class="c2">Waits for nodes to join the cluster before continuing</td> </tr><tr><td class="c2">Basic Kubernetes support</td> <td class="c2">No knowledge of Kubernetes</td> <td class="c2"><em>Supports advanced features like node and pod affinity</em></td> </tr><tr><td class="c2"><em>Can simulate autoscaling decisions on production data</em></td> <td class="c2">No simulator</td> <td class="c2">No simulator</td> </tr><tr><td class="c2"><em>Extensible (open-source)</em></td> <td class="c2">Closed-source API</td> <td class="c2"><em>Extensible (open-source)</em></td> </tr></tbody></table><p>A few highlights we’d like to call out: firstly, note that Clusterman is the only autoscaler that can support a mixture of cloud resources (Spot Fleets, Auto-Scaling Groups, etc.) - it can even handle this in the same cluster! This allows for a very flexible infrastructure design.</p><p>Moreover, Clusterman’s pluggable signal architecture lets you write any type of scaling signal you can imagine (and write in code). At Yelp, we generally believe that the Kubernetes Cluster Autoscaler approach (scale up when pods are waiting) is right for “most use cases,” but having the flexibility to create more complex autoscaling behavior is really important to us. One example of how we’ve benefitted from this capability is Jolt, an internal tool for running unit and integration tests. The Jolt cluster runs millions of tests every day, and has a very predictable workload; thus, we wrote a custom signal that allows us to scale up and down before pods get queued up in the “waiting” state, which saves our developers a ton of time running tests! To put it another way, the Kubernetes Cluster Autoscaler is reactive, but Clusterman has enough flexibility to be proactive and scale up before resources are required.</p><p>To be fair, not everyone needs the ability to make complex autoscaling decisions; many users will be just fine using something like the AWS Spot Fleet Autoscaler or Kubernetes Cluster Autoscaler. Fortunately for these users, Clusterman can be easily swapped in as needed. For example, it can be configured to read all of the same node labels that the Kubernetes Cluster Autoscaler does, and behave appropriately. Also note that the Kubernetes Cluster Autoscaler does support some Kubernetes features that Clusterman doesn’t (yet) know about, like pod affinity and anti-affinity. But we’re constantly adding new features to Clusterman, and of course, pull requests are always welcome!</p><h2 id="want-to-know-more">Want to Know More?</h2><p>If you’re as excited as we are about this release, we encourage you to head over to our <a href="https://github.com/Yelp/clusterman">GitHub</a> and check it out! Give it a star if you like it, and if you have any questions about getting Clusterman set up in your environment, feel free to open an issue or send us an email! Also, we’d love to hear any success stories you have about autoscaling with Clusterman, or Kubernetes in general; you can reach us on Twitter (<a href="https://twitter.com/YelpEngineering">@YelpEngineering</a>) or on Facebook (<a href="https://www.facebook.com/pg/yelpengineers/photos/">@yelpengineers</a>).</p><hr /><p>David is going to be at KubeCon 2019 and will happily talk your ear off about Clusterman and Kubernetes; ping him on <a href="https://twiter.com/drmorr0">Twitter</a> or find him in the hallway track.</p><hr /><div class="island job-posting"><h3>Become an Infrastructure Engineer at Yelp</h3><p>Want to work on exciting projects like Clusterman? Apply here!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/7f3e2412-3736-473e-95ff-5d11a9190080?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Inside TensorFlow</h1> <p>Fri, 08 Nov 2019 01:00:00 +0100</p> <p>It’s probably not surprising that Yelp utilizes deep neural networks in its quest to connect people with great local businesses. One example is the selection of photos you see in the Yelp app and website, where neural networks try to identify the best quality photos for the business displayed. A crucial component of our deep learning stack is <a href="https://www.tensorflow.org/">TensorFlow</a> (TF). In the process of deploying TF to production, we’ve learned a few things that may not be commonly known in the Data Science community.</p><p>TensorFlow’s success stems not only from its popularity within the machine learning domain, but also from its design. It’s very well-written and has been extensively tested and documented (you can read the documentation offline by simply cloning its <a href="https://github.com/tensorflow/docs">repository</a>). You don’t have to be a machine learning expert to enjoy reading it, and even experienced software engineers can learn a thing or two from it.</p><h2 id="building-tensorflow">Building TensorFlow</h2><p>You can start using TF without the extra build steps by installing the Python package from <a href="https://pypi.org/project/tensorflow/">pypi.org</a>. Doing it this way is straightforward, but also means you won’t have access to any optimization features. Here’s an example of what this can look like in practice:</p><div class="language-bash highlighter-rouge highlight"><pre>$ python3 -c 'import tensorflow as tf; tf.Session().list_devices()' 2>&1 | grep -oE 'Your CPU .*' Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA </pre></div><p>If you want to hack TF (the second part of this post explains how), then in order to test your changes, you’ll have to build the package yourself. So, assuming you’re interested in building TF for your own requirements, or perhaps with your own code changes, here’s a compilation of hints on how to make it a relatively painless experience. <em>Note: this is not a step-by-step recipe; obvious points (like “copy the <a href="http://github.com/tensorflow/tensorflow">sources</a>”, and “read the <a href="https://www.tensorflow.org/install/source">documentation</a>”) are not included!</em></p><p>We recommend building TensorFlow inside containers like <a href="https://docs.docker.com">Docker</a> or <a href="https://podman.io">Podman</a>. The TF project uses Docker for both continuous integration and <a href="https://hub.docker.com/r/tensorflow/tensorflow">official images</a>. You’ll find Dockerfiles and documentation for the latter in the <code class="highlighter-rouge">tensorflow/tools/dockerfiles</code> directory. However, it is a Continuous Integration (CI), which is of more interest in the context of building TF, so make sure to read <code class="highlighter-rouge">tensorflow/tools/ci_build/README.md</code> and check out other files in this directory. Using containers to build TF makes it easier to consistently install all required packages and helps ensure the builds are reproducible (a critical requirement of CI).</p><p>A major required package for building TF is the <a href="https://bazel.build">Bazel Build system</a> (it’s possible, but not recommended, to use make instead of bazel. For instructions see <code class="highlighter-rouge">tensorflow/contrib/make/README.md</code>). In addition to Bazel, other TF dependencies can be found inside the <code class="highlighter-rouge">configure.py</code> script (in the project root directory). TF also depends on a number of Python packages, all of which are listed inside the <code class="highlighter-rouge">tensorflow/tools/pip_package/setup.py</code> file (look for <code class="highlighter-rouge">REQUIRED_PACKAGES</code>). Important among those is NumPy, which may require you to install an extra package in the operating system, such as the <code class="highlighter-rouge">libatlas3-base</code> package for Ubuntu users. Additionally, if you want to build TF for GPU, you’ll need either CUDA with cuDNN (for NVIDIA) or ROCm (for AMD, which we have not tried) installed inside your container. The simplest way to ensure that all CUDA dependencies are present is to use the <a href="https://hub.docker.com/r/nvidia/cuda">official nvidia images</a> as your container base, as demonstrated in the <code class="highlighter-rouge">tensorflow/tools/ci_build/Dockerfile.gpu</code> file.</p><p>You’ll need to execute <code class="highlighter-rouge">configure.py</code> before the actual build. The script will ask many questions, such as “Please specify which C compiler should be used.” For a scripted build, the answer to all questions can be automated with “<code class="highlighter-rouge">yes |</code>” (as demonstrated in <code class="highlighter-rouge">tensorflow/tools/ci_build/builds/configured</code>). Also, if you read the <code class="highlighter-rouge">configure.py</code> source, you’ll quickly discover that individual questions can be suppressed with environment variables, such as <code class="highlighter-rouge">HOST_C_COMPILER</code>. Among these, a very useful variable is <code class="highlighter-rouge">CC_OPT_FLAGS</code>, which by default contains “<code class="highlighter-rouge">-march=native -Wno-sign-compare</code>”. If you want to use the resulting package with a model of CPU different than the one where you run your build, you should replace “native” with a <a href="https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options">more appropriate value</a>. The output of <code class="highlighter-rouge">configure.py</code> is the <code class="highlighter-rouge">.tf_configure.bazelrc</code> file, which you may want to look into.</p><p>After the initial configuration step, you’ll need to run “<code class="highlighter-rouge">bazel build</code>” with options to build TF binaries (but not its Python wheel - yet!). The selection of <a href="https://www.tensorflow.org/install/source#build_the_pip_package">Bazel options</a> can be a little tricky, but the script <code class="highlighter-rouge">tensorflow/tools/ci_build/ci_build.sh</code> may give you some ideas. The build typically takes between 30–60 minutes (or longer when CUDA is enabled) on 40 CPUs - it is quite a large project! After this step is completed, you still need to build the Python wheels. As explained in the documentation, this step is actually performed by the “<code class="highlighter-rouge">build_pip_package</code>” binary you’ve just built!</p><p>Here’s an example of what the above steps may look in a Dockerfile:</p><div class="language-dockerfile highlighter-rouge highlight"><pre>RUN curl -L https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VER}/bazel-${BAZEL_VER}-installer-linux-x86_64.sh --output bazel.sh && bash bazel.sh --prefix=/opt/bazel && rm bazel.sh ENV PATH ${PATH}:/opt/bazel/bin RUN curl -L https://github.com/tensorflow/tensorflow/archive/${VERSION}.tar.gz | tar xz --strip-components=1 ENV TF_NEED_CUDA 0 ENV CC_OPT_FLAGS -mtune=intel -march=haswell -Wno-sign-compare RUN tensorflow/tools/ci_build/builds/configured CPU RUN cat .tf_configure.bazelrc RUN bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package RUN bazel-bin/tensorflow/tools/pip_package/build_pip_package /tensorflow </pre></div><p>This of course implies that you’ll want to actually build TF with a “<code class="highlighter-rouge">docker build</code>”. This may seem counterintuitive at first (running Bazel in the context of “<code class="highlighter-rouge">build run</code>” will be a more natural choice to some, and in fact will be required for the incremental build), but is actually quite useful as it lets you re-run the build very quickly if no changes have been made, and you don’t have to worry about the build directory. Just remember to “<code class="highlighter-rouge">build run</code>” with <code class="highlighter-rouge">--user</code> option to copy your Python wheels out of the container image afterwards.</p><h2 id="tensorflow-project-structure">TensorFlow project structure</h2><p>There are two important top-level directories in the TF project: <code class="highlighter-rouge">tensorflow</code> and <code class="highlighter-rouge">third_party</code>. The latter contains TF dependencies (which you may want to check out). While the list is rather extensive and some third-party libraries can alternatively be brought in as system dependencies (you may see them inside <code class="highlighter-rouge">third_party/systemlibs/syslibs_configure.bzl</code>), our focus is going to be on the <code class="highlighter-rouge">tensorflow</code> directory. It may not be immediately apparent, but most of the TF functionality is, at the lowest level, implemented in C++. This is what the <code class="highlighter-rouge">tensorflow/core</code> directory is for. Next, this low-level functionality is exported as a public API to various programming languages inside directories named after each language. Most TF users are familiar with the Python API inside the <code class="highlighter-rouge">tensorflow/python</code> directory, but there are also subdirectories for C, C++, Java and Go. Knowing your way around the Python subdirectory can help you find useful pieces of information without the need to seek external documentation. For example, to find the constants used by selu activation, you can look in <code class="highlighter-rouge">tensorflow/python/keras/activations.py</code>. Another useful Python subdirectory is <code class="highlighter-rouge">debug</code>. If you’ve ever wondered what the computation graph of your deep learning model looks like, then file <code class="highlighter-rouge">tensorflow/python/debug/README.md</code> is a good start. There are also some very useful tools inside the (you guessed it!) <code class="highlighter-rouge">tensorflow/python/tools</code> directory.</p><p>Some C++ functions are imported by Python with the <a href="http://www.swig.org/tutorial.html">SWIG</a> file <code class="highlighter-rouge">tensorflow/python/tensorflow.i</code>, which in turn includes <code class="highlighter-rouge">*.i</code> files in various subdirectories. As you’ll see, most of these files have an accompanying <code class="highlighter-rouge">*.cc</code> with implementation, which in turn include headers from the <code class="highlighter-rouge">tensorflow/core</code> directory (and also from the <code class="highlighter-rouge">tensorflow/c</code> public API directory). However, SWIG is only used for low-level functions, and TF focuses mostly on high-level operations. These are coded and registered in the <code class="highlighter-rouge">tensorflow/core</code> directory as so-called “ops” (look for <code class="highlighter-rouge">REGISTER_OP</code> macro; the majority of ops are inside the <code class="highlighter-rouge">ops</code> subdirectory). Ops are imported by language APIs using their name. Note that in Python, the spelling of each op is changed, replacing CamelCase with snake_case (for example, <code class="highlighter-rouge">ApplyGradientDescent</code> from <code class="highlighter-rouge">tensorflow/core/ops/training_ops.cc</code> is imported inside <code class="highlighter-rouge">tensorflow/python/training/gradient_descent.py</code> as <code class="highlighter-rouge">apply_gradient_descent</code>). Other language APIs refer to ops using the original CamelCase names.</p><p>The C++ implementation of each op is coded in the so-called “kernel” (there can be separate kernels for CPU and GPU as demonstrated in <code class="highlighter-rouge">tensorflow/core/kernels/fact_op.cc</code>), which is then mapped to an op with a <code class="highlighter-rouge">REGISTER_KERNEL_BUILDER</code> macro. Most kernels reside inside the <code class="highlighter-rouge">tensorflow/core/kernels</code> directory. For example, <code class="highlighter-rouge">ApplyGradientDescent</code> is implemented in <code class="highlighter-rouge">tensorflow/core/kernels/training_ops.cc</code>. Unit tests for kernels are written in Python and reside either inside the <code class="highlighter-rouge">tensorflow/python/kernel_tests</code> directory or next to their Python API wrapper, in “*_test.py” files. For example, unit tests for <code class="highlighter-rouge">ApplyGradientDescent</code> are coded in <code class="highlighter-rouge">tensorflow/python/training/training_ops_test.py</code>.</p><p>A complete list of ops is available in two locations: the <code class="highlighter-rouge">tensorflow/core/api_def</code> directory and the <code class="highlighter-rouge">tensorflow/core/ops/ops.pbtxt</code> file. As you can see, TF defines a considerable number of ops which explains the large size of its binary. When building TF, you can minimize its size by enabling only selected ops. This is documented inside the <code class="highlighter-rouge">tensorflow/core/framework/selective_registration.h</code> file (note, this is an experimental feature). Interestingly, you don’t need to maintain a fork of TF if you want to add your own custom ops. Instead, TF’s design allows for an external project to extend TF with a new functionality. This is demonstrated in the <a href="https://github.com/tensorflow/addons/">TensorFlow Addons project</a>.</p><p>Finally, you may want to check the content of the <code class="highlighter-rouge">tensorflow/core/platform</code> directory. There, you can find files not specific to TensorFlow, but rather low-level operating systems or network protocol functionalities. Files shared by all platforms reside in this directory, but there are also several platform-specific subdirectories. For example, if you’re troubleshooting an S3-related issue, there’s an “<code class="highlighter-rouge">S3</code>” subdirectory to help you. This code is very well-written and potentially useful outside of the TF project (but please do check the license first!). Finally, for a high-level overview of the TensorFlow architecture, we recommend you check the official <a href="https://www.tensorflow.org/guide/extend/architecture">documentation</a>.</p><p>We hope you’ll find this collection of hints useful when playing with TensorFlow or deploying it in your machine learning workflow!</p><h3 id="note">Note</h3><p><em>Neither Yelp nor the author of this post are affiliated with Google or TensorFlow authors.</em></p><div class="island job-posting"><h3>Become a Machine Learning Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Winning the Hackathon with Sourcegraph</h1> <p>Fri, 01 Nov 2019 01:00:00 +0100</p> <p><em>Visualizing how code is used across the organization is a vital part of our engineers’ day-to-day workflow - and we have a *lot* of code to search through! This blog post details our journey of adopting Sourcegraph at Yelp to help our engineers maintain and dig through the tens of gigabytes of data in our git repos!</em></p><hr /><p>Here at Yelp, we maintain hundreds of internal services and libraries that power our website and mobile apps. Examples include our mission-critical “<em>emoji service</em>” which helps translate and localize emojis, as well as our “<em>homepage service</em>” which… you guessed it, serves our venerable homepage, yelp.com!</p><div class="image-caption"><p class="subtle-text"><small>Yelp homepage</small></p></div><h3 id="dont-break-the-website">Don’t Break the Website</h3><p>Imagine you’re a developer tasked with implementing an exciting new feature. Perhaps you need to change the interface of the “<code class="highlighter-rouge">getBusinesses</code>” API endpoint to power a dedicated <em>Find Desserts Near Me</em> button on the homepage. “Piece of cake!” you say to yourself, as you add new parameters to alter the response of the shared resource. In order to not break <em>the rest</em> of the website though, you figure it’s best to see where other code is calling this endpoint so you can create a design that works for all use cases and doesn’t break existing call sites.</p><p>We have over 100,000 Python files alone to power Yelp - that’s a lot of code to search through! In order to figure out a safe rollout plan, we need to scan through all of our existing code to understand where and how the method is being called across multiple git repositories. So how can we do this?</p><p>Combined, our git repositories amount to tens of gigabytes of data. So cloning everything down locally whenever you want to perform a search is not a viable solution. Instead, we do this in the background as a scheduled process on a subset of our development machines, powered by <a href="https://github.com/asottile/all-repos">all-repos</a>. Some folks use this workflow, stringing together xargs and git grep, etc. into many homegrown bash scripts. A web interface (historically cgits and opengrok) is generally a more convenient go-to tool for browsing and searching code.</p><p>Tools like this are essential to our workflow. And since we’re always on the lookout for ways we can improve the developer experience at Yelp, we want the best-in-class tool for the job!</p><p>We first heard about <a href="https://about.sourcegraph.com/">Sourcegraph</a> at a React meetup hosted at Yelp. There was a discussion around how different companies view and search code, and Sourcegraph was introduced as an interesting-looking new search tool. One of the participants pulled up sourcegraph.com to demonstrate its capabilities. We tried a couple of searches using the repo and file regex filters and jumped around the codebase using the Jump to Definition feature. Coming from other tools and homegrown scripts, this was a huge step up in the developer experience! It stood out as a clear win on that front, and we decided to look into it some more and see how we could maybe bring Sourcegraph to Yelp.</p><p>We validated the idea to see if it was worth pursuing by first setting it up locally. Sourcegraph is conveniently distributed as a docker image, so we were able to get a proof-of-concept running quickly and share it out with a small group of people. The feedback was positive! After using it regularly for a few weeks, we felt that the code browsing experience had been improved significantly and we pushed on to try and roll it out to the rest of Yelp!</p><h2 id="productionizing-sourcegraph">Productionizing Sourcegraph</h2><p>At Yelp, we run a biannual <a href="https://engineeringblog.yelp.com/2018/11/all-about-yelp-hackathon.html">Hackathon</a> – an opportunity for engineers to “scratch their creative itch” on projects outside of their day-to-day work. It was during one of these Hackathons that we started to productionize Sourcegraph at Yelp - which meant graduating the Sourcegraph instance from running on a local machine to being deployed on our PaaS platform, <a href="https://engineeringblog.yelp.com/amp/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>. By the end of the three days, we had Sourcegraph ready for the whole company to try out.</p><p>The feedback was great, and Sourcegraph was well received. We even won an award!</p><div class="image-caption"><p class="subtle-text"><small>A coveted Hackathon trophy</small></p></div><div class="image-caption"><p class="subtle-text"><small>Showing off Sourcegraph to Yelpers at the Hackathon “Science Fair”</small></p></div><p>Once Sourcegraph was up and running at Yelp, we had to decide whether we wanted to invest more in the product to get features such as Code Intelligence. To come to this decision, we surveyed developers on how they liked Sourcegraph compared to other code search/viewing tools we were using, and the results were heavily favored towards Sourcegraph. <strong>70% of developers rated Sorcegraph as very good, and 51% percent of developers were already using Sourcegraph exclusively as their preferred code analysis tool.</strong> As a result of this feedback, we decided to make Sourcegraph the singular supported tool at Yelp for code search and viewing!</p><h2 id="shipping-code-faster-with-sourcegraph">Shipping Code Faster with Sourcegraph</h2><p>Sourcegraph empowers developers at Yelp to ship code faster and more reliably than ever before. <a href="https://docs.sourcegraph.com/user/code_intelligence">Code intelligence</a> features such as Go-to-Definition and Find References are heavily-used features that enable developers to understand the plethora of microservices and libraries in our code base. When making large changes, Sourcegraph is the way to discover how your code is being called throughout the rest of the code base. Sourcegraph has also been helpful for onboarding new hires and introducing them to the code base.</p><p>Sourcegraph has proven to be one of the most useful tools for making mass code migrations and deprecations. A quick search can help scope out the magnitude of the change and the difficulty of implementing it, while also providing an easy way to track the progress of long-running migrations and deprecations.</p><p>Sourcegraph’s GraphQL API has also proved to be useful for tooling we have built in-house. Developers at Yelp have used the Sourcegraph API to power services such as our internal npm registry and flaky test analysis engine, both of which heavily utilize source control metadata.</p><div class="image-caption"><p class="subtle-text"><small>Daily active users of Sourcegraph at Yelp</small></p></div><h2 id="future-work">Future Work</h2><p>We are evaluating running Sourcegraph as a clustered deployment. While we are currently able to serve all Sourcegraph usage on a single host, we are looking into running all of Sourcegraph’s different services individually. This would allow us to scale up more resource-intensive instances of Sourcegraph’s services. We are planning to put it on Kubernetes, an initiative that is underway for a lot of Yelp’s infrastructure.</p><h2 id="written-by">Written By</h2><ul><li>Mark Larah, Software Engineer (<a href="https://twitter.com/mark_larah">@mark_larah</a>)</li> <li>Dennis Coldwell, Engineering Manager</li> <li>Kevin Chen, Software Engineer</li> </ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a0fc4d3d-1fd2-495b-94d4-cc2ed1d80cf3?description=Software-Engineer-New-Grad-Backend_College-Engineering-Product_San-Francisco-CA?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Beyond Labels: Stories of Asian Pacific Islanders at Yelp</h1> <p>Mon, 28 Oct 2019 01:00:00 +0100</p> <p>During <em>Asian Pacific American Heritage Month</em>, ColorCoded (a Yelp employee resource group) hosted a panel discussion called <strong>“Beyond Labels: Stories of Asian Pacific Islanders (API)* at Yelp.”</strong></p><p>We heard stories from five API Yelpers about their cultural backgrounds, identities, and thoughts on what it means to be an API in today’s world. Their stories helped us understand that identity is both multilayered and contextual, and that individuality goes beyond labels.</p><p></p><p>Read more from their unique perspectives below.</p><h4 id="tenzin-kunsal-events--partnerships-engineering-recruiting">Tenzin Kunsal, Events + Partnerships, Engineering Recruiting</h4><p>From a young age, I knew the concept of “home” was complicated. Like many refugees, my family called multiple countries home. My grandparents left my first home, Tibet, in the 1960s, after it was taken over by China. My second home, India, is where I was born and where I grew up, in a Tibetan refugee community. I was not automatically granted Indian citizenship, so for the first few years of my life, I was state-less, born without a country. That was until 1996, when Minneapolis became my third home. Soon after, I became an American citizen and finally officially “belonged” to a country. Growing up, this was all very confusing. I never felt like I fully fit in anywhere. It wasn’t until college that I started to accept the multifacetedness of my identity and that it’s okay to call multiple places “home.”</p><h4 id="nivedita-mittal-software-engineer-reader-experience">Nivedita Mittal, Software Engineer, Reader Experience</h4><p>I moved to the U.S. four years ago to get my Master’s in Computer Science. Since then, it’s been a journey of self-discovery. When I moved from Mumbai to Boston, I always said “I’m from Mumbai, India.” Then, after moving to San Francisco, it became “I’m from Boston.” Something that has always stuck with my identity is how my immigration status defined whether I “belonged.” Whether it’s finding a job that sponsors your H-1B visa, or filling out your green card, defining who you are and whether you belong in the first place is an ongoing insecurity. It didn’t help that during grad school, every conversation I had with other international students revolved around my visa situation. The same applied to recruiting conversations with companies—I would always get questions like, “Did you get your H-1B yet? Did they file your green card already?” Once this is all said and done, I wonder if I’ll finally find that sense of belonging, or whether it’ll still be a conscious thought in my head to remind people that I belong here.</p><h4 id="gabe-ramos-director-corpeng">Gabe Ramos, Director, CorpEng</h4><p>I identify as Filipino American, a person of color, and a Hapa. “Hapa” is a Hawaiian word that’s used to describe people who are part Asian and part Caucasian. Growing up in the Bay Area, I bounced around schools that had different ethnic make-ups. People often can’t tell what race I am. When I was in a predominantly Black and Latino school, classmates teased me for being “white.” When I was in a mostly white Palo Alto public school, classmates teased me for being “Japanese” because they didn’t know what race I was. I felt like I was between worlds because I didn’t pass for white yet often didn’t feel Filipino enough. Learning about different racial identities in college was pivotal for me. I have a liberal arts background, and my education really helped me learn about other Asian Americans’ experiences, the history of racial violence in the U.S., and anti-miscegenation laws. This helped me gain more of a sense of shared history. Most importantly, this empowered me to feel more ownership over my opinions of my own racial and cultural identity.</p><h4 id="julie-truong-software-engineer-restaurant-plan">Julie Truong, Software Engineer, Restaurant Plan</h4><p>From my last name, you may assume that I’m Vietnamese; I’m actually Chinese. My family immigrated from China to Vietnam (and later to the U.S.), and in order to blend in, my paternal grandfather changed our last name. My family is a mix of Chinese and Vietnamese cultures. At any given family gathering, you can hear English, Cantonese, and Vietnamese—all within the span of a couple minutes. I grew up in a primarily Latinx/Black/Samoan/Fillipino neighborhood in the East Bay. When I was younger, I had an idea of what being a “cool Asian” entailed, and Chinese people weren’t necessarily portrayed in this light. So I actually wished I were Fillipino, just like the cool kids in school. Now, as an adult living in the Bay Area, I feel I’m actually quite privileged. There’s a large Asian American population here, and I don’t have to think about my cultural identity very often. Interestingly, I find I have to think more about my gender and sexual orientation and how these parts of my identity show up in my personal and professional life.</p><h4 id="wing-yung-vice-president-engineering">Wing Yung, Vice President, Engineering</h4><p>I grew up near Arcadia, California, in a community with many other Asian Americans. Most of my classmates in public school were like me—our parents immigrated here, and we were born here. I can speak three dialects of Chinese (poorly): Mandarin (which I learned through lessons), Cantonese (which my parents speak at home because they grew up in Hong Kong), and Wenzhounese (my grandparents’ dialect). Throughout college I became more aware of my Asian identity, but didn’t seek out opportunities to explore it. Early on in my career at IBM, one of my managers sent me to an Asian leadership development program. In retrospect, it was one of the first times I became aware that leadership comes in many forms. I’m very much aware of the fact that I’m often the only (or one of the few) Asians in leadership settings. It’s important to me to be a role model for others so that they know there are paths to these roles.</p><h3 id="conclusion">Conclusion</h3><p>What ties all of these stories together is a sense of belonging that impelled us to redefine our identities on our own terms. Finding the right communities and support groups was critical for our journeys of self-discovery. The process of preparing for this panel was in itself extremely empowering, as it allowed us to dig deeper and reflect on what makes us who we are. Opportunities like these provide a platform to learn about others’ experiences and to realize how much representation influences our lives. It’s important to remind ourselves that sharing these stories makes us stronger and is an important part of cultivating community.</p><p>Want to be a part of the dialogue? Here are a few steps you can take right now!</p><ul><li>Join a resource group/meetup/support group that focuses on diversity and inclusion. We have <a href="https://www.yelp.com/careers/who-we-are">employee resource groups</a> employee resource groups here at Yelp, including Colorcoded, Diverseburst, and Awesome Women in Engineering (AWE).</li> <li>For a more personal conversation, grab coffee with someone who identifies as an API to hear more about their journey.</li> </ul><p>*In the context of this conversation, API stands for Asian Pacific Islanders—people with origins in Asia or the Pacific Islands.</p><div class="island job-posting"><h3>Engineering at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/3021acac-2237-4288-bb84-73e770fc2c90?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Open sourcing spark-redshift-community</h1> <p>Fri, 25 Oct 2019 02:00:00 +0200</p> <p>At Yelp, we are heavy users of both Spark and Redshift. We’re excited to announce <a href="https://github.com/spark-redshift-community/spark-redshift">spark-redshift-community</a>, a fork from <a href="https://databricks.com">databricks</a>’ original <a href="https://github.com/databricks/spark-redshift">spark-redshift</a> project.</p><p>spark-redshift is a Scala package which uses Amazon S3 to efficiently read and write data from AWS Redshift into Spark DataFrames. After the open source project effort was abandoned in 2017, the community has struggled to keep up with updating dependencies and fixing bugs. The situation came to a complete halt upon release of Spark 2.4 which was sharply incompatible with the latest spark-redshift. Developers looking for a solution turned to online threads on websites like StackOverflow or Github. Answers strayed far from even a simple workaround.</p><p>At Yelp, it was only a matter of time before we jumped into action. The inability to upgrade Spark from 2.3.3 to 2.4 meant that:</p><ul><li>We could not use highly sought-after features from Spark 2.4,</li> <li> <p>Our move on to Kubernetes was endangered. In order to move our infrastructure to run on Kubernetes, we needed Spark on 2.4:</p> <blockquote> <p>“Spark can run on clusters managed by <a href="https://kubernetes.io/">Kubernetes</a>. This feature makes use of native Kubernetes scheduler that has been added to Spark [2.4].” <sup id="fnref:1"><a href="https://engineeringblog.yelp.com#fn:1" class="footnote">1</a></sup></p> </blockquote> </li> </ul><div class="c2"></div><p>The <a href="https://github.com/snowflakedb/spark-snowflake">spark-snowflake</a> open source project is a stable spark-redshift fork for Snowflake. We considered adapting spark-snowflake to work with Redshift but the time estimate was higher than forking and upgrading the original spark-redshift. Upon suggestion from databricks, we did exactly that.</p><p>We focused on porting the functionalities that we use the most, like performant reads from Redshift. We had to make tradeoffs in supporting a subset of features due to the timeline and workload. While some made the cut (reading from Redshift, various data types parsing, implementing an InMemoryS3AFileSystem for testing), others didn’t (Postgres driver support, AWS IAM Authentication, some SaveMode options). We have already seen great internal adoption, and several teams are unblocked in their progress on moving to Spark 2.4.</p><p>Our plans for the future include supporting the project by focusing on the features we use the most, in the hope that the community could carry forward features they find useful. <a href="https://github.com/spark-redshift-community/spark-redshift">spark-redshift-community</a> is an edition for the community. Any support in the form of Github issues or pull requests is greatly welcomed.</p><div class="footnotes"><ol><li id="fn:1"> <p><a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html">https://spark.apache.org/docs/latest/running-on-kubernetes.html</a> <a href="https://engineeringblog.yelp.com#fnref:1" class="reversefootnote">↩</a></p> </li> </ol></div><div class="island job-posting"><h3>Become a Backend (Big Data) Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/461e8999-1bb8-4d37-9212-da7558ebdc21?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Redesigning Yelp for Apple Watch with SwiftUI</h1> <p>Mon, 21 Oct 2019 02:00:00 +0200</p> <p>At this year’s WWDC, Apple unveiled <a href="https://developer.apple.com/xcode/swiftui/">SwiftUI</a>, a framework that helps developers build declarative user interfaces. At Yelp, we were immediately excited about it and were looking for a way to start adopting it. We decided that our Apple Watch application was the perfect candidate for modernization using SwiftUI and were excited to explore a redesign with this new framework.</p><p>At Yelp, one of the things we pride ourselves on is the quality of our content. Yelp users have posted hundreds of millions of reviews and photos. As we set out to re-imagine the user interface for our Apple Watch app, we knew that our gorgeous photos should be the star.</p><p>Here is a side-by-side comparison of the old interface and the new one.</p><div class="image-caption"><p class="subtle-text"><small>Star ratings as of October 16, 2019</small></p></div><p>As you can see, we’ve adopted an interface similar to the Audiobooks and Music apps which put a very strong emphasis on the thumbnail image. Users of the Apple Watch Series 5 will also see a compass that will allow them to see the direction and distance to each business in their search results. We hope this will help users in their search for great local businesses near them.</p><div class="image-caption"><p class="subtle-text"><small>Star ratings as of October 16, 2019</small></p></div><p>In contrast with WatchKit, SwiftUI gives us much more freedom when building our user interface. It feels much more like developing for the iPhone, with the added constraint of designing for a small screen. One thing that’s notable about the design of the search listings is the simplicity of it in code. This scrollable card stack took less than 120 lines of code, animations included! The magic of it resides in the custom <a href="https://developer.apple.com/documentation/swiftui/viewmodifier">view modifiers</a> you can create to apply to your SwiftUI views. Let’s dive into a simplified example.</p><p>Here is a slightly simplified modifier that shifts the cards vertically and doesn’t take care of any scaling down or rotation.</p><p>Given a cardOffset that represents the difference between the current index and the card’s index, we return a custom view modifier that offsets the view’s origin on the y-axis, and modifies its opacity if it goes to the background. Our own implementation also takes care of adding a scale effect for the depth impression, and a zRotation effect, to give the animation more flavor when the cards are scrolled off-screen.</p><p>Now that we have view modifiers, let’s create the scrollable stack.</p><p>We create a <a href="https://developer.apple.com/documentation/swiftui/zstack">ZStack</a> that will fill out the remaining screen space left out by the Spacer. We then compute the cardOffset needed for returning the correct view modifier, and apply the modifiers on their respective cards.</p><p>SwiftUI is able to smoothly interpolate animation parameters for the offset and opacity whenever the modifier changes for a given card. This means the animation logic is all handled for us if the current index is changed within an animation block. Since this code hooks into the digitalCrownRotation modifier and passes the animated binding that represents the current index, the animation will be automatically performed when the crown is rotated. How convenient!</p><p>This redesign made us eager to see where Apple is going to take the framework, and what we’re going to be able to build with it in the upcoming years. We’re thrilled to launch the new Yelp for Apple Watch application today and hope you will love it as much as we do!</p><div class="island job-posting"><h3>Become an iOS Software Engineer at Yelp</h3><p>Want to build more great looking products? Come join us!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/d38ed5fc-bbfa-4f96-92fd-0d194b0433fb?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Discovering Popular Dishes with Deep Learning</h1> <p>Tue, 08 Oct 2019 02:00:00 +0200</p> <p>Yelp is home to nearly 200 million user-submitted reviews and even more photos. This data is rich with information about businesses and user opinions. Through the application of cutting-edge machine learning techniques, we’re able to extract and share insights from this data. In particular, the Popular Dishes feature leverages Yelp’s deep data to take the guesswork out of what to order. For more details on the product itself, check out our <a href="https://www.yelpblog.com/2018/06/introducing-popular-dishes-on-yelp-taking-the-guesswork-out-of-what-to-order">product launch blog post</a>.</p><div class="c2"></div><p>The Popular Dishes feature highlights the most talked about and photographed dishes at a restaurant, gathering user opinions and images in one convenient place. In this post we’ll explain how we used machine learning to make this possible.</p><h2 id="problem-definition">Problem Definition</h2><p>Given everything we know about a business, the data for the Popular Dishes feature boils down to the following output:</p><ul><li>A ranked list of popular menu items</li> <li>Reviews mentioning each item</li> <li>Photos associated with each item</li> </ul><p>Sometimes Yelp already knows a business’s menu. It may have been uploaded by the business itself or provided by partners. In such cases, the problem is simple: match items from the menu to reviews and photo captions and rank by the number of matches. This matching can be as simple as Python’s string.find method, or can incorporate fuzzy matching and more complex NLP techniques.</p><div class="image-caption"><div class="c2"></div><p class="subtle-text"><small>The basic pipeline</small></p></div><p>However, there are many businesses without menus. In order to provide a better and more consistent experience to users, we wanted to be able to display Popular Dishes even at businesses whose menus we didn’t have. So, we used machine learning to predict them.</p><p>We wanted to use the vast review content from Yelp’s businesses to create menus we didn’t have. Since going straight from a list of reviews to a menu is hard, we split this into two steps: information extraction and aggregation. First, we extracted all mentions of potential menu items from a business’s reviews, and second, we aggregated these mentions to obtain an Inferred Menu for the business.</p><h3 id="machine-learning-problem">Machine Learning Problem</h3><p>Recognizing menu items in unstructured text is very similar to the <a href="https://en.wikipedia.org/wiki/Named-entity_recognition">Named Entity Recognition</a> (NER) process. NER is usually solved by classifying each token in the text by whether or not it belongs to a named entity. In our case, we classified each token in a review by whether or not it belonged to a menu item.</p><p>For this, we used the BIESO labelling scheme (introduced as BILOU in <a href="https://www.aclweb.org/anthology/W09-1119">Ratinov & Roth 2009</a>), where:</p><ul><li>All irrelevant (i.e., not part of a menu-item) tokens received the label ‘O’ (Outside)</li> <li>All single-word menu items received the label ‘S’ (Single)</li> <li>All multi-word menu items started with the label ‘B’ (Begin) and ended with the label ‘E’ (End), and any tokens in between got the label ‘I’ (Inside)</li> </ul><div class="image-caption"><p class="subtle-text"><small>Example formatted review (with legend)</small></p></div><p>Once we classified each token as one of these five classes, we could extract menu items by gathering token spans with the following labels:</p><ul><li>S (Single token items)</li> <li>BI*E (Regular expression for multi token items)</li> </ul><h3 id="data">Data</h3><p>To train such a sequence classification model, we needed a dataset of reviews where each token in each review was tagged with one of the five classes (BIESO). Manually labelling reviews for this task would’ve been quite costly since training text models requires a significant amount of data.</p><p>To solve this problem, we took advantage of the menu data that Yelp already has. Since an ideal Inferred Menu for a business would be similar to its Yelp menu (when available), we inverted the problem to create our “Gold Data.”</p><p>We performed a variant of fuzzy matching to tag the reviews of businesses where we already had a menu (provided by partners/business owners) to create our Gold Dataset.</p><table><tr><th>Menu</th> <th>Tagged Reviews</th> </tr><tr><td> <p>The Chicken Burrito</p> <p>The Vegetarian Burrito</p> <p>Nachos</p> </td> <td> <ul><li>(Their O) (Chicken B) (Burritos E) (were O) (good O) (. O)</li> <li>(We O) (had O) (the B) (vegetarian I) (burrito E) (. O)</li> <li>(The O) (nachos S) (were O) (salty O) (. O)</li> </ul></td> </tr></table><h4 id="matching-heuristics">Matching Heuristics</h4><p>While generating our training data, we applied some heuristics to account for the variety of ways reviewers talk about menu items. For example, a business that offers a “tofu and vegetable momo” on its menu may have reviews that mention:</p><ul><li>Misspellings like “tofu and vegtable momo”</li> <li>Different inflections with the same <a href="https://en.wikipedia.org/wiki/Lemmatisation">lemmas</a>, like “tofu and vegetables momo”</li> <li>Partial mentions like “tofu vegetable momo” or “tofu momo”</li> <li>Synonymous mentions like “tofu veggie momo”</li> </ul><p>The fuzzy matcher attempted to accommodate this diversity of speech.</p><h4 id="issues-with-the-gold-data">Issues with the Gold Data</h4><p>While the large number of reviews we have can provide a large enough dataset to train our model, the data does come with some caveats, even after using matching heuristics:</p><ul><li><strong>Menus can be incomplete</strong>: This creates false negatives in the data when reviewers talk about an item that’s served at the restaurant but isn’t present on the menu.</li> <li><strong>Users can be creative</strong>: As discussed above, reviewers can refer to the same item in different ways. This creates false negatives when the menu only has one representation of the item and users mention it in other ways.</li> <li>The <strong>leniency</strong> we introduce in <strong>matching heuristics</strong> can introduce false positives. For example, a mention of “New York” (the city) in a review can be tagged as a menu item since it’s close to “New York Cheesecake.”</li> </ul><p>These issues with the Gold Data can cause a few problems:</p><ul><li>Issues with training data can result in a sub-optimal model</li> <li>Issues with validation data can result in unreliable metrics</li> </ul><p>As a result, some post-processing was required; this is discussed later on in the Evaluation and Improvements section.</p><h3 id="model">Model</h3><p>We used a Hierarchical GRU model described in <a href="https://arxiv.org/abs/1603.06270">Yang et al</a>.</p><div class="c2"></div><h4 id="model-components">Model Components</h4><h5 id="word-representation">Word Representation</h5><p>We combined pre-trained word (trained on Yelp review data using <a href="https://radimrehurek.com/gensim/">gensim</a>) and learned character embeddings to represent each word in the model.</p><p>The embeddings of each character in a word are passed through a <a href="https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks">bidirectional GRU</a> to provide a learned character-based representation of the word. This representation is concatenated with pre-trained word embeddings to provide a single representation of each word in the review.</p><h5 id="multi-layer-rnn">Multi Layer RNN</h5><p>We used GRUs as our base RNN unit. The word representation from above (pre-trained word embeddings + learned character embeddings) is passed through two layers of bidirectional GRUs. Each bidirectional GRU has a forward and a backward GRU, which parses the tokens in the review in the forward and reverse order, respectively. The output of the forward and backward GRUs are concatenated to represent the output of the bidirectional GRU layer.</p><h5 id="crf-layer">CRF Layer</h5><p>One way to predict the class labels for each word in the review would be to use a dense (or linear classification) layer with softmax activation on top of the bidirectional GRU layers to predict one of the five classes.</p><p>This approach doesn’t account for dependencies between the labels assigned to a sequence of tokens. By the definition of BIESO tagging, there are some invalid sequences (e.g., OBSO, OBOEO, OIO, ...). If we use a dense layer that doesn’t have the context of class probabilities for other tokens in the review, we may end up predicting invalid sequences.</p><p>To avoid this, we used a <a href="https://arxiv.org/abs/1508.01991">Conditional Random Field</a> (CRF) which can model dependencies between the labels assigned to each token in the review.</p><p>The output of this CRF layer is the final classification assigned to each token in the review.</p><h3 id="aggregation">Aggregation</h3><p>The final step was to transform these classifications into an actual menu. We extracted all matching spans (BI*E and S) and counted their occurences. Based on language heuristics, some mentions were collapsed into each other. For example, mentions of “mac and cheese” were combined with those of “macaroni & cheese.” The final result was an Inferred Menu: a list of items that we’re confident appear on the menu.</p><h2 id="evaluation-and-improvements">Evaluation and Improvements</h2><h3 id="motivation">Motivation</h3><p>Now that we had an Inferred Menu, we were able to plug it back into the same matching pipeline that we used with regular menus.</p><div class="image-caption"><table><tr><td></td> <td></td> </tr></table><p class="subtle-text"><small>The pipeline for businesses with and without menus</small></p></div><p>Since ML is fuzzy and never perfect, we made sure to evaluate the performance of the model before shipping. We settled on a few evaluation criteria. Our focus was on achieving a high <a href="https://en.wikipedia.org/wiki/Precision_and_recall">precision</a>, as mistakes would look bad and hurt user trust. There are several questions we wanted to answer about our “Popular Dishes” before showing them to users:</p><ul><li>Are they actually dishes?</li> <li>Are they popular?</li> <li>Are we finding all their matches in photos and reviews? Are we matching anything we shouldn’t be?</li> </ul><p>For a textbook machine learning task, you assemble a set of labelled data for training and testing. The output of your model is compared with this golden data, and you can calculate metrics like precision, recall, and accuracy. Under this paradigm, we’d manually find the Popular Dishes and their mentions at a restaurant, then compare this gold data with the output of our models. However, collecting this data would’ve been prohibitively expensive, and potentially impossible given the domain knowledge required. Hiring a human to read a business’s reviews, come up with a menu, and find these matches was impractical.</p><h3 id="approach">Approach</h3><p>Thankfully, our focus was on precision, which was much easier to calculate. Given a “Popular Dish” or a mention, it’s fairly simple to decide whether it’s good or not. As a result, we turned to human-in-the-loop machine learning to both evaluate and improve our model. Our process is summarized in this flowchart:</p><div class="c2"></div><p>First, we generated Popular Dishes using the model. Then, we asked taskers from an online tasking service to answer the following questions about each item:</p><div class="c2"></div><p>Next, we calculated precision metrics from the results of this task. If they were good enough, we declared the task done. If not, we incorporated data from these gold labels into improving the model. For example, examining false positives revealed problems with our matcher. We were also able to extract blacklists of common non-main-dish items by combining tasker judgements. The blacklists were incorporated into training data generation for the Inferred Menu model and used to post-process its predictions.</p><p>Through repeated iterations of evaluation and improvement, we significantly increased the model’s precision. The heatmap below displays our average precision for sample businesses in different (business review count, business category) strata before and after iteration. During the entire process, we kept an eye on coverage as a proxy for recall.</p><div class="c2"><br /></div><h2 id="deployment">Deployment</h2><p>The Popular Dishes backend is currently deployed as a handful of PySpark batches. Every day, all the data we have about our businesses is gathered and run through an NLP pipeline powered by the open source <a href="https://spacy.io/">spaCy</a> package. In this way, new mentions of dishes quickly become available for users to browse. Inferred Menus are regenerated periodically to pick up new dishes.</p><div class="island job-posting"><h3>Become a Machine Learning Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Hosting Our First Awesome Women in Engineering Summit in SF</h1> <p>Fri, 04 Oct 2019 02:00:00 +0200</p> <p>Last month, we held our first <a href="https://www.yelp.com/engineering/awe">Awesome Women in Engineering (AWE)</a> Summit at our headquarters in San Francisco. AWE’s mission is to build a strong community for women and allies in our engineering and product departments by facilitating professional career-building activities, leadership, and mentorship opportunities. As a resource group, we provide support and organize activities targeted towards professional growth for women, helping them to maximize their potential at Yelp and beyond.</p><p>The summit was an internal, half-day event for women and allies in engineering and product at Yelp. We had previously hosted a summit <a href="https://engineeringblog.yelp.com/2018/08/first-awe-eu-summit.html">for our EU offices</a>, but this was our first time organizing the event stateside. During the event, we shared our experiences with one another and learned about the amazing work done at Yelp by women through a packed agenda of technical and career-based talks, workshops, and a round table discussion.</p><div class="image-caption"></div><h2 id="overview-of-the-day-and-session-highlights">Overview of the Day and Session Highlights</h2><p>We kicked off the summit with lunch and an introduction from Rachel Z., a group engineering manager and leading member of AWE at Yelp.</p><div class="image-caption"></div><p>We followed with a series of sessions in parallel tracks, which ran the gamut from lightning talks to hour-long workshops. Some sessions were highly technical, including a lightning talk on <a href="https://github.com/Yelp/paasta/">PaaSTA</a>, our open-source, distributed platform-as-a-service, by Qui N., and a workshop on machine learning and data mining using the Yelp dataset by Xun T.</p><div class="image-caption"></div><p>Other sessions were focused on career growth, diversity, and inclusion. For example, Maria C., one of Yelp’s group technical leads, detailed her career path as an individual contributor, and Jenni S. led a workshop targeted toward allies that focused on real-world scenarios in which an ally could take action to promote a more inclusive workplace environment.</p><p>The round table discussion, facilitated by Annie W., presented opportunities for women to have open, honest discussions about their experiences and any challenges they were facing.</p><div class="image-caption"></div><p>We closed out the day with refreshments and a few parting words from our SVP of Engineering, Sam E.</p><div class="image-caption"></div><h2 id="in-our-own-words">In Our Own Words</h2><p>“I was proud to see my coworkers–women and men–coming together to discuss and learn about these important topics. Only with everyone on board can we make a change towards a more equal industry.”</p><p>“My favorite part was the round table discussion! I felt at ease to share the difficulties I’d faced. It was very enriching to share career and personal development tips with others.”</p><div class="image-caption"></div><p>“I feel inspired to use the takeaways from the [technical leadership talk] as a springboard for leading my own projects. It was clear how the takeaways emerged from practical situations.”</p><p>“As an ally, I’m glad to be able to participate in the event, it was great! My favorite part was the Ally Skills Workshop discussions and hearing different viewpoints.”</p><h2 id="conclusion">Conclusion</h2><p>Holding this event allowed women across different technical departments at Yelp to come together, feel a stronger sense of belonging, and walk away feeling empowered and inspired. We plan to hold this event again in the future, and are proud of the progress that has grown AWE from a small social group when it was founded in 2013, to a thriving organization with hundreds of members today. Organizing this event with the brilliant, motivated women in engineering and product has been a highlight of my time here at Yelp.</p><div class="image-caption"></div><p>This event further demonstrated Yelp’s ongoing commitment to diversity and inclusion, as well as the importance for women to have opportunities to connect with others in the workplace to learn and grow. For more information about how Yelp supports women in tech, check out <a href="https://www.yelp.com/engineering/awe">our website</a>!</p><div class="island job-posting"><h3>Interested in joining the awesome women in engineering and product at Yelp?</h3><p>We're hiring! Check out our Careers page for more open positions.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/3669001f-8f45-472c-8d8a-8904b3a07826?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Breaking Down Technical Interviews</h1> <p>Mon, 09 Sep 2019 02:00:00 +0200</p> <p>Finding that first job in the tech industry can be a daunting task. You might not get a response to your application, or maybe you’ll move forward with the interview, but it doesn’t pan out in the end. You might wonder, “How are other people successful at getting offers? What do others do differently? What’s the secret to getting through this arduous process?”</p><p>The answer is pretty simple: lots of practice. While there’s never a guarantee of getting an offer, following these recommendations can increase your chances of successfully going through the interview process and potentially landing your dream job!</p><h2 id="know-the-basic-flow">Know the Basic Flow</h2><p>When you start interviewing, expect to follow this basic structure. Below is Yelp’s interview process, but many companies have something similar:</p><ol><li>Online coding challenge</li> <li>Technical video interview</li> <li>Onsite interview</li> </ol><h2 id="review-cs-fundamentals">Review CS Fundamentals</h2><p>Before you begin the interview process, you’ll need to have a good grasp of fundamental CS knowledge. If it’s been a while, review data structures and algorithms. I highly recommend <a href="https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844/ref=sr_1_6?crid=1PPJLVE88HUQH&keywords=data+structures&qid=1560564134&s=gateway&sprefix=data+str%2Caps%2C199&sr=8-6">Introduction to Algorithms</a> as a good resource. If textbooks aren’t for you, you can find an abundant amount of <a href="https://github.com/kdn251/interviews">public resources</a> online. Keep in mind that when a company decides to not proceed at any stage of the interview process, they will likely have you wait 6 months or more before you can reapply. It’s important to be prepared before you apply so you can get the most out of your experience.</p><h2 id="coding-practice">Coding Practice</h2><p>For coding exercises, most companies will allow you to choose your programming language. Even if your preferred language is different from the company’s primary coding language, use it! This is your opportunity to demonstrate your coding knowledge and show your best work to the interviewer. Pro tip! The interview process is not the time to attempt a new language. Stick to your strengths.</p><p>If you don’t have a strong preference, consider the trade-offs of different programming languages. For example, one of the biggest constraints for any interview is time, so using a scripted language like Python over a compiled language like Java could be an advantage. Statically typed languages like Java and C++ are considered very verbose, which can take some time to transfer from your head onto a screen or whiteboard. In addition to being less verbose, scripting languages like Python have more flexibility with data structures such as slice notation.</p><p>Now that you have the fundamentals down, it’s time to practice! Out of all the options available online, many engineers favor <a href="https://leetcode.com/">LeetCode</a>. Leetcode categorizes problems by difficulty, which is helpful when you’re preparing for an interview. The first technical interview typically covers easy to medium level questions, and onsite interviews medium to hard, so it’s important to practice all levels. From personal experience and online suggestions, you should do 100-200 Leetcode questions to be prepared.</p><p>Practicing coding will help you at every stage in the interview process. However, you’ll want to make sure you also prepare for the online coding challenge, which is typically the first step. Take advantage of any practice tests made available to you on the website that the company uses (for Yelp, it’s currently <a href="https://www.hackerrank.com/">HackerRank</a>). Practice will allow you to familiarize yourself with the UI and the environment, so those won’t be barriers when taking the timed version. When you’re ready to take the test, be sure you’re are in a quiet place without interruptions and that you have a reliable internet connection. You don’t want technical difficulties affecting your performance.</p><h2 id="prepare-for-the-first-interview">Prepare for the First Interview</h2><p>Once you’re done with independent study, it’s time to mimic a real life interview! There are many online resources, such as <a href="https://www.pramp.com/#/">Pramp</a>, to help you get comfortable with interview situations. Doing mock interviews will let you practice coding while explaining your thought process. It’s important to know that during an interview, you are evaluated on your technical thought process, and not just your code. This is often what distinguishes an outstanding candidate from an average candidate. There is no substitute for practicing saying your thoughts out loud— you don’t want your actual interview to be the first time you try.</p><p>During your first interview, there are at least three times when you should speak up:</p><ol><li>At the beginning, always reiterate the problem back to the interviewer to make sure your understanding is aligned. This is your chance to ask follow-up and/or clarifying questions. It is important that you ask clarifying questions instead of making assumptions, as some interviewers will purposely omit criteria to test that candidate will ask the proper questions.</li> <li>When it comes time for a solution, it’s best to offer two or three options and discuss the pros/cons of each with your interviewer. Once you’ve reached a consensus, be sure to either explain the components you plan to code, or make comments, before moving on to actually writing code.</li> <li>Once you’re done writing code, walk through each line to look for errors. If everything looks good, you can verbally unit test your code. Depending on the difficulty of the problem, your interviewers may ask follow up questions or provide additional parts to the problem.</li> </ol><p>Remember that in an interview, timing matters! Don’t assume that the first question that the interviewer asks will be the only question and be sure to keep a steady pace throughout the interview.</p><p>I hope all of this was helpful as you embark on the interview process! See below for key takeaways and additional resources.</p><h2 id="key-takeaways">Key Takeaways</h2><h4 id="cs-fundamentals">CS Fundamentals</h4><ul><li>Whether you have old class notes, pick up a book to review, or study online, know your fundamentals before jumping into the process.</li> </ul><h4 id="coding-practice-1">Coding Practice</h4><ul><li>Use your preferred coding language in practice and in interviews.</li> <li>Practice as many coding challenges as you can, ranging in level of difficulty. It helps to browse the website of the company’s online coding challenge to get familiar with the set up.</li> </ul><h4 id="technical-interview">Technical Interview</h4><ul><li>Practice sharing your technical thought process out loud.</li> <li>Reiterate the problem and state assumptions.</li> <li>Tell the interviewer why you’re choosing the solution that you chose.</li> <li>At the end, walk through the code for errors and explain what you see.</li> <li>Always keep an eye on time and keep the conversation moving.</li> </ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a0fc4d3d-1fd2-495b-94d4-cc2ed1d80cf3?description=Software-Engineer-New-Grad-Backend_College-Engineering-Product_San-Francisco-CA?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Maptype — fast doc-value lookups for map data in Elasticsearch</h1> <p>Tue, 30 Jul 2019 12:07:00 +0200</p> <p>MapType is a custom Elasticsearch datatype that provides an optimized look-up approach for such data. The ability to quickly serve search results is essential for Yelp. Ranking performance has a significant impact on the time it takes to process a search request, and it’s crucial for fast ranking that we can quickly look up the data that’s fed into our machine learning models. Per-document map-like data is especially challenging in this regard, as traditional approaches often read and decode the full map only to look up a single value within that map later on.</p><p>Let’s start by discussing a concrete use case using map-like lookups: For each stored business that offers delivery, we want to store a mapping from the geobox that the business delivers to, to the average time it takes to complete a delivery to that geobox. This allows us to provide the average delivery time for a particular address as input to our ranking model. Figure 1 shows how that data is stored in the index.</p><div class="image-caption"><p class="subtle-text"><small>Figure 1: Example of how businesses are represented in the search index.</small></p></div><p>One main concern regarding map storage is how to find an efficient storage format that can quickly search data for a particular map key. In this blog post, we’ll start with a short introduction into how Elasticsearch internally stores data, and then further discuss how maptype enables low-latency lookups for this particular type of use case.</p><h2 id="lucene-doc-values">Lucene doc-values</h2><p>While a regular inverted index maps from a term to a document id (encoded in the postings list), doc-values map from document id to doc-value of that document. Internally, doc-values are stored in a column-store on a per-segment basis. As an example, let’s see how the doc value for the rating of a business is represented:</p><table><thead><tr><th>Business ID</th> <th>Rating</th> </tr></thead><tbody><tr><td>1</td> <td>5</td> </tr><tr><td>2</td> <td>3.5</td> </tr><tr><td>3</td> <td>4.2</td> </tr></tbody></table><p>Due to the columnar property of doc-values, the actual physical layout looks like [5, 3.5, 4.2] — with each doc-value for a particular field sorted by document id and placed next to each other.</p><p>Internally, Lucene uses a set of compression techniques to reduce the amount of storage required for doc-values. For example, in the case of numeric doc-values, Lucene tries to find a common denominator by which it can divide all of the values, and only encodes the offset from one number to the next, further reducing the amount of bits required to store that number.</p><h2 id="serializing-maps-for-efficient-lookups">Serializing maps for efficient lookups</h2><p>Our first high-level goal is to find an approach that allows us to serialize the map data-structure into a field that we can store as doc-value. One important characteristic of our use case is that we usually only look up one key per deserialized map. Because of this, using a traditional encoding format such as JSON, Protocol Buffers, or Avro is not a good choice, as they would deserialize the whole map, only to have to look up single elements later on.</p><p>Instead, we would use a format similar to FlatBuffers or Cap’n Proto, as they support “zero-copy” deserialization. Instead of first decoding the serialized data into memory, they directly execute operations on the serialized data. For example, in the case of a map lookup, only the accessed element would have to be serialized, thereby considerably improving performance in cases where only a single element is required.</p><p>Our first prototype was implemented based on Cap’n Proto, but two issues convinced us that a custom implementation would be a better choice. First, we would need to pre-compile our schema. This works well if we know beforehand which types we want to support, but doesn’t work so well if we want to allow users to choose the types during index creation. Second, we have a very specific use case for which we can optimize our format, compared to general encoding formats that need to support a much wider range of use cases.</p><p>As a consequence, we implemented our own serialization format from scratch, but reused some of Elasticsearch’s functionality to, e.g., efficiently encode a number in VLong or ZLong format. The next sections describe our format and implementation.</p><h2 id="maptype-format">Maptype format</h2><p>The maptype format is based on multiple layers:</p><ul><li> <p>The bottom layer provides an efficient way to store variable length byte-arrays in a list and to enable random-access lookups of individual list items.</p> </li> <li> <p>The middle layer uses two lists to implement a map data-structure from a variable length byte-array to another variable length byte-array: The first list stores the keys of the map in sorted order and the second stores the values of the map in the order that corresponds to the sorted key.</p> </li> <li> <p>The top layer allows us to serialize and deserialize custom data-types (such as integer values and geohashes) from and to the underlying byte-arrays that are used as keys and values in the lower layers.</p> </li> </ul><h3 id="encoding-a-list-of-variable-length-byte-arrays">Encoding a list of variable-length byte arrays</h3><p>There are two cases for encoding a list of potentially variable-length elements. In the first case, all variable elements are the same size, so we have a header section (as depicted in Figure 2) that stores (i) the total number of elements and (ii) the number of bits required by each element. This information allows us to calculate the offset value when we want to address an element with a given index.</p><div class="image-caption"><p class="subtle-text"><small>Figure 2: All elements have the same length.</small></p></div><p>The second case, as shown in Figure 3, is more complex: If elements have variable lengths, we cannot calculate the offset position based on the index and size of each element. Instead, we must use an array of fixed size pointers that point to the last byte of each element. While the pointers are a fixed size, we use the minimum number of bits required to represent the last pointer. For example, if the last pointer points to byte 700, we would use 10 bits per pointer (which allows us to represent numbers up to 1024). We point to the last byte of each element instead of the first since it’s implicitly known that the first byte of the first element starts at the position right after the pointers.</p><div class="image-caption"><p class="subtle-text"><small>Figure 3: Elements have variable lengths.</small></p></div><p>To simplify the handling of this logic in our code implementation subclasses, we implement <code class="highlighter-rouge">AbstractList<byte[]></code>, which allows us to use Java Collections functionalities such as Collections.binarySearch.</p><p>Encoding a map of variable-length byte arrays keys to variable-length byte array values</p><p>The next layer uses two instances of our <code class="highlighter-rouge">AbstractList<byte[]></code> implementation to provide a map interface. The basic idea is that we can store sorted keys in one list and values in the other in an order that corresponds to the order of the keys.</p><p>If we want to look up an element, we first search for the key via a binary search in the key list. If we find the key, we look up the value with the same index as the key and return it.</p><h3 id="mapping-byte-arrays-to-custom-data-types">Mapping byte arrays to custom data types</h3><p>The top layer allows us to map between types such as strings, geoboxes, and byte arrays used as the key and value type in the underlying map. This can best be explained with a concrete example:</p><p>Let’s assume the key consists of two elements: A geohash and an integer number representing the time of day (as hour from 0-23). The value also consists of two elements: The time a delivery takes to a geohash at that particular time, as well as the average rating (from 1-5) for that delivery. Using JSON during indexing and query-time, here’s how an instance of this map could look:</p><figure class="highlight"><pre class="language-json" data-lang="json">{"9qdex|10":{"avg_delivery_time":25,"avg_rating":4},"hmnpx|20":{"avg_delivery_time":15,"avg_rating":5}}</pre></figure><p>We immediately notice two things:</p><p>First, although the key of the map contains two different elements, it is represented as a single string. This is required, as we need to be able to encode the map as JSON during index and query time. However, on an internal level, the two elements are split and encoded as their respective types.</p><p>Second, we assign a label to the elements in the value, representing the value as something similar to a struct or namedtuple. This simplifies the usage of the data-type, as fields are available under a descriptive label instead of a random offset. Internally, the offset added by the labels is negligible: The values are still stored concatenated to each other in a byte array, but the order of the values is given by the sorted labels.</p><h2 id="putting-it-all-together--es-maptype-plugin">Putting it all together — ES maptype plugin</h2><p>The implementation of the ES maptype plugin takes all of the encoding logic described in the previous section and wraps it inside an Elasticsearch plugin that stores the encoded byte array as a doc value.</p><p>Here’s an example schema for a maptype encoded field:</p><figure class="highlight"><pre class="language-json" data-lang="json">"properties":{"example_field":{"type":"maptype","doc_values":true,"key_types":["geohash","vint"],"value_types":{"avg_delivery_time":"float","avg_rating":"float"}}}</pre></figure><p>To index values to that field, we can directly post the JSON-encoded version as shown in the previous section.</p><p>If we want to access the average delivery time for geobox <code class="highlighter-rouge">9qdex</code> at 10 am in a painless script, we can access that value via the following statement:</p><figure class="highlight"><pre class="language-python" data-lang="python">doc['example_field'].get('9qdex|10').get('average_delivery_time')</pre></figure><p>As our underlying code only implements a Map interface without decoding all of the values into a, e.g., HashMap, this statement only needs to decode the values for the given key: First <code class="highlighter-rouge">9qdex|10</code> is split into the geohash <code class="highlighter-rouge">9qdex</code> and the number 10, then both values are serialized into bytes with their respective encoders. Afterwards, we look up those serialized bytes via a binary search and — if they exist — find the bytes for the corresponding key and deserialize the key with the respective decoder for each type.</p><h3 id="current-status">Current status</h3><p>MapType is currently in production use at Yelp and an open-source release of the plug-in is in the works. If you want to learn more about MapType or the Yelp ranking platform in general, we’ll be hosting a meetup on August 1, 2019 at Yelp HQ in San Francisco: <a href="https://www.meetup.com/Elasticsearch-San-Francisco/events/263053170/">https://www.meetup.com/Elasticsearch-San-Francisco/events/263053170/</a></p><div class="island job-posting"><h3>Join us to build the Ranking Platform at Yelp</h3><p>We are building a reliable and standardized platform for search and ranking applications at Yelp. If you are curious to learn more, Apply here!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/461e8999-1bb8-4d37-9212-da7558ebdc21?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Autoscaling AWS Step Functions Activities</h1> <p>Thu, 20 Jun 2019 12:06:00 +0200</p> <p>In an ongoing effort to break down our monolithic applications into microservices here at Yelp, we’ve migrated <a href="https://engineeringblog.yelp.com/2017/11/breaking-down-the-monolith-with-aws-step-functions.html">several business flows</a> to modern architecture using <a href="https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html">AWS Step Functions</a>.</p><p>Transactional ordering at Yelp covers a wide variety of verticals, including food (<a href="https://www.yelp.com/search?find_desc=Delivery">delivery/takeout orders</a>), <a href="https://www.yelp.com/search?attrs=PlatformSpaBooking&find_desc=Spa">booking</a>, <a href="https://www.yelp.com/search?cflt=localservices">home services</a>, and <a href="https://www.yelp.com/">many more</a>. These orders are processed via Step Functions, where each is represented as an execution instance of the workflow, as shown below.</p><div class="image-caption"><p class="subtle-text"><small>Figure 1: Illustrative Step Functions Workflow for Transactions Orders</small></p></div><p>Each step in the above workflow is an “activity,” and Yelp implements these <a href="https://docs.aws.amazon.com/step-functions/latest/dg/concepts-activities.html">activities</a> as batch daemons, which interact with AWS Step Functions via an API integration that fetches tasks and submits activity execution results. We run multiple instances of these batch daemons, which we deploy via <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>, an in-house deployment platform.</p><p>Each workflow execution, i.e., processing of an order, is time bound. We do this by enabling the <a href="https://docs.aws.amazon.com/step-functions/latest/dg/sfn-stuck-execution.html">timeout capabilities</a> provided by Step Functions. Each activity within a workflow needs to be completed within a specified time, such that the sum of the time taken by all activities in a workflow is within the limits of the workflow timeout. We also use <a href="https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-errors.html">activity retries</a>, which are configurable per activity, to achieve resiliency in cases of intermittent recoverable failures.</p><div class="image-caption"><p class="subtle-text"><small>Figure 2: Zoom-in view for SubmitOrder activity execution (Step Functions' internal steps shown in blue color)</small></p></div><p>If we zoom in on a specific activity (SubmitOrder in this case) for each workflow execution, each execution will be queued in the <a href="https://docs.aws.amazon.com/step-functions/latest/apireference/API_ActivityScheduledEventDetails.html">ActivityScheduled</a> state until it’s picked up by one of the “activity workers.” Since the total time per activity (including execution and wait time) is bound, tasks with longer wait times get less time for execution. These tasks may need retries, and cascading effects from multiple activities could hit a workflow timeout threshold. As AWS provides aggregated metrics on the wait time for these tasks (<a href="https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html#monitoring-using-cloudwatch-state-machine-metrics"><em>ActivityScheduleTime</em></a>), in order to maintain the desired success rate and latencies for workflow processing, we need to have a healthy count of activity workers.</p><h2 id="why-activity-instances-need-auto-scaling">Why Activity Instances Need Auto-Scaling</h2><p>Transactional flows at Yelp experience repetitive traffic patterns over the course of the day and week. Before the new autoscaling system, we used <a href="https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html">Step Functions cloudwatch metrics</a> to tune the activity instances count to meet service-level objectives. In these cases, we provisioned a static count of activity workers to handle peak traffic, which led to a lot of unused compute capacity during non-peak hours, making transactional flow an ideal use case for auto-scaling.</p><p><a href="https://paasta.readthedocs.io/en/latest/autoscaling.html#enabling-autoscaling">Auto-scaling from PaaSTA</a> is based on pyramid uwsgi metrics and CPU usage, whereas Step Functions(SFN) workflows rely on a timely execution (<a href="https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html#monitoring-using-cloudwatch-state-machine-metrics"><em>ActivityScheduleTime</em> metric</a>) of the activities. The Step Functions Autoscaler bridges the gap between these two systems to manage and control activity instances.</p><h2 id="autoscaler-architecture">Autoscaler Architecture</h2><div class="image-caption"><p class="subtle-text"><small>Figure 3: Different components of the Autoscaler and it's interactions with AWS services and PaaSTA</small></p></div><p>The Autoscaler system consists of three components: AWS Services, the Autoscaler service, and the PaaSTA system. At a high level, the Autoscaler first fetches the scaling configuration, then gathers scaling demands from the AWS side, computes the scaling decision, and, lastly, invokes the PaaSTA api to adjust the instance count.</p><h3 id="aws-components">AWS Components</h3><p>We utilize <a href="https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html">Step Functions Metrics</a> (<em>ActivityScheduleTime</em> metric) and <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html">Coudwatch Alarms</a> to detect any scaling-worthy events, and <a href="https://aws.amazon.com/sns/">SNS</a> and <a href="https://aws.amazon.com/sqs/">SQS</a> services for relaying scaling messages.</p><p>More specifically, there are two cloudwatch alarms for each activity: <strong>scale up</strong> and <strong>scale down</strong>. When an alarm detects a breaching condition, it sends out “ALARM” notifications to be polled by the Autoscaler. It will then send an “OK” notification when the <em>ActivityScheduleTime</em> metric is back to normal.</p><h3 id="scaling-brain">Scaling Brain</h3><p>Upon receiving scaling messages, the Autoscaler will validate the message, parse, and comprehend the request, and then compute a concrete scaling decision for a specific activity. It considers two major factors for scaling decisions: scaling configurations (e.g., min/max count and scaling gradient) and scaling record (e.g., the last alarming time and the last scaling time).</p><div class="image-caption"><p class="subtle-text"><small>Figure 4: Steps involved in the scale-up process, highlighting repeated scaling.</small></p></div><h3 id="design-considerations">Design Considerations</h3><h4 id="repeatedly-scaling">Repeatedly Scaling</h4><p>When traffic ramps up, the scale-up alarm fires off an “ALARM” notification and the Autoscaler repeatedly scales up the activity workers until the <em>ActivityScheduleTime</em> metric is back to the normal threshold. When traffic settles down, the cloudwatch alarm sends out an “OK” notification. With the “OK” message, the autoscaler will begin to clean up the previous “ALARM” notification and wrap up the cycle of scaling for that activity.</p><h4 id="avoid-scale-flapping">Avoid Scale Flapping</h4><p><a href="https://en.wiktionary.org/wiki/flap#Verb">Flapping</a> (continuous churn of scale-up and -down events) is a typical challenge for any auto-scaling system. Here are a few highlights from our design made to handle this challenge:</p><ol><li>We support a scale-down cool-off time to prevent two consecutive scale-down actions within a certain amount of time. This value is configurable by service owners.</li> <li>We validate incoming scaling signals to guard against any malicious, delayed, or duplicated scaling notifications. This is achieved through qualifying the alarm name, maintaining the last alarm time, and examining the scaling configuration before every scaling action.</li> <li>Conservative scaling down is based on historical statistics for scale-down alarms so that they’re less susceptible to triggers and never occur during peak hours.</li> </ol><h2 id="rollout-story">Rollout Story</h2><p>We’ve rolled out the Step Functions Autoscaler in production for 85% of our transaction ordering and have already seen positive results in the first few months.</p><div class="image-caption"><p class="subtle-text"><small>Figure 5: Graph showing number of activity workers for a given activity in last 7 days. Blue line represent the static count of workers before Autoscaler integration.</small></p></div><p>The above graph shows the instances of number changes for a production activity in a one-week range. The blue line is the instance number we would have without the Autoscaler and the orange line is with autoscaling. You can clearly see that there are periods where an activity could use fewer instances, and we’ve seen a ~34% savings in compute cost per activity per week, even with a very conservative scaling down.</p><p>We’ve enabled autoscaling for ~11 Step Functions activities which processed ~2 million tasks within one week. The entire rollout took about 2 weeks, not including downtime for the activities. Below are some suggestions and lessons learned as part of this rollout.</p><ul><li> <p><strong>Revisit historical data for setting cloudwatch alarm threshold</strong> We used historical data for the initial values of scale-up and scale-down thresholds, and adjusted them accordingly during the rollout phase. Note: there will be a non-zero activity wait time (<em>ActivityScheduleTime</em> metric) even during off-peak hours, so a scale-down alarm threshold should be set accordingly.</p> </li> <li> <p><strong>Gradually reach the ideal minimal instance count with large scale-up increment steps</strong> For a safe rollout, we kept the minimum and maximum instance bounds close in the beginning and gradually widened the gap throughout. High scale-up gradients can avoid degradation during burst traffic, while instance count is kept at a minimum.</p> </li> <li> <p><strong>Be aggressive on scaling up and conservative on scaling down</strong> We used cool-down times, small scale-down gradients, and larger evaluation periods for conservative scale-downs, as opposed to large scale-up gradients and smaller evaluation periods for aggressive scale-ups.</p> </li> </ul><h3 id="monitoring">Monitoring</h3><p>Among <a href="https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html">Step Functions cloudwatch metrics</a>, close monitoring of activities and workflow metrics like <em>ActivityScheduleTime</em>, <em>ActivitiesTimedOut</em> and <em>ExecutionsTimedOut</em>, helped us during the rollout phase.</p><p>As for next steps, we look forward to rolling out Autoscaler for other Step Functions use cases at Yelp. We’re also exploring proactive scaling strategies based on heuristics like time of day, historical trends, ad-hoc demands, etc.</p><div class="island job-posting"><h3>Join us to build a Commerce Platform at Yelp</h3><p>We are building a comprehensive and APIs driven Commerce Platform that enables teams at Yelp to build Subscription and Transaction products. If you are curious to learn more, Apply here!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/3669001f-8f45-472c-8d8a-8904b3a07826?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Introducing Bento</h1> <p>Thu, 02 May 2019 12:05:00 +0200</p> <p></p><p>Today we’re proud to introduce <a href="https://github.com/Yelp/bento">Bento</a>, an open source framework for building modularized Android user interfaces, created here at Yelp. Over the past year, we’ve seen great developer productivity gains and product design flexibility from using Bento on our most critical screens. In this post we’ll explain a bit about how Bento works, why you might want to use it, and where we want to go next.</p><h2 id="what-is-bento">What is Bento?</h2><p>We named this framework after the wonderfully compartmentalized Japanese lunch container. A Bento box is a container with dividers to separate different food items from each other. If you squint really hard and maybe take a step back, it also looks like the way most apps are designed these days: a list of colorful components arranged in a somewhat staggered grid.</p><p>Many Android apps that have a list-based user interface use a RecyclerView to display their views. At a basic level, the RecyclerView works by referencing an ordered list of data and creating a view on screen for each item of data in that list. That works really well if your list consists of homogenous data types, but can quickly become unruly when you need to manage an unbounded number of data and view types in a list. It also becomes an issue if you need to use the same view type in a user interface other than a RecyclerView, such as a ViewPager or ListView.</p><p>Bento aims to fix these issues by providing a framework to manage the complexity of handling <strong>different view types</strong> and the dynamic position of each view in the list. Bento can also be used to manage views in other parent view types, such as ViewPagers and ListViews, all while keeping the benefits of RecyclerView best practices (like view holders and view recycling).</p><h2 id="how-does-bento-work">How Does Bento Work?</h2><p>Bento groups different view types and the logic associated with displaying and interacting with those view types into “Components”. A <code class="highlighter-rouge">Component</code> can be anything from a simple text view to a horizontal carousel comprised of other components.</p><p>At its core, a <code class="highlighter-rouge">Component</code> is a self-contained element that provides a data item. An associated <code class="highlighter-rouge">ComponentViewHolder</code> class will inflate a view and bind the data item provided to the inflated view. The view holder will also typically bind the <code class="highlighter-rouge">Component</code> (or some <code class="highlighter-rouge">Presenter</code>) to the view to handle any user interactions.</p><p>To demonstrate how a <code class="highlighter-rouge">Component</code> works, here’s a diagram of the data flow of a component to be displayed on screen.</p><p></p><p>First, the underlying page view needs to display something, so it asks the <code class="highlighter-rouge">ComponentController</code> for a view to render. The <code class="highlighter-rouge">ComponentController</code> needs to return an updated view to the underlying page view;so based on the internal list of components the controller maintains, it creates a new <code class="highlighter-rouge">ComponentViewHolder</code> by calling <code class="highlighter-rouge">getHolderType</code> on the component in the list at the position that the page view needs. This method returns a <code class="highlighter-rouge">ComponentViewHolder</code> class which is then instantiated through reflection. Since this is the first time the component is creating a view, the layout needs to be inflated. The <code class="highlighter-rouge">ComponentController</code> calls the <code class="highlighter-rouge">inflate</code> method on the newly created <code class="highlighter-rouge">ComponentViewHolder</code> to create the view. Next, we need to populate the view with data and make sure it will react to user input. The <code class="highlighter-rouge">bind</code> method is called on the <code class="highlighter-rouge">ComponentViewHolder</code> instance that was created. This method is provided with a data item and a presenter. These are generated through the <code class="highlighter-rouge">ComponentController</code> calling the <code class="highlighter-rouge">getPresenter</code> and <code class="highlighter-rouge">getItem</code> methods of the corresponding <code class="highlighter-rouge">Component</code>. The presenter is any object that does some business logic or handles user interactions, In many cases it is the <code class="highlighter-rouge">Component</code> class itself. The data item is usually a data class with view properties and strings to display to the user. The view is updated with the data item and event listeners are bound to the presenter. The view is then passed back to the underlying page view to be rendered. The order of a <code class="highlighter-rouge">Component</code> in its parent view relative to other components is determined by the <code class="highlighter-rouge">ComponentController</code>. This interface is the magic soy sauce in our Bento box that allows us to add, remove, and insert components dynamically into the ordering as if we were manipulating values in a simple list data structure. It also provides an abstraction we can use to apply this functionality to different view types, such as RecyclerView, ListView, ViewPager, and potentially many others. For example, the <code class="highlighter-rouge">RecyclerViewComponentController</code> handles the complex choreography of communicating with the RecyclerView class and adapter to determine spans and positions, making it very simple to manage diverse sets of components in a list. We can also create groupings of different components using a <code class="highlighter-rouge">ComponentGroup</code>, which is also a <code class="highlighter-rouge">Component</code> itself, to keep logical groupings of components together in the list.</p><p>The Bento framework makes it easy to break down complex interfaces into a set of easy to understand, modular, dynamic, and testable components.</p><h2 id="an-example">An Example</h2><p>Let’s take a look at an example of how to build a very basic component that just renders some text. Here’s an example of a very simple <code class="highlighter-rouge">Component</code> class:</p><div class="language-kotlin highlighter-rouge highlight"><pre>class ExampleComponent(private val text: String): Component() { override fun getCount() = 1 override fun getPresenter(position: Int) = this override fun getItem(position: Int) = text override fun getHolderType(position: Int) = ExampleViewHolder::class.java } </pre></div><p>Here we can see we’ve overridden some methods of the abstract <code class="highlighter-rouge">Component</code> class. Let’s take a look at each one:</p><ul><li><code class="highlighter-rouge">getCount</code> - Components can be internally made up of a series of items. In our simple case, we only have one item. Each item in the component at each position can have its own presenter, data item, and view holder type if we wanted, but it’s usually best to break it into different <code class="highlighter-rouge">Components</code>, unless all items have an identical view holder and presenter.</li> <li><code class="highlighter-rouge">getPresenter</code> - The presenter is the brains of the component that knows how to respond to user interactions and do other complex state-driven things. In a way, each Bento component is it’s own MVP ecosystem where the <code class="highlighter-rouge">Component</code> is the Presenter, the <code class="highlighter-rouge">ComponentViewHolder</code> is the View, and the data item is the Model.</li> <li><code class="highlighter-rouge">getItem</code> - The item is the data that’s associated with the component at the specified position. In this case, our data is the <code class="highlighter-rouge">text</code> that we want to display.</li> <li><code class="highlighter-rouge">getHolderType</code> - The holder type is a class that is instantiated by the Bento framework through reflection. It’s responsible for inflating the component’s layout and binding the data item to the view. Let’s take a look at our <code class="highlighter-rouge">ExampleViewHolder</code> class:</li> </ul><div class="language-kotlin highlighter-rouge highlight"><pre>class ExampleViewHolder: ComponentViewHolder<ExampleComponent, String>() { private lateinit var textView: TextView override fun inflate(parent: ViewGroup) = parent.inflate<TextView>(R.layout.example_component_layout) .also { textView = it } override fun bind(presenter: ExampleComponent, element: String) { textView.text = element } } </pre></div><p>Much like the view holder pattern we see when using RecyclerViews, Bento’s view holders are separated into an inflate and a bind method. Let’s take a look at what these methods are doing:</p><ul><li><code class="highlighter-rouge">inflate</code> - Here we inflate a layout file which, at its root, contains nothing but a simple TextView element. We then return that inflated view and store a reference to it in <code class="highlighter-rouge">textView</code> so we can use it later when binding data.</li> <li> <p><code class="highlighter-rouge">bind</code> - This method is called whenever an item in the <code class="highlighter-rouge">Component</code> is ready to be shown on screen. It is called once for each item as defined by <code class="highlighter-rouge">getCount</code> in the <code class="highlighter-rouge">Component</code> class. The <code class="highlighter-rouge">bind</code> method provides a reference to a presenter and the corresponding data item at the position in the component that this view holder represents. In other words, the <code class="highlighter-rouge">presenter</code> argument is obtained from calling <code class="highlighter-rouge">getPresenter(i)</code> at some position i. The <code class="highlighter-rouge">element</code> argument is also obtained from calling <code class="highlighter-rouge">getItem(i)</code> for the same position i.</p> </li> <li>NOTE: The bind method is often called as views are recycled in the list, so performance should be a high priority for this method.</li> </ul><p>Great! So now we have a <code class="highlighter-rouge">Component</code> and a <code class="highlighter-rouge">ComponentViewHolder</code> that will take some string and bind it to a <code class="highlighter-rouge">TextView</code> to show the user. So how do we actually use the component? We need to create a <code class="highlighter-rouge">ComponentController</code> that organizes all of the components. For this example, we’ll use the simple <code class="highlighter-rouge">RecyclerViewComponentController</code>. Here it is in an example activity:</p><div class="language-kotlin highlighter-rouge highlight"><pre>class ExampleActivity: AppCompatActivity() { private val componentController by lazy { RecyclerViewComponentController(recyclerView) } override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_recycler_view) componentController.addComponent(ExampleComponent("Hello World!")) } } </pre></div><p>Here we create a regular activity whose content view layout is just a RecyclerView with an id of <code class="highlighter-rouge">recyclerView</code>. We lazily initialize the <code class="highlighter-rouge">ComponentController</code> the first time it is referenced, and create it by passing in the instance of the RecyclerView. From there, we call <code class="highlighter-rouge">addComponent</code> and pass in a new instance of our <code class="highlighter-rouge">ExampleComponent</code> with a text string of <code class="highlighter-rouge">Hello World</code> to display. Here’s what the app looks like when rendered:</p><p></p><p>Nice! Bento also has helper classes to avoid boilerplate code. Our Example component class was actually pretty simple, and so we can write it as a <code class="highlighter-rouge">SimpleComponent</code>:</p><div class="language-kotlin highlighter-rouge highlight"><pre>class SimpleExampleComponent( private val text: String ): SimpleComponent<Nothing>(ExampleViewHolder::class.java) { override fun getItem(position: Int) = text } </pre></div><p>It’s still using our <code class="highlighter-rouge">ExampleViewHolder</code> from before, but now we don’t need to worry about the count or the presenter, and the view holder type is specified in the super constructor. There are also many variations of components included with Bento, including a <code class="highlighter-rouge">ListComponent</code> for repeating views, a <code class="highlighter-rouge">PaginatingListComponent</code> for lazy loading, and even a <code class="highlighter-rouge">CarouselComponent</code> for collections of components that can be scrolled horizontally.</p><h2 id="features">Features</h2><h3 id="modular">Modular</h3><p>As you can tell from the above example, <code class="highlighter-rouge">Components</code> are pretty modular. The component exists as a cohesive whole without any dependencies on the environment into which it’s inserted. That’s nice for a lot of reasons, most of which are covered in this section.</p><h3 id="testable">Testable</h3><p>Since <code class="highlighter-rouge">Components</code> don’t rely on the Android framework, it’s very easy to unit test their logic. It’s also easy to test the logic of a <code class="highlighter-rouge">Component</code> view holder using the <code class="highlighter-rouge">ComponentViewHolderTestCase</code> that inflates the view and injects it into a testing activity where the data is bound. Then, Espresso can check that everything is displayed properly and the correct methods are called on the presenter during user interactions. From an integration testing standpoint, the <code class="highlighter-rouge">bento-testing</code> module provides some <code class="highlighter-rouge">BentoInteraction</code>s to test an entire screen of components as the user would see it.</p><h3 id="reusable">Reusable</h3><p>A component created for one environment can be reused across many different screens that are using a <code class="highlighter-rouge">ComponentController</code> of any kind. That means it’s easy to drop the same component into a RecyclerView, ListView, or ViewPager.</p><h3 id="progressive">Progressive</h3><p>Bento was made to be progressively introduced into an existing application. You don’t need to rewrite your app from scratch or rethink your entire application architecture. We’ve been integrating it into the Yelp consumer and business owner apps for almost a year now. For example, on the nearby screen of the consumer app, everything below the header (outlined below in red) is a Bento component.</p><p></p><p>We’ve also incorporated some tools to make the transition easier for existing apps. For example, for those of you still stuck using ListViews (this is a judgement-free zone), you can use the <code class="highlighter-rouge">ListAdapterComponent</code> to wrap your existing list into its own component, and start converting the items in the list into their own separate components.</p><h3 id="scalable">Scalable</h3><p>Bento is scalable from a technical and organizational standpoint. There’s no limit to the number of components a project using Bento can have. Because it’s easy to keep components isolated from one another, it’s also easy to separate them across modules for faster build times. Since we can use RecyclerViews as a backing view for Bento, it’s also very performant when using a large number of heterogeneous components. More modular user interface components also mean that we can assign ownership of a component to a particular team for maintenance. Instead of one screen being one team’s problem, now other teams can own components on that screen, meaning the weight of software maintenance can be more evenly distributed and bugs more easily triaged.</p><p>Bento doesn’t use type annotations. That means there’s a very low compile-time overhead since no annotation processing needs to happen. Also, since Bento is only a handful of classes, it has a very small storage footprint. The <code class="highlighter-rouge">aar</code> library file is only 105.6 KB.</p><h2 id="where-to-next">Where to Next?</h2><p>Bento has helped our Android app development scale and allowed us to execute new features efficiently and reliably. But Bento is still growing, and is by no means perfect or complete. We have several areas which we’d like to improve, mostly centered around performance. We currently don’t have asynchronous layout inflation, where each layout is inflated off of the main UI thread. We also don’t have automatic diffing when notifying of a change in a ComponentGroup. Parts of the framework are still written in Java while others are written in Kotlin. That being said, we’re always looking for new contributors to the project! If you’d like to contribute to the project, please check out the repo on GitHub at <a href="https://github.com/Yelp/bento">github.com/Yelp/bento</a> and follow the contributing steps in the readme.</p><div class="island job-posting"><h3>Want to build next-generation Android application infrastructure?</h3><p>We're hiring! Become an Android developer at Yelp</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/2c6736d6-7c8e-4f57-8912-15a71815eef0?description=Software-Engineer-Android_Engineering_San-Francisco-CA?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Jira & Ansible: Scaling Jira Server Administration for the Enterprise</h1> <p>Wed, 03 Apr 2019 12:04:00 +0200</p> <p>In 2017, Yelp had over 40 Jira administrators to allow different teams across the organization to perform administrative tasks. With lots of admins came lots of changes, which lead to our Jira environment accumulating hundreds of orphaned workflows, screens, and schemes. To solve this problem, we built a scalable solution that empowers our engineers to create Jira projects themselves using code and source control – ensuring 100% standardization across all engineering projects and making Jira easier to manage, simpler to use, and better performing. This blog post will cover Yelp's use of Ansible to manage Jira server project creation, updates, and archive functionality.</p><h2 id="behind-the-scenes">Behind the Scenes</h2><p>We first needed to decide which technologies to utilize to solve this problem. We settled on an open-source tool called Ansible as the main framework because of its procedural language and client-only architecture. While Puppet is used across Yelp, not having to deal with a complex infrastructure to run Ansible was key for us. In our implementation, Ansible is responsible for interacting with Jira's REST API to perform every step of the project lifecycle. We identified three critical project lifecycle elements, each with its own playbook:</p><ul><li>Project creation</li> <li>Project archival</li> <li>Project updates</li> </ul><p>We use Git to host our configuration while maintaining a full revision history. These configurations are maintained across three different repositories:</p><ul><li>Ansible_configs: Hosts Ansible configuration, playbooks, and roles.</li> <li>Ansible_precommithooks: Hosts configuration for the <a href="https://github.com/pre-commit">pre-commit</a> hooks that were used in Ansible_configs repository.</li> <li>Ansible_jira_projects: Hosts YAML files defining each Jira project and its configuration.</li> </ul><p>Each project is represented by a single YAML file which describes the configuration in an easy-to-read way:</p><figure class="code"><figure class="highlight"><pre class="language-yaml" data-lang="yaml">PROJECTBETA: - key: beta - leader: darwin - board_type: agile - description: the new revolutionary project - security_schema: engineering</pre></figure></figure><p>We then use the powerful templating feature from Ansible to transform the YAML configuration files into a JSON object that can be interpreted by Jira's REST API. Using Jinja2, we created custom templates that transform the YAML files into a JSON payload of the different Jira Server API requests. Here's an example of one of our Jinja2 templates we used to create a project:</p><figure class="code"><figure class="highlight"><pre class="language-json" data-lang="json">{"key":"","name":"","lead":"","projectTemplateKey":"","issueSecurityScheme":"","permissionScheme":"","notificationScheme":""}</pre></figure></figure><p>To verify that our engineers only push valid project manifests, we use a framework for managing and maintaining multi-language pre-commit hooks called "pre-commit." Each time an engineer makes a commit, our defined pre-commit hooks are automatically run to verify the manifest and identify any issues, such as invalid Jira project keys, names, or illegal characters.</p><p>Pre-commit has a lot of pre-built hooks available <a href="https://github.com/pre-commit/pre-commit-hooks#hooks-available">out-of-the-box</a>. For this solution, we used ones like check-yaml and sort-simple-yaml, but also wrote our own custom hooks to ensure that all secrets were properly encrypted with Ansible Vault, and all existing and new YAML files only contained allowed keys/values and followed Yelp's Jira project standards.</p><p>Here's an example of one of our custom hooks that we created to validate our YAML syntax:</p><figure class="code"><figure class="highlight"><pre class="language-python" data-lang="python">#!/usr/bin/python import sys import argparse import yaml import re VALID_BOARD_CONFIGS = ['kanban', 'agile'] def main(argv=None): retval = 0 parser = argparse.ArgumentParser() parser.add_argument('filenames', nargs='*', help='Jira project files to check.') args = parser.parse_args(argv) argv = argv if argv is not None else sys.argv[1:] for filename in args.filenames: try: YAML_PROJ_CONF = yaml.safe_load(open(filename)) PROJECT_KEY = YAML_PROJ_CONF.keys()[0] except yaml.YAMLError: print('Error parsing: {}'.format(filename)) retval = 1 continue if 'key' in YAML_PROJ_CONF[PROJECT_KEY]: if not re.match(r"^[A-Z]{2,10}$", YAML_PROJ_CONF[PROJECT_KEY]['key']): print("{} isn't a valid key. The project key MUST be 2-10 characters, only A-Z".format(YAML_PROJ_CONF[PROJECT_KEY]['key'])) retval = 1 else: print('{}: is missing project key'.format(filename)) retval = 1 if not re.match(r"^[a-z]{2,10}$", YAML_PROJ_CONF[PROJECT_KEY]['lead']) and 'svc-' not in YAML_PROJ_CONF[PROJECT_KEY]['lead']: print("{} isn't a valid lead. The lead MUST be 2-10 characters, only a-z".format(YAML_PROJ_CONF[PROJECT_KEY]['lead'])) retval = 1 if YAML_PROJ_CONF[PROJECT_KEY]['board_config'] not in VALID_BOARD_CONFIGS: print("{} isn't a valid board config. You can choose between the following options {}".format(YAML_PROJ_CONF[PROJECT_KEY]['board_config'], VALID_BOARD_CONFIGS)) retval = 1 return retval if __name__ == '__main__': sys.exit(main())</pre></figure></figure><p>Jenkins (our continuous integration tool) puts all the pieces together by merging the three Git repositories, verifying that all new and existing YAML files pass validation through pre-commit hooks, and then executing in Ansible. The payload that’s built is based on the information included in the new manifest files and is executed against Jira’s API to process the changes.</p><p>Our Jenkins pipeline consists of four different stages:</p><ol><li>Review validation: In order to avoid engineers pushing code without a code review, we’ve customized our Gitolite permissions to ensure engineers only push integration branches. By doing this, we avoid rogue changes in our repositories that haven’t undergone proper review from other team members.</li> <li>Manifest validation: Runs pre-commit hooks against all YAML and configuration files and validates that they pass.</li> <li>Playbook execution: If the first two stages pass, Jenkins executes the proper playbook based on the changes that were made to the repository.</li> <li>Notification: If the Ansible project creation/update/deletion is successful, it notifies the requestor that the changes are now live.</li> </ol><h2 id="implementation">Implementation</h2><p>Now that you’re familiar with the technologies involved, let’s talk about implementation. This chart illustrates how each of these components works together to deploy changes automatically:</p><p></p><p>When an engineer wants to create, update, or delete a Jira project in our main production instance, they simply clone the “ansible_jira_projects” repository, which contains all existing project YAML files. At this point, engineers have the option to manually update an existing project YAML file or to use one of our self-service scripts to automatically generate a new project manifest based on prompted details:</p><p></p><p>Pre-commit hooks are run when an engineer makes a Git commit to submit the changes for review.</p><p>At this point, the engineer submits a code review. Included in the review are the results of all tests run in pre-commit hooks.</p><p></p><p>Once the code review receives a “ship it!” from one of our team members, the engineer can merge the changes to a deploy branch using a simple script.</p><p>Jenkins will get a notification of new changes in our Git repositories.</p><h2 id="benefits">Benefits</h2><p>By using Ansible, Jenkins, pre-commit hooks, custom written Python scripts, and easily comprehensible YAML we were able to:</p><ul><li>Ensure 100% standardization across all engineering projects, making Jira easier to manage, simpler to use, and better performing.</li> <li>Establish a rich auditing trail by ensuring that every action taken during the Jira project lifecycle is captured, reviewed, and logged.</li> <li>Eliminate repetitive tasks for our Jira administrators.</li> <li>Reduce the turnaround time of Jira project creation/update/deletion while also freeing our Jira admins to focus on other important tasks.</li> <li>Reduce the number of Jira global administrators.</li> </ul><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Tech Intersections Conference</h1> <p>Mon, 04 Mar 2019 12:03:00 +0100</p> Yelp <noscript> </noscript> <p><a href="https://engineeringblog.yelp.com/">Yelp</a></p> <div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"> <p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p> </form></div> <div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2018 Yelp Inc. Yelp, , and related marks are registered trademarks of Yelp.</small></div> </article> <article> <h1>Autoscaling Mesos Clusters with Clusterman</h1> <p>Tue, 19 Feb 2019 12:02:00 +0100</p> <p>Here at Yelp, we host a lot of servers in the cloud. In order to make our website more reliable—yet cost-efficient during periods of low utilization—we need to be able to autoscale clusters based on usage metrics. There are quite a few existing technologies for this purpose, but none of them really meet our needs of autoscaling extremely diverse workloads (microservices, machine learning jobs, etc.) at Yelp’s scale.</p><p>In this post, we’ll describe our new in-house autoscaler called Clusterman (the “Cluster Manager”) and its magical ability to unify autoscaling resource requests for diverse workloads. We’ll also describe the Clusterman simulator, which we use not only to verify that our code is working correctly, but also to predict operating costs based on hypothetical data. And just to pique your interest, we’d like to note that Clusterman is now the <em>de facto</em> autoscaler at Yelp, and is able to begin autoscaling new clusters with just a couple hours of work!</p><h2 id="so-you-want-to-run-a-mesos-cluster-would-you-like-autoscaling-with-that">So you want to run a Mesos cluster; would you like autoscaling with that?</h2><p>First, let’s start with some preliminaries: there are lots of different models for distributed computation out there, and not all of them can be autoscaled in the same way. For the purpose of this post, we’ll focus on autoscaling strategies for distributed clusters running Apache Mesos.</p><p><a href="http://mesos.apache.org/">Apache Mesos</a> is an open-source project that abstracts away the specific properties of servers in your cluster, and essentially allows developers to treat that cluster as one large computer. In order to make our clusters ready for autoscaling, we’ve designed our applications to be relatively fault tolerant so they can easily handle the loss of a compute node in one of our Mesos clusters.</p><p>Now, let’s consider some different types of workloads we might want to autoscale. At Yelp, we have numerous types of distributed computation that we want to perform, ranging from long-running services (such as the web service hosting this very blog!), to periodic batch jobs (for doing image classification or analyzing ad revenue), unit and integration tests, machine learning, and many more! Each of these applications have a different usage pattern and level of fault tolerance, so it’s important to have an autoscaler that can handle them all.</p><p>Finally, let’s think about this from an operational standpoint for a minute. At Yelp, we know that people make mistakes all the time, and our goal is to provide safeguards that minimize the impact of those mistakes. The problem with autoscaling is that it’s invisible when it works, catastrophic when it doesn’t, and nearly impossible to test except on production workloads. Ideally, we’d like to design a solution that catches errors before they take down our website.</p><h2 id="enter-clusterman-50-autoscaler-50-simulation-environment-100-love">Enter Clusterman: 50% autoscaler, 50% simulation environment, 100% love</h2><p>There’s a lot of prior art on how to build an autoscaler. At Yelp, we built at least two different autoscalers before Clusterman (the “Cluster Manager”) was born! This new autoscaling project has allowed us to consolidate our engineering efforts in one place, while simultaneously handling a wide range of distributed computation workloads.</p><p>The key idea behind this design is to separate the data from the business logic, as shown in the following diagram:</p><div class="image-caption"><p class="subtle-text"><small>Figure 1: The Clusterman architecture diagram. Yelp's infrastructure is hosted on AWS, but in principle Clusterman can integrate seamlessly with any cloud provider.</small></p></div><p>We’re a data-driven organization at Yelp, so the most important component of our autoscaler is the actual data itself. We collect all kinds of data about our clusters, ranging from the number and type of machines in the cluster, to the amount of resources allocated by Mesos frameworks running on the cluster, as well as many application-specific metrics such as, “How many developers are trying to run unit tests on this cluster right now?” Much of this load is periodic, which we can take advantage of for autoscaling purposes (Figure 2). All of these data points are collected per minute and stored in Amazon’s DynamoDB, which works great for storing timeseries data (Figure 1, Step 1).</p><div class="image-caption"><p class="subtle-text"><small>Figure 2: The CPU load range for our biggest service, yelp-main.</small></p></div><p>Next, this data is passed into autoscaling signals (Figure 1, Step 2), which indicate how many resources each application thinks it will need in the upcoming minutes. The key insight here is that every application will need a different signal, and we can’t know the ins and outs of every application running on the cluster. So, at Yelp, these signals are owned by the teams that running the applications, and we use configuration files to specify what versions of which signals we want to use for each cluster. The input and output of these signals is specified through a simple API, which allows a team to plug-and-play various signals as they please.</p><p>The use of “pluggable” signals is the secret sauce that makes it easy for new teams and projects to take advantage of Clusterman. Each signal can be as simple as just a few lines of Python code or as complex as a machine-learning-driven prediction algorithm. For example, the default signal used by Clusterman looks at the current utilization (CPUs, memory, and disk) of the cluster and scales up or down to maintain a constant load fraction; however, CPU utilization is not a great metric for other types of workloads, so our unit-testing cluster instead uses the number of developers currently trying to run tests and bases its autoscaling requests on that number. Even more recently, we’ve added a signal that allows our cluster that runs Spark jobs to be autoscaled based on the number of Spark jobs in flight and the amount of resources they’ve requested. And because all the signals are independent, nobody has to worry about what anyone else is doing on the cluster.</p><p>The output from each signal is collected and aggregated by the core Clusterman autoscaler (Figure 1, Step 3), which is then responsible for analyzing all of the signal requests and deciding what to do (scale up, scale down, or do nothing). Once it’s made a decision, it communicates with the cloud provider (Figure 1, Step 4) to provision or terminate machines as needed (in our case, the cloud provider is AWS, but this is modularized so there’s no reason you couldn’t use something else).</p><p>So this solves half of the equation: how to handle different types of workloads with one common system. But there’s still the issue of how to make sure your autoscaling will do the right thing on production traffic. This is exactly where the Clusterman simulation environment comes in!</p><p>To put it simply, we’ve developed an entire simulated ecosystem that allows you to test out changes to the autoscaling logic before deployment. You can use randomly-generated metrics for this simulation, replay production metrics from a previous time period, or use any combination of the two that you like. Not only does this allow you to check for bugs in the code against production data, but you can also start answering hypothetical questions like, “How much more money would I spend if our website traffic doubled next week?”</p><h2 id="enough-talk-show-me-pretty-pictures">Enough talk, show me pretty pictures!</h2><p>Let’s take a look at some actual data from one of our clusters. This cluster runs <a href="https://www.slideshare.net/mesosphere/jolt-distributed-faulttolerant-test-running-at-scale-using-mesos">Jolt</a>, our internal distributed unit and integration testing framework. To give you a sense of scale, the code powering yelp.com has about 100,000 unit and integration tests. Running these back-to-back would take a couple of days, and we run all of them about 400 times a day with less than 30 minutes per run. So this cluster definitely earns its keep! Here’s a graph from our internal monitor showing the usage on the cluster during the month of December 2018:</p><div class="image-caption"><p class="subtle-text"><small>Figure 3: Cluster capacity for Jolt (our unit- and integration-testing cluster) in December 2018. The green curve shows the total CPUs in the Mesos cluster and the purple curve shows the CPUs allocated to tasks.</small></p></div><p>This graph shows some nice trends: in particular, we see that we’re almost always overprovisioned on the cluster, and moreover, starting on 9 December there’s a period where a bug in the autoscaler prevented us from scaling down. There are two main disadvantages in the above graph: first, our monitoring aggregates a lot of data points, especially for data that’s more than thirty days old; and second, we can’t provide it with “fictional” input data and ask it what the cluster would do.</p><p>Now let’s take a look at the same data as represented by Clusterman:</p><div class="image-caption"><p class="subtle-text"><small>Figure 4: Actual cluster usage for our Jolt (unit- and integration-testing) workload in December 2018.</small></p></div><div class="image-caption"><p class="subtle-text"><small>Figure 5: Actual cluster capacity (i.e., number of available CPUs) in the Jolt cluster in December 2018.</small></p></div><p>Figures 4 and 5 show the respective number of CPUs allocated by Jolt tasks and provisioned in the cluster, as reported by Clusterman for this same time frame. Each dot represents one minute of real time, and the brightness of the dot indicates a higher capacity or utilization. Clusterman’s data collection does not average or aggregate, and we store data for up to two years; so, unlike other more general monitoring tools, we can do 100% accurate simulations for events from two years ago, or compute year-over-year comparisons of usage or cost.</p><p>You can definitely see some patterns here that weren’t immediately obvious in Figure 3. For example, it’s quite clear that most people run tests during the day rather than at night (we encourage a healthy work-life balance at Yelp!). Moreover, it’s still obvious, even just qualitatively, that our cluster is overprovisioned. Finally, using EC2 pricing data from AWS, Clusterman can look at the above usage data and tell you how much you’re spending and when your costs peak. While there are lots of other tools out there with the same functionality, only Clusterman can break the costs down to a per-application level, enabling us to perform cost attribution to various teams at Yelp.</p><p>The two above graphs are based on actual data from Clusterman and Jolt. Let’s look now at a simulation to see how we might do better (or worse). Here’s an example of a really basic signal called the “Constant Signal.” As you might imagine, it always requests the same number of resources. What would happen to the above workload if we used this signal? Let’s specifically look to see how many unused CPUs we have sitting around:</p><div class="image-caption"><p class="subtle-text"><small>Figure 6: Simulation of what would happen if we replaced our sophisticated autoscaling signal with one that requests the same amount of resources at all times.</small></p></div><p>Oh no! What are all those red dots?? Well, based on historical utilization, we know these are cases when the number of CPUs allocated by Jolt Mesos frameworks would have exceeded the capacity of the cluster. This means that if we deployed this signal into production, we would run out of capacity practically on a daily basis, and our unit test jobs would be delayed. This might be what we want if we’re spending too much money during the day, but in all likelihood is probably a bad sign. Fortunately, this is just a simulation!</p><p>The use of this simulator has proven invaluable for our autoscaling efforts; it’s allowed us to catch several bugs that would have made it into production otherwise, and additionally has given us useful data for where to focus our cost-saving efforts. We’ve been able to translate results from our simulator into real changes to signals, resulting in around 10% cost savings (at no loss of reliability) for our Jolt cluster!</p><h2 id="wrapping-up-whats-next-for-clusterman">Wrapping Up: What’s next for Clusterman?</h2><p>Have no fear: just because we wrote this blog post doesn’t mean we’re done working on Clusterman. We’ve got a lot of exciting things in store for the future. First up, we’re currently investing in building Kubernetes support into Clusterman. Due to its modular design, it will take just a few weeks of effort before seeing the benefits of Clusterman for free on our nascent Kubernetes clusters. We also have a bunch of improvements we want to make around the simulation interface, namely with regards to making it easier to do data analysis. Finally, we’re just scratching the surface of what we can do with our signals. We’re planning to start looking at machine learning models to better predict the incoming demand for our clusters, and balance that automagically with our cost constraints. Watch this space for more exciting info about Clusterman in quarters to come!</p><div class="island job-posting"><h3>Become an Infrastructure Engineer at Yelp</h3><p>Want to do data analytics on our infrastrucure? Apply here!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/7f89991f-f202-494e-b389-83cbfae76b59?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Yelp Dataset Challenge: Round 11 Winners</h1> <p>Wed, 06 Feb 2019 12:02:00 +0100</p> <p>The eleventh round of the <a href="http://www.yelp.com/dataset_challenge">Yelp Dataset Challenge</a> ran throughout the first half of 2018 and we received many impressive, original, and fascinating submissions. As usual, we were struck by the quality of the entries: keep up the good work, folks!</p><p>Today, we are proud to announce the grand prize winner of the $5,000 award: <a href="https://arxiv.org/pdf/1810.03764.pdf">“Generalized Latent Variable Recovery for Generative Adversarial Networks”</a> by Nicholas Egan, Jeffrey Zhang, and Kevin Shen (from the Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science). The authors used a Deep Convolutional Generative Adversarial Network (DCGAN) to create photo-realistic pictures of food by training on images from Yelp. Through use of Gaussian priors on the latent variables in the images and other standard technique improvements, they managed to train DCGANs that perform better than past models. Their DCGANs can also be adapted to multiple applications in the industry. The three authors presented at Yelp’s weekly Engineering Learning Group last month to further describe their work and its various utilizations.</p>[embedded content]</iframe><p>This entry was selected from numerous submissions for its technical and academic merit by our panel of data scientists, data mining engineers, and software engineers. For a list of all previous Yelp Dataset winners, <a href="http://www.yelp.com/dataset_challenge">head over to the challenge site</a>. Thanks to all who participated!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Yelp Dataset Challenge: Round 11 Winners</h1> <p>Wed, 30 Jan 2019 12:01:00 +0100</p> <p>The eleventh round of the <a href="http://www.yelp.com/dataset_challenge">Yelp Dataset Challenge</a> ran throughout the first half of 2018 and we received many impressive, original, and fascinating submissions. As usual, we were struck by the quality of the entries: keep up the good work, folks!</p><p>Today, we are proud to announce the grand prize winner of the $5,000 award: <a href="https://arxiv.org/pdf/1810.03764.pdf">“Generalized Latent Variable Recovery for Generative Adversarial Networks”</a> by Nicholas Egan, Jeffrey Zhang, and Kevin Shen (from the Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science). The authors used a Deep Convolutional Generative Adversarial Network (DCGAN) to create photo-realistic pictures of food by training on images from Yelp. Through use of Gaussian priors on the latent variables in the images and other standard technique improvements, they managed to train DCGANs that perform better than past models. Their DCGANs can also be adapted to multiple applications in the industry. The three authors presented at Yelp’s weekly Engineering Learning Group last month to further describe their work and its various utilizations.</p>[embedded content]</iframe><p>This entry was selected from numerous submissions for its technical and academic merit by our panel of data scientists, data mining engineers, and software engineers. For a list of all previous Yelp Dataset winners, <a href="http://www.yelp.com/dataset_challenge">head over to the challenge site</a>. Thanks to all who participated!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Migrating Kafka's Zookeeper With No Downtime</h1> <p>Thu, 17 Jan 2019 12:01:00 +0100</p> <p>Here at Yelp we use Kafka extensively. In fact, we send <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">billions of messages a day</a> through our various clusters. Behind the scenes, Kafka uses Zookeeper for various distributed coordination tasks, such as deciding which Kafka broker is in charge of assigning partition leaders and storing metadata about the topics in its brokers.</p><p>Kafka’s success within Yelp has also meant that our clusters have grown substantially from when they were first deployed. At the same time, our other heavy Zookeeper users (e.g., <a href="https://medium.com/airbnb-engineering/smartstack-service-discovery-in-the-cloud-4b8a080de619">Smartstack</a> and <a href="https://github.com/Yelp/paasta">PaasTA</a>) have increased in scale, putting more load on our shared Zookeeper clusters. To alleviate this situation, we made the decision to migrate our Kafka clusters to use dedicated ZooKeeper clusters.</p><p>Since we rely so heavily on Kafka, any downtime due to maintenance can cause knock-on effects, like delays in the dashboards we show to business owners and logs backing up on our servers. And so the question arose… can we switch Zookeeper clusters without Kafka and other ZooKeeper users noticing?</p><p>After a few rounds of discussion and brainstorming between the teams that look after Kafka and Zookeeper, we figured out an approach that seemed to allow us to achieve our goal: Migrate the Kafka clusters into their own dedicated Zookeeper clusters without any Kafka downtime.</p><p>The procedure we came up with can be compared to the process of <a href="https://en.wikipedia.org/wiki/Mitosis">cell mitosis</a> in nature: At a high level, we replicate the Zookeeper hosts (our DNA), and then divide the duplicated hosts into two separate clusters using firewall rules (our cell walls).</p><div class="image-caption"><p class="subtle-text"><small>Major events in mitosis, where chromosones are divided in a cell nucleus</small></p></div><p>Let’s delve a bit deeper into the details, step by step. Throughout this post we’ll be referring to the source- and destination-clusters, with thesource representing the cluster that already existsand the destination the new cluster that Kafka will be migrated to. The examples we’ll be using are for a three-node Zookeeper cluster, but the process itself works for any number of nodes.</p><p>Our examples will use the following IP addresses for the Zookeeper nodes:</p><table><tbody><tr><td>Source</td> <td>192.168.1.1-3</td> </tr><tr><td>Destination</td> <td>192.168.1.4-6</td> </tr></tbody></table><h3 id="stage-onedna-replication">Stage One: DNA Replication</h3><p>First, we need to fire up a new Zookeeper cluster. This destination-cluster MUST be entirely empty, as the contents will be wiped out during the migration process.</p><p>We’ll then take two of the destination nodes and add them to the source-cluster, giving us a five-node Zookeeper cluster. The reason for this is that we want the data (originally stored by Kafka on the source ZooKeeper cluster) to get copied onto the destination-cluster. By joining the two destination nodes to the source-cluster, the copy is performed automatically by ZooKeeper’s replication mechanism.</p><div class="image-caption"><p class="subtle-text"><small>Nodes from the source- and destination-clusters are combined</small></p></div><p>Each of the nodes’ zoo.cfg file looks something like this now, with all of the source nodes and two of the destination nodes in the cluster:</p><div class="highlighter-rouge highlight"><pre>server.1=192.168.1.1:2888:3888 server.2=192.168.1.2:2888:3888 server.3=192.168.1.3:2888:3888 server.4=192.168.1.4:2888:3888 server.5=192.168.1.5:2888:3888 </pre></div><p>Notice that one of the nodes from the destination ZooKeeper cluster (192.168.1.6 in the above example) remains dormant during the procedure; it does not become a part of the joint-cluster,and ZooKeeper is not running on it. The reason for this dormancy is to maintain quorum in the source ZooKeeper cluster.</p><p>At this point the joint ZooKeeper cluster has to be restarted. Make sure you perform a rollingrestart (restart a single node at a time with at least 10-second intervalsbetween each) starting with the two nodes from the destination-cluster. This order ensures that quorum is not lost in the source ZooKeeper cluster and ensures availability for other clients, e.g. Kafka, while the new nodes are joining the cluster.</p><p>After the rolling restart of ZooKeeper nodes, Kafka has no idea about the new nodes in the joint-cluster, as its Zookeeper connection string only has the original source-cluster’s IP addresses:</p><div class="highlighter-rouge highlight"><pre>zookeeper.connect=192.168.1.1,192.168.1.2,192.168.1.3/kafka </pre></div><p>Data being sent to Zookeeper is now being replicated to the new nodes without Kafka even noticing.</p><p>Now that the data is in sync between the source and destination Zookeeper clusters, we can update Kafka’s zookeeper connection string to point tothe destination-cluster:</p><div class="highlighter-rouge highlight"><pre>zookeeper.connect=192.168.1.4,192.168.1.5,192.168.1.6/kafka </pre></div><p>A rolling restart of Kafka is required for it to pick up the new connections, but this does not require any downtime for the cluster as a whole.</p><h3 id="stage-two-mitosis">Stage Two: Mitosis</h3><p>The first step of splitting the joint-cluster is restoring the original source and destination ZooKeeper configuration files (zoo.cfg), as they reflect the desired final state of the clusters. Note that none of the Zookeeper services should be restarted at this point.</p><p>We use firewall rules to perform our mitosis, splitting our joint ZooKeeper cluster into distinct source and destination-clusters, each with their own leader. In our case, we’re using <a href="https://en.wikipedia.org/wiki/Iptables">iptables</a> to achieve this, but any firewall system that you can enforce between the hosts in your two Zookeeper clusters should suffice.</p><p>For each destination node, we run the following commands to add iptables rules:</p><div class="highlighter-rouge highlight"><pre>$source_node_list = 192.168.1.1,192.168.1.2,192.168.1.3 sudo /sbin/iptables -v -A INPUT -p tcp -d $source_node_list -j REJECT sudo /sbin/iptables -v -A OUTPUT -p tcp -d $source_node_list -j REJECT </pre></div><p>This rejects any incoming or outgoing TCP traffic to the source nodes from the destination nodes, achieving a separation of the two clusters.</p><div class="image-caption"><p class="subtle-text"><small>The source and destination clusters are separated by firewall rules and then restarted</small></p></div><p>The split means that we now have two destination nodes on their own. Since they think they’re part of a five-node cluster and cannot talk to a majority of their cluster, they will not elect a leader.</p><p>At this point, we simultaneously restart Zookeeper on every node in the destination-cluster, including the dormant node that was not part of the joint-cluster. This allows the Zookeeper processes to pick up their new configuration from step two. It also forces a leader election in the destination-cluster so that each cluster has its own leader.</p><p>From Kafka’s perspective, the destination-cluster is unavailable from the moment the network partition is added until the leader election completes. This is the only time ZooKeeper is unavailable to Kafka throughout the whole procedure. From now on, we have two distinct Zookeeper clusters, and there is no return! At least, not without copying data between clusters and possible data loss or downtime for Kafka.</p><p>All we have to do now is clean up after ourselves. The source-cluster still thinks it has two extra nodes, and we have some firewall rules in place that need to be cleared up.</p><p>Next, we restart the source cluster in order to pick up the zoo.cfg configuration containing only the original source-cluster nodes. This allows us to safely remove firewall rules, as the clusters are no longer trying to talk to each other. The following command drops the iptables rules:</p><div class="highlighter-rouge highlight"><pre>$source_node_list = 192.168.1.1,192.168.1.2,192.168.1.3 sudo /sbin/iptables -v -D INPUT -p tcp -d $source_node_list -j REJECT sudo /sbin/iptables -v -D OUTPUT -p tcp -d $source_node_list -j REJECT </pre></div><h2 id="distributed-stress-test">Distributed Stress Test</h2><p>The main approach we took to test the correctness of the migration procedure was a distributed stress test. The script runs tens of instances of Kafka producers and consumers across multiple machines while the migration procedure is in progress. After traffic generation finishes, all consumed payloads are aggregated into a single host to check for data loss.</p><p>The distributed stress test works by creating a set of Docker containers for Kafka producers and consumers and running them in parallel on multiple hosts, the list of which is passed as one of the parameters of the experiment. All produced messages contain serial numbers that can be used to check for message loss.</p><h2 id="ephemeral-clusters">Ephemeral Clusters</h2><p>In order to prove that this migration procedure was going to work, we wanted to build some clusters specifically for testing. Rather than manually building Kafka clusters and tearing them down again, we built a tool to spin up new clusters within our infrastructure that could be torn down automatically, allowing us to script the whole test procedure.</p><p>The tool connects to the AWS EC2 API and fires up several hosts with specific EC2 instance tags, permitting our puppet code, via <a href="https://puppet.com/docs/puppet/5.5/nodes_external.html">External Node Classifiers</a>, to figure out how to configure the host and install Kafka. This ultimately allowed us to run and re-run our migration scripts, simulating the migration multiple times.</p><p>The Ephemeral Cluster script was later reused to create ephemeral Elasticsearch-clusters for our integration testing, proving to be an invaluable tool.</p><h2 id="zk-smoketest">zk-smoketest</h2><p>We found phunt’s simple <a href="https://github.com/phunt/zk-smoketest">Zookeeper smoketest</a> scripts extremely useful in monitoring the state of each Zookeeper cluster while we were migrating. Throughout each stage of the migration we ran a smoketest in the background to ensure the Zookeeper clusters were behaving as expected.</p><h2 id="zkcopy">zkcopy</h2><p>Our first (naive) plans for the migration involved simply stopping Kafka, copying a subset of the Zookeeper data over to the new cluster, and starting Kafka with an updated Zookeeper connection. A more refined version of this procedure, which we called ‘block & copy’, is still used for moving Zookeeper clients to clusters with data stored in them, as the ‘mitosis’ procedure requires a blank destination Zookeeper cluster. A great tool for copying subsets of Zookeeper data is <a href="https://github.com/ksprojects/zkcopy">zkcopy</a>, which copies a sub-tree of one Zookeeper cluster to another.</p><p>We also added transaction support, enabling us to batch Zookeeper operations and minimize the network overhead of creating one transaction per znode. This sped up our usage of zkcopy by about 10x!</p><p>Another feature that was core to our speedup was ‘mtime’ support, which allowed us to skip copying any nodes older than a given modification time. Through this, we were able to avoid a majority of the work for the second ‘catch-up’ copy required to get our Zookeeper clusters in sync. The downtime needed for Zookeeper went from 25 minutes to less than two!</p><p>Zookeeper clusters are pretty lightweight. When possible, try not to share them between different services, as they may start to cause performance issues in Zookeeper, which are hard to debug and usually require downtime to repair.</p><p>It is possible to migrate Kafka to a new Zookeeper cluster without any Kafka downtime, but it’s certainly not trivial.</p><p>If you can schedule Kafka downtime to do your Zookeeper migration, it’ll be a lot simpler.</p><p>Piotr Chabierski - Initial draft & reviews</p><p>Raghu Prabhu - Reviews</p><p>Toby Cole - Typing furiously & photoshoppery</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Come and work with the teams in London that wrangled these various moving parts!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/2cfdf523-06dd-41d9-b025-3db1b45f0548?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>Joinery: A Tale of Un-Windowed Joins</h1> <p>Tue, 11 Dec 2018 12:12:00 +0100</p> <h2 id="summary">Summary</h2><p>At Yelp, we generate a wide array of high throughput data streams spanning logs, business data, and application data. These streams need to be joined, filtered, aggregated, and sometimes even quickly transformed. To facilitate this process, the engineering team has invested a significant amount of time analyzing multiple stream processing frameworks, ultimately identifying Apache Flink as the best suited option for these scenarios. We’ve now implemented a join algorithm using Flink, which we’re calling “Joinery.” It is capable of performing un-windowed one-to-one, one-to-many, and many-to-many inner joins across two-or-more keyed data streams.</p><p>So, how does it work? In the simplest terms, developers provide a config file describing the desired join, and the Joinery service executes a joined keyed output stream.</p><h2 id="background-what-problem-are-we-trying-to-solve">Background: What Problem Are We Trying to Solve?</h2><p>Since the advent of streaming pipelines, the gap between streams and tables has been greatly reduced. Streaming pipelines allow for computationally intensive data operations like joins, filtering, and aggregation to be performed on high throughput data streams. While most streaming pipelines support joins within time-bounded windows, there are many that require joins on un-windowed data.</p><p>One such use case is <a href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">Salesforce</a>. Salesforce is a downstream data store we use at Yelp to empower sales teams. It contains data about the businesses supported on the platform, such as purchased advertising packages and business owner profiles. The data is stored in separate tables in a relational database, but is also denormalized in Salesforce to help prevent expensive real-time join operations when sales people need access to data on-the-fly (e.g., while pitching to clients).</p><p>To support this use case, we implemented a real time stream joiner that joins data across multiple data streams and presents the normalized tables in the relational database into one stream that feeds into the denormalized table in Salesforce. In the figure below, each inbound stream represents a table in the relational database. The stream joiner consumes messages from these inbound streams and creates fully joined messages based on a key before publishing them to outbound streams. For example, in the stream joiner below, the key used to join messages is the business-id, which represents the primary key of the business and advertisement tables and the foreign key of the business owner table.</p><div class="image-caption"></div><h2 id="previous-approach">Previous Approach</h2><p>Historically, Yelp Engineering has built <a href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">Paastorm spolts</a> to solve similar problems. However, when datasets grew to the tens of gigabytes, spolts incurred a higher maintenance cost to recover. Another issue was that they were not designed for stateful applications, so using Paastorm spolts for stateful solutions meant having to implement state management from scratch. To cite an example, <a href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">one spolt</a> that uploaded results to Salesforce stored several tens of millions of messages at any given time, and in case of a crash, took several hours to recover! This resulted in delays in the overall pipeline and required manual intervention, which ultimately hampered engineering productivity.</p><p>This historical use mandates that any approach to joining unbounded streams must scale to be fault tolerant.</p><h2 id="a-join-algorithm">A Join Algorithm?</h2><p>Our past experience in building <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">data pipelines</a> and aggregation led us to the following joiner algorithm:</p><div class="image-caption"></div><h3 id="algorithm">Algorithm:</h3><ul><li>Shuffle/sort messages into <a href="https://en.wikipedia.org/wiki/Join_(SQL)#Equi-join">equi-join partitions</a> based on message keys.</li> <li>Insert every message into its corresponding hash table within the <a href="https://en.wikipedia.org/wiki/Multimap">multi-map</a>.</li> <li>Construct the output by taking the Cartesian product of all the multi-map’s lists.</li> <li>Filter and project based on what’s required in the final output.</li> </ul><p>The above algorithm can be summed up into three key parts:</p><ol><li>Update phase</li> <li>Join phase</li> <li>Projection phase</li> </ol><p>Let’s discuss these phases in more detail.</p><p>For each input, the algorithm creates a hash table of schemas, and then maps <a href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">datapipeline messages</a> to keys in these streams. For every new incoming message, we check the message type (analogous to MySQL LogType - log, create, update, delete) and apply the create/update/delete messages to their corresponding hash tables.</p><h3 id="join-phase">Join Phase</h3><p>Next, we probe the above hash tables to generate a cross-product of all messages. This generates all possible permutations of the new message with the tuples of the other relations. The joined messages are then published to the target stream. Note that a joined message (one for each row of the joined result) is published to the target stream when there are inbound messages (one on each inbound stream) with the same key. The join phase of the algorithm here performs an inner join.</p><h3 id="projection-phase">Projection Phase</h3><p>During creation of the output message, aliases can be used to project fields in the output schema to prevent naming collisions. Fields can also be dropped entirely if unnecessary to downstream consumers.</p><p>This algorithm only works on log compacted, schematized keyed streams. Using a <a href="http://kafka.apache.org/documentation/#compaction">log compacted stream</a> prevents unbounded growth and ensures that a consumer application will retain at least the last known value for each message key within the kafka partition. These constraints imply the algorithm works with data change log streams as opposed to regular log streams.</p><p>In the diagram below, the input streams are represented on the left, with messages coming from different input sources. The figures depict the cartesian product computed for the input streams. In the join phase, we perform stream aggregation that emits a tuple when records with the same key (id in this example) are detected from the input sources. In other words, the algorithm checks if the keys in the input stream have a mapping in all hash tables (streams), and only if there is, move to the projection phase.</p><div class="image-caption"></div><p>This schematic illustrates how the algorithm emits records:</p><div class="image-caption"></div><p>Since Joinery computes joins on unbounded streams, its internal state could potentially grow very large. Having a large in-memory state is costly and does not allow fast recovery. To alleviate this, Joinery keys data streams by different keys, which helps distribute the memory footprint across nodes. However, this doesn’t necessarily keep the state size from growing beyond the total available heap memory on its nodes, which may lead to OOM errors. Therefore, we needed a way to spill data to disk while maintaining a relatively low memory footprint.</p><p>By utilizing Flink’s incremental checkpointing with RocksDB, we can persist the application state to an external storage. This results in a low memory footprint and allows for a faster recovery time (as compared to our spolt implementation), in a matter of minutes. For a more thorough understanding of Flink and RocksDB, check out this <a href="https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html">article</a>.</p><h2 id="so-far-so-good-but-do-you-have-an-end-to-end-example">So Far So Good, but Do You Have an End-To-End Example?</h2><div class="image-caption"></div><p>Let’s talk about a hypothetical scenario where Joinery joins two streams: user review and business.</p><pre>user review: - biz_id - content - review_id - user_id </pre><p>user review stream</p><pre>business: - business_id - name - address - state </pre><p>business stream</p><p>We want to generate an output stream that joins the above two streams based on the business id. The Joinery configuration for this would be as follows:</p><pre>join: - schema_id: 12345 join_keys: [biz_id] exclude_fields: [content, review_id] - schema_id: 23143 join_keys: [business_id] aliases: - from: business_id to: biz_id exclude_fields: [address, name] output: namespace: joinery_example source: business_review_join Doc: Join of business table and review table pkey: - business_id </pre><p>Joinery Configuration</p><p>The above configuration guides Joinery to join the two streams of the biz_id key across the input streams. One important thing to note here is that even though we don’t have the same key names in both streams, we can utilize aliases to map keys (similar to traditional SQL aliases). An example of this join is provided below:</p><div class="image-caption"></div><h2 id="future-work">Future Work</h2><p>One of the main challenges we’ve faced and are looking to tackle in the future is maintaining data integrity during upgrades and state migrations. A truly robust streaming application deployed in production should be resilient to restarts and state recovery should work consistently without any significant time lags.</p><p>Blackbox testing and auditing an application like Joinery is hard. Yelp has built tooling like pqctl (custom docker compose environment) that helps infrastructure teams have testbeds to implement repeatable, simple unit tests. By leveraging this tooling and developing an extensive acceptance test suite, we look to test more end-to-end joins while inducing failures scenarios. Some of this is in progress, but there is still more work to be done to ensure that we can repeatedly verify states after restarts, particularly on version upgrades of Joinery.</p><h2 id="appendix">Appendix:</h2><ul><li><a href="http://www.vldb.org/conf/2003/papers/S10P01.pdf">MJoin algorithm</a></li> </ul><h2 id="acknowledgements">Acknowledgements</h2><p>Thanks to Justin Cunningham, Semir Patel, Alexandru Malaescu and Sharvari Marathe who contributed to this project, in addition to members of the Stream Processing team for their advice and support.</p><div class="post-gray-box">This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL updates in real-time with an exactly-once guarantee, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into datastores like Redshift, Salesforce, and Elasticsearch.<p>Read the posts in the series:</p><ul><li><a title="Billions of Messages a Day - Yelp's Real-time Data Pipeline" href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Billions of Messages a Day - Yelp's Real-time Data Pipeline</a></li> <li><a title="Streaming MySQL tables in real-time to Kafka" href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Streaming MySQL tables in real-time to Kafka</a></li> <li><a title="More Than Just a Schema Store" href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">More Than Just a Schema Store</a></li> <li><a title="PaaStorm: A Streaming Processor" href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">PaaStorm: A Streaming Processor</a></li> <li><a title="Data Pipeline: Salesforce Connector" href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">Data Pipeline: Salesforce Connector</a></li> <li><a title="Streaming Messages from Kafka into Redshift in near Real-Time" href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Streaming Messages from Kafka into Redshift in near Real-Time</a></li> <li><a title="Open-Sourcing Yelp's Data Pipeline" href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">Open-Sourcing Yelp's Data Pipeline</a></li> <li><a title="Making 30x Performance Improvements on Yelp’s MySQLStreamer" href="https://engineeringblog.yelp.com/2018/02/making-30x-performance-improvements-on-yelps-mysqlstreamer.html">Making 30x Performance Improvements on Yelp’s MySQLStreamer</a></li> <li><a title="Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift" href="https://engineeringblog.yelp.com/2018/04/black-box-auditing.html">Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift</a></li> <li><a title="Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch" href="https://engineeringblog.yelp.com/2018/06/fast-order-search.html">Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch</a></li> <li><a title="Joinery: A Tale of Un-Windowed Joins" href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery: A Tale of Un-Windowed Joins</a></li> </ul></div><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/461e8999-1bb8-4d37-9212-da7558ebdc21?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>TTL as a Service: Automatic Revocation of Stale Privileges</h1> <p>Mon, 19 Nov 2018 12:11:00 +0100</p> <p>Security and usability are often at odds with one another, a fact that is best illustrated by access control. Deny everyone, and you’ll have a super secure system that no one can use; allow everyone, and you’ll maximize usability at the cost of security.</p><p>The <a href="https://digitalguardian.com/blog/what-principle-least-privilege-polp-best-practice-information-security-and-compliance">Principle of Least Privilege</a> exists to balance both security and usability by giving users only the minimum amount of access they need to do their job. This reduces the attack surface by preventing attackers from leveraging a compromised user’s important, albeit unused, privileges for vertical/horizontal escalation.</p><h2 id="the-problem">The Problem</h2><p>That said, there are a few key reasons why least privilege is <strong>hard to enforce</strong>:</p><ol><li> <p>No one <strong>asks</strong> for their access to be taken away.</p> <p>Developer velocity is important to us. As long as there’s an audit trail, we generally allow people access to the resources necessary to do their job. However, once their task is complete, be it one day or several years later, they move on. This often results in people accumulating access like scout badges throughout their tenure.</p> <p>It’s much more common for people to complain about <strong>not</strong> having access, rather than having <strong>too</strong> much access. That’s just human nature.</p> </li> <li> <p>There is <strong>no singular governing system</strong> of access control.</p> <p>Enterprises are comprised of many different systems, including external vendors and internal builds/hosts, and each of these may have its own access control management system. An employee’s holistic set of privileges includes their access to each one of these different systems.</p> <p>While it may technically be possible to have one single, centralized system that maps every user to all things they have access to, this can quickly become unruly given the level of granularity that a solid access control system should provide.</p> </li> <li> <p>Audits are <strong>painfully manual</strong>.</p> <p>Manual audits are a necessary burden to ensure that the state of the world is as we know it to be. However, they are also very time consuming and becoming increasingly more difficult to scale as the amount of privileges grows in a company.</p> </li> </ol><h2 id="the-solution">The Solution</h2><p>To address this issue, we designed “TTL-as-a-Service” (Time-To-Live): a system to <strong>identify and flag users and their stale privileges</strong>. The premise is simple: if you haven’t used your access in X days, you probably don’t need it anymore and won’t notice if we take it away.</p><p>At its core, this requires two things:</p><ol><li> <p>Knowledge of every time a user has used a given privilege.</p> </li> <li> <p>The ability to revoke access upon detecting staleness.</p> </li> </ol><p>Embracing the <a href="https://opensource.com/business/15/2/how-linux-philosophy-affects-you">UNIX philosophy</a> of portability and minimalism, this system is designed to simply ingest logs, then perform a daily scan to process and identify stale privileges. Upon detecting staleness, it will fire off an alert to execute custom integrations or automatically generate tickets to revoke the identified user’s stale privilege.</p><p>The architectural diagram below provides a clearer image of our implementation:</p><p></p><p>We use Splunk as both a log ingestor and alerting mechanism, powered by <a href="https://docs.splunk.com/Documentation/Splunk/7.1.3/Admin/Savedsearchesconf">savedsearches</a>.</p><ol><li> <p>Splunk ingests logs from various sources, including our access control system and osquery. These upstream log providers are also configured to log upon permission usage.</p> </li> <li> <p>On a daily basis, our customized saved searches are triggered to perform two things:</p> <ol><li> <p>Aggregate daily use and throw them in a <a href="http://docs.splunk.com/Documentation/Splunk/7.1.3/Knowledge/Usesummaryindexing">summary index</a>.</p> <p>This allows us to perform more <strong>efficient</strong> searches, since we merely need to know whether a person has used a permission in a given day, rather than every single time they use it.</p> </li> <li> <p>Query the last X days to detect new stale permissions.</p> <p>This <strong>rolling report</strong> is the essence of this solution, as it enables us to minimize the amount of manual effort necessary for periodic audits through automation.</p> </li> </ol></li> <li> <p>If stale permissions are detected, actions are automatically triggered. By default, this results in JIRA ticket creation, however, it can also be uploaded to an S3 bucket for downstream consumption.</p> </li> <li> <p>An example of downstream consumption is a batch worker for our access control system. On a daily basis, this pulls the latest changes from S3 and subsequently revokes access for the (user, permission) pairs listed.</p> </li> </ol><p>This system can be easily applied to a variety of different access control systems by merely feeding the access log and receiving actionable alerts. These alerts can be further expanded through optional custom integrations that read from S3, and revoke privileges appropriately. Holistically, this allows us to assert that anyone with a given privilege has actively used it within the last X days.</p><h2 id="issue-cold-start">Issue: Cold Start</h2><p>The <a href="https://www.yuspify.com/blog/cold-start-problem-recommender-systems/">Cold Start issue</a> occurs when a system has not processed enough data to make accurate judgements on an individual user level. In the case of least permissions, this occurs when a user is first granted new permissions, or when a permission is exercised infrequently or irregularly. How do we know when the <strong>right</strong> time is to remove a privilege with no prior knowledge of expected use cases?</p><p>To address cold start issues, we leverage anomaly detection techniques and try to bootstrap our knowledge of an individual by comparing their permission usage against the rest of their team and the company as a whole. For example, we attempted to identify “unusual” permissions by aggregating a given team’s permission set. If 95% of the team has a given permission, it would suggest they need it for their job. On the flipside, if only 1% of the team has a given permission, it might suggest an anomaly that should be more closely investigated.</p><p>With the additional assistance of on-the-ground managers to process and validate this data analysis, we’re able to answer the following questions:</p><ul><li> <p>Which employees currently have privileges they should not need to do their job?</p> </li> <li> <p>Given these usage statistics, what seems like an appropriate upper bound for a permission to be considered stale for the entire team?</p> </li> </ul><p>Though we’re unable to completely avoid manual processing, this solution has helped ensure that we only have to do it once. For future potential improvements, we can also train a machine learning (ML) model to better improve the performance of our statistical analysis.</p><h2 id="issue-other-edge-cases">Issue: Other Edge Cases</h2><p>No project implementation is complete without a few hiccups along the way. Some edge cases to consider include:</p><ol><li> <p>Emergency-only Privileges</p> <p>There are certain privileges that are only used in an emergency or rare, time-sensitive situations. By definition, these will be flagged by the system as “stale,” yet may not be advisable to be removed if it would require additional overhead when they’re actually needed.</p> <p>However, this varies from case to case and is implementation-dependent, as it depends on the system’s ease of acquiring a permission when necessary.</p> </li> <li> <p>Periodic Usage</p> <p>Some activities are only done periodically, e.g., once a quarter. By definition, this may also exceed the X days configured for your staleness definition. Therefore, depending on your implementation, you can either revoke immediately (requiring the user to request the permission again every period) or create an exception for these privileges.</p> <p>In general, we found that a smooth, auditable process to quickly and securely reinstate an employee’s privileges was incredibly helpful, allowing us to be more aggressive in revoking privileges. For example, if it only takes a couple hours to restore revoked privileges after ninety days of non-use, people are more willing to give up stale access.</p> </li> </ol><h2 id="takeaways">Takeaways</h2><p>The ability to quickly provision user access is important, especially for a high growth company. In the same way, it’s important to be able to quickly deprovision users when access is no longer needed. Unfortunately, the latter is a lot harder to manage and scale.</p><p>Using this system, we’re able to identify and subsequently revoke stale privileges without hindering developer velocity. This allows us to confidently assert that users will not have unused access longer than X days, thereby systematically enforcing least privilege with minimal manual effort.</p><p>Finally, through the process of building and rolling this out, we learned that it is also beneficial to have a smooth, speedy process in place for restoring revoked privileges, as it will reduce friction when trying to establish this new process.</p><h2 id="contributors">Contributors</h2><p>I would like to credit the following people (in alphabetical order) for their hard work in building this system and in continuing to bolster Yelp’s security.</p><ul><li><a href="https://www.linkedin.com/in/aaronloo">Aaron Loo</a></li> <li><a href="https://www.linkedin.com/in/joeysclee/">Joey Lee</a></li> <li><a href="https://github.com/KevinHock">Kevin Hock</a></li> </ul><div class="island job-posting"><h3>Security Engineering at Yelp</h3><p>Want to build automated systems to reduce manual effort, and help keep the Yelps secure? Apply to join!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/30bfc49d-efdd-4543-9748-d95bef5692ae?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>All About Yelp Hackathon</h1> <p>Wed, 07 Nov 2018 12:11:00 +0100</p> <p><strong>By:</strong> Alex Phillips</p><p>It’s time for our fall Hackathon! At Yelp, Hackathons are two-day events that provide unstructured time for our engineering and product teams to work on whatever may scratch their creative itch! Hackathon truly embodies our company values of “Playing Well with Others” and “Being Unboring,” as it invites us to participate in so many different ways.</p><p>Engineers have the liberty to work on projects related to or completely outside the box of the Yelp product. We’ve seen many types of projects over the years from music videos and new photo classification algorithms to baking workshops, custom video games, and so much more! It’s a great outlet for collaboration and innovation that really helps foster teamwork and creativity.</p><div class="image-caption"><p class="subtle-text"><small>Ready, set, hack!</small></p></div><p>For the past several weeks, we’ve been hard at work preparing for the final Hackathon of the year: number 27! This year will be our ninth year running, with each year traditionally hosting three hackathons. This pace enables the engineering team to have reliable and regular outlets for their creativity and to take advantage of several opportunities throughout their career to work on a variety of different project types.</p><p>We’re particularly focused on the celebration of building something_ together_,** **and in an effort to recognize that, have come up with six different awards: Useful, Funny, Cool, Hardcore, Unhack, and Spotlight. The Spotlight award in particular rotates its theme every Hackathon; Hackathon 27 we’ll be spotlighting “Inclusion” which is an important facet of Yelp culture. We’re hoping to this inspires a broad range of projects and activities bringing awareness to how important inclusion is in workplace culture.</p><p>Hackathon planning is a collaboration between our awesome Engineering, Engineering Event Planning, and Engineering Recruiting teams. There’s a lot of orchestration involved in selecting the theme, arranging the catering, helping engineers find or evangelize their ideal projects, designing the swag, and of course, planning the Ridiculousness!</p><div class="image-caption"><p class="subtle-text"><small>Plenty of hacking fuel!</small></p></div><p>In the true spirit of being unboring, Ridiculousness is the center of fun and games during Hackathon. Need a break from hacking? Come on by to paint, build, draw, or play interactive games with your fellow engineers! Team connectedness is something that transcends both our SF and Hamburg Engineering teams and is celebrated by sharing a Hackathon kickoff toast and awards ceremony.</p><div class="image-caption"><p class="subtle-text"><small>Ridiculousness!</small></p></div><p>I’ve had the amazing opportunity of seeing so many unique, creative projects that have been the product of hard work and collaboration of our engineering and product teams. I’d like to share just a few with you!</p><p>One of my favorite projects coming out of Hackathon is “AWE the Book.” AWE is our Awesome Women in Engineering employee group at Yelp, who champions and facilitates initiatives to improve inclusion and diversity within Yelp Engineering. “AWE the Book” is a collection of interviews from over 60 women in Engineering and Product, with each page speaking to their childhood aspirations, what they love about Yelp, and their pathway into the tech industry. It was an amazing demonstration of people coming together to work on a project they’re passionate about. Read more about it in <a href="https://engineeringblog.yelp.com/2018/02/celebrating-the-women-of-yelp-awe-the-book.html">this blog post</a>!</p><p>One useful Hackathon project that’s now embedded into Yelp culture is Yelp Love, an app that allows any employee to send kudos to one or several colleagues at a time. Yelp Love has become the defacto way to say “thank you” to a coworker that really went above and beyond, and it helps us all live by our “play well with others” value.</p><div class="image-caption"><p class="subtle-text"><small>Hackathon Science Fair - Winner of the Hardcore award, Neon Incident Pager</small></p></div><p>One of the most hardcore projects was the “Neon Incident Pager project.” This was a physical neon light and LED display that integrated with our incident paging system to create a bright and eye-catching display when an incident is triggered! This project took on a creative, fun, and yet hardcore challenge to produce something really remarkable!</p><div class="image-caption"><p class="subtle-text"><small>Hackathon Science Fair</small></p></div><p>I grow more and more excited as we head into Hackathon 27 as I’m reminded of some of my favorite aspects of Hackathon at Yelp: meeting new people, learning new things, and building! Hack on!</p><div class="island job-posting"><h3>Engineering at Yelp</h3><p>Want to experience Hackathons for yourself? Join us at Yelp!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/3669001f-8f45-472c-8d8a-8904b3a07826?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>A Guide to Software Engineering for the Visually Impaired</h1> <p>Mon, 29 Oct 2018 12:10:00 +0100</p> <p>My name is Abrar Sheikh, and I’m a backend engineer on Yelp’s Distributed Systems team. Our team enables real-time data transfers between microservices and different data stores by building <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">streaming infrastructure on top of Kafka</a> using technologies like Python, Scala, and Apache Flink.</p><p>I suffer from a genetic disorder called <a href="https://en.wikipedia.org/wiki/Albinism">Albinism</a>, which is mainly characterized by two things:</p><ol><li>Lack of pigmentation in the body, resulting in a white skin and hair tone.</li> <li>Severe loss of vision, which limits the ability to perform routine tasks such as driving, reading, or using computers.</li> </ol><p>Growing up, I was fascinated by computers. Luckily my parents led me to pursue a career in software engineering. I’ve faced many challenges along the way, but through deliberate practice and employing some simple tricks, I’ve been able to successfully grow my career.</p><p>In this blog, I’ll cover the following topics:</p><ol><li>Understanding what low vision really means.</li> <li>How low vision impacts certain population groups and its effects on society.</li> <li>Common challenges that people suffering from low vision face in education and employment verticals.</li> <li>Simple and effective strategies to overcome these challenges.</li> </ol><p>We’ll start off by discussing which aspects of vision impairment apply to people with low vision, both regardless of profession and in software engineering specifically. This is by no means a comprehensive list of problems nor a complete set of recommendations; this is solely based on my personal experience. My ultimate hope is to convince you that software engineering doesn’t have to be a difficult profession for someone suffering from severe vision loss.</p><h2 id="what-is-low-vision">What is Low Vision?</h2><div class="image-caption"><p class="subtle-text"><small>Snellen's eye chart. http://www.optometrial.com/snellen-vision-chart-downloadable-graphic-free</small></p></div><p>This chart will likely look familiar to most of you. It’s called <a href="https://en.wikipedia.org/wiki/Snellen_chart">Snellen’s eye chart</a> and is used by optometrists to measure visual acuity. Essentially, the further down the chart you’re able to read, the better your vision is.</p><p>A visual acuity lower than 20/70 may be considered low vision, and would likely require someone to use alternative methods to engage in regular visual activity. However, keep in mind that this definition of low vision is subjective and does not universally apply to every individual.</p><p>Personally, I prefer a more <a href="http://www.visionaware.org/info/your-eye-condition/eye-health/low-vision/low-vision-terms-and-descriptions/1235#AFunctionalDefintion_of_LowVision">functional definition</a> of low vision: uncorrectable vision loss that interferes with daily activities. In other words, “not enough vision to do whatever you need to do.” This can vary from person to person. For example, a visual acuity of 20/70 would be perfectly acceptable for a truck driver, whereas a visual acuity of 20/70 would be too low for a medical surgeon.</p><p>Here are some <a href="https://nfb.org/blindness-statistics">interesting numbers</a> I pulled from the 2015 American Community Survey and Report published by the National Federation of the Blind:</p><div class="image-caption"><p class="subtle-text"><small>Maximum Education Attainment</small></p></div><div class="image-caption"><p class="subtle-text"><small>Unemployment Rate</small></p></div><p>As seen from the charts above, a significant portion of the American population suffering from low vision drops out of the education system early on, and the unemployment rate for this group is significantly higher than those with normal vision.</p><p><strong>What are some of the reasons for this distinction?</strong></p><ol><li>Lack of a supportive learning environment in classrooms for students with low vision.</li> <li>Restrictive access to reading and learning materials for students with low vision.</li> <li>Lack of documentation around access to technology for people with visual impairment.</li> <li>Lack of structure and awareness for creating an inclusive working environment for people with visual impairment in a professional work setting.</li> </ol><p>In the following sections, I’ll outline some solutions that have worked for me to overcome these challenges.</p><h2 id="education">Education</h2><p>Let’s see what challenges exist in an academic setting for a person with low vision and how they can overcome them.</p><div class="image-caption"><p class="subtle-text"><small>Students studying in classroom. http://news.mit.edu/sites/mit.edu.newsoffice/files/styles/news_article_image_top_slideshow/public/images/2014/20140219114514-1_0.jpg?itok=yyWbC65W</small></p></div><p>Our current education system is centered around classroom learning, and for decades this has been the primary way of consuming knowledge. The problem with this system for someone with low vision is that, in most situations, even with the best possible corrective lenses and proximity to the front of the class, it’s not possible to easily decipher the content on the board or projector.</p><p>These limitations can be alleviated with some simple strategies:</p><ol><li>Reach out to your student registrar: They can help facilitate appropriate accommodations like extra time for exams and assignments, access to larger font notes and textbooks, classroom assistants for taking notes, etc.</li> <li>Make your professor aware of your condition: They can get you access to recorded lectures from previous years, allow you to audio record the lecture, or provide access to slides and lecture notes beforehand.</li> <li>Get a <a href="https://en.wikipedia.org/wiki/Monocular">monocular</a>: This is a visual aid device that can help read distant texts.</li> </ol><div class="image-caption"><p class="subtle-text"><small>Specs on a book. http://eliaspelcastre.com/wp-content/uploads/2015/09/book-min.jpg</small></p></div><p>It can be difficult to read certain materials like books and research papers due to their small font size. This can be remedied by using a good magnifying glass; one with at least a 3x-5x optical zoom should sufficiently offset any reading difficulties. As simple as this sounds, it’s often overlooked because it takes some time to get used to.</p><p>Reading content on a computer screen can be more challenging than books, as the content on the screen is significantly more dynamic in nature. To elaborate: the layout on a computer screen is less predictable and changes often; text rarely follows a prescribed format and can differ among applications. These factors pose challenges for someone with low vision. At the same time, it’s hard to imagine a career in software engineering without easy access to technology, something that’s become increasingly true across most professional fields. Fortunately for us, the early developers of operating systems built their software with accessibility in mind.</p><p class="c15 c3"></p><p>Both Windows and MacOS come with their own versions of on-screen magnifiers, which work great. This is how it looks in reality::</p><div class="image-caption"><p class="subtle-text"><small>Computer screen with magnifier turned on.</small></p></div><p>You’re probably wondering how I’m able to swiftly navigate my way through the screen while using a magnifier? I get asked this question a lot, and it can be easily explained with a driving analogy.</p><div class="image-caption"><p class="subtle-text"><small>Driving with GPS. http://www.navionika.com/images/stories/products/drive61lmt.jpg</small></p></div><p>Imagine you just moved to New York and have no idea what the city’s map looks like. The only way you can get from point A to point B is by using a navigation system. This navigation gives you a broader view of the city while you use the windshield of your car to actually look around and make driving decisions. This is exactly how a person with low vision uses the on-screen magnifier to engage with applications on computers. The zoom area of the screen is analogous to the windshield of the car, while the rest of the screen is analogous to the navigation system. A person with low vision can see a localized area of the screen clearly while still being able to see where it fits in the broader context of the entire screen.</p><h2 id="getting-hired">Getting Hired</h2><p>When I began interviewing with companies, one of the biggest battles I faced was my own insecurities. I would constantly analyze my limitations and conclude that there was no way in the world anyone was going to hire me. It almost sounds silly to say that now, but the fact is, to some degree we all find ourselves in that same mindset.</p><p>If that situation sounds familiar to you, this is what I want you to remember: Your hiring manager typically only cares about the answers to the following three questions:</p><ol><li>Do you have exceptional analytical and problem-solving abilities?</li> <li>Do you have the right leadership skills to drive important business decisions?</li> <li>Are you a strong team player?</li> </ol><p>I know it sounds simple, but in my experience, this has always been the case. One thing I can say with confidence is that your visual limitation has little to no impact on your ability to fulfill these job requirements.</p><p>If you do need certain accommodations during the hiring process due to your disability, please don’t hesitate to bring it up with your recruiter. From personal experiences, after having spoken to a lot of technical recruiters, I can attest to the fact that your physical limitation is definitely not one of the factors that goes into the hiring decision.</p><p>When interviewing a candidate with the low vision, the interviewer should keep the following in mind:</p><ol><li>Ask the candidate if they need extra time.</li> <li>Ensure that all coding questions can be conducted on a whiteboard instead of a computer.</li> </ol><h2 id="how-can-your-peers-help">How Can Your Peers Help?</h2><p>As a <strong>manager</strong>, you can help provide appropriate workplace accommodations for engineers with low vision. These can vary from person to person, but here are a few things that I’ve found helpful:</p><ul><li>Having a desk that’s away from sunlight, as extremely bright environments make it difficult for me to read the screen.</li> <li>Sharing the slide deck with me during each team meeting since I often don’t have visual access to the content on the projector.</li> </ul><p>As a <strong>peer</strong>, if you’re working on a problem with a low-vision engineer, using a whiteboard for discussions and tickets or documents for sharing problem context makes a huge difference. Engineers with low vision have a certain way of using computers with accessibility turned on; while they can work seamlessly on their own computer, they may not necessarily be able to on someone else’s.</p><p>As a <strong>tech community</strong>, we should all strive to follow consistent coding standards, test coverage, and solid code documentation. This is important because while an engineer with normal vision can read about 10 lines of code without moving a visual muscle, an engineer with low vision can only read about three lines in the same amount of time.</p><h2 id="recap-and-conclusion">Recap and Conclusion</h2><h3 id="gather-your-tools">Gather Your Tools</h3><ol><li>Adopt the usage of assistive devices like monocular and magnifying glasses to gain access to classroom education and reading material.</li> <li>Use an on-screen magnifier to access computers.</li> </ol><h3 id="practice-and-perfect-your-skill">Practice and Perfect Your Skill</h3><p>Spend time familiarizing yourself with the above mentioned devices, as it’s likely going to take more time for you to reach the same level of efficiency to perform routine tasks as a person with normal vision.</p><h3 id="ask-for-help">Ask for Help</h3><p>It’s in your best interest to make your peers aware of your limitations so that they can step in and help whenever necessary.</p><h2 id="demo">Demo</h2><p>The content in this blog was first presented at PyOhio 2018.</p><p><a href="https://youtu.be/L3WpnG49XLc?t=20m19s" target="_blank"></a></p><h2 id="join-us">Join Us</h2><p>Yelp fosters an inclusive work culture that brings out creativity and enables self development for all. To learn more about working at Yelp, I highly encourage checking out our <a href="http://yelp.com/careers">Careers</a> page.</p><p>If my blog post has left you curious and with lots of questions, or if you just want to say “hi,” I’m reachable via twitter at <a href="https://twitter.com/abrarsheik">@abrarsheik</a>.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>The Yelp Production Engineering Documentation Style Guide</h1> <p>Tue, 16 Oct 2018 12:10:00 +0200</p> <p>Documentation is something that many of us in software and site reliability engineering struggle with – even if we recognize its importance, it can still be a struggle to write it consistently and to write it well. While we in Yelp’s Production Engineering group are no different, over the last few quarters we’ve engaged in a concerted effort to do something about it.</p><p>One of the first steps towards changing this process was developing our documentation style guide, something that started out as a Hackathon project late last year. I spoke about it when I was giving <a href="https://www.usenix.org/conference/srecon18europe/presentation/blackwell">my talk on documentation at SRECon EMEA</a> in August, and afterwards, a number of people reached out to ask if they could have a copy.</p><p>While what we’re sharing today isn’t our exact style guide – we’ve trimmed out some of the specifics that aren’t really relevant, done a bit of rewording for a more general audience, and added some annotations – it’s essentially the one we’ve been using since the start of this year, with the caveat that it’s a living document and continues to be refined. While this may not be perfect for every team (both at Yelp and elsewhere), it’s helped us raise the bar on our own documentation and provides an example for others to follow.</p><p>So, without further ado, here’s the…</p><p>To make sure we provide consistently good documentation that’s also easy to find, this guide lays out a number of standards to use when writing your documentation.</p><h2 id="make-your-documentation-discoverable">Make Your Documentation Discoverable</h2><p>One of the most important parts of writing documentation is making sure it’s discoverable. This is not the same as searchable; discoverable means that someone should be able to find the document without knowing exactly what they’re looking for. <strong>For this reason, Production Engineering has decided that our Wiki will be our primary repository for documentation.</strong> Here are a few methods you can use within this framework to make your documentation more discoverable.</p><p><em>We also provide a more specific doc for the steps to take when migrating docs from elsewhere to our Wiki.</em></p><ul><li><strong>Create a portal.</strong> If you’re working on a large project or collecting projects for a team, you can have a portal, like the Production Engineering Home Page, that gives easy access to the most frequently used or most important documents of that project/team. When you do this, try to make sure that the portal is no more than one page, and that you’ve arranged it so that the most essential docs are towards the top of the page (check out the <a href="https://en.wikipedia.org/wiki/Inverted_pyramid_(journalism)">inverted pyramid style</a>).</li> <li><strong>Group similar documents together.</strong> Let’s say you have a new service and need to create a topical guide, how-to, and runbook for it. One way of grouping these so that the connection between them is obvious is to make the how-to and runbook child documents of the topical guide. These should also include links so that it’s easy to navigate between them. In general, the best groupings are based on subject, rather than the type of document. If I’m trying to solve a problem with Puppet, for instance, and I’m looking at its runbook, the other documents I’m probably most interested in looking at next are other Puppet documents.</li> <li><strong>Keep titles short and descriptive.</strong> If you’re writing a runbook for the “Helper” service logs, calling it “Runbook: Helper Logs” is much better than “Runbook: Helper 1” or “Runbook: Logs” (even if it’s connected to the Helper service document). Documentation doesn’t necessarily need to be dry and boring, but it should be functional, first and foremost. So, try to avoid giving it a name like “Help! What do I do with Helper Logs?” or something else where the intent isn’t clear.</li> <li><strong>Avoid making titles too generic.</strong> If you name your page “Docker,” chances are someone else will choose a similar name, and they’ll all show up in the same search. “Docker for Production Engineering” or “Docker for Service X” is a much narrower title. Also, keep in mind that links will be organized in the left side table of contents alphabetically, so try to make your first word (or the first word after the “How-to:” or “Runbook:” prefix) specific; i.e., “How-to: Load Balancer Configuration” rather than “Howto: Configuring Load Balancers.”</li> </ul><h2 id="types-of-documentation">Types of Documentation</h2><p>Within Production Engineering, we have a number of different categories for documentation. Each category serves a different purpose and sometimes, a different audience. This determines what should be in the document; regardless of audience and content, all documentation should be in the Production Engineering Wiki space, if at all possible.</p><p><strong>Topical guides provide an overview of how a system or service works, as well as why certain technical decisions were made.</strong> The goal of a topical guide is to give the reader a thorough understanding of the topic at hand. They should come away feeling confident that they understand how the system or service functions and how to further investigate any issues that may fall outside the scope of a how-to or runbook. Configuration files, important directories, Puppet modules, and other components of the service should be discussed (or at least mentioned), as well as any upstream or downstream dependencies. Any complex behaviors should be demonstrated with concrete examples of how the system would behave under various conditions. Diagrams can be helpful, but keep in mind that adding any sort of graphics adds to the load required to revise the document, so use them sparingly. In general, these should be on the Wiki (though the initial draft can be done in Google Docs, which has a better collaborative model; this should then be deleted after the final version is put into the Wiki to avoid confusion). For open-sourced projects, this documentation may be elsewhere (such as <a href="http://readthedocs.org/">readthedocs.org</a>).</p><p><strong>How-tos describe how to do common tasks for a system or service.</strong> They should be more streamlined than topical guides, directly addressing the steps to actually complete the task(s) described. However, there should also be some context for commands. If relevant, you can talk briefly about why we made the decision to do things this way (especially if you want to reassure the reader under what circumstances this procedure is appropriate, or if the procedure seems counterintuitive). Keep in mind that these documents will probably be the first thing that newcomers read. Follow the <a href="https://en.wikipedia.org/wiki/Inverted_pyramid_(journalism)">inverted pyramid style</a> and address the most important or most common situations at the top of the document. Keep the main document reasonably short (3-4 pages at most); if you need to split the document because it’s too long, keep the most common use-cases in the main document and split the special cases. If the how-to addresses a common topic of interest to outside groups, consider adding a link to the document from the Production Engineering Home Page or one of the sub-portals linked off of that (like the PaaSTA or Puppet Homepages). <strong>The title for all How-to docs should start with “How-to:,” as in “How-to: Writing Docs.”</strong></p><p><strong>Runbooks describe how to diagnose and remediate issues with a system or service.</strong> They should seek to address specific questions: How do I diagnose an issue with this service? What does this alert mean and how do I fix it? When organizing the runbook, follow the <a href="https://en.wikipedia.org/wiki/Inverted_pyramid_(journalism)">inverted pyramid style</a>, placing the most common and/or important questions at the top of the page, and avoid making the runbook too long; it should be ~2-3 pages at the most. If it needs to be longer, find a way to break it up, but try to keep related topics together. Do not include long explanations for each action, a sentence or two will do. For anything longer, link to another, more comprehensive technical document. If you’re addressing specific alerts, make sure to include their names in the document so that they’re easily searchable. Specific command lines and expected outputs are good to include in these; avoid screenshots and other large graphics which can make the document too long or disjointed. <strong>The title for all Runbooks should start with “Runbook:,” as in “Runbook: Fixing Docs.”</strong></p><p><em>Note that How-tos and Runbooks are both more specific versions of the “runbook” type document referred to in the SRECon talk that inspired this post.</em></p><h3 id="a-note-on-how-tos-and-runbooks">A Note on How-tos and Runbooks</h3><p>One consideration when writing How-tos and Runbooks is that they should be seen as the first step on the route to automating these processes. Because of this, the more specifics you can include and the more explicit steps you can define, the easier it will be to automate the process.</p><p><em>For more on this, see Tom Limoncelli’s ACM Queue article “<a href="https://queue.acm.org/detail.cfm?id=3197520">Manual Work Is A Bug</a>”.</em></p><h2 id="writing-documentation">Writing Documentation</h2><h3 id="general-writing-tips">General Writing Tips</h3><h4 id="command-lines">Command Lines</h4><p>Especially when writing runbooks and how-tos, you should include exact command lines that people can actually use. <strong>When writing example command lines, you should be sure of two things: that they actually work and that they are benign.</strong> The first is self-explanatory, the second means that if someone takes the command line and cuts-and-pastes it into a terminal, it will not cause an unwanted behavior. For instance, if you were writing an example of a command intended to delete a user account, you’d want to make sure your example uses a nonexistent user so that a cut-and-pasted action would not impact any real users.</p><p>When adding command lines to documents, they should be added in {code} blocks, like so:</p><div class="highlighter-rouge highlight"><pre>$ /usr/bin/do_the_thing -o now </pre></div><h4 id="linking-to-other-docs">Linking to Other Docs</h4><p>Linking to other docs is one way you can keep runbooks and how-tos short. However, in order to avoid sending someone down an ever-expanding link hole, try to give a one or two sentence summary of the relevant material in the original doc so that they don’t need to reference another one. In addition, when you link to another document, try to use the title of the document as your link.</p><h4 id="using-graphics">Using Graphics</h4><p>In general, you want to use graphics sparingly. They take up a lot of space and can make your document longer than necessary. If you do use them, make sure they clearly show what you’re trying to illustrate, and include alt-text for accessibility. You could also consider providing a thumbnail or link to the graphic, which would provide a way for people to see the resource without impacting the surrounding text as much. If you create a diagram, if possible, attach its source to the document you’ve included it in so that it can be easily edited later on.</p><h3 id="spelling-grammar-and-language">Spelling, Grammar, and Language</h3><p>In general, when it comes to basic spelling, grammar, and language, we follow the Yelp Brand Style Guide.</p><p><em>This is an internal style guide written by our Marketing department copywriters that tackles many common language issues, as well as how to use specific Yelp-branded terms. Your organization probably has one too!</em></p><h4 id="acronyms">Acronyms</h4><p>When using acronyms, be sure that its first appearance is in expanded form, so that new readers understand what it means.</p><h4 id="jargon">Jargon</h4><p>Be very careful about using company-specific jargon; keep in mind that your documentation may be the first thing a new hire reads on the topic. If you can use more general or industry-wide language, you should. If you do use company-specific jargon, make sure its meaning is clear. (That doesn’t mean just linking to a glossary.)</p><h2 id="documentation-resources">Documentation Resources</h2><p><em>This section might not seem that important, but people really like having a place to go to learn more about things, and it can be really helpful for gathering momentum – especially having a dedicated chat channel.</em></p><p>Here’s a collection of resources which can help you improve your documentation writing.</p><p><em>Having a specific Slack or company chat channel for documentation discussion is also a helpful tool to ensure everyone understands company best practices for documentation.</em></p><p>Articles:</p><ul><li><a href="https://opensource.com/life/16/11/tips-for-clear-documentation">10 Tips For Making Your Documentation Crystal Clear</a> (Ben Cotton)</li> <li><a href="https://increment.com/documentation/primer-on-documentation-content-strategy/">A Primer on Documentation Content Strategy</a> (Stephanie Blotner)</li> <li><a href="https://jacobian.org/writing/what-to-write/">What to Write</a>/<a href="https://jacobian.org/writing/technical-style/">Technical Style</a> (Jacob Kaplan-Moss)</li> </ul><p>Talks:</p><ul><li><a href="https://www.usenix.org/conference/srecon18europe/presentation/blackwell">The 7 Deadly Sins of Documentation</a> (Chastity Blackwell)</li> <li><a href="https://www.usenix.org/conference/lisa15/conference-program/presentation/goldfuss">Scalable Meatfrastructure</a> (Alice Goldfuss)</li> <li><a href="https://www.usenix.org/conference/lisa16/workshop-program/presentation/reilly">Traps and Cookies: A Mystery Package from Your Former Self</a> (Tanya Reilly)</li> </ul><div class="island job-posting"><h3>Site Reliability Engineer (SRE)</h3><p>Feel strongly about reliability? So do we. Check out the Site Reliability Engineer positions on our careers page.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/92576d5d-2d2c-4600-a7ac-8f18840c7330?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> <article> <h1>How We’re Keeping Our Android UI Consistent</h1> <p>Mon, 08 Oct 2018 12:10:00 +0200</p> <p></p><p>Every day, we ship the Yelp experience to <a href="https://www.yelp.com/factsheet">millions of users</a> through our website and mobile apps. Our design team is committed to shaping the best interfaces to make Yelp easier to understand and more practical for users. Writing a review should be as simple as searching for a restaurant or reserving a table on Yelp.</p><p>The Yelp experience must be <strong>consistent</strong> across all products and platforms where we ship our applications. For this reason, every app has to comply with our trusty Yelp style guide. This post will explain how we’ve built an Android library around our style guide and will focus on how we design, build, and share our reusable components.</p><h2 id="yelps-style-guide">Yelp’s Style Guide</h2><p>The Yelp style guide (<a href="https://yelp.com/styleguide">yelp.com/styleguide</a>) is the source of truth that all designers, product managers, and developers use to build our elegant Yelp UI. It’s a collection of common patterns and components that makes frontend development easier and consistent on all fronts.</p><p>The style guide contains specifications about <strong>components, typography, illustrations, and color palettes</strong> used for both mobile and web applications. We like to think of it as the alphabet to Yelp’s visual language.</p><p>The Android style guide is the library responsible for implementing those design specs into real code. This library plays a critical role in our environment and dozens of Android developers rely on it every day. We pay special attention to making sure the components are accessible and also comply with our style guide.</p><h2 id="yelps-android-style-guide-library">Yelp’s Android Style Guide Library</h2><p>The Android Style Guide Library (ASG) is a repository of <strong>components</strong> and <strong>resources</strong> available to every Android developer here at Yelp. To better understand this library, let’s start from scratch and focus first on how components are created. This process usually consists of three phases:</p><ul><li><strong>API Design:</strong> Define how developers and designers will interact with the component.</li> <li><strong>Build:</strong> Implement the component and make sure it’s in harmony with our ecosystem.</li> <li><strong>Share:</strong> Ensure the component is reusable, documented, and frustration-free!</li> </ul><h3 id="api-design">API Design</h3><p>The <strong>API design</strong> phase defines the <strong>rules and interactions</strong> between our components and our client developers, and is the first step of the component lifecycle. In this phase, we define how flexible our components are and how we can expect developers to interact with them.</p><p>We’re pretty strict on this step because a method with the wrong visibility (say <code class="highlighter-rouge">public</code>) may result in a @Deprecated method that will end up polluting our codebase.</p><p>The first question we ask ourselves during this phase is, <strong>“Can it be reused?”</strong> And if so, how? Adding a new component to the shared library comes with some costs, especially since components have to be flexible and reusable. The ASG is not just a folder with a lot of Android custom views, but a collection of components with well defined use cases, and understanding those use cases is the first step to designing a solid component.</p><h4 id="attributes-and-styles">Attributes and Styles</h4><p>The definition of <strong>attributes</strong> sparks a lot of discussion. In its simplest definition, an attribute represents a mutable property of a component. Adding more attributes means giving more <strong>freedom</strong> to developers and designers to edit the look-and-feel of each component. To make sure our library is consistent, we want to restrict the attribute set so that it provides only those that are absolutely needed.</p><p>As an example, we can take a look at our <strong>user passport</strong>, a component used to display a user’s information in our Android app (e.g., on top of a review or next to an uploaded picture):</p><p></p><p>As we can see, there are several mutable fields in this component:</p><ul><li>User photo</li> <li>Username and description</li> <li>Elite badge</li> <li>Counter for friends, media, and check-ins</li> </ul><p>The above fields will be mapped into one or more attributes inside the library resource file. All attributes will be declared inside a <code class="highlighter-rouge"><declare-styleable></code> block:</p><div class="language-xml highlighter-rouge highlight"><pre><resources> <declare-styleable name="UserPassport">  <attr name="user_passport_name" format="string"/>  <attr name="user_passport_description" format="string"/>  <attr name="user_passport_tint" format = "color"/> ... </declare-styleable> </resources> </pre></div><p><em>Example 1: Defining attributes for the UserPassport component</em></p><p>We ensure that every attribute is usable from both XML and Java/Kotlin:</p><div class="language-xml highlighter-rouge highlight"><pre><UserPassport android:layout_width="match_parent" android:layout_height="wrap_content" app:user_passport_description="..." /> </pre></div><p><em>Example 2: Setting the description attribute from the XML</em></p><div class="language-kotlin highlighter-rouge highlight"><pre>var description: CharSequence get() = descriptionTextView.text set(value) { descriptionTextView.text = value // .isGone is defined in Android KTX core descriptionTextView.isGone = value.isBlank() } </pre></div><p><em>Example 3: The description attribute from the Kotlin code point of view</em></p><p>Occasionally, updating an attribute may trigger several side effects. In Example 3, we can see a setter for the description property. When setting an empty description, we also want to hide the TextView (updating the .isGone property).</p><p>This kind of <strong>logic</strong> should only live inside the setters for every attribute, and shouldn’t spread around the component code. This helps us keep our components clean and organized, and is practical since our components’ constructors share a common structure:</p><div class="language-kotlin highlighter-rouge highlight"><pre>init { // Inflate the layout and retrieve the views. context.withStyledAttributes( attrs, R.styleable.UserPassport, defStyleAttr, R.style.UserPassport ) { description = getText(R.styleable.UserPassport_userPassportDescription) ?: "" // Other attribute initialization here. } } </pre></div><p><em>Example 4: A section of the UserPassport constructor; we’re also using Android KTX to keep our constructors clean and more idiomatic</em></p><p>We generally adhere to the following four steps: inflate the layout, bind the views, retrieve the attributes, and call the setters. In Example 4, we’re actually retrieving the attributes and calling the description setter at the same time.</p><p>We also need to define the component’s <strong>styles</strong>. A style allows us to fix a value for one or more attributes of a component. We want our components to come with a good set of styles to cover all major use cases.</p><p>First, we define a <strong>default</strong> style for every component. As the name suggests, this style will be applied by default whenever a component is used. As a result, developers won’t need to provide a value for all the attributes, just the desidered customizations.</p><p>On top of the default styles, we provide other styles to address each essential use case.</p><p>A good example is our UserPassport component and its styles. The <code class="highlighter-rouge">.White</code> style defines how the passport should look in a dark environment (like a media viewer). Thanks to the Android styles’ <em>dot notation</em>, the <code class="highlighter-rouge">.White</code> style will inherit all the attributes from the parent attribute (UserPassport in this case) so we don’t need to redefine all the values, but just override the one needed to obtain the desired appearance.</p><div class="language-xml highlighter-rouge highlight"><pre> <style name="UserPassport"> <item name="user_passport_tint">@color/orange_dark_interface</item> ... </style>  <style name="UserPassport.White"> <item name="user_passport_tint">@color/white_interface</item> </style> </pre></div><p><em>Example 5: Two of our UserPassport styles</em></p><p>Defining attributes and styles plays a fundamental role in the API Design phase. When defining attributes, we are essentially defining the <strong>features</strong> of every single component. This will have a deep impact on how flexible our library is, and to which degree of freedom we want to provide the client developers.</p><h4 id="colors-and-icons">Colors and Icons</h4><p>When defining styles and attributes, we often have to fix <strong>colors</strong> and <strong>icons</strong> that will be used by our components. This step is very critical: assets and hex color strings can be lost among Slack messages, and exporting icons for every density and platform can be tough and can cause us to easily lose consistency.</p><p>To overcome these kinds of issues, we developed two tools to automate this process: Yelpdesign and Yelpicons.</p><p><strong>Yelpdesign</strong> is a tool used to automate the handoff of designer <strong>“tokens.”</strong> A designer token is basically a constant that can be used by frontend developers, such as a color or a padding value. All tokens are bundled together into archives that can be consumed by every platform. E.g., for Android, we obtain an .aar containing a set of resource files.</p><p>Yelpdesign is the tool we use to convert the <a href="https://www.yelp.com/styleguide/color">color palette</a> defined in our style guide to a resource accessible from the Android environment. As you can see from the style guide website, we indicate which colors are safe to use for mobile and which will be exported in a colors.xml file.</p><p></p><p><strong>Yelpicons</strong> is a tool used to automate the handoff of assets. Yelpicons works in a similar way as yelpdesign: illustrators upload SVGs to Yelpicons, which get bundled for every platform. For Android, assets are created for every density and placed in the proper <code class="highlighter-rouge">drawable-*dpi</code> folder.</p><p>You can read more about how we automate these process in this blogpost from our Yelp design team: <a href="https://medium.com/yelp-design/automating-consistency-8f7e488c2681">Automating Consistency</a>.</p><h2 id="build">Build</h2><p>Once the API for our component is defined, it’s time to implement and integrate it into the ASG library. When adding a new component, we want to make sure the component is <strong>well documented</strong> and <strong>tested</strong>. The component should also work well with the Android Studio’s <strong>designer preview</strong> tool. We ultimately want every widget to work out of the box with just a simple drag-and-drop action.</p><p>We want every developer contributing to the library to be aware of the impact of their pull requests. For this reason, we use a simple <strong>template</strong> to populate each pull request with a set of questions that every developer must answer (e.g., “Have you added tests?” “Have you documented your component?”). This could come off as a weak form of enforcement, but in actuality is already catching a lot of common mistakes and in the process, is educating developers on how to contribute.</p><p></p><p>The ASG library is hosted on its own repo and exposed to client developers as a <strong>maven artifact</strong> through our internal repository. We use semantic versioning to inform developers if the next release contains new components, bug fixes, or breaking changes.</p><p>Some of our components are designed to replace the Android framework component (e.g., we have a custom button). For those components, it’s good practice to write a custom <strong>lint check</strong>, which warns developers to use their style guided counterpart rather than the framework component.</p><p>Lint checks turn out to be really handy. Since they’re integrated into Android Studio, the developer’s code is immediately highlighted to mark the warning.</p><p></p><p>From our experience, lint checks are a great tool to advocate for new components.</p><p>Furthermore, they’re integrated into our CI system and run on every build. In our infrastructure, we set all the warnings as errors, and abort all builds that return errors. This means that a developer’s build will be broken if they don’t use the proper component. This approach could sound a bit stricter than necessary, but is actually a great way to encourage library usage.</p><p>Adding a new lint rule using such strict settings could sound hard. Let’s say that we add a new button with a lint check to raise a warning for every usage of a legacy button. We have hundreds of buttons in our codebase, so if we run this lint check, we will break the build for everyone. To overcome this problem, we use a <strong>lint baseline</strong> file. A baseline file is a snapshot of all the current lint warnings. Lint will check if the warning is contained in the file and suppress it if necessary. This allows us to add new lint checks easily and without breaking legacy code.</p><div class="language-groovy highlighter-rouge highlight"><pre>lintOptions { abortOnError true warningsAsErrors true lintConfig ("lint.xml") baseline ("lint-baseline.xml") } </pre></div><p><em>Example 6: One of our lint configuration blocks</em></p><p>After we’ve coded the component and made sure it works properly within our infrastructure, it’s time to share it with the developer and designer community at Yelp. Lint checks are a great tool to enforce usage, but components should also be easy to find and use.</p><p>First, we need to provide proper <strong>documentation</strong>. Every component detail should be documented, in particular:</p><ul><li>Attributes and styles should be documented with an XML comment</li> <li>Classes should be documented with a KDoc/Javadoc</li> <li>Public methods should be documented with a KDoc/Javadoc comment</li> </ul><p>We use Dokka to build our documentation from the Kotlin/Java files. The generated documentation will then be populated with comments from the XML to make sure all attributes/styles/code/methods are <strong>on the same page</strong> for every component. This will allow every developer and designer to understand the particular capabilities of each individual component.</p><p>Someone once said “A picture is worth a thousand words.” This is why we also include several <strong>screenshots</strong> with every component to aid in understanding the different styles and attributes of each one. They’re also a great tool to improve discoverability and are added to the same widget page.</p><p>Finally, we also developed a small companion app called the <strong>styleguide test app</strong>. This application is a repository with a list of all available components. This is the best tool to see all of our components in action, and especially to appreciate all the animations and see how they actually look on a real device. Our test app is also rich with <strong>playgrounds</strong> that designers can use to see how every attribute change transforms the component.</p><p></p><h2 id="conclusion">Conclusion</h2><p>In this blog post, we presented our Android Style Guide Library, a collection of reusable components used by our Android developers. We followed the lifecycle of every single component to better understand how to fix the nitty gritty details and build great, reusable components. While building a components repository is no easy task, we’re deeply committed to achieving our final goal: providing a solid and consistent Yelp experience to our users!</p><div class="island job-posting"><h3>Software Engineer - Mobile Android</h3><p>Come join us in helping build an awesome Android experience at Yelp</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bea0f558-8f09-4503-afd4-05f2a5f923e9?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p> </article> </main></body></html>