<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:webfeeds="http://webfeeds.org/rss/1.0" version="2.0">
  <channel>
    <atom:link href="http://pubsubhubbub.appspot.com/" rel="hub"/>
    <atom:link href="https://f43.me/yelp-engineering.xml" rel="self" type="application/rss+xml"/>
    <title>Yelp Engineering</title>
    <description>News from the Yelp Engineering and Product Teams</description>
    <link>http://engineeringblog.yelp.com</link>
    <webfeeds:icon>https://s2.googleusercontent.com/s2/favicons?alt=feed&amp;domain=engineeringblog.yelp.com</webfeeds:icon>
    <webfeeds:logo>https://engineeringblog.yelp.com/css/assets/img/structural/biz_header_logo.png</webfeeds:logo>
    <webfeeds:accentColor>BE0E02</webfeeds:accentColor>
    <generator>f43.me</generator>
    <lastBuildDate>Fri, 13 Mar 2026 05:06:50 +0100</lastBuildDate>
    <item>
      <title><![CDATA[How Yelp Built a Back-Testing Engine for Safer, Smarter Ad Budget Allocation]]></title>
      <description><![CDATA[<p>Modern advertising platforms are fast-paced and interconnected: even small adjustments can have ripple effects on how ads are shown, how budgets are spent, and the value advertisers get from their ad spend.</p><p>At Yelp, Ad Budget Allocation means splitting each campaign’s spend between on‑platform inventory (our website, mobile site, and app) and off‑platform inventory (the Yelp Ad Network). We optimize this split to meet advertisers’ performance goals while growing overall revenue. Due to the complexity of the budget allocation system and its feedback loop, even small changes can lead to unexpected system‑wide effects.</p><p>To help us safely evaluate changes, we developed a Back-Testing Engine. This tool allows us to simulate the entire Ad Budget Allocation ecosystem with proposed algorithm changes, giving us a preview of real-world effects before we run full A/B tests or launch new code. All simulations use aggregated campaign data, with no personal user information involved.</p><p>In this post, we’ll share why we built this Engine, explain how it works, and reflect on how it’s improving our decision-making process.</p><h2 id="what-is-a-back-testing-engine">What is a Back-Testing Engine?</h2><p>A Back-Testing Engine allows us to simulate “what if” scenarios by applying alternative algorithms or parameters against historical campaign data. Instead of testing changes live, where mistakes could impact real budgets and advertisers, we can safely preview the effects of updates in a controlled environment.</p><p>For the Yelp Ad Budget Allocation team, this means virtually rerunning past campaigns with proposed allocation strategies and measuring outcomes like spend, leads, or revenue. This approach offers a key advantage over traditional simulation methods or “back-of-the-envelope” calculations using aggregate data, which often miss important day-to-day dynamics and interactions.</p><p>As our allocation logic and partner integrations have become more sophisticated, rapid and safe innovation has become essential. The Back-Testing Engine gives us the confidence to explore improvements, validate ideas, and iterate faster, while keeping advertiser trust and system performance front and center.</p><p>Yelp’s advertising system handles budget allocation for hundreds of thousands of campaigns each month. Advertisers typically set a monthly budget, but behind the scenes, our infrastructure makes daily decisions on how much to spend, and where.</p><p>In particular, a campaign goes through the following steps:</p><ol><li><strong>Beginning of the day</strong>: Our system calculates how much of the campaign’s budget to allocate that day, and how to split it between Yelp and our ad network based on the campaign’s goals.</li>
<li><strong>Throughout the day</strong>: Once the budget is set, the campaign generates outcomes (such as impressions, clicks, and leads) as the day progresses. While we can’t directly control the number of these outcomes, we closely monitor them as the ad budget is spent.</li>
<li><strong>End of day</strong>: Our system collects the day’s results and uses them to bill the campaign.</li>
</ol><p>Importantly, each day’s budget decisions depend on the outcomes of previous days, so the system constantly adapts as new outcomes come in. This is a fundamental property to take into account for our Back-Testing Engine. In fact, even small changes can have cascading, system-wide impacts over the billing period.</p><p>Below is a visual example of this day-by-day process (for example, taking December 2025 as the billing period) for two campaigns:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2026-02-02-how-yelp-built-a-back-testing-engine-for-safer-smarter-ad-budget-allocation/campaign_journey.png" alt="Figure 1. Campaign journey" /><p class="subtle-text"><small>Figure 1. Campaign journey</small></p></div><p>Our Back-Testing Engine is designed to replay this daily process using historical data and simulated changes together, helping us forecast the effects of changes before we ever touch production systems.</p><h2 id="system-overview">System overview</h2><p>The Back-Testing Engine is built from eight interconnected components, each playing a distinct role in the simulation process:</p><ol><li><strong>Parameter search space</strong>: Defines the parameters and values to explore.</li>
<li><strong>Optimizer</strong>: Selects the most promising candidates to test.</li>
<li><strong>Candidate</strong>: Represents a specific set of parameter values to be tested (one value for each parameter).</li>
<li><strong>Production repositories</strong>: Mirror production code (e.g., budgeting, billing).</li>
<li><strong>Historical daily campaign data</strong>: Actual historical data used for simulation.</li>
<li><strong>Machine-learning (ML) Models for clicks, leads, etc.</strong>: Predict daily outcomes such as impressions, clicks, and leads.</li>
<li><strong>Metrics</strong>: Store main KPIs for each candidate.</li>
<li><strong>Logging and visualization</strong>: Collects and displays all results.</li>
</ol><p>The diagram below shows how they interact during the simulation process.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2026-02-02-how-yelp-built-a-back-testing-engine-for-safer-smarter-ad-budget-allocation/system_architecture.png" alt="Figure 2. System architecture" /><p class="subtle-text"><small>Figure 2. System architecture</small></p></div><p>Below, we break down each component in more detail.</p><p>To run a back-test, we first define which parameters we want to tune or evaluate. These might include algorithm choices, thresholds, or weights, all specified in a YAML file—a human-readable format widely used for configuration.</p><p>The file includes:</p><ul><li>A <strong>date range</strong> for the simulation.</li>
<li>A <strong>run name</strong> to identify the test.</li>
<li>The <strong>search space</strong> for each parameter: allowed values or intervals.</li>
</ul><p>For example, suppose our budget allocation system currently uses a standard allocation approach, but we want to experiment with a new method called Algorithm X. We’re also interested in tuning a constant (called parameter Alpha) which we believe will impact allocation performance, with reasonable values ranging between -10 and +10.</p><p>To run this back-test for December 2025, we’d configure the YAML file as follows:</p><div class="language-yaml highlighter-rouge highlight"><pre>date_interval:
  - '2025-12-01'
  - '2025-12-31'
experiment_name: 'algorithm_x_vs_status_quo'
searches:
  - search_type: 'scikit-opt'
    minimize_metric: 'average-cpl'
    max_evals: 25
    search_space:
      allocation_algo: skopt.space.Categorical(['status-quo', 'algorithm_x'])
      alpha: skopt.space.Real(-10, 10)
</pre></div><p>Once this configuration is set, the optimizer can begin exploring different combinations of these parameters during the simulation.</p><h3 id="component-2---optimizer-scikit-optimize">Component 2 - Optimizer [Scikit-Optimize]</h3><p>An optimizer is necessary to select the best candidates to back-test.</p><p>For this purpose, we use the Python library Scikit-Optimize. The optimizer (Bayesian in this case) is designed to extract the most promising candidates that aim at minimizing a given metric (the one that is defined in the YAML file).</p><p>To efficiently explore the parameter space, our Back-Testing Engine uses an optimizer: specifically, the Scikit-Optimize library. The optimizer’s goal is to propose parameter combinations (candidates) that are likely to improve a chosen metric, defined in the YAML file as <code class="language-plaintext highlighter-rouge">minimize_metric</code>, in this case <code class="language-plaintext highlighter-rouge">average-cpl</code> (cost per lead).</p><p>The process begins with the optimizer suggesting an initial candidate, which is typically a random sample since no prior data exist. For example, the first candidate might be <code class="language-plaintext highlighter-rouge">{'allocation_algo': 'status_quo', 'alpha': 3.53}</code>. The Engine simulates this candidate and returns its performance metrics. In turn, the optimizer uses this feedback to select the next candidate, learning from previous results to propose combinations more likely to optimize the target metric.</p><p>This iterative loop continues until a specified number of candidates (<code class="language-plaintext highlighter-rouge">max_evals</code> in the YAML file, in this case 25) have been evaluated.</p><p>Scikit-Opt search is just one possible search strategy. Other strategies are supported, and specifically:</p><ul><li><strong>Grid search</strong>: All the possible combinations of parameter values are back-tested. This approach requires limiting the number of values to be tested, as the number of possible combinations grows quickly. For instance if we have a parameter with 5 values, another parameter with 3 values, and a third parameter with 10 values, the total number of candidates would be 5 × 3 × 10 = 150.</li>
<li><strong>Listed search</strong>: Each candidate is directly specified by the user in the YAML file.</li>
</ul><p>Note that for all kinds of search except Scikit-Opt, the optimizer doesn’t really act as an optimizer but just a wrapper that yields the next candidate to try.</p><p>As we have seen, each candidate is a specific combination of parameter values. This component is a key-value dictionary. In the example above (see <em>Component 2 - Optimizer [Scikit-Optimize]</em>), <em>Candidate #1</em> is a dictionary: <code class="language-plaintext highlighter-rouge">{'allocation_algo': 'status_quo', 'alpha': 3.53}</code>.</p><h3 id="component-4---production-repositories-git-submodules">Component 4 - Production repositories [Git Submodules]</h3><p>To support accurate back-testing, our Engine uses the same code as production by including key repositories (like Budgeting and Billing) as Git Submodules. This lets us simulate current logic or proposed changes by pointing to specific Git branches.</p><p>For example, to test a new budgeting algorithm, we add it on a separate branch, configure the Back-Testing Engine to use that branch, and run simulations. This setup enables our tests to closely match production and allows us to validate code changes in a controlled environment before rollout.</p><h3 id="component-5---historical-daily-campaign-data-redshift">Component 5 - Historical daily campaign data [Redshift]</h3><p>For the back-test, the system needs to retrieve historical campaign and advertiser data from Redshift, limited to the selected simulation period (e.g., December 1–31, 2025). This data is relevant because:</p><ul><li>The budgeting logic may vary depending on specific campaign attributes.</li>
<li>These attributes also serve as input features for the ML models (see <em>Component 6 - ML models for clicks, leads, etc. [CatBoost]</em>), improving the accuracy of predicted outcomes.</li>
</ul><p>All data is ingested at the campaign and date level to match the granularity of our production environment.</p><h3 id="component-6---ml-models-for-clicks-leads-etc-catboost">Component 6 - ML models for clicks, leads, etc. [CatBoost]</h3><p>Once daily budget allocations are set (see <em>Component 4 - Production repositories [Git Submodules]</em>) and campaign characteristics are loaded (see <em>Component 5 - Historical daily campaign data [Redshift]</em>), the next step is to estimate each campaign’s outcomes, such as impressions, clicks, and leads. Accurately predicting these results is challenging because:</p><ol><li>These outcomes depend on external systems we don’t directly control (e.g., partner ad networks).</li>
<li>There is intrinsic randomness in user behavior, such as whether someone chooses to click on an ad.</li>
</ol><p>To address this, we leverage ML models (specifically, CatBoost) trained to predict expected impressions, clicks, and leads based on daily budget and campaign features.</p><p>Using a non-parametric ML approach, instead of making simplistic assumptions (e.g. constant cost per click), allows us to accurately capture complex effects such as diminishing returns on budget, resulting in simulations that more closely reflect real-world behavior.</p><p>Using the same ML models for all candidates promotes fair comparisons. To further improve reliability, we monitor these models to prevent overfitting, checking that performance is consistent between training and hold-out datasets.</p><p>Because our models output average expected values (not integers), we apply a Poisson distribution to simulate integer outcomes. This approach captures the randomness seen in live systems.</p><p>Note: The use of ML models to predict counterfactual outcomes means this is not a pure back-testing approach, but rather a hybrid that combines elements of both simulation and back-testing.</p><h3 id="component-7---metrics">Component 7 - Metrics</h3><p>For each candidate, we track a set of metrics that are important indicators of campaign performance or economic result for Yelp, for instance these could be per-campaign average cost-per-click, average cost-per-lead, Yelp margin, etc. These metrics are calculated from the raw simulation results for each campaign and day, including daily budgets, impressions, clicks, leads, and billing.</p><p>As already mentioned (see <em>Figure 1. Campaign journey</em>), the raw simulation results of each candidate are obtained from “replaying” each campaign for each day. Such simulation process works as follows:</p><ul><li><strong>Beginning of the day</strong>: The Engine, using the Budgeting repository (configured with the candidate parameters), determines each campaign’s daily budget and allocates spend across channels.</li>
<li><strong>Throughout the day</strong>: ML models predict the campaign’s impressions, clicks, and leads based on the allocated budget and campaign features.</li>
<li><strong>End of day</strong>: The Billing repository (configured with the candidate parameters) computes each campaign’s billing using the simulated outcomes and candidate parameters.</li>
</ul><p>This process is repeated for each campaign and for each day in the period.</p><p>At the end, we aggregate these raw results into summary metrics, stored as key-value pairs for each candidate (e.g., <code class="language-plaintext highlighter-rouge">{'avg_cpc': 1.39, 'avg_cpl': 18.48, 'margin': 0.35}</code>). These global metrics make it easier to compare candidates.</p><h3 id="component-8---logging-and-visualization-mlflow">Component 8 - Logging and Visualization [MLFlow]</h3><p>For every candidate, we log both the input parameters and the resulting metrics to MLFlow, which runs on a remote server.</p><p>This setup offers two main advantages:</p><ul><li><strong>Centralized collaboration</strong>: All experiment results are stored in one place, making it easy for developers and applied scientists to access, review, and share findings.</li>
<li><strong>Effortless visualization</strong>: MLFlow’s built-in tools allow users to quickly compare and visualize candidate results without extra coding, streamlining analysis, and decision making.</li>
</ul><h2 id="insights--learnings">Insights &amp; Learnings</h2><p>Since adopting the Back-Testing Engine, we’ve seen clear improvements in the accuracy, speed, and safety of our experimentation. Here are the key ways it’s changed our workflow and decision-making.</p><h3 id="the-impact-on-our-experimentation-process">The impact on our experimentation process</h3><p>Before the Back-Testing Engine, we’d typically test algorithmic changes by running A/B experiments. We’d split campaigns into control and treatment groups, measuring results and assessing risk after the fact.</p><p>While statistically sound, this approach has major limitations in our setting:</p><ul><li><strong>Limited data</strong>: We experimented at the advertiser (not user) level, so sample sizes were relatively small for some effects to be detected.</li>
<li><strong>Slow results</strong>: Since most advertisers set monthly budgets, we had to wait one month to fully measure the effect of an A/B test.</li>
<li><strong>High risk</strong>: Mistakes or unintended consequences could have affected real advertisers.</li>
</ul><p>The Back-Testing Engine changes this dynamic. Instead of relying solely on A/B tests, we can affordably and safely simulate a wide range of changes using historical data. This allows us to quickly filter out less ideal candidates and focus A/B tests only on the most promising ideas, preserving A/B testing for final validation rather than discovery.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2026-02-02-how-yelp-built-a-back-testing-engine-for-safer-smarter-ad-budget-allocation/back_testing_and_a_b_testing.png" alt="Figure 3. How back-testing fits into our experimentation workflow" /><p class="subtle-text"><small>Figure 3. How back-testing fits into our experimentation workflow</small></p></div><h3 id="operational-benefits">Operational benefits</h3><p>The introduction of back-testing has provided several additional advantages:</p><ul><li><strong>Faster productionization</strong>: By allowing teams to implement changes directly in dedicated Git branches and immediately simulate their impact, we’re able to move promising ideas into production much more quickly. This effectively blurs the line between prototyping and production, streamlining our workflows.</li>
<li><strong>Improved collaboration</strong>: Scientists and engineers can now work side-by-side with production code, turning experiments into reusable, production-ready artifacts, rather than disconnected notebooks.</li>
<li><strong>Increased prediction accuracy</strong>: Our ML-driven simulations provide more realistic estimates of the business impact of each change, capturing complexities, like varying cost per click and cost per lead at different budget levels, that simplistic estimates often miss.</li>
<li><strong>System fidelity</strong>: By replaying the daily budgeting process, our Engine closely mirrors real-world operations, avoiding naive extrapolations and making results far more trustworthy.</li>
<li><strong>Early bug detection</strong>: Running simulations across a broad set of real data helps us catch code bugs or edge cases that would be tricky to find with unit tests alone.</li>
</ul><p>Overall, the Back-Testing Engine acts as both a safety net and a launchpad, empowering us to explore, evaluate, and improve our ad system with confidence.</p><h3 id="caveats-risks-and-limitations">Caveats, risks, and limitations</h3><p>While back-testing brings significant benefits, it’s important to acknowledge its limitations:</p><ul><li><strong>Not a perfect predictor</strong>: Back-testing relies on historical data and model assumptions, which may not capture major shifts in user, market, or partner behavior.</li>
<li><strong>Risk of overfitting to history</strong>: Relying too heavily on historical simulations could bias development toward optimizations that perform well on past data, potentially limiting innovation.</li>
<li><strong>ML model dependency</strong>: The accuracy of this methodology depends heavily on the quality and generalizability of the underlying ML models.</li>
</ul><p>Being aware of these caveats helps us use back-testing more effectively, complementing it with A/B tests and real-world monitoring to ensure robust, reliable improvements.</p><h2 id="conclusion">Conclusion</h2><p>The introduction of our Back-Testing Engine has transformed the way we experiment and optimize Ad Budget Allocation at Yelp. By leveraging production code and historical data, we can evaluate changes safely and efficiently, enabling faster iteration and more informed decision-making. This approach has reduced the risks associated with live experimentation, improved collaboration between teams, and provided a more accurate picture of the impact any proposed update can have on our ad ecosystem.</p><p>While there are limitations, such as reliance on historical data and ML model accuracy, acknowledging these caveats ensures that back-testing remains a reliable tool in our experimentation toolkit. Throughout this process, we ensure that all campaign simulations use aggregated, anonymized data, prioritizing the privacy of our users and advertisers.</p><p>Altogether, the Back-Testing Engine has proven to be both a safety net and an accelerator, empowering our team to drive continuous improvement and deliver greater value to advertisers.</p><div class="island job-posting"><h3>Join Our Team at Yelp</h3><p>We're tackling exciting challenges at Yelp. Interested in joining us? Apply now!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2026/02/how-yelp-built-a-back-testing-engine-for-safer-smarter-ad-budget-allocation.html</link>
      <guid>https://engineeringblog.yelp.com/2026/02/how-yelp-built-a-back-testing-engine-for-safer-smarter-ad-budget-allocation.html</guid>
      <pubDate>Mon, 02 Feb 2026 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[S3 server access logs at scale]]></title>
      <description><![CDATA[Introduction Yelp heavily relies on Amazon S3 (Simple Storage Service) to store a wide variety of data, from images, logs, database backups, and more. Since data is stored on the cloud, we need to carefully manage how this data is accessed, secured, and eventually deleted—both to control costs and uphold high standards of security and compliance. One of the core challenges in managing S3 buckets is gaining visibility into who is accessing your data (known as S3 objects), how frequently, and for what purpose. Without robust logging, it’s difficult to troubleshoot access issues, respond to security incidents, and ensure we...]]></description>
      <link>https://engineeringblog.yelp.com/2025/09/s3-server-access-logs-at-scale.html</link>
      <guid>https://engineeringblog.yelp.com/2025/09/s3-server-access-logs-at-scale.html</guid>
      <pubDate>Fri, 26 Sep 2025 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Exploring CHAOS: Building a Backend for Server-Driven UI]]></title>
      <description><![CDATA[<p>A little while ago, we published a blog post on <a href="https://engineeringblog.yelp.com/2024/03/chaos-yelps-unified-framework-for-server-driven-ui.html">CHAOS: Yelp’s Unified Framework for Server-Driven UI</a>. We strongly recommend reading that post first to gain a solid understanding of SDUI and the goals of CHAOS. This post builds on those concepts to delve into the inner workings of the CHAOS backend and how it generates server-driven content. To briefly recap, CHAOS is a server-driven UI framework used at Yelp. When a client wants to display CHAOS-powered content, it sends a GraphQL query to the CHAOS API. The API processes the query, requests the CHAOS backend to construct the configuration, formats the response, and returns it to the client for rendering.</p><p>The CHAOS backend accepts client requests through the GraphQL-based CHAOS API. At Yelp, we have adopted <a href="https://www.apollographql.com/docs/graphos/schema-design/federated-schemas/federation">Apollo Federation</a> for our GraphQL architecture, utilizing <a href="https://strawberry.rocks/">Strawberry</a> for federated Python subgraphs to leverage type-safe schema definitions and Python’s type hints. The CHAOS-specific GraphQL schema resides in its own CHAOS Subgraph, hosted by a Python service.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-07-08-chaos-inside-yelps-sdui-framework/chaos_backend_overview.png" alt="Diagram of CHAOS API and Backend Architecture" /></p><p>This federated architecture allows us to manage our CHAOS-specific GraphQL schema independently while seamlessly integrating it into Yelp’s broader Supergraph.</p><p>Behind the GraphQL layer, we support multiple CHAOS backends that implement a CHAOS REST API to serve CHAOS content in the form of CHAOS Configurations. This architecture allows different teams to manage their CHAOS content independently on their own services, while the GraphQL layer provides a unified interface for client requests. The CHAOS API authenticates requests and routes them to the relevant backend service, where most of the build logic is handled.</p><p>The primary goal of a CHAOS backend is to construct a CHAOS SDUI Configuration. This data model encompasses all the information needed for a client to configure a CHAOS-powered SDUI view. Below is an example of a CHAOS view called “consumer.welcome” and its configuration:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-07-08-chaos-inside-yelps-sdui-framework/chaos-view.png" alt="CHAOS View Example" /></p><div class="language-json highlighter-rouge highlight"><pre>{"data":{"chaosView":{"views":[{"identifier":"consumer.welcome","layout":{"__typename":"ChaosSingleColumn","rows":["welcome-to-yelp-header","welcome-to-yelp-illustration","find-local-businesses-button"]},"__typename":"ChaosView"}],"components":[{"__typename":"ChaosJsonComponent","identifier":"welcome-to-yelp-header","componentType":"chaos.text.v1","parameters":"{\"text\": \"Welcome to Yelp\", \"textStyle\": \"heading1-bold\", \"textAlignment\": \"center\"}}"},{"__typename":"ChaosJsonComponent","identifier":"welcome-to-yelp-illustration","componentType":"chaos.illustration.v1","parameters":"{\"dimensions\": {\"width\": 375, \"height\": 300}, \"url\": \"https://media.yelp.com/welcome-to-yelp.svg\"}}"},{"__typename":"ChaosJsonComponent","identifier":"find-local-businesses-button","componentType":"chaos.button.v1","parameters":"{\"text\": \"Find local businesses\", \"style\": \"primary\"}, \"onClick”: [\"open-search-url\"]}"}],"actions":[{"__typename":"ChaosJsonAction","identifier":"open-search-url","actionType":"chaos.open-url.v1","parameters":"{\"url\": \"https://yelp.com/search\"}"}],"initialViewId":"consumer.welcome","__typename":"ChaosConfiguration"}}}</pre></div><p>The configuration includes a list of views, each with a unique identifier and a layout. If there are multiple views, the initialViewId specifies which view should be displayed first. The layout, such as the single-column layout in this example, organizes components into sections based on their component IDs, helping the client determine the positioning of components within the CHAOS view.</p><p>Additionally, the configuration lists components and actions, detailing their settings as referenced by their respective IDs. Each component may have its own action, such as an onClick action for a button. A screen may also have actions triggered at specific stages, such as onView, for purposes like logging.</p><p>In CHAOS, components and actions are the fundamental building blocks. Instead of defining individual schemas for each element in the GraphQL layer, we use JSON strings for element content. This approach maintains a stable GraphQL schema and allows for rapid iteration on new elements or versions.</p><p>To ensure proper configuration, each element is defined as a Python dataclass, providing a clear interface. Type hinting guides developers on the expected parameters. These components and actions are available through a shared CHAOS Python package. For example, a text component could be structured as follows:</p><div class="language-python highlighter-rouge highlight"><pre>@dataclass
class TextV1(_ComponentData):
    value: str
    style: TextStyle
    color: Optional[Color] = None
    textAlignment: Optional[TextAlignment] = None
    margin: Optional[Margin] = None
    onView: Optional[List[Action]] = None
    onClick: Optional[List[Action]] = None
    component_type: str = "chaos.component.text.v1"
</pre></div><div class="language-python highlighter-rouge highlight"><pre>text = Component(
  component_data=TextV1(
      value="Welcome to Yelp!",
      style=TextStyleV1.HEADING_1_BOLD,
      textAlignment=TextAlignment.CENTER,
  )
)
</pre></div><p>These dataclasses internally handle the serialization of Python dataclasses to JSON strings, as shown below:</p><div class="language-json highlighter-rouge highlight"><pre>{"component_type":"chaos.text.v1","parameters":"{\"text\": \"Welcome to Yelp\", \"textStyle\": \"heading1-bold\", \"textAlignment\": \"center\"}}"}</pre></div><p>These basic components and actions, when combined with container-like components such as vertical and horizontal stacks—which organize elements in a vertical or horizontal sequence—enable powerful UI building capabilities.</p><p>In this section, we will explore how CHAOS constructs a configuration. Although the process can be complex, the shared CHAOS Python Package, which also contains CHAOS elements, provides Python classes that manage most of the build process in the background. This allows backend developers using the CHAOS SDUI framework to focus on configuring their content. Below is a high-level overview of the build process, with subsequent sections examining each step in detail.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-07-08-chaos-inside-yelps-sdui-framework/chaos_build_flow.png" alt="Configuration Build Flow" /></p><h2 id="step-1-request">Step 1: Request</h2><p>When a client sends a GraphQL query to the CHAOS API, it provides a view name and context. The view name is used to route the request to the relevant CHAOS backend to build the configuration. The context, a JSON object, is forwarded to the backend and includes information such as client specifications or feature specifications. This allows the backend to customize the build for each client request.</p><p>To illustrate this process, here is a simplified request from a mobile device that demonstrates how to retrieve a CHAOS view named “consumer.welcome”:</p><div class="language-plaintext highlighter-rouge highlight"><pre>POST /graphql
Request Body:
{
  "query": "
  query GetChaosConfiguration($name: String!, $context: ContextInput!) {
    chaosConfiguration(name: $viewName, context: $context) {
      # The actual fields of ChaosConfiguration would be specified here
      ...ChaosConfiguration Schema...
    }
  }
  ",
  "variables": {
    "viewName": "consumer.welcome",
    "context": "{\"screen_scale\": \"foo\", \"platform\": \"bar\", \"app_version\": \"baz\"}"
  }
}
</pre></div><p>Upon receiving the request, the CHAOS subgraph routes it to a CHAOS backend service for further processing.</p><h2 id="step-2-view-selection">Step 2: View Selection</h2><p>An individual CHAOS backend can support various CHAOS views. The <code class="language-plaintext highlighter-rouge">ChaosConfigBuilder</code> allows backend developers to register their <code class="language-plaintext highlighter-rouge">ViewBuilder</code> classes, which manage individual view builds. Upon receiving a request, the encapsulated logic in <code class="language-plaintext highlighter-rouge">ChaosConfigBuilder</code> selects the relevant <code class="language-plaintext highlighter-rouge">ViewBuilder</code> based on the request’s view name and executes the view build steps, constructing the final configuration. Here is a simplified example of using <code class="language-plaintext highlighter-rouge">ChaosConfigBuilder</code> in practice:</p><h4 id="simplified-example-to-illustrate-the-use-of-chaosconfigbuilder">Simplified example to illustrate the use of ChaosConfigBuilder</h4><div class="language-python highlighter-rouge highlight"><pre>from chaos.builders import ChaosConfigBuilder
from chaos.utils import get_chaos_context
from .views.welcome_view import ConsumerWelcomeViewBuilder
def handle_chaos_request(request):
    # Obtain the context for the CHAOS request
    context = get_chaos_context(request)
    # Register the view builders supported by this service.
    ChaosConfigBuilder.register_view_builders([
        ConsumerWelcomeViewBuilder,
        # Add other view builders here
    ])
    # Build and return the final configuration
    return ChaosConfigBuilder(context).build()
</pre></div><h2 id="step-3-layout-selection">Step 3: Layout Selection</h2><p>Each view has a ViewBuilder class, which selects the appropriate layout and manages the construction of the view.</p><p>CHAOS supports different layouts. For example, a single-column layout, as shown in the previous example, has only one “main” section. Other layouts, such as a basic mobile layout, include additional sections like a toolbar and footer. This flexibility allows content to be presented differently across various clients, such as web and mobile, to accommodate different client characteristics.</p><p>Each supported layout type in CHAOS has a corresponding LayoutBuilder. This class accepts a list of <code class="language-plaintext highlighter-rouge">FeatureProvider</code> classes (described in detail later) for each section. The order of FeatureProviders within each section determines their order when rendered on the client.</p><p>Continuing with the welcome_consumer example, the ViewBuilder looks like this:</p><h4 id="simplified-and-illustrative-example-of-a-viewbuilder-in-chaos">Simplified and illustrative example of a ViewBuilder in CHAOS.</h4><div class="language-python highlighter-rouge highlight"><pre>
from chaos.builders import ViewBuilderBase, LayoutBuilderBase, SingleColumnLayoutBuilder
from .features import WelcomeFeatureProvider
class ConsumerWelcomeViewBuilder(ViewBuilderBase):
    @classmethod
    def view_id(cls) -&gt; str:
        return "consumer.welcome"
    def subsequent_views(self) -&gt; List[Type[ViewBuilderBase]]:
        "refer to 'Advanced Features - View Flows' section for details"
        return []
    def _get_layout_builder(self) -&gt; LayoutBuilderBase:
        """
        Logic to select the appropriate layout builder based on the context.
        """
        return SingleColumnLayoutBuilder(
            main=[
                WelcomeFeatureProvider,
            ],
            context=self._context
        )
</pre></div><p>When the <code class="language-plaintext highlighter-rouge">ChaosConfigBuilder</code> executes the <code class="language-plaintext highlighter-rouge">ViewBuilder</code>’s build steps, it internally invokes the _get_layout_builder() method to determine the appropriate <code class="language-plaintext highlighter-rouge">LayoutBuilder</code> and execute its build steps. In this example, the method returns a SingleColumnLayoutBuilder, which is structured with a single section named “main”. This section contains only one feature provider: WelcomeFeatureProvider. The LayoutBuilder will then execute the FeatureProvider’s build process, which constructs the configuration for the feature’s SDUI.</p><h2 id="step-4-build-features">Step 4: Build Features</h2><p>A feature’s SDUI comprises one or more components and actions that collectively fulfill a product purpose, allowing users to view and interact with it on the Yelp app. Feature developers define each feature by inheriting from the FeatureProvider class, which encapsulates all the logic required to load feature data and configure the user interface appropriately.</p><p>Each FeatureProvider builds its feature by going through the following major steps:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-07-08-chaos-inside-yelps-sdui-framework/provider-build-flow.png" alt="Feature Provider Build Flow" /></p><div class="language-python highlighter-rouge highlight"><pre>class FeatureProviderBase:
    def __init__(self, context: Context):
        self.context = context
    @property
    def registers(self) -&gt; List[Register]:
        """Sets platform conditions and presenter handler."""
    def is_qualified_to_load(self) -&gt; bool:
        """Checks if data loading is allowed."""
        return True
    def load_data(self) -&gt; None:
        """Initiates asynchronous data loading."""
    def resolve(self) -&gt; None:
        """Processes data for SDUI component configuration."""
    def is_qualified_to_present(self) -&gt; bool:
        """Checks if configuration of the feature is allowed."""
        return True
    def result_presenter(self) -&gt; List[Component]:
        """Defines component configurations."""
</pre></div><p>A view can contain multiple features, and during the build process, all features are built in parallel to enhance performance. To achieve this, the feature providers are iterated over twice. In the first loop, the build process is initiated, triggering any asynchronous calls to external services. This includes the steps: <code class="language-plaintext highlighter-rouge">registers</code>, <code class="language-plaintext highlighter-rouge">is_qualified_to_load</code>, and <code class="language-plaintext highlighter-rouge">load_data</code>. The second loop waits for responses and completes the build process, encompassing the steps: <code class="language-plaintext highlighter-rouge">resolve</code>, <code class="language-plaintext highlighter-rouge">is_qualified_to_present</code>, and <code class="language-plaintext highlighter-rouge">result_presenter</code>. (It is worth mentioning that the latest CHAOS backend framework introduces the next generation of builders using Python asyncio, which simplifies the interface. This will be explored in a future blog post.)</p><h3 id="check-registers">Check Registers</h3><p>The <code class="language-plaintext highlighter-rouge">Register</code> class in CHAOS is crucial for ensuring that any SDUI content returned to the client is supported. Each register specifies:</p><ul><li><strong>Platform</strong>: The platforms (e.g., iOS, Android, web) for which the registered configuration is intended.</li>
<li><strong>Elements</strong>: The required components and actions in this configuration that the client must support. Internally, we maintain information about which components and actions supported by a given client platform type and app version, which is used for verification.</li>
<li><strong>Presenter Handler</strong>: The associated handler (e.g., result_presenter) responsible for constructing the configuration if all conditions are met.</li>
</ul><p>During setup, developers can define multiple registers, each linked to a different handler. Based on the client information provided to the backend, the presenter handler of the first qualifying register is selected to build the configuration. If no register qualifies, the feature is omitted from the final response.</p><h3 id="check-qualification-to-load">Check Qualification to Load</h3><p>The qualification step, <code class="language-plaintext highlighter-rouge">is_qualified_to_load</code>, allows developers to perform additional checks to decide whether the feature building process should continue and if feature data should be loaded. This is typically where feature toggles are applied or experimental checks are conducted. If this step returns false, the feature will be excluded from the final configuration.</p><h3 id="async-data-loading-and-resolve">Async Data Loading and Resolve</h3><p>During the <code class="language-plaintext highlighter-rouge">load_data</code> stage, we initiate asynchronous requests to upstream services in parallel. We defer resolving and blocking for results to the <code class="language-plaintext highlighter-rouge">resolve</code> stage. This approach enables efficient dispatch of requests and data sharing in all feature providers, optimizing performance by resolving data at a later stage.</p><h3 id="check-qualification-to-present">Check Qualification to Present</h3><p>The qualification step, <code class="language-plaintext highlighter-rouge">is_qualified_to_present</code>, allows developers to perform additional checks to determine whether a feature should be included in the configuration. This is especially useful when data fetched during the loading step is needed to decide if the feature should be displayed. If this returns false, the feature will be dropped from the final configuration.</p><h3 id="configure-the-feature">Configure the Feature</h3><p>This is the stage where we configure the components and actions that constitute the feature. In the <code class="language-plaintext highlighter-rouge">FeatureProvider</code> code, this is represented by the <code class="language-plaintext highlighter-rouge">result_presenter</code> method. Developers can define multiple presenter handlers. The one selected in the registers will serve as the final handler for the feature.</p><p>Back to the example, the <code class="language-plaintext highlighter-rouge">WelcomeFeatureProvider</code> feature is shown to users when it meets the following conditions: the requesting client is on an iOS or Android platform, and the client supports the required CHAOS elements (<code class="language-plaintext highlighter-rouge">TextV1</code>, <code class="language-plaintext highlighter-rouge">IllustrationV1</code>, <code class="language-plaintext highlighter-rouge">ButtonV1</code>). If satisfied, an asynchronous request fetches button text in the <code class="language-plaintext highlighter-rouge">load_data</code> method, which is then processed in the <code class="language-plaintext highlighter-rouge">resolve</code> method. The <code class="language-plaintext highlighter-rouge">result_presenter</code> method configures and displays the welcome text, illustration, and button with the fetched text.</p><div class="language-python highlighter-rouge highlight"><pre>class WelcomeFeatureProvider(ProviderBase):
    @property
    def registers(self) -&gt; List[Register]:
        return [
            Register(
                condition=Condition(
                    platform=[Platform.IOS, Platform.ANDROID],
                    library=[TextV1, IllustrationV1, ButtonV1],
                ),
                presenter_handler=self.result_presenter,
            )
        ]
    def is_qualified_to_load(self) -&gt; bool:
        return True
    def load_data(self) -&gt; None:
        self._button_text_future = AsyncButtonTextRequest()
    def resolve(self) -&gt; None:
      button_text_results = self._button_text_future.result()
      self._button_text = button_text_results.text
    def result_presenter(self) -&gt; List[Component]:
        return [
            Component(
                component_data=TextV1(
                    text="Welcome to Yelp!",
                    style=TextStyleV1.HEADER_1,
                    text_alignment=TextAlignment.CENTER,
                )
            ),
            Component(
                component_data=IllustrationV1(
                    dimensions=Dimensions(width=375, height=300),
                    url="https://media.yelp.com/welcome-to-yelp.svg",
                ),
            ),
            Component(
                component_data=ButtonV1(
                    text=self._button_text,
                    button_type=ButtonType.PRIMARY,
                    onClick=[
                        Action(
                            action_data=OpenUrlV1(
                                url="https://yelp.com/search"
                            )
                        ),
                    ],
                )
            )
        ]
</pre></div><p>In an SDUI view with multiple features, error handling is essential. In a data-intensive backend, upstream requests might fail, or unexpected issues could occur. To prevent a complete CHAOS configuration failure due to a single feature’s issue, each FeatureProvider is wrapped in an error-handling wrapper during the CHAOS build process. If an exception occurs, the individual feature is dropped, and the rest of the view remains unaffected. Unless developers choose to mark the feature as “essential,” meaning its failure will affect the entire view.</p><h4 id="simplified-pseudo-code-example-for-error-handling-in-a-feature-provider">Simplified pseudo-code example for error handling in a feature provider.</h4><div class="language-python highlighter-rouge highlight"><pre>def error_decorator(f: F) -&gt; F:
    @wraps(f)
    def wrapper(self, *args, **kwargs):
        try:
            return f(self, *args, **kwargs)
        except Exception as e:
            if self._is_essential_provider:
                raise
            log_error(exception=e, context=self._context)
        return []
    return cast(F, wrapper)
class ErrorHandlingExecutionContext:
    def __init__(self, wrapped_element: ProviderBase) -&gt; None:
        self._wrapped_element: ProviderBase = wrapped_element
        self._context: Context = self._wrapped_element.context
        self._is_essential_provider: bool = self._wrapped_element.IS_ESSENTIAL_PROVIDER
    # Other methods are omitted for brevity.
    @error_decorator
    def final_result_presenter(self) -&gt; List:
        ...
</pre></div><p>When an error occurs, we record details such as the feature name, ownership info, exception specifics, and additional request context. This logging facilitates the monitoring of issues, the generation of alerts, and the automatic notification of the responsible team when problems reach a specified threshold.</p><p>The example above covers a pretty basic configuration build example. Now, here’s a quick look at some advanced CHAOS features.</p><h2 id="view-flows">View Flows</h2><p>In the CHAOS configuration schema, the “ChaosView - views” is defined as a list, with the initial view specified by “ChaosView - initialViewId.”</p><p>The CHAOS framework is engineered to allow a view to be linked with multiple “subsequent views.” The configurations for these subsequent views are also contained within “ChaosView - views,” with each view having its own unique ViewId.</p><p>Subsequent views are accessed through the “CHAOS Action - Open Subsequent View.” This action enables navigation to another view using its associated ViewId. This action can be attached to the onClick event of a component, such as a button, thereby allowing users to navigate seamlessly.</p><div class="language-python highlighter-rouge highlight"><pre>@dataclass
class OpenSubsequentView(_ActionData):
    """`"""
    viewId: str
    """The name of subsequent view this action should open."""
    action_type: str = field(init=False, default="chaos.open-subsequent-view.v1", metadata=excluded_from_encoding)
</pre></div><p>The process for constructing subsequent views is identical to that of the primary view builder. To register a view builder as a subsequent view to the primary one, the ViewBuilder class provides the subsequent_views method.</p><div class="language-python highlighter-rouge highlight"><pre>def subsequent_views(self) -&gt; List[Type[ViewBuilderBase]]:
    return [
        # Add View Builders for Subsequent Views here.
    ]
</pre></div><p>Each view builder in this list is constructed alongside the primary view builder and stored in the “ChaosView - views” list within the final configuration. This design allows developers to define sequences of views, known as “flows,” which are interconnected using the “OpenSubsequentView” action. This approach is particularly beneficial in scenarios where users need to navigate quickly through a series of closely related content. By preloading these views, we eliminate the need for additional network requests for each view configuration, thereby enhancing the user experience by reducing latency.</p><p>Below is an example of a CHAOS Flow utilized in our Yelp for Business mobile app, specifically designed to support a customer support FAQ menu.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-07-08-chaos-inside-yelps-sdui-framework/chaos-flow-example.png" alt="CHAOS Flow Example" /></p><h4 id="simplified-flow-chaos-config">Simplified Flow CHAOS Config</h4><p>This basic configuration demonstrates a three-view flow. Each view contains a button that, when clicked, triggers an action to open the next view. In this example, the views will navigate sequentially from View 1 to View 2 to View 3, and then loop back to View 1.</p><div class="language-json highlighter-rouge highlight"><pre>{"data":{"chaosView":{"views":[{"identifier":"consumer.view_one","layout":{"__typename":"ChaosSingleColumn","rows":["button-one"]},"__typename":"ChaosView"},{"identifier":"consumer.view_two","layout":{"__typename":"ChaosSingleColumn","rows":["button-two"]},"__typename":"ChaosView"},{"identifier":"consumer.view_three","layout":{"__typename":"ChaosSingleColumn","rows":["button-three"]},"__typename":"ChaosView"}],"components":[{"__typename":"ChaosJsonComponent","identifier":"button-one","componentType":"chaos.button.v1","parameters":"{\"text\": \"Next\", \"style\": \"primary\"}, \"onClick”: [\"open-subsequent-one\"]}"},{"__typename":"ChaosJsonComponent","identifier":"button-two","componentType":"chaos.button.v1","parameters":"{\"text\": \"Next\", \"style\": \"primary\"}, \"onClick”: [\"open-subsequent-two\"]}"},{"__typename":"ChaosJsonComponent","identifier":"button-three","componentType":"chaos.button.v1","parameters":"{\"text\": \"Back to start\", \"style\": \"primary\"}, \"onClick”: [\"open-subsequent-three\"]}"}],"actions":[{"__typename":"ChaosJsonAction","identifier":"open-subsequent-one","actionType":"chaos.open-subsequent-view.v1","parameters":"{\"viewId\": \"consumer.view_two\"}"},{"__typename":"ChaosJsonAction","identifier":"open-subsequent-two","actionType":"chaos.open-subsequent-view.v1","parameters":"{\"viewId\": \"consumer.view_three\"}"},{"__typename":"ChaosJsonAction","identifier":"open-subsequent-three","actionType":"chaos.open-subsequent-view.v1","parameters":"{\"viewId\": \"consumer.view_one\"}"}],"initialViewId":"consumer.view_one","__typename":"ChaosConfiguration"}}}</pre></div><h2 id="view-placeholders">View Placeholders</h2><p>In CHAOS, we allow a CHAOS view to be nested within another CHAOS view, which the client loads once the parent view is displayed. This is achieved using a special CHAOS component called a view placeholder. When rendering this component, the parent view initially shows a loading spinner by default until the nested view’s CHAOS configuration is successfully loaded asynchronously. Once loaded, the nested view is seamlessly integrated with the surrounding content of the parent view.</p><p>This approach enables the main content to be displayed to the user more quickly, while additional content is loaded in the background as the user engages with other items on the screen.</p><p>The view placeholder component can also be optionally configured to handle different states during the loading process, including loading, error, and empty states.</p><div class="language-python highlighter-rouge highlight"><pre>@dataclass
class ViewPlaceholderV1(_ComponentData):
    """
    Used to provide a placeholder that clients should use to fetch the indicated CHAOS Configuration and then load the retrieved content in the location of this component.
    """
    viewName: str
    """The name of the CHAOS view to fetch, e.g. "consumer.inject_me"."""
    featureContext: Optional[ChaosJsonContextData]
    """
    A feature-specific JSON object to be passed to the backend for the view building process by view placeholder.
    """
    loadingComponentId: Optional[ComponentId]
    """An optional component that provides a custom loading state."""
    errorComponentId: Optional[ComponentId]
    """An optional component that provides a custom error state."""
    emptyComponentId: Optional[ComponentId]
    """An optional component that provides a custom empty state."""
    headerComponentId: Optional[ComponentId]
    """An optional component that provides a static header."""
    footerComponentId: Optional[ComponentId]
    """
    An optional component that provides a static footer.
    Use the footer to provide a separator between the component and content below it.
    If the view is closed, the separator will be removed along with the view content.
    """
    estimatedContentHeight: Optional[int]
    """An optional estimate for the height of the content so that space can be allocated when loading."""
    defaultLoadingComponentPadding: Optional[Padding]
    """Specifies whether padding should be added around the shimmer."""
    component_type: str = field(init=False, default="chaos.component.view-placeholder.v1")
</pre></div><p>Here’s an example of the View Placeholder in action on our Yelp for Business home screen. The full home screen is supported by CHAOS. The “Reminders” feature is another standalone CHAOS view supported by a different CHAOS backend service. A ViewPlaceholder is used to asynchronously fetch the Reminders after the home screen has loaded and position it in the appropriate location.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-07-08-chaos-inside-yelps-sdui-framework/view-placeholder-example.png" alt="CHAOS View Placeholder Example" /></p><p>This post provided a high-level overview of how the backend build process for CHAOS comes together. We walked through how configurations are built, how features are composed and validated, and how advanced capabilities like view flows and nested views help create dynamic, responsive user experiences.</p><p>In upcoming posts, our client engineering teams will take a deeper dive into how CHAOS is implemented across Web, iOS, and Android, and how each platform adapts the server-driven configurations to deliver a seamless experience to users. We’ll also explore more advanced topics, such as strategies for making CHAOS even more dynamic, optimizing performance, and scaling the framework to support increasingly complex product needs.</p><p>We’re excited to continue sharing what we’ve learned as we evolve CHAOS to power even richer, faster, and more flexible user experiences across Yelp. Stay tuned!</p><div class="island job-posting"><h3>Join Our Team at Yelp</h3><p>We're tackling exciting challenges at Yelp. Interested in joining us? Apply now!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/07/chaos-inside-yelps-sdui-framework.html</link>
      <guid>https://engineeringblog.yelp.com/2025/07/chaos-inside-yelps-sdui-framework.html</guid>
      <pubDate>Tue, 08 Jul 2025 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Revenue Automation Series: Testing an Integration with Third-Party System]]></title>
      <description><![CDATA[<h2 id="background"><strong>Background</strong></h2><p>As described in the <a href="https://engineeringblog.yelp.com/2025/02/revenue-automation-series-building-revenue-data-pipeline.html">second blog post</a> of Revenue Automation series, Revenue Data Pipeline processes a large amount of data via complex logic transformations to recognize revenue. Thus, developing a robust production testing and integration strategy was essential to the success of this project phase.</p><p>The status quo testing process utilized the <a href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Redshift Connector</a> for data synchronization once the report was generated and published to the data warehouse (Redshift). This introduced a latency of approximately 10 hours before the data was available in the data warehouse for verification. This delay impacted our ability to verify whether the changes were accurately reflected and the data was updated as required. Additionally, the initial process involved manual data verification, increasing the risk of human error.</p><p>To enhance efficiency and minimize manual effort, we implemented a new testing strategy leveraging the concept of a “staging pipeline” which is discussed further in this blog. This improvement significantly accelerated the testing process as the data was available immediately after the reports were generated. This allowed the pipeline to detect errors earlier in the process.</p><p>Due to the unique nature of Yelp’s product implementation, we faced some challenges in testing the pipeline:</p><ul><li>Since the development environments have limited data, the different edge cases that occur in production could not be covered during dev testing. This was discovered when the data pipeline was executed in production for the first time.</li>
<li>We needed to ensure that the changes implemented in the development region do not affect data correctness in the production pipeline.</li>
<li>We had to find a way to test the pipeline behavior with production data before release.</li>
</ul><h2 id="execution-plan"><strong>Execution Plan</strong></h2><p>Glossary</p><ul><li>Billed Revenue — This is the amount that Yelp invoices the customer for.</li>
<li>Earned Revenue / Estimated Revenue — This revenue is calculated when Yelp fulfills the delivery of a purchased product.</li>
<li>Revenue Period — Time period during which we recognize revenue for a product.</li>
<li>Revenue Contract — This defines the terms under which services are delivered, and revenue is recognized.</li>
<li>Redshift Connector — A <a href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Data Connector</a> that loads data from Data Pipeline streams into <a href="https://aws.amazon.com/redshift/">AWS Redshift</a>.</li>
<li>Latency — The time it takes for the Redshift connector to process input data.</li>
<li>Data Pipeline - Data Pipeline ecosystem refers to Yelp’s infrastructure for <a href="https://en.wikipedia.org/wiki/Streaming_data">streaming data</a> across services.</li>
</ul><p>To ensure data correctness, we followed the steps discussed below:-</p><h2 id="step-1-staging-pipeline-setup"><strong>Step 1: Staging Pipeline Setup</strong></h2><p>We set up a separate data pipeline configuration, referred to as the staging pipeline, in parallel with the production pipeline. Its results were published to <a href="https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html">AWS Glue</a> tables which enabled us to solve the Redshift Connector latency problem by making data immediately available for querying via <a href="https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-overview.html">Redshift Spectrum</a>. When new changes to the code were implemented, the staging pipeline was modified to reflect them and was executed using production data. This approach allowed us to validate changes in a production environment by easily comparing staging and production results, eliminating the need to revert changes in the production pipeline when bugs were introduced.</p><p>The diagram below illustrates the parallel pipeline.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-05-20-revenue-automation-series-testing-an-integration-with-third-party-system/staging_pipeline_setup-parallel_pipeline.png" alt="Staging Pipeline Setup - Parallel Pipeline" /></p><p>With this approach, the production pipeline and its data were left untouched until the new changes were verified before updating it.</p><p>The diagram below shows both pipelines after staging verification.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-05-20-revenue-automation-series-testing-an-integration-with-third-party-system/staging_pipeline_setup-after_verification_stage.png" alt="Staging Pipeline Setup - After Verification Stage" /></p><p>Test data in the development environment was very limited as Yelp has various products with different edge cases and revenue calculation requirements in production, thus increasing the gap between the two data sets.</p><p>We had to outline different scenarios and edge cases that might occur in production and create data in the dev environment to mimic those scenarios. Once an edge case was discovered via manual verification and data integrity checks (as explained in step 3), new data points were created in the dev environment to replicate that scenario.</p><p>The diagram below illustrates the process.</p><h2 id="step-2-test-data-generation"><strong>Step 2: Test Data Generation</strong></h2><p>Test data in the development environment was very limited as Yelp has various products with different edge cases and revenue calculation requirements in production, thus increasing the gap between the two data sets.</p><p>We had to outline different scenarios and edge cases that might occur in production and create data in the dev environment to mimic those scenarios. Once an edge case was discovered via a manual verification and integrity checker (as explained in <a href="https://engineeringblog.yelp.com/2025/05/revenue-automation-series-testing-an-integration-with-third-party-system.html#step-3-data-integrity-checkers">step 3</a>), new data points were created in the dev environment to replicate that scenario.</p><p>The diagram below illustrates the process.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-05-20-revenue-automation-series-testing-an-integration-with-third-party-system/test_data_generation_process.png" alt="Test Data Generation Process" /></p><p>The large number of database tables required as input to the pipeline made this process very tedious as it involved manual creation of data points. Hence, for the development and testing of some scenarios, we relied heavily on the staging pipeline for verification. In the future, we plan to automate the creation of test data in the development environment, as discussed in the <a href="https://engineeringblog.yelp.com/2025/05/revenue-automation-series-testing-an-integration-with-third-party-system.html#future-improvements">Future Improvements</a> section.</p><h2 id="step-3-data-integrity-checkers"><strong>Step 3: Data Integrity Checkers</strong></h2><p>Ideally, integrity checks are developed to compare the output of a process and its source data. In our case, it was a comparison between revenue contract data pipeline results which includes estimated revenue versus the billed revenue produced by Yelp’s billing system. Various metrics were explored to improve the reliability of the pipeline results. Some of the critical metrics include:</p><ul><li>
<p>Number of contracts matching invoice (billed transaction) gross revenue: We wanted to verify that at least 99.99 percent of the contracts matched the billed revenue as this level of accuracy was required to ensure the reliability of the system.</p>
</li>
<li>
<p>Number of contracts with mismatched discount invoice: With Yelp offering different discounts and promotions applied at the business and product offering level, we had to ensure that the pipeline predicts the discount application as close as possible to the billing system.</p>
</li>
<li>
<p>Number of contract lines with no equivalent invoice: Some of Yelp’s product offerings are billed differently. Some are manually billed and don’t necessarily pass through the automated billing process. We wanted to figure out how many of these contracts were being reported by the pipeline and if an invoice generated through the billing system was not recognized by the automated pipeline.</p>
</li>
<li>
<p>Number of duplicate or past contract lines: We also wanted to confirm that we sent unique contracts and prevented redundant past contracts from entering the Revenue Recognition system.</p>
</li>
</ul><p>After finalizing the required reporting metrics, the next steps in developing the integrity checker presented further challenges:</p><ul><li>
<p>The pipeline used snapshots of the database as its daily source, that is, the daily state of the source database is saved in a data lake, but the database itself is constantly changing.</p>
</li>
<li>
<p>Products are mostly billed at the end of the monthly period. Thus, we would have to rely on monthly comparisons for confirming data accuracy.</p>
</li>
</ul><p>The Revenue Data Pipeline is designed to send data for revenue recognition on a daily basis and the ideal data source of truth is the revenue data produced each month by Yelp’s billing system. To resolve this, we decided to develop both monthly and daily checks.</p><ul><li>Monthly Integrity Check: Revenue Recognition pipeline data was published to redshift tables. Data from the billing system and contract pipeline were compared via SQL queries. Determining the right SQL queries was done iteratively, accounting for different edge cases based on the variety of products offered at Yelp.</li>
</ul><div class="language-plaintext highlighter-rouge highlight"><pre>Sample Monthly Check Result format
---- Metrics ----
Number of contracts reviewed: 100000
Gross Revenue:  xxxx
Net Revenue:  xxxx
Number of contracts matching both gross &amp; net:  99799
Gross Revenue: xxxx
Net Revenue:  xxxx
Number of contracts with mismatch discount: 24
Gross Revenue: xx
Net Revenue:  xx
Number of contracts with mismatch gross revenue:  6
Gross Revenue: xx
Number of contracts with no equivalent invoice:  1
Gross Revenue:  xx
Net Revenue:  xx
Number of invoice with no equivalent contract:  10
Gross Revenue:  xx
Net Revenue:  xx
Line match %: 99.997
Net revenue mismatch difference:  1883.12
</pre></div><ul><li>Daily Integrity Check: Waiting for the end of the month to determine data accuracy would lead to challenges when implementing quick code modifications and data correction backfills. Publishing the data generated by the pipeline to the Redshift database experiences latency of more than 10 hours before the data becomes available. To perform checks as quickly as possible after the pipeline run completes, we implemented daily checks in two steps.
<ul><li>
<p>We published data to data lake and created <a href="https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html">AWS Glue</a> data catalog tables to read from the data storage immediately, once the data became available.</p>
</li>
<li>
<p>We implemented SQL queries on the staging pipeline results, as explained in the <a href="https://engineeringblog.yelp.com/2025/05/revenue-automation-series-testing-an-integration-with-third-party-system.html#step-1-staging-pipeline-setup">Staging Pipeline Setup</a> section.</p>
</li>
</ul></li>
</ul><p>If any discrepancies were discovered, we would swiftly determine the cause, correct it, and rerun the pipeline for the day.</p><div class="language-plaintext highlighter-rouge highlight"><pre>Sample Daily Check Result Format
        ---- Metrics ----
number of contracts with negative revenue:  0
number of contracts passed or expired:  28
number of contracts with unknown program:   0
number of contracts missing parent category:   66
</pre></div><h2 id="step-4-data-format-validation"><strong>Step 4: Data Format Validation</strong></h2><p>To support a streamlined data upload process, we had to verify that the data required for upload to a third-party system met system-specific validation rules. We faced a few challenges where the internal file data format, column mapping sequence, or differences in the external table schema led to upload failures.</p><p>To avoid such failures, we developed a Schema Validation Batch, which retrieves the field mapping from the external system using a REST APl and compares it against the revenue data schema before proceeding with the data upload.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-05-20-revenue-automation-series-testing-an-integration-with-third-party-system/schema_validation_batch_process.png" alt="Schema Validation Batch Process" /></p><p>This validation layer minimized failure risks and improved system reliability. Here is an example of a request/response to retrieve the required details.</p><div class="language-plaintext highlighter-rouge highlight"><pre>-- Request
curl -X GET --header "token: RandomBasicToken" "https://yourHost/api/integration/v1/upload/mapping?templatename=SAMPLE_TEMPLATE"
-- Response
{
        "Dateformat":"MM/DD/YYYY",
        "Mapping":[
                {
                        "sample_column_id":1,
                        "sample_column_name":"random",
                        "sample_column_data_type":"string"
                }
        ]
}
</pre></div><h2 id="step-5-data-ingestion"><strong>Step 5: Data Ingestion</strong></h2><p>We developed an Upload Batch in the integration layer, which compressed large data files for a specific report date and uploaded them to the external system. Logging and monitoring were enabled to identify issues and failures in a timely manner, contributing to building a more resilient process.</p><p>We implemented two methods for data upload:</p><ul><li>Upload using REST APIs: It is common practice to opt for standard HTTP methods, which are easy to use and scalable. This option was initially explored and used for uploads during the development phase. An example of an upload request/response is shown below:</li>
</ul><div class="language-plaintext highlighter-rouge highlight"><pre>-- Request
curl -X POST https://yourHost/api/integration/v1/upload/file \
-H 'cache-control: no-cache' \
-H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary' \
-H 'templatename: Sample Template' \
-H 'token: &lt;generated token&gt;' \
-F 'file=sample_data_file.csv'
--Response
{"Message":"File consumed successfully" ,"Result":{"file_name":"sample_data_file.csv","file_request_id":"1000"},"Status":"Success"}
</pre></div><ul><li>SFTP Upload: This method uses the SSH protocol, which provides encrypted data channels and protects sensitive data from unauthorized access. We implemented individual flag-level options for each environment to use the SFTP upload feature.</li>
</ul><p>While testing these upload methods, we observed a few differences which are briefly discussed below:</p><ul><li>
<p>Reliability issue: Testing revealed that the availability and performance of REST APIs posed reliability concerns. API responses were found to be flaky, which lead to inconsistent uploads and multiple retries.</p>
</li>
<li>
<p>File size limitations: REST APIs have a predefined limit of 50,000 records per file, which resulted in approximately 15 files being generated on a daily basis and approximately 50 files during the month-end closing period. By using the SFTP upload method, we increased the file size limit to between 500,000 and 700,000 records. As a result, the pipeline now generates only 4-5 files making the upload process more efficient and easier to maintain.</p>
</li>
<li>
<p>Set up process: The SFTP upload setup process was found to be less complex when compared to the API set up. To maintain a standardized and more efficient process across multiple pipelines at Yelp, such as the revenue contract and invoice data pipelines, the SFTP upload approach emerged as our preferred method.</p>
</li>
</ul><p>After analyzing Yelp’s system and daily data volume requirements, we decided to enable SFTP upload as the primary method for the daily batch upload process.</p><h2 id="step-6-external-system-support"><strong>Step 6: External System Support</strong></h2><p>We confirmed that files were successfully uploaded to the external Revenue Recognition System for the production region on a daily basis. In cases where the file became stuck on the external SFTP server, we relied on the external third-party team to resolve the issue on their end before services could resume. These issues could occur for several reasons, such as:</p><ul><li>SFTP Server downtime or service disruption</li>
<li>Failure of the external upload job trigger after files were uploaded to the SFTP server</li>
<li>Insufficient table space in the external system</li>
</ul><p>To streamline the integration process, it is essential to maintain clear documentation and support channels to enable faster communication and issue resolution. We established internal guidelines for escalating high-priority issues and contacting the appropriate support teams.</p><h2 id="future-improvements"><strong>Future Improvements</strong></h2><p>As we continue to enhance the existing implementation, future opportunities could include:</p><ul><li>
<p>Automate test data generation: Manually creating test data is impractical due to the number of tables and unforeseeable edge cases. Automating test data creation could reduce development time and improve the overall testing process.</p>
</li>
<li>
<p>Optimize data generation pipeline: To reduce redundant contracts, discount calculations could be moved to the raw data generation phase, followed by an eligibility filter. This would enhance pipeline performance and improve report clarity.</p>
</li>
<li>
<p>Improve maintainability: Unexpected product types in reports can cause failures in external systems. Allowing users to include or exclude specific product types could ease new product onboarding, aid debugging, and prevent processing failures.</p>
</li>
<li>
<p>Enhance integrity checker: Improving discount matching could better align billed and earned revenue, increasing data accuracy.</p>
</li>
<li>
<p>Streamline third-party data upload: Creating a user interface for data uploads would give downstream teams better control across environments. Including upload history tracking could also reduce manual coordination, team dependencies, and delays.</p>
</li>
</ul><h2 id="acknowledgements"><strong>Acknowledgements</strong></h2><p>We have come a long way with this project, and it has been a valuable and rewarding experience. Special thanks to everyone on the Revenue Recognition Team for their commitment to delivering this project, and to all the stakeholders from the Commerce Platform and Financial Systems teams for their support.</p><div class="island job-posting"><h3>Join Our Team at Yelp</h3><p>We're tackling exciting challenges at Yelp. Interested in joining us? Apply now!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/05/revenue-automation-series-testing-an-integration-with-third-party-system.html</link>
      <guid>https://engineeringblog.yelp.com/2025/05/revenue-automation-series-testing-an-integration-with-third-party-system.html</guid>
      <pubDate>Tue, 27 May 2025 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More]]></title>
      <description><![CDATA[<p>It has been over 3 years since we published our <a href="https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html">Nrtsearch blog post</a> and over 4 years since we started using Nrtsearch, our Lucene-based search engine, in production. We have since migrated over 90% of Elasticsearch traffic to Nrtsearch. We are excited to announce the release of <a href="https://github.com/Yelp/nrtsearch">Nrtsearch 1.0.0</a> with several new features and improvements from the initial release.</p><ul><li><a href="https://aws.amazon.com/ebs/">EBS</a> (Elastic Block Store): Network-attached block storage volumes in AWS.</li>
<li><a href="https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world">HNSW</a> (Hierarchical Navigable Small World): A graph-based approximate nearest neighbor search technique.</li>
<li><a href="https://lucene.apache.org/core/">Lucene</a>: An open-source search library used by Nrtsearch.</li>
<li><a href="https://aws.amazon.com/s3/">S3</a>: Cloud object storage offered in AWS.</li>
<li><a href="https://www.enterpriseintegrationpatterns.com/patterns/messaging/BroadcastAggregate.html">Scatter-gather</a>: A pattern where a request is sent to multiple nodes, and their responses are combined to create the result.</li>
<li><a href="https://lucene.apache.org/core/10_1_0/core/org/apache/lucene/codecs/lucene101/package-summary.html#Segments">Segment</a>: Sub-index in a Lucene index which can be searched independently.</li>
<li>SIMD (Single Instruction, Multiple Data): CPU instructions that perform the same operation on multiple data points, can make some operations like vector search faster and more efficient.</li>
</ul><h2 id="incremental-backup-on-commit">Incremental Backup on Commit</h2><p>Nrtsearch now does an incremental backup to S3 on every commit, and this backup is used to start replicas. The motivation behind this change and more details are discussed below.</p><h3 id="initial-architecture">Initial Architecture</h3><p>A quick refresher of the Nrtsearch architecture from the first <a href="https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html">Nrtsearch blog</a> - the primary flushes Lucene index segments to locally-mounted network storage (Amazon EBS in our case) when commit is called. This guarantees that all data up to the last commit would be available on the EBS volume. If a primary node restarts and moves to a different instance, we don’t need to download the index to the new disk; instead, we only need to attach the EBS volume, allowing the primary node to start up in under a minute and reducing the indexing downtime. We call the backup endpoint occasionally on the primary to backup the index to S3. When replicas start, they download the most recent backup of the index from S3 and then sync the updates from the primary. When the primary indexes documents, it publishes updates to all replicas, which can then pull the latest segments from the primary to stay up to date.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2025-05-08-nrtsearch-v1-release/initial_nrtsearch_architecture.png" alt="Initial Nrtsearch architecture" /><p class="subtle-text"><small>Initial Nrtsearch architecture</small></p></div><h3 id="drawbacks-of-architecture">Drawbacks of Architecture</h3><p>This architecture mostly worked well for us, but there were some drawbacks:</p><ol><li>EBS volume was the source of truth. If the EBS was lost, corrupted, or took too long to resize, we would have to reindex all data.</li>
<li>The EBS movement was not as smooth as expected. At times, the EBS volume would not be correctly dismounted from the old node, and then the new node would take some time to mount it.</li>
<li>Ingestion-heavy clusters would need to back up the entire index frequently so that replicas did not have to spend too much time catching up with the primary after downloading the index.</li>
</ol><h3 id="switching-to-ephemeral-local-disks-and-incremental-backup-on-commit">Switching to Ephemeral Local Disks and Incremental Backup on Commit</h3><p>To avoid these drawbacks, we wanted to use an ephemeral local disk instead of an EBS volume. There were two blockers for using local disk:</p><ol><li>Ensuring that restarting the primary wouldn’t result in losing changes made between the last backup and the most recent commit</li>
<li>Making sure the primary could download the index quickly enough to remain competitive with the speed of mounting an EBS volume.</li>
</ol><p>To ensure that primary retained all changes until the last commit, we needed to backup the index after every commit instead of doing it periodically. To do this in a feasible manner, we switched to incrementally backing up individual files instead of creating an archive of the index. Lucene segments are immutable, so when we perform a backup, we only need to upload the new files since the last backup. On every commit, Nrtsearch checks the files in S3, determines the missing files, and uploads them. This makes a commit slightly slower, as we are now uploading files to S3 while we were only flushing them to EBS before. The additional time is generally a few ms to 20 seconds depending on the size of the data, which is trivial enough to not cause any issues.</p><p>To address the second blocker, we started downloading multiple files from S3 in parallel to make full use of the available network bandwidth. Combined with a local SSD, this yielded a 5x increase in the download speed. With both blockers resolved, we were able to stop using EBS volumes in favor of local disks.</p><p>We still have the ability to take full consistent backups (snapshots) to use in case the index gets corrupted. Instead of having the primary perform this operation, we now directly copy the latest committed data between locations in S3. Since these full backups are not involved in replica bootstrapping, they can be less frequent than before.</p><p>The updated Nrtsearch architecture can be seen below. Large Nrtsearch indices are split into multiple clusters, and Nrtsearch coordinator directs all requests to the correct Nrtsearch primaries and replicas. It also does scatter-gather for search requests if needed. On the ingestion side, Nrtsearch coordinator receives index, commit, and delete requests from indexing clients, and forwards the requests to the correct Nrtsearch primaries. Check out <a href="https://engineeringblog.yelp.com/2023/10/coordinator-the-gateway-for-nrtsearch.html">Coordinator - The Gateway For Nrtsearch</a> blog for more information about sharding in Nrtsearch.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2025-05-08-nrtsearch-v1-release/updated_nrtsearch_architecture.png" alt="Updated Nrtsearch architecture" /><p class="subtle-text"><small>Updated Nrtsearch architecture</small></p></div><h2 id="lucene-10">Lucene 10</h2><p>We used the latest version of the Lucene library available during the initial development of Nrtsearch, version 8.4.0. We have now updated to use the latest release of Lucene 10 (10.1.0). This update includes a host of improvements, optimizations, bug fixes, and new features. The most notable new feature is vector search using the HNSW algorithm. Additionally, combined with our update to Java 21, Lucene 10 can leverage newer java features such as SIMD vector instructions and the foreign memory API.</p><h3 id="legacy-state-management">Legacy State Management</h3><p>The original management of cluster and index state was simplistic and difficult to work with.</p><p>The cluster state only contained the names of the indices that had been created. The servers did not know which of these indices should be started, so it was up to the deployment manager (<a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">Kubernetes operator</a>) to start the necessary indices during server bootstrapping. This resulted in more complexity for the operator.</p><p>Index state is composed of three main sections:</p><ul><li>Settings: properties that can only be configured before an index is started</li>
<li>Live settings: properties that can be updated dynamically</li>
<li>Fields: index schema specifying the field types and their properties</li>
</ul><p>The process for updating index state on a cluster was time consuming and error prone:</p><ol><li>Issue requests to the primary to change index state</li>
<li>Issue an index commit request on the primary, which makes the data and state durable on local storage (EBS)</li>
<li>Issue a backup request on the primary, which makes the latest committed state and index data durable in remote storage (S3)</li>
<li>Restart all cluster replicas to load the latest remote state and data</li>
</ol><h3 id="pain-points">Pain Points</h3><p>There were several issues with the state update process:</p><ol><li>The commit of index state and data were coupled together. This was unnecessary, since the only allowed state changes were backwards compatible with previous data. Specifically, new fields can be added to the index, but existing fields cannot be removed or modified.</li>
<li>State changes were not durable until after a commit request. This meant that an update could be lost if the primary server restarted between the state modification and the commit.</li>
<li>The local disk (EBS) on the primary was the source of truth for cluster state. However, there is only a single primary, which is unavailable during restarts and re-deployments. Since the state was sometimes inaccessible, building tools around it was difficult.</li>
<li>Needing to backup an index and restart all replicas significantly extended the time needed to fully propagate state changes across a cluster.</li>
</ol><p>Internally, when a state change was applied, it was not done in an isolated way. As a result, state values may change when sampled multiple times during the processing of a single request. This could lead to inconsistencies and more edge cases to handle.</p><h3 id="new-state-system">New State System</h3><p>The state management system was redesigned to address the above issues.</p><p>The cluster state was updated to include additional information about the indices. The ‘started’ state of each index is tracked. This allows the necessary indices to be automatically started during server bootstrapping, removing the need for external coordination. The state also contains a unique identifier for each index to isolate data/state in the case that an index is recreated.</p><p>Committing state changes are decoupled from committing data. Because of this, state changes are now committed within the life of the update request. This prevents changes made by successful requests from being lost. Clients will also no longer see state values applied to the primary that have not yet been committed.</p><p>The location of state data can be set to either local (EBS) or remote (S3). Setting the location to remote means that the local data is no longer the source of truth for state, and the primary disk no longer needs to be durable to maintain cluster state.</p><p>The ability to hot reload state was added to replicas, allowing the application of changes without needing a server restart. The state update process is now simplified to: Issue requests to the primary to change state Hot reload state on all replicas This greatly reduces the time needed to apply a change to the whole cluster.</p><p>Internally, the index state was changed to build into an immutable representation. When a change is processed, it is merged into the existing state to produce a new immutable representation. After the change is committed to the state backend (EBS or S3), the new state atomically replaces the reference to the old state. This prevents changes from being observed before they are committed. Client requests retrieve the current state once, and reference it for the remainder of the request. Since the state object is immutable, changes will not be visible during the processing of a single request.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2025-05-08-nrtsearch-v1-release/nrtsearch_state_management.png" alt="Legacy vs modern state management" /><p class="subtle-text"><small>Legacy vs modern state management</small></p></div><p>Starting in major version 9, Lucene added support for vector data and nearest neighbor search with the HNSW algorithm. Nrtsearch leverages this api to provide kNN vector search for float and byte vector data with up to 4096 elements. A number of different similarity types are available: cosine (with or without auto normalization), dot product, euclidean, and maximum inner product.</p><p>Nrtsearch also exposes several additional advanced features:</p><ul><li>Float vectors may be configured to use scalar quantized values for search, allowing a tradeoff between accuracy and memory usage</li>
<li>Vector search is available for fields in nested documents</li>
<li>Intra merge parallelism may be configured to speed up merging vector graph data</li>
<li>Optimized SIMD instruction support, provided by Java, can be enabled to accelerate vector computations</li>
</ul><p>See <a href="https://nrtsearch.readthedocs.io/en/latest/vector_search.html">Vector Search and Embeddings in Nrtsearch</a> for more details.</p><h2 id="aggregations">Aggregations</h2><p>One of the requirements for migrating our existing search applications from Elasticsearch was providing limited support for aggregations. We initially tried to replicate any needed functionality by using and extending Lucene facets. This worked for some of the more simple use cases, but was not suitable for more complex and/or nested aggregations.</p><p>There were two main issues with using facetting. Complex aggregation logic could be added using the Nrtsearch plugin system. However, this required creating custom code for every use case, which was not a scalable or maintainable process. Additionally, facet processing was not integrated with parallel search. Collecting and ranking documents is done in parallel by dividing the index into slices. However, facet result processing happened after collection and is single threaded. For large indices with complex aggregations, this could noticeably add to request latency.</p><p>As an alternative to facets, we added an aggregation system that integrates with parallel search. Aggregations are tracked independently for each index slice. When a slice document is recalled and collected for ranking, it is also processed by the aggregations to update internal state (such as term document counts).</p><p>When parallel search finishes for each index slice, it performs a reduce operation to merge all the slice top hits together to form the global document ranking. This reduction also happens for the aggregations. The aggregation state from each slice is merged together to produce the global state. The global state is used to produce the aggregation results returned to the client in the search response. In the case where aggregations are nested, this merge happens recursively.</p><p>Currently, this system supports the following aggregations:</p><ul><li>term (text and numeric) - creates bucket results with counts of documents that contain the terms</li>
<li>filter - filter documents to nested aggregations based on a given criteria</li>
<li>top hits - top k ranked documents based relevance score or sorting</li>
<li>min - minimum observed value</li>
<li>max - maximum observed value</li>
</ul><h2 id="support-for-more-plugins">Support for More Plugins</h2><p>Nrtsearch is highly extensible and supports a variety of plugins, enabling customization and enhanced functionality. Some of the key plugins include:</p><ol><li>Script - custom scoring logic written in Java (simple custom logic can also be specified using Lucene javascript expressions without a plugin)</li>
<li>Rescorer - allows custom rescore operations to refine search results by recalculating scores for a subset of documents.</li>
<li>Highlight - provides custom highlighting to emphasize relevant sections in search results.</li>
<li>Hits Logger - facilitates custom logging of search result hits, useful for collecting data to train machine learning models.</li>
<li>Fetch Task - enables custom processing of search hits to retrieve and enrich data as needed.</li>
<li>Aggregation - custom aggregation implementation These plugins empower users to tailor Nrtsearch to their specific search and retrieval requirements.</li>
</ol><p>We have exposed more Lucene query types in our search request. We are still missing some queries though, and we plan to add more query types on an as-needed basis. You can find the currently available query types in our <a href="https://nrtsearch.readthedocs.io/en/latest/querying_nrtsearch.html">documentation</a>.</p><p>We are planning to do NRT replication via S3 instead of gRPC to allow replicas to scale without the primary being a bottleneck. We are also planning to replace virtual sharding with parallel search in a single segment, a feature added in Lucene 10.</p><div class="island job-posting"><h3>Become a Data Backend Engineer at Yelp</h3><p>Do you love building elegant and scalable systems? Interested in working on projects like Nrtsearch? Apply to become a Data Backend Engineer at Yelp.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/05/nrtsearch-v1-release.html</link>
      <guid>https://engineeringblog.yelp.com/2025/05/nrtsearch-v1-release.html</guid>
      <pubDate>Thu, 08 May 2025 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Journey to Zero Trust Access]]></title>
      <description><![CDATA[<h2 id="glossary">Glossary</h2><ul><li><strong>ZTA:</strong> zero trust architecture</li>
<li><strong>SAML:</strong> security assertion markup language (an SSO facilitation protocol)</li>
<li><strong>Devbox:</strong> a remote server used to develop software</li>
</ul><h2 id="zero-trust-access"><strong>Zero Trust Access</strong></h2><h3 id="remote-future"><strong>Remote Future</strong></h3><p>Yelp is now a <strong>fully remote company</strong>, which means our employee base has become increasingly distributed across the world, making secure access to resources from anywhere a critical business function. Yelp historically used Ivanti Pulse Secure as the employee VPN, but due to the need for a more reliable solution, it became clear that a change was necessary to ensure secure and consistent access to internal resources. The Corporate Systems and Client Platform Engineering teams began looking for alternative connectivity options to replace Pulse in late 2023. In the early discussions, the question of what role a VPN should play in Yelp was asked. We knew that a large population of daily VPN users did not require full network access and that only select web applications were required to perform their job duties. Work was already underway to shift less sensitive applications to alternate access methods like our MTLS based Edge Gateway. However, this was not an immediate solution for all employees and required significant effort for widespread implementation. This led us to understand that we needed a solution that could support all Yelpers today, with the goal of reducing its use to more granular use cases in the future.</p><p>We also recognized that the vast majority of use cases that would be difficult to migrate off VPN were engineering oriented.Whether it was SSH access to a devbox or downloading files from internal servers, these activities involved complex and diverse access patterns. Additionally, they would benefit from improved throughput—something significantly constrained by the current Pulse solution.Additionally, we understood that Zero Trust Architecture (ZTA) was the future. ZTA was not only becoming an industry trend, but also aligned with our long term goals of reducing VPN utilization and creating more fine grained access control structures in the future, as opposed to broad, binary policies on huge subnets and network segments. Implementing a secure, modern high-availability solution was paramount to maintaining productivity for our employees.</p><h3 id="wireguard-and-netbird"><strong>Wireguard and Netbird</strong></h3><p>WireGuard debuted in 2015, and since then it has quickly become the protocol of choice for secure network access. When alternatives to Ivanti were being evaluated, we found ourselves gravitating towards WireGuard-based solutions. To say the least, there are plenty. To narrow down our options, we started compiling a list of must-haves that we quantified to be pillars of a great user experience.</p><p>The core pillars we identified as essential for a solution that met our operational and user needs are listed below. Netbird ultimately stood strong; supporting all 5 which made it our solution of choice:</p><ol><li>Support for Okta as an identity and authentication provider</li>
<li>A simple and intuitive user interface</li>
<li>Open source and extensible</li>
<li>Capable of high throughput and low latency</li>
<li>Fault tolerant</li>
</ol><p>Let’s expand over some items in this feature wishlist and how Yelp engineering worked towards realizing this using Netbird and Wireguard.</p><p>At Yelp, Okta is used to employ a Zero Trust Authentication model as part of our overall Zero Trust Access strategy. Our previous solution initially used LDAP for authentication, which lacked advanced user and device trust verification. Later, we transitioned to SAML, but the implementation within Ivanti’s product led to a suboptimal user experience due to a cumbersome browser-to-VPN client handoff for session authentication. To address these issues, we sought a solution that supported OpenID Connect (OIDC), specifically with Okta integration. This approach empowers us to enforce policies that ensure only users on managed devices with a secure security posture are granted access. In today’s environment, it’s not enough to simply verify the identity of the user.</p><p>While intuitive authentication alone was not our sole goal, OIDC was the first step to ensure a great user experience. Yelp is proud to employ a very diverse workforce with different levels of technical expertise and abilities. Therefore, we wanted an application that was simple and intuitive. Netbird’s implementation of their client largely met this goal but we found that simplicity is key when supporting less-technical users. We began to tailor the client experience for our specific needs, removing some of the more advanced options from the UI. Additionally, we added elements that empowered users to quickly and easily self repair, access user friendly documentation, and request assistance from our awesome helpdesk teams. We didn’t stop there though. We personalized the icons and added feedback for each specific stage of the connection process. All these modifications would not have been possible without a code base that was approachable, well thought out and open source.</p><p>Additionally, open source products would allow us to keep a finger on the pulse of a project by tracking its commit and issue histories. What’s more, if critical security issues ever arose, we would not be beholden to the maintainers alone - we ourselves could provide fixes if need be. Open source also means Yelp has the opportunity to contribute back to the community, enhancing the software for everyone’s benefit. To date, multiple changes have been pushed upstream to Netbird’s main branch from Yelpers working to solve issues we encountered, debugged, and ultimately solved.</p><p>While most users do not demand high throughput and low latency to complete their day to day business functions, improving these aspects was a clear quality-of-life enhancement we aimed to achieve with a new solution. Nothing drags on like downloading large logs at a snail’s pace, cloning big Git repos and watching the commits trickle in, or connecting to a terminal or remote desktop and feeling like you are moving in slow motion. Pulse, optimistically, had a peak download speed in the low tens of megabits per second for most users. With Netbird being plugged into a 10 gigabit backbone and supported by the blazing fast cryptography of Wireguard, testing showed users could achieve speeds upwards of 1 gigabit per second - mostly restricted by their home internet speed limits. Latency was also close to the pure cost of the wire users traversed with single digit milliseconds of overhead added by wrapping packets in the Wire Guard protocol. Simply put, it was fast and this is exactly what we wanted.</p><p>Finally, the worst user experience arises when things simply don’t work. We needed a solution that was robust and fault tolerant. Relying on users to connect to their optimal endpoint was a needless complication. Furthermore, outages regardless of their cause should not require user intervention to mitigate and ideally should be totally transparent to the end user’s experience. WireGuard’s mesh topology aspires to add this redundancy by creating multiple paths a user can take and be simultaneously connected to. Each of these paths or endpoints in Netbird are referred to as a router peer. All members of the mesh are peers but router peers serve the special role of being able to accept and egress traffic from other peers. Clients intrinsically have a one to many relationship with router peers they are permitted to use. This allows for maintenance or service interruption on one router peer without causing a user to reconnect to the network or experience noticeable degradation. Our testing with Netbird showed that a router peer that was actively handling traffic for a given peer could suddenly halt operation, and the client would experience a sub 2 second connectivity interruption while their traffic was rerouted to another host. This addressed the final pillar in our user experience aspirations as we could respond to incidents, security, operational, or otherwise with ease and confidence that Yelpers can continue working without disruption.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-04-15-journey-to-zero-trust-access/rollout-plan.png" alt="netbird-rollout-plan" /></p><h2 id="implementation-and-outlook"><strong>Implementation and Outlook</strong></h2><p>In summary, our desire to learn from our previous challenges with a legacy VPN solution led to a dramatic shift in corporate connectivity. We aimed to provide Yelpers with secure and consistent access to internal resources, ultimately enhancing the daily experiences of Yelpers who work tirelessly to connect people with great local businesses.</p><p>Despite encountering hurdles and bumps along the road of delivering this next generation in connectivity, the output shows the juice was well worth the squeeze. We look forward to our next post where we talk about the implementation, challenges, and initial architecture of Yelp’s deployment of Netbird.</p><div class="island job-posting"><h3>Join Our Team at Yelp</h3><p>We're tackling exciting challenges at Yelp. Interested in joining us? Apply now!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/04/journey-to-zero-trust-access.html</link>
      <guid>https://engineeringblog.yelp.com/2025/04/journey-to-zero-trust-access.html</guid>
      <pubDate>Tue, 15 Apr 2025 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Revenue Automation Series: Building Revenue Data Pipeline]]></title>
      <description><![CDATA[<h2 id="background"><strong>Background</strong></h2><p>As Yelp’s business continues to grow, the revenue streams have become more complex due to the increased number of transactions, new products and services. These changes over time have challenged the manual processes involved in <a href="https://en.wikipedia.org/wiki/Revenue_recognition">Revenue Recognition</a>.</p><p>As described in the <a href="https://engineeringblog.yelp.com/2024/12/modernization-of-yelp's-legacy-billing-system.html">first post</a> of the Revenue Automation Series, Yelp invested significant resources in modernizing its Billing System to fulfill the pre-requisite of automating the revenue recognition process.</p><p>In this blog, we would like to share how we built the Revenue Data Pipeline that facilitates the third party integration with <strong>a Revenue Recognition SaaS solution</strong>, referred to hereafter as the <strong>REVREC</strong> <strong>service</strong>.</p><h2 id="our-journey-to-automated-revenue-recognition"><strong>Our Journey to Automated Revenue Recognition</strong></h2><p>The REVREC service provides the following benefits in recognizing revenue:</p><ul><li>Recognize any <a href="https://en.wikipedia.org/wiki/Revenue_stream">revenue stream</a> with minimal cost and risk for one time purchases and subscriptions, offered at either flat rates or variable prices.</li>
<li>Unblock a continuous accounting process, reconcile revenue data in real-time so the Accounting team can <a href="https://en.wikipedia.org/wiki/Financial_close_management">close the books</a> up to 50% faster.</li>
<li>Forecast revenue in real-time, which gives us access to out of the box reports and dashboards for revenue analytics.</li>
</ul><p>In order to get all the benefits above, we needed to ensure the REVREC service had the right data to recognize revenue. Therefore, a centerpiece of this project was to build a data pipeline that collects and produces quality revenue data from Yelp’s ecosystem to the REVREC service.</p><h3 id="step-1-handle-ambiguous-requirement"><strong>Step 1: Handle Ambiguous Requirement</strong></h3><p>We started with a list of product requirements around a standard revenue recognition process as below:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-02-19-revenue-automation-series-building-revenue-data-pipeline/revrec-flow.drawio.png" alt="Handling Ambiguous Requirements" /></p><p>Let’s take a look at one of the example requirement under <strong>Define Transaction Amount</strong> section: <strong>The total contract value (TCV) should be captured during booking as revenue to be recognized based on fair value allocation by each line will be different from invoice amount.</strong></p><p>These requirements were written for accountants but were quite challenging for engineers to comprehend. In order to translate it down to engineering friendly requirements and make sure we reached cross functional alignment, we followed the below methodology to break it down:</p><p>First, we needed to create a <strong>Glossary Dictionary</strong> that mapped business terms to engineering terms at Yelp. Having such mapping quickly aligned the understanding among cross functional teams. For the example requirement, the Glossary Dictionary is illustrated as below:</p><p><strong>Requirement: The total contract value (TCV) should be captured during booking as revenue to be recognized based on fair value allocation by each line will be different from invoice amount.</strong></p><p><strong>Glossary Dictionary</strong></p><ul><li><strong>Revenue Contract:</strong> A subscription order placed on Yelp’s Biz Site.</li>
<li><strong>Booking:</strong> The moment when a user completes a subscription purchase request.</li>
<li><strong>Fair Value:</strong> The gross amount if we sell this product standalone.</li>
<li><strong>Each Line:</strong> The minimum granularity that we track fulfillment and billing of a product feature.</li>
<li><strong>Invoice Amount:</strong> The net amount that on ledger for a product feature in a revenue period.</li>
</ul><p>Second, we needed to extract the <strong>purpose</strong> of this requirement so that we understand why we needed certain data in the first place, and what data is closest to the business need. It was also helpful to explain it with an <strong>example calculation</strong>.</p><p><strong>Purpose:</strong></p><ul><li><strong>Make sure revenue allocation is based on the price if we sell this product alone, and not based on how much we actually bill this product.</strong></li>
<li><strong>Example Calculation:</strong>
<ul><li><strong>Product A</strong> standalone selling price is $100</li>
<li><strong>Product B</strong> standalone selling price is $20</li>
<li>Subscription bundles of <strong>Product A and B</strong>, sells product A for $100, product B as add-on for free.</li>
<li>Total Revenue of subscription order $100 should be split based on the standalone selling price, A’s share 100/120 * 100 = 83.333, B’s share 20/120 * 100 = 16.667</li>
</ul></li>
</ul><p>Finally<em>,</em> we translated the original requirement into an engineering friendly requirement as below:</p><p><strong>The REVREC service requires a purchased product’s gross amount to be sent over whenever a user completes a subscription order, this amount is possible to be different from the amount we actually bill the user.</strong></p><h3 id="step-2-data-gap-analysis"><strong>Step 2: Data Gap Analysis</strong></h3><p>The project team also faced challenges integrating Yelp’s custom order-to-cash system (details can be found in this <a href="https://engineeringblog.yelp.com/2024/12/modernization-of-yelp's-legacy-billing-system.html">blog post</a>) with a standard ETL(Extract, Transform and Load) architecture that usually works well in such projects. This led to a data gap analysis to align Yelp’s data with the required template for integration.</p><p>Some of the main challenges and the solutions included:</p><ul><li><strong>Data Gaps</strong>
<ul><li>Issue: No direct mapping between fields in Yelp’s system and the 3rd Party System.</li>
<li>Immediate Solutions:
<ul><li>Use approximations, e.g., using gross billing amounts as list prices.</li>
<li>Composite data, e.g., creating unique monthly contract identifiers by combining product unique IDs with revenue periods.</li>
</ul></li>
<li>Long-term Solution: Develop a Product Catalog system for common product attributes.</li>
</ul></li>
<li><strong>Inconsistent Product Implementation</strong>
<ul><li>Issue: Data attributes were scattered across various databases for different products.</li>
<li>Immediate Solutions:
<ul><li>Pre-process data from different tables into a unified schema.</li>
<li>Categorize and preprocess data by type (e.g., subscription, billing, fulfillment).</li>
</ul></li>
<li>Long-term Solution: Propose unification of billing data models into a centralized schema for future needs.</li>
</ul></li>
</ul><p>The outcome of data gap analysis not only clarified the immediate solutions to support automated revenue recognition with status quo data, but also gave us better direction on long term investment that would make the custom order-to-cash system closer to industry standards so that the data mapping would be more straightforward.</p><h3 id="step-3-system-design-evaluations"><strong>Step 3: System Design Evaluations</strong></h3><p>At Yelp, there are many options for processing, streaming and storing data at a large scale. We considered the available options and evaluated their pros and cons before picking the option that would suit our requirements.</p><p>We had the following design choices for storage and data processing framework:</p><p><strong>Option 1: MySQL + Python Batch</strong>:</p><ul><li>Traditional method for generating financial reports.</li>
<li>Rejected due to inconsistent rerun results from changing production data and slow batch processing times during peak data volumes.</li>
</ul><p><strong>Option 2</strong>: <strong>Data Warehouse + <a href="https://www.getdbt.com/product/what-is-dbt">dbt</a></strong>:</p><ul><li>Uses SQL for data transformation, allowing non-engineers to update jobs.</li>
<li>Rejected due to difficulty representing complex logic in SQL and insufficient production use cases for confidence in its reliability.</li>
</ul><p><strong>Option 3</strong>: <strong>Event Streams + Stream Processing</strong>:</p><ul><li>Proven technology with near real time data processing capability.</li>
<li>Rejected because immediate data presentation isn’t necessary, and third-party interfaces don’t support stream integration, adding complexity without benefits.</li>
</ul><p><strong>Option 4</strong>: <strong>Data Lake + Spark ETL</strong>:</p><ul><li>MySQL tables are snapshotted daily and stored in a Data Lake, then processed with Spark ETL.</li>
<li>Preferred choice, benefits include independent reproducibility of data sources and processing, scalability during peak times, and strong community support.</li>
</ul><p>Ultimately, we chose the <strong>Data Lake + Spark ETL</strong> option. However, it presented challenges such as managing complex data processing DAGs and translating existing business logic from Python to PySpark efficiently. We will discuss how those challenges are addressed in the following section.</p><h2 id="address-technical-challenges"><strong>Address Technical Challenges</strong></h2><h3 id="manage-complex-spark-etl-pipeline"><strong>Manage Complex Spark ETL Pipeline</strong></h3><p>Reporting Revenue Contract data requires comprehensive data from different sources including raw mysql table snapshots and some pre-computed data source from external systems.</p><p>It also involves multiple stages of data transformation where we need to aggregate data in categories, add additional fields by applying transformation logic, join and then map them into final data templates. You can get a sense of the Spark ETL pipeline by viewing a simplified diagram below:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-02-19-revenue-automation-series-building-revenue-data-pipeline/simplified_revpro_diagram.png" alt="Spark ETL Pipeline Structure" /> Managing such complex ETL pipelines can be a daunting task, especially when dealing with intricate revenue recognition logic. At Yelp, we used an internal package called <em>spark-etl</em> to streamline this process. In this section, we’ll explain how spark-etl helps us manage and maintain our ETL pipelines effectively.</p><h4 id="building-blocks---spark-features"><strong>Building Blocks - Spark Features</strong></h4><p>The building blocks of a Spark-ETL program are Spark features, which define the input, transformation logic, and output. These features resemble web APIs with their request-response schemas.</p><p>In our design, we classified Spark features into two categories: source data snapshot features and transformation features.</p><h4 id="source-data-snapshot-features"><strong>Source Data Snapshot Features</strong></h4><p>Source data snapshot features read database snapshots from S3 and pass the data downstream without any transformation. This raw data can then be reused by various transformation features. Here’s an example of how source data is retrieved from S3 location:</p><div class="language-py highlighter-rouge highlight"><pre>class ARandomDatabaseTableSnapshotFeature(SparkFeature):
    alias = f"{TABLE_NAME}_snapshot"
    def __init__(self) -&gt; None:
        self.sources = {
            TABLE_NAME: S3PublishedSource(
                   base_s3_path=get_s3_location_for_table(TABLE_NAME),
                   source_schema_id=get_schema_ids_for_data_snapshot(TABLE_NAME),
                   date_col="_dt",
                   select_cols=TABLE_SCHEMA,
            )
  }
    def transform(
        self, spark: SparkSession, start_date: date, end_date: date, **kwargs: DataFrame
    ) -&gt; DataFrame:
        return kwargs[TABLE_NAME]
</pre></div><h4 id="transformation-features"><strong>Transformation Features</strong></h4><p>Transformation features take in source data snapshot features or other transformation features as pyspark.DataFrame objects. They perform various transformations like projection, filtering, joining, or applying user-defined functions. Here’s an example of pyspark.DataFrame operations in a transform function:</p><div class="language-py highlighter-rouge highlight"><pre>class ARandomTransformationFeature(SparkFeature):
    def __init__(self) -&gt; None:
        self.sources = {
            "feature_x": ConfiguredSparkFeature(),
            "feature_y": ConfiguredSparkFeature(),
        }
    def transform(
        self, spark: SparkSession, start_date: date, end_date: date, **kwargs: DataFrame
    ) -&gt; DataFrame:
        feature_x = kwargs["feature_x"]
        feature_y = kwargs["feature_y"]
        # Transform DataFrame based on needs
        feature_x = feature_x.withColumn(
            "is_flag", lit(False).cast(BooleanType())
        )
        feature_y = feature_y.withColumn(
            "time_changed", lit(None).cast(IntegerType())
        )
        aggregated_feature = feature_x.unionByName(feature_y).drop("alignment")
        return aggregated_feature.filter(
            (
                aggregated_feature.active_period_start
                &lt;= aggregated_feature.active_period_end
            )
            | aggregated_feature.active_period_end.isNull()
            | aggregated_feature.active_period_start.isNull()
        )
</pre></div><h4 id="dependency-management"><strong>Dependency Management</strong></h4><p>The dependency relationship is handled by a user defined yaml file which contains all the related Spark features. There is no need to draw a complex diagram of dependency relationships in the yaml file. At runtime, spark-etl figures out the execution sequence according to topology.</p><p>For instance, if the relationship is presented as the below DAG, the corresponding yaml configuration only needs to specify the nodes but not the edges to keep the configuration dry, since the edges are already defined in SparkFeature.sources.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-02-19-revenue-automation-series-building-revenue-data-pipeline/dependency.png" alt="Complex Feature Dependency Relationships" /></p><div class="language-plaintext highlighter-rouge highlight"><pre>features:
    &lt;feature1_alias&gt;:
        class: &lt;path.to.my.Feature1Class&gt;
    &lt;feature2_alias&gt;:
        class: &lt;path.to.my.Feature2Class&gt;
    &lt;feature3_alias&gt;:
        class: &lt;path.to.my.Feature3Class&gt;
    &lt;feature4_alias&gt;:
        class: &lt;path.to.my.Feature4Class&gt;
publish:
    s3:
        - &lt;feature4_alias&gt;:
            path: s3a://bucket/path/to/desired/location
            overwrite: True
</pre></div><h4 id="debugging"><strong>Debugging</strong></h4><p>Given the distributed nature and <a href="https://medium.com/@john_tringham/spark-concepts-simplified-lazy-evaluation-d398891e0568">lazy evaluation</a> of Spark programs, it’s hard to set a breakpoint and debug interactively, and the fact that this new data pipeline often deals with data frames that have thousands and even millions of rows, checkpointing intermediate data frames to a scratch path would be a convenient way to inspect data for debugging and resuming pipeline faster by specifying computational expensive features’ paths.</p><p>Yelp’s Spark-etl package enables features to be checkpointed to a scratch path, for example:</p><div class="language-plaintext highlighter-rouge highlight"><pre>spark-submit \
        /path/to/spark_etl_runner.py \
        --team-name my_team \
        --notify-email my_email@example.com \
        --feature-config /path/to/feature_config.yaml \
        --publish-path s3a://my-bucket/publish/ \
        --scratch-path s3a://my-bucket/scratch/ \
        --start-date 2024-02-29 \
        --end-date 2024-02-29 \
        --checkpoint feature1, feature2, feature3
</pre></div><p>Then <a href="https://engineeringblog.yelp.com/2023/07/overview-of-jupyterhub-ecosystem.html">Jupyterhub</a> came in handy when reading those checkpointed data, making the debugging experience more straightforward and shareable among the team.</p><h3 id="translating-python-logic-to-pyspark"><strong>Translating Python Logic to Pyspark</strong></h3><p>At Yelp, our revenue recognition process involves numerous business rules due to the variety of products we offer. Converting such logic into a PySpark transformation function requires careful design and possibly several iterations. While PySpark’s SQL-like expressions (such as selects, filters, and joins) are powerful, they may not be flexible enough for complex business logic. In such cases, PySpark UDFs (User Defined Functions) provide a more flexible solution for implementing intricate rules, like discount applications. We used PySpark UDFs in various logic that was too complex for SQL-like expressions to handle.</p><p>There are two types of UDFs in PySpark: pyspark.sql.functions.udf and pyspark.sql.functions.pandas_udf. While both serve similar purposes, this demonstration will focus on the former for simplicity.</p><h4 id="udf-example-business-rules-for-discount-application"><strong>UDF Example: Business Rules for Discount Application</strong></h4><p>Consider the following simplified business rules for applying discounts:</p><ul><li>A product can consider a discount as applicable if its active period completely covers the discount’s period.</li>
<li>Each product can receive only one discount.</li>
<li>Type A products receive discounts before Type B products.</li>
<li>If products have the same type, the one with the smaller ID receives the discount first.</li>
<li>Discounts with smaller IDs are applied first.</li>
</ul><h4 id="udf-example-implementation"><strong>UDF Example: Implementation</strong></h4><p>After grouping both products and discounts by customer_id, we need to determine the discount application for each product. Here’s a Python function that encapsulates this logic, how it can be applied to the grouped data frames and how to retrieve the results:</p><div class="language-py highlighter-rouge highlight"><pre>@udf(ArrayType(DISCOUNT_APPLICATION_SCHEMA))
def calculate_discount_for_products(products, discounts):
    # Sort products and discounts based on priority
    products = sorted(products, key=lambda x: (x['type'], x['product_id']))
    discounts = sorted(discounts, key=lambda x: x['discount_id'])
    results = []
    for product in products:
        for discount in discounts:
            if period_covers(discount['period'], product['period']) and discount['amount'] &gt; 0:
                amount = min(product['budget'], discount['amount'])
                discount['amount'] -= amount
                results.append((product['product_id'], product['type'], product['period'], amount, discount['discount_id']))
                break
    return results
# Say that we have previously grouped products and discounts by business_id(code not shown here)
# Apply the UDF above
result_df = (
    grouped_contracts_with_biz_discounts.withColumn(
        "results",
        calculate_discount_for_products("products", "discounts"),
    )
)
# Then explode the discount application by selecting the key of grouping products and discounts,
# in this case business_id
result_exploded = result_df.select(
    "business_id", explode("results").alias("exploded_item")
)
# Retrieve product and the applied discount amount/id from exploded items
result_exploded = result_exploded.select(
    "business_id",
    "exploded_item.product_id",
    "exploded_item.amount",
    "exploded_item.discount_id",
)
</pre></div><p>Without using UDFs, implementing this logic would require multiple window functions, which can be hard to read and maintain. UDFs provide a more straightforward and maintainable approach to applying complex business rules in PySpark.</p><h3 id="future-improvements"><strong>Future Improvements</strong></h3><p>We understand that solutions mentioned above don’t come without a cost</p><ul><li>We still have 50+ features in the 2-staged spark jobs which can be really challenging for a single team to maintain and develop in the future. Adding a new product would require changes and testing in the whole spark job.</li>
<li>We rely heavily on UDFs on calculating revenue period and contract amount. These UDFs are expensive to run and can penalize performance and reliability of the job over time.</li>
</ul><p>We are exploring future improvements to address these challenges:</p><ul><li><strong>Enhanced Data Interfaces and Ownership</strong>: Feature teams will own and manage standardized data interfaces for offline consumption, ensuring consistent data availability for analysis and reporting. These interfaces abstract implementation details, offering flexibility for teams to make changes without disrupting reporting processes.</li>
<li><strong>Simplified Data Models</strong>: Simplifying source data models minimizes the need for custom UDFs, as mappings can be handled with standard PySpark functions.</li>
<li><strong>Unified Implementation</strong>: Standardizing implementation across products and leveraging high-level data interfaces reduces input tables, streamlining Spark feature topology and lowering maintenance overhead.</li>
</ul><h2 id="more-on-revenue-automation-series"><strong>More on Revenue Automation Series</strong></h2><p>In this blog, we talked about how we handled ambiguous requirements in building a data pipeline for automating revenue recognition at Yelp. We also discussed system design decisions we made and technical challenges we addressed.</p><p>In the next article in series, we will discuss topics related to ensuring data integrity, reconciling data discrepancies and all the learnings from working with 3rd party systems. Stay Tuned!</p><h2 id="acknowledgement"><strong>Acknowledgement</strong></h2><p>This is a multi-year, cross organization project that relies on all teams’ tenacity and collaboration to scope, design, implement, test and take it to the finish line. As of today, we have automated nearly all Yelp’s revenue using this new system. A huge thank you to everyone in the project team for the great work. Special thanks for the support from Commerce Platform and Financial Systems Leadership and all stakeholder teams involved in this project.</p><div class="island job-posting"><h3>Join Our Team at Yelp</h3><p>We're tackling exciting challenges at Yelp. Interested in joining us? Apply now!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/02/revenue-automation-series-building-revenue-data-pipeline.html</link>
      <guid>https://engineeringblog.yelp.com/2025/02/revenue-automation-series-building-revenue-data-pipeline.html</guid>
      <pubDate>Wed, 19 Feb 2025 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Search Query Understanding with LLMs: From Ideation to Production]]></title>
      <description><![CDATA[<p>From the moment a user enters a search query to when we present a list of results, understanding the user’s intent is crucial for meeting their needs. Were they looking for a general category of business for that evening, a particular dish or service, or one specific business nearby? Does the query contain nuanced location or attribute information? Is the query misspelled? Is their phrasing unusual, so that it might not align well with our business data? All of the above questions represent Natural Language Understanding tasks where Large Language Models (LLMs) might well do better than traditional techniques. In this blog post, we detail our development process and elucidate the steps we’ve taken at Yelp to enhance our query understanding using LLMs, from ideation to full-scale rollouts in production.</p><p>Yelp has integrated Large Language Models (LLMs) into a wide array of features, from creating <a href="https://blog.yelp.com/news/winter-product-release-2024/">business summaries</a><sup id="fnref:1"><a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> that highlight what a business is best known for based on first-hand reviews, to <a href="https://blog.yelp.com/news/spring-product-release-2024/">Yelp Assistant</a><sup id="fnref:2"><a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> which intelligently guides consumers through the process of requesting a quote from a service provider with personalized relevant questions. Among these applications, query understanding was the pioneering project and has become the most refined, laying the groundwork for Yelp’s innovative use of LLMs to improve user search experiences. Particularly, tasks that require query understanding such as spelling correction, segmentation, canonicalization, and review highlighting all share a few common and advantageous features. These include: (1) all of these tasks can be cached at the query level, (2) the amount of text to be read and generated is relatively low, and (3) the query distribution follows the power law - a small number of queries are very popular. These features make the query understanding a very efficient ground for using LLMs.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image0.png" alt="Running examples" /></p><p>In this post, we will discuss our generic approach for leveraging LLMs across these query understanding tasks. To showcase this approach, we use the following two running examples:</p><ul><li><strong>Query Segmentation</strong>: Given a query, we want to segment and label semantic parts of that query. For example, “pet-friendly sf restaurants open now” might have the following segmentation: {topic} pet-friendly {location} sf {topic} restaurants {time} open now. This can be used to further refine the search location when suitable, implicitly rewriting the geographic bounding box (geobox) to match the user’s intent.</li>
<li><strong>Review Highlights</strong>: Given a query, we want a creatively expanded list of phrases to match on – particularly to help us find interesting “review snippets” for each business. Review snippets help the user see how each shown business is relevant to their search query. For example, if a user searched for “dinner before a broadway show,” bolding the phrase “pre-show dinner” in a short review snippet is a very helpful hint for their decision making.</li>
</ul><p>Yelp had older pre-LLM systems for both of these tasks, but they were fragmented (i.e. several different systems stitched together) and often lacked intelligence, leaving room for improvement to provide an exceptional user experience. As we progress, we’ll continue to refer to these examples to highlight our path from conceptualization to full-scale production rollouts.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image1.png" alt="Running examples" /><em><strong>[Figure 1] Generic approach for leveraging LLMs across two query understanding tasks.</strong> We formulate and scope the use case for LLMs, build and validate a small proof of concept, and then aggressively scale if the POC indicates a positive impact.</em></p><p>In this step, our initial goals are to: (1) determine if an LLM is the appropriate tool for the problem, (2) define the ideal scope and output format for the task, and (3) assess the feasibility of combining multiple tasks into a single prompt. Here, we also consider the potential for Retrieval Augmented Generation (RAG) to assist in the task, by identifying extra information (besides the query text) that could help the model make better decisions. This typically entails quick prototyping with the most powerful LLM available to us, such as the latest stable GPT-4 model, and creating many iterations of the prompt. At this stage, we also welcome changes to the task’s formulation itself, as we gain a deeper understanding of how the LLM perceives the task.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image2.png" alt="Formulation" /></p><h2 id="query-segmentation">Query Segmentation</h2><p>Compared to traditional Named Entity Recognition techniques, LLMs excel at segmentation tasks and are flexible enough to allow for easy customization of the individual classes. After several iterations, we settled on six classes for query segmentation: topic, name, location, time, question, and none. This involved a number of small but important decisions:</p><ol><li>Our legacy models had several subclasses all akin to “topic,” but this would have required the LLM to understand intricate details of our internal taxonomy that are both unintuitive and subject to change.</li>
<li>We introduced a new “question” tag for searches that want an answer beyond just “a list of businesses.” For example, the query “magic kingdom upcoming events” might be classified as {name} magic kingdom {question} upcoming events.</li>
<li>We aligned the model outputs with potential downstream applications that can benefit from a more intelligent labeling of these tags, such as implicit location rewrite, improved name intent detection, and more accurate auto-enabled filters.</li>
</ol><p><strong>Few-shot examples within query segmentation prompt:</strong></p><div class="language-plaintext highlighter-rouge highlight"><pre>1) chicago riverwalk hotels
   =&gt; {location} chicago riverwalk {topic} hotels
2) grand chicago riverwalk hotel
   =&gt; {name} grand chicago riverwalk hotel
3) healthy fod near me
   =&gt; {topic} healthy food {location} near me [spell corrected - high]
</pre></div><p>We also took note that spell correction is not only a prerequisite for segmentation, but is also a conceptually related task. Throughout the process we learned that spell-correction and segmentation can be done together by a sufficiently powerful model, so we added a meta tag to mark spell corrected sections and decided to combine these two tasks into a single prompt. On the RAG side, we augment the input query text with the names of businesses that have been viewed for that query. This helps the model learn and distinguish the many facets of business names from common topics, locations, and misspellings. This is highly useful for both segmentation and spell correction (so was another reason for combining the two tasks).</p><p><strong>RAG examples (using businesses names):</strong></p><div class="language-plaintext highlighter-rouge highlight"><pre>1) barber open sunday [Fade Masters, Doug's Barber Shop]
   =&gt; {topic} barber {time} open sunday
2) buon cuon [Banh Cuon Tay Ho, Phuong Nga Banh Cuon]
   =&gt; {topic} banh cuon [spell corrected - high]
</pre></div><h2 id="review-highlights">Review Highlights</h2><p>LLMs also excel at creative tasks by expanding on concepts using their world knowledge. In this task, we used the LLM to generate terms that are suitable to be highlighted, and we agreed on a low bar for inclusion – opting to include any phrase that would be better than showing no snippet at all.</p><p>The hardest part of this task was devising great examples of phrase lists. Using only the words in the query text, we have very limited options as to what to highlight in the review snippet. Not only that, there are also many subtleties within this complex task that make it difficult for traditional text similarity models to solve, such as:</p><ol><li>What does a query mean in the context of Yelp - regarding reservations and pick ups, food near me searches, and/or Yelp guaranteed searches for services.</li>
<li>If a user searches for seafood, it would be too limited to only consider reviews containing the “seafood” term. However, we can highlight adjacent terms such as “fresh fish,” “fresh catch,” “salmon roe,” “shrimp,” etc., which are interesting and sufficiently relevant to the business.</li>
<li>In contrast, we might also need to go up the semantic tree when appropriate and be more general for searches like “vegan burritos” to “vegan,” “vegan options,” and so on.</li>
<li>Generating multi-word or casual phrases like “watch the game,” which are substantially relevant to the searches like “best bar to watch lions games.”</li>
<li>Being cognizant of whether phrases generated are likely to produce spurious matches in actual reviews, or what to prioritize when a query contains multiple concepts, such as “ayce korean bbq for under $10 near me.”</li>
</ol><p>In essence, the way we define phrase expansions requires critical reasoning to resolve such subtleties, and we taught the LLMs to replicate that thought process through carefully curated examples.</p><p>On the RAG side, we enhanced the input raw query text with the most relevant business categories with respect to that query (from our in-house predictive model). This helps the LLM to generate more relevant phrases for our needs, especially for searches with a non-obvious topic (like the name of a specific restaurant) or ambiguous searches (like pool - swimming vs billiards).</p><p><strong>Evolutions of curated examples within review highlighting prompt:</strong></p><div class="language-plaintext highlighter-rouge highlight"><pre>May 2022
Query: healthy food
-&gt; Key concepts: healthy food, healthy, organic
March 2023
healthy food
-&gt; healthy food, healthy, organic, low calorie, low carb
September 2023
healthy food
-&gt; healthy food, healthy options, healthy | nutritious, organic, low calorie, low carb, low fat, high fiber | fresh, plant-based, superfood
December 2023 (with RAG)
search: healthy food, categories: [healthmarkets, vegan, vegetarian, organicstores]
-&gt; healthy food, healthy options, healthy | nutritious, organic, low calorie, low carb, low fat, high fiber | fresh, plant-based, superfood
</pre></div><p>After formulating the task and defining our input/output formats, our focus shifts to building a proof of concept to demonstrate the effectiveness of the new approach in practice. Up to this point, we iterated on our ideas using the most powerful LLM model available, which typically entails significant latency and cost. However, this setup is not conducive to a real-time system dealing with a vast array of distinct queries.</p><p>To address this challenge, we leverage the fact that distribution of query frequencies can be estimated by the power-law. By caching (pre-computing) high-end LLM responses for only head queries above a certain frequency threshold, we can effectively cover a substantial portion of the traffic and run a quick experiment without incurring significant cost or latency. We then integrated the cached LLM responses to the existing system and performed offline and online (A/B) evaluations.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image3.png" alt="Proof of Concept" /></p><h2 id="query-segmentation-1">Query Segmentation</h2><p>To evaluate offline, we observed the impact of the new segmentation on downstream tasks as well as specialized datasets. We compared the accuracy of LLM provided segmentation with the status quo system on human labeled datasets of name match and location intent. Among the different applications of this segmentation signal, we were able to (a) leverage token probabilities for (name) tags to improve our query to business name matching and ranking system and (b) achieve online metric wins with implicit location rewrite using the (location) tags.</p><table><thead><tr><th class="c1"><strong>Original Query Text</strong></th>
<th class="c1"><strong>Original Location</strong></th>
<th class="c1"><strong>Rewritten Location (only used by search backend)</strong></th>
</tr></thead><tbody><tr><td class="c2">Restaurants near Chase Center</td>
<td class="c2">San Francisco, CA</td>
<td class="c2">1 Warriors Way, San Francisco, CA 94158</td>
</tr><tr><td class="c2">Ramen Upper West Side</td>
<td class="c2">New York, NY</td>
<td class="c2">Upper West Side, Manhattan, NY</td>
</tr><tr><td class="c2">Epcot restaurants</td>
<td class="c2">Orlando, FL</td>
<td class="c2">Epcot, Bay Lake, FL</td>
</tr></tbody></table><h4 id="status-quo">Status Quo</h4><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image4.png" alt="Status Quo" /></p><h4 id="rewritten">Rewritten</h4><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image5a.png" alt="Rewritten" /><em><strong>[Figure 2] Better understanding of location intent lets us return more relevant results to the users.</strong> One of our POC leverages query segmentation to implicitly rewrite text within location boxes to a refined location within 30 miles of the user’s search if we have high confidence in the location intent. For example, the segmentation ”epcot restaurants =&gt; {location} epcot {topic} restaurant” helps us to understand the user’s intent in finding businesses within the Epcot theme park at Walt Disney World. By implicitly rewriting the location text from “Orlando, FL” to “Epcot” in the search backend, our geolocation system was able to narrow down the search geobox to the relevant latlong.</em></p><h2 id="review-highlights-1">Review Highlights:</h2><p>Offline evaluation of the quality of generated phrases is subjective and requires very strong human annotators with good product, qualitative, and engineering understanding. We cross-checked the opinion of human annotators as well as performed some quantitative methods for snippets like looking at how common the phrases are. After a thorough review, we carried out online A/B experimentations using the new highlight phrases.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image6.png" alt="Review Highlights" /><em><strong>[Figure 3] A/B evaluations with POC show high impact for snippets.</strong> Online evaluations showed that a better understanding of users’ intent can lead to impactful metric wins. By highlighting the relevant phrase to the user’s query, we increased Session / Search CTR across our platforms. Further iteration from GPT3 to GPT4 also improved Search CTR on top of previous gains. The impact was also higher for less common queries in the tail query range. A large portion of the wins came from incremental quality improvement as we addressed all of the nuances listed above for the task.</em></p><p>If an online experiment for the proof of concept indicates a meaningful positive impact, it’s time to improve the model, and also expand its utilization to a larger volume of queries. However, scaling to millions of queries (or to a real-time model, in order to support never-before-seen queries) poses cost and infra challenges. For example, we’re building out new signal datastores to support larger pre-computed signals. And though we’d love for next-gen technology to work on all searches, the investment can get harder to justify. Particularly, given the distribution of queries, scaling up to millions of queries may require a disproportionately high investment to achieve only a marginal increase in traffic coverage.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image7.png" alt="Scaling Up" /></p><p>Furthermore, as queries get further into the long tail, understanding user intent also becomes more challenging. So to scale up effectively, we need a more precise model that is also more cost-effective. At the moment, we’ve landed on a multi-step process for scaling from the prototype stage to a model that serves 100% of traffic:</p><ol><li>
<p><strong>Iterate on the prompt using the “expensive” model (GPT-4/o1).</strong> This is mainly testing the prompt against real or contrived example inputs, looking for errors that could be teachable moments, and then augmenting the examples in the prompt. One approach we used to narrow down our search for problematic responses was tracking the query level metrics to find those queries that have nontrivial traffic and that their metric is obviously worse than status-quo.</p>
</li>
<li>
<p><strong>Create a golden dataset for fine tuning smaller models.</strong> We ran the GPT-4 prompt on a representative sample of input queries. The sample size should be large (but not unmanageably so, since quality &gt; quantity) and it should cover a diverse distribution of inputs. For newer and more complex tasks that require logical reasoning, we have begun using o1-mini and o1-preview in some use cases, depending on the difficulty of the task.</p>
</li>
<li>
<p><strong>Improve the quality of the dataset if possible, prior to using it for fine tuning.</strong> With hard work here, it can be possible (for many tasks) to improve upon GPT-4’s raw output. Try to isolate sets of inputs that are likely to have been mislabeled and target these for human re-labeling or removal.</p>
</li>
<li>
<p><strong>Fine tune a smaller model (GPT4o-mini)</strong> that we can run offline at the scale of tens of millions, and utilize this as a pre-computed cache to support that vast bulk of all traffic. Because fine-tuned query understanding models only require very short inputs and outputs, we have seen up to a 100x savings in cost, compared to using a complex GPT-4 prompt directly.</p>
</li>
<li>
<p><strong>Optionally, fine tune an even smaller model</strong> that is less expensive and fast (to run in real-time only for long-tail queries). Specifically, at Yelp, we have used BERT and T5 to serve as our real time LLM model. These models are optimized for speed and efficiency, allowing us to process user queries rapidly and accurately during the complete rollout phase. As the cost and latency of LLMs improve, as seen with GPT4o-mini and smaller prompts, realtime calls for OpenAI’s fine-tuned model might also be achievable in the near future.</p>
</li>
</ol><h2 id="review-highlights-2">Review Highlights</h2><p>After fine-tuning our model and validating the responses on a diverse and random test set, we scaled to 95% of traffic by pre-computing snippet expansions for those queries using OpenAI’s batch calls. The generated outputs were quality checked and uploaded to our query understanding datastores. Cache-based systems such as key/value DBs were used to improve retrieval latency due to the power law distribution of search queries. With this pre-computed signal, we further leveraged the “common sense” knowledge that they possessed in other downstream tasks. For instance, we used CTR signals for the relevant expanded phrases to further refine our ranking models, and additionally used the phrases (averaged over business categories) as heuristics to get highlight phrases for the remaining 5% of traffic not covered by our pre-computations.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2025-01-31-search-query-understanding-with-LLMs/image8.png" alt="Review Highlights 2" /><em><strong>[Figure 4] Example of tail query in our review highlights system.</strong> For the query “dinner before a broadway show,” the model outputs a creative list of phrases that can be used to match relevant and interesting phrases within real reviews written by real users. This not only enhances user trust by aligning with their search intentions, but also enables quick decision-making by allowing users to easily assess the experiences of others and determine the business that can meet their needs.</em></p><p>The integration of LLMs deeper into our search systems holds great potential for transforming user search experiences. As the landscape of LLMs evolves, we continue to adapt to those new capabilities, which can unlock new ways to use our content. For some search tasks that require complex logical reasoning, we’re starting to see large benefits in the quality of outputs generated by latest reasoning models compared to the previous generative models. As we aim to develop even more advanced use cases, considering the trade-offs, we will continue to follow a multi-step validation and gradual scaling strategy. By staying agile and responsive to these new advancements, we can better showcase and highlight the best and most authentic content from our data, enhancing the overall user experience within the app.</p><p>LLMs hold immense potential for transforming user search experiences. To realize these possibilities, a strategic approach involving ideation, proof of concept testing, and full-scale production rollout is essential. This requires continuous iterations and adaptability to new advances in foundation models, as some query understanding tasks may require more complex logical reasoning while others require a deeper knowledgebase. Thus far, Yelp has successfully leveraged LLMs and our depth of content to improve the users’ experience and bring greater value to our business. We remain committed to staying at the forefront of LLM advancements and rapidly adapting these innovations to our use cases.</p><p>For more information on this topic, check out our more detailed talks at Haystack this year <sup id="fnref:3"><a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup><sup id="fnref:4"><a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>.</p><h2 id="acknowledgement">Acknowledgement</h2><p>The authors would like to acknowledge the Search Quality team for their exceptional contributions to this initiative, especially Cem Aksoy, Akshat Gupta, Alexander Levin, Brian Johnson, Arthur Cruz de Araujo, and Ashwani Braj. This blog reflects the collaborative effort and technical expertise that each member has brought to the table. Your dedication and innovative approach have been crucial in advancing our engineering goals. Thank you!</p><h2 id="footnotes"><strong>Footnotes</strong></h2><div class="footnotes" role="doc-endnotes"><ol><li id="fn:1">
<p><a href="https://blog.yelp.com/news/winter-product-release-2024/">Yelp releases a series of new discovery, contribution, services and AI-powered features</a> <a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://blog.yelp.com/news/spring-product-release-2024/">Yelp launches new AI assistant to help consumers easily find and connect with service professionals</a> <a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://haystackconf.com/eu2024/talk-11/">Relevance Proof in Yelp Search: LLM-Powered Annotations</a> <a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4">
<p><a href="https://haystackconf.com/us2024/talk-12/">Search Query Understanding with LLMs: From Ideation to Production.</a> <a href="https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol></div><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html</link>
      <guid>https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html</guid>
      <pubDate>Tue, 04 Feb 2025 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep]]></title>
      <description><![CDATA[<p>At Yelp, we encountered challenges that prompted us to enhance the training time of our ad-revenue generating models, which use a <a href="https://arxiv.org/abs/1606.07792">Wide and Deep</a> Neural Network architecture for predicting ad click-through rates (pCTR). These models handle large tabular datasets with small parameter spaces, requiring innovative data solutions. This blog post delves into our journey of optimizing training time using <a href="https://www.tensorflow.org/">TensorFlow</a> and <a href="https://horovod.ai/">Horovod</a>, along with the development of ArrowStreamServer, our in-house library for low-latency data streaming and serving. Together, these components have allowed us to achieve a 1400x speedup in training for business critical models compared to using a single GPU with <a href="https://github.com/uber/petastorm">Petastorm</a>.</p><p>At Yelp, we encountered several challenges in optimizing the training time of our machine learning models, particularly pCTR model.Our primary tool for machine learning tasks is Spark, and most of our datasets are stored in S3 in <a href="https://parquet.apache.org/">Parquet</a> format. Initially, training on 450 million tabular samples took 75 hours per epoch on a p3-2xlarge instance. Our goal was to scale this efficiency to 2 billion tabular samples while also achieving a per epoch training time of less than 1 hour. This required innovative solutions to address several key challenges:</p><ul><li><strong>Data Storage</strong>: Efficiently manage large<sup id="fnref:1"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> distributed tabular datasets stored in Parquet format on S3 to ensure compatibility with our Spark ecosystem. Petastorm was inefficient for our needs because of the tabular nature of our data, leading to the development of ArrowStreamServer, which improved streaming performance.</li>
<li><strong>Distributed Training</strong>: Efficiently scale across multiple GPUs by transitioning from TenserFlow’s MirroredStrategy to Horovod.</li>
</ul><h2 id="data-storage"><strong>Data Storage</strong></h2><p>We have opted to store materialized training data in a Parquet dataset on S3 for the following reasons:</p><ul><li><strong>Compatibility</strong>: It integrates seamlessly with Yelp’s Spark ecosystem.</li>
<li><strong>Efficiency</strong>: Parquet is highly compressed and optimized for I/O operations.</li>
<li><strong>Scalability</strong>: S3 offers virtually unlimited storage capacity.</li>
<li><strong>Performance</strong>: S3 can support very high throughput<sup id="fnref:2"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> when accessed in parallel.</li>
</ul><p>After materializing the training data in S3, it needs to be read and converted into a TensorFlow Dataset. Our first approach was to use Petastorm, which is very easy to use and integrates well with Spark. However, we found that Petastorm did not fit our use case as it was much slower when using the rebatch approach to achieve the desired batch size. This inefficiency is due to Yelp’s focus on tabular datasets with hundreds of features and hundreds of thousands of rows per<sup id="fnref:3"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>, where unbatching causes tensor explosions and creates millions of tensors.</p><p>To address this challenge, we explored alternative solutions, and discovered that <a href="https://github.com/tensorflow/io">Tensorflow I/O</a> has an implementation called ArrowStreamDataset. By using TensorFlow Dataset directly, we can avoid relying on a Python generator, which is known for <a href="https://www.tensorflow.org/guide/data#consuming_python_generators">limited portability and scalability</a> issues. Upon testing, we found that this approach performed significantly better for our use case.</p><table><thead><tr><th class="c1">Method</th>
<th class="c1">Time</th>
</tr></thead><tbody><tr><td class="c2">boto3 read raw</td>
<td class="c2">18.5 s</td>
</tr><tr><td class="c2">Petastorm(a batch per row group)</td>
<td class="c2">76.4 s</td>
</tr><tr><td class="c2">Petastorm(rebatch to 4096)</td>
<td class="c2">815 s</td>
</tr><tr><td class="c2">ArrowStream(batch to 4096)</td>
<td class="c2">19.2 s</td>
</tr><tr><td class="c2"><em>Time to convert 9 millions samples</em></td>
<td class="c2"> </td>
</tr></tbody></table><p>Unfortunately ArrowStreamDataset only supports dataset being stored in a local disk with <a href="https://arrow.apache.org/docs/python/feather.html">Feather</a> format which is incompatible with our ecosystem and inhibits scaling to our necessary dataset sizes. To work with ArrowStreamDataset, and to more effectively stream data from S3, we implemented an ArrowStreamServer with PyArrow. ArrowStreamServer reads and batches datasets, serving them as RecordBatch streams over a socket. Each ArrowStreamServer runs in a separate process to have better parallelism.</p><p>On the consumer side, we utilized ArrowStreamDataset to read RecordBatches from endpoints, enabling efficient batching and interleaving of datasets.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-22-enhancing-neural-network-training-at-yelp/image3.png" alt="System diagram of ArrowStream Dataset and Server" /><em>System diagram of ArrowStream Dataset and Server</em></p><h2 id="distributed-training"><strong>Distributed Training</strong></h2><p>As training data grew from hundreds of millions to billions of samples, we adopted distributed training across multiple GPUs<sup id="fnref:4"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> to improve training time. Ultimately we choose Horovod as our default distribution method.</p><p>Initially, we used TensorFlow’s built-in MirroredStrategy for distributed training. We got almost linear speedup in 4 GPUs. However, as we scaled from 4 GPUs to 8 GPUs, we discovered that MirroredStrategy was not optimal, resulting in low GPU, CPU, and IO metrics. The bottleneck was identified in the Keras data handler, which struggled with sharding the dataset across devices.</p><p>After we observed bottlenecks with TensorFlow’s built in strategies, we decided to try Horovod’s more sophisticated distributed training capabilities. In our tests, we found Horovod provided linear performance scaling up to 8 GPUs compared to using a single GPU. We believe the reasons for Horovod’s superior performance are as follows:</p><ul><li><strong>Efficient Process Management</strong>: Horovod uses one process per device, which optimizes resource utilization and avoids <a href="https://wiki.python.org/moin/GlobalInterpreterLock">GIL</a> in Python.</li>
<li><strong>Gradient Conversion</strong>: By converting sparse gradients to dense gradients, Horovod significantly improves memory efficiency and speed during all-reduce operations. We recommend converting sparse gradients to dense gradients for models with a large batch size and numerous categorical features<sup id="fnref:5"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup>.</li>
</ul><p>Switching to Horovod presented several challenges, primarily due to the use of Keras’s premade WideDeepModel and the complexities of resource management on a multi-GPU machine.</p><p>To support Keras, Horovod implemented DistributedOptimizer to wrap and override the <em>_compute_gradients</em> method, but Keras’s premade WideDeepModel directly calls <em>GradientTape</em> instead of calling <em>minimize</em> to support two optimizer cases. To address this issue, we had to override the <em>train_step</em> of the WideDeepModel.</p><p>We also encountered a “threading storm” and out-of-memory issues as thousands of threads were created. To prevent oversubscription of cores and memory, we devised a way to downsize thread pools used by these libraries from all available resources to only the available resources per GPU.</p><table><thead><tr><th class="c1">setting</th>
<th class="c1">value</th>
</tr></thead><tbody><tr><td class="c2">tf.data.Options.threading.private_threadpool_size</td>
<td class="c2">number of cpu cores per GPU</td>
</tr><tr><td class="c2">tf.data.Options.autotune.ram_budget</td>
<td class="c2">available host memory per GPU</td>
</tr><tr><td class="c2">OMP_NUM_THREADS</td>
<td class="c2">number of cpu cores per GPU</td>
</tr><tr><td class="c2">TF_NUM_INTEROP_THREADS</td>
<td class="c2">1</td>
</tr><tr><td class="c2">TF_NUM_INTRAOP_THREADS</td>
<td class="c2">number of cpu cores per GPU</td>
</tr><tr><td class="c2"><em>Settings for splitting resources based on GPU</em></td>
<td class="c2"> </td>
</tr></tbody></table><h2 id="summarize"><strong>Summarize</strong></h2><p>To put all of this together and incorporate it into Yelp’s existing Spark ML ecosystem, we designed a KerasEstimator to act as the Spark Estimator in our ML pipeline. As shown in the following figure, we first materialize transformed features into S3 and serve it with ArrowStreamServer on the training Spark Executors, then we make use of ArrowStreamDataset to stream training data from the ArrowStreamServer. TFMirrorRunner<sup id="fnref:6"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup> and HorovodRunner<sup id="fnref:7"><a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fn:7" class="footnote" rel="footnote" role="doc-noteref">7</a></sup>, as two concrete implementations of the SparkRunner, are used to set up and train Keras Model in Spark executor.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-22-enhancing-neural-network-training-at-yelp/image2.png" alt="Overview of KerasEstimator fits into Spark Pipeline" /><em>Overview of KerasEstimator fits into Spark Pipeline</em></p><h2 id="results-and-benefits"><strong>Results and Benefits</strong></h2><h3 id="performance-improvements"><strong>Performance Improvements</strong></h3><p>Our benchmarks on a dataset with 2 billion samples with the pCTR model demonstrated substantial improvements. By optimizing data storage with ArrowStream, we achieved an 85.8x speedup from a starting point of 75 hours with 450 million samples. Additionally, implementing distributed training provided a further 16.9x speedup, resulting in a total speedup of approximately 1,400x. These results underscore the effectiveness of our approach in optimizing both speed and cost. <img src="https://engineeringblog.yelp.com/images/posts/2024-11-22-enhancing-neural-network-training-at-yelp/image1.png" alt="" /></p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-22-enhancing-neural-network-training-at-yelp/image5.png" alt="" /></p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-22-enhancing-neural-network-training-at-yelp/image6.png" alt="" /><em>note: all tests are done using AWS P3 instances except A10G, which is a G5dn instance.</em></p><h2 id="conclusion"><strong>Conclusion</strong></h2><p>Our transition to using TensorFlow with Horovod has significantly accelerated our machine learning training processes, reducing costs and improving developer velocity. This approach not only addresses the challenges we faced but also sets a foundation for future scalability and efficiency improvements. IO is often the bottleneck for Neural Network training, especially for tabular datasets and relatively simple models. Improving IO equates to improving training performance. For training small to medium-sized models, AWS G series instances often outperform P series instances especially if we take cost into account.</p><h2 id="acknowledgements"><strong>Acknowledgements</strong></h2><p>Thanks Nathan Sponberg for implementing the KerasEstimator.</p><h2 id="footnotes"><strong>Footnotes</strong></h2><div class="footnotes" role="doc-endnotes"><ol><li id="fn:1">
<p>Up to 10s of terabytes. <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2">
<p>AWS S3 can provide up to 100Gb/s to a single EC2 instance. <a href="https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/introduction.html">Introduction - Best Practices Design Patterns: Optimizing Amazon S3 Performance</a> <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3">
<p>Parquet row group is the default batch size in Petastorm <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4">
<p>We are using p3.8xlarge and p3.16xlarge <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5">
<p>In our case we have hundreds categorical features and a batch size of 32K-256K. <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6">
<p>Wrapper for Tensorflow MirrorStrategy on Spark <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7">
<p>Wrapper for Horovod on Spark <a href="https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol></div><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2025/01/enhancing-neural-network-training-at-yelp.html</guid>
      <pubDate>Wed, 22 Jan 2025 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Revisiting Compute Scaling]]></title>
      <description><![CDATA[<p>As mentioned in our earlier blog post <a href="https://engineeringblog.yelp.com/2024/05/fine-tuning-AWS-ASGs-with-attribute-based-instance-selection.html">Fine-tuning AWS ASGs with Attribute Based Instance Selection</a>, we recently embarked on an exciting journey to enhance our Kubernetes cluster’s node autoscaler infrastructure. In this blog post, we’ll delve into the rationale behind transitioning from our internally developed Clusterman autoscaler to AWS Karpenter. Join us as we explore the reasons for our switch, address the challenges with Clusterman, and embrace the opportunities with Karpenter.</p><p>At Yelp, we used <a href="https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html">Clusterman</a> to handle autoscaling of nodes in Kubernetes clusters. It is an open-source tool we initially designed for Mesos clusters and later adapted for Kubernetes. Instead of managing entire clusters, Clusterman focuses on the management of pools, wherein each pool comprises groups of nodes backed by <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html">AWS Auto Scaling Groups</a> (ASGs). These pools are governed by a config called the setpoint, representing the desired reservation ratio between “Requested Resources by Workloads” and “Total Allocatable Resources.” Clusterman actively maintains this reservation ratio by adjusting the desired capacity of ASGs. It has also got some nifty features for safely <a href="https://engineeringblog.yelp.com/2023/01/recycling-kubernetes-nodes.html">recycling nodes</a>, using custom signals to scale clusters, and a simulator to experiment with different scaling parameters.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-12-06-revisiting-compute-scaling/pool-asgs.jpg" alt="An example Clusterman pool" /></p><p>Despite its capabilities, Clusterman had its challenges. If your clusters are frequently being scaled up or down, finding the perfect setpoint can be a tricky task. Choosing a lower setpoint ensures cluster stability but runs at a higher cost. Conversely, a higher setpoint might lead to difficulties in maintaining the desired ratio of resources. When Clusterman attempted to maintain the setpoint, it would sometimes delete nodes (and the pods on those nodes) to increase the ratio, causing the deleted pods to become unschedulable due to insufficient resources in the pool. To address these unschedulable pods, Clusterman would launch new instances. However, this process could lead to an endless cycle of scaling up and down, as Clusterman continuously tried to balance the cluster. This cycle could result in inefficiencies and resource management issues, making it difficult to achieve a stable and cost-effective scaling strategy.</p><p>Another hitch was workload requirements. Clusterman relies on ASGs, and it doesn’t consider the specific needs of your pending pods. If those pods demand certain resources like ‘R category EC2s,’ it’s often a matter of luck if the ASG will give you R category EC2s. Consequently, when dealing with such specific requirements, you’d often find yourself creating new pools and specifying the instance types in the ASGs which also results in managing another Clusterman instance for the pool. This method of managing the pools and its requirements eventually increases the operational cost of managing Kubernetes clusters.</p><p>Furthermore, Clusterman had its speed issues. Due to its interval-based logic that iterated every ‘X’ minutes, it sometimes struggled to keep up with rapidly changing workload demands, making it less than ideal for dynamic clusters. Lastly, customizing recycling criteria meant tinkering with the code, making it less flexible for unique requirements like recycling <a href="https://aws.amazon.com/ec2/instance-types/g5/">G5 family instances</a>.</p><h2 id="requirements">Requirements</h2><p>As the challenges with Clusterman became increasingly apparent, we embarked on a mission to gather input from various teams that run workloads on Kubernetes. We discovered numerous specific requirements that highlighted the burden of managing Clusterman. These requirements included:</p><ul><li>The need for an autoscaler capable of identifying the availability zone of stateful pending pods’ volumes and launching new instances in the corresponding zone.</li>
<li>Diverse machine learning workloads, each with different GPU requirements.</li>
<li>The importance of accommodating workload constraints, such as topology spread constraints and affinities. The overarching theme among the teams’ requirements: ‘Find the right instances for my dynamic workload requirements’.</li>
</ul><p>In addition to these, we had our own set of demands:</p><ul><li>The ability for the autoscaler to react to pending pods in a matter of seconds.</li>
<li>Ensure that autoscaling remains cost-efficient.</li>
</ul><h2 id="alternative-options">Alternative options</h2><p>In our quest for an alternative to Clusterman, we explored two primary options: <a href="https://github.com/kubernetes/autoscaler">Kubernetes Cluster Autoscaler</a> and <a href="https://karpenter.sh/">Karpenter</a>.</p><p>We opted against Kubernetes Cluster Autoscaler as we would have faced similar problems as we experienced in Clusterman. Notably, it organizes nodes into groups where all nodes must be identical, posing a challenge for our diverse workloads with varying requirements.</p><p>Karpenter is an open-source node lifecycle management project built for Kubernetes, developed by AWS. Karpenter not only launches the boxes for our requirements but also brings some features to the table:</p><ul><li><strong>Better Bin Packing:</strong> Clusterman relied on ASGs when it came to deciding the instance types and sizes for workload, which resulted in higher cost and lower resource utilization, Karpneter on the other hand has a far better approach by creating batches of pending pods and considering workload resource requirements before launching instances.</li>
<li><strong>Pool Based Segregation:</strong> In Karpenter, by having the same status quo for pool based segregation and using nodepools we had a better migration path (as other Yelp infrastructure relies on the pool based approach). Customizable TTL: Karpenter empowers you to specify a Time-to-Live (TTL) for your nodes, enabling graceful node recycling after a designated time (e.g., ‘Please recycle my nodes after 10 days’) while respecting Pod Disruption Budgets (PDB).</li>
<li><strong>Cost Optimization:</strong> Karpenter can be your partner in cost efficiency. It offers features like:
<ul><li>Automatically deleting a node if all of its pods can run on available capacity of other nodes in the cluster.</li>
<li>Replacing on-demand instances with more cost-effective options when available.</li>
<li>Terminating a couple of instances to replace them with larger, more cost-efficient ones.</li>
</ul></li>
<li><strong>Enhanced Scheduling:</strong> Karpenter enriches nodes with useful labels, contributing to smarter workload scheduling. Now workloads don’t need to create a new pool for their specific EC2 requirements. They can just add requirements (EC2 category, family, generation etc. visit <a href="https://karpenter.sh/docs/concepts/scheduling/#well-known-labels">Well-Known Labels</a> for the list) to node selector/affinity.</li>
<li><strong>Fall-back mechanism for spot market:</strong> With Clusterman, managing periods of high spot market demand (e.g., Black Friday week) was challenging, requiring us to manually adjust our ASG specifications. But Karpenter can run spot and on-demand instances in the same pool. It will launch on-demand instances if there is no spot capacity in the region/availability zone. It helped us handle Black Friday week.</li>
<li><strong>Insightful Metrics:</strong> Karpenter provides a suite of useful metrics to help you gain a deeper understanding of the autoscaling state and compute costs. Understanding the cost impact of changing auto scaling specifications was difficult without real-time cost metrics.</li>
</ul><p>In choosing Karpenter, we found a solution that not only meets our fundamental requirements but also equips us with the tools and features to navigate the ever-evolving landscape of infrastructure management.</p><p>As we began our transition from Clusterman to Karpenter, a key consideration was the reliance of Clusterman on <a href="https://aws.amazon.com/blogs/aws/new-attribute-based-instance-type-selection-for-ec2-auto-scaling-and-ec2-fleet/">ASGs (with Attribute-Based Instance Type Selection)</a>, where you can specify a set of instance attributes that describe your compute requirements instead of manually choosing instance types. Our ASGs were attribute-based, making it relatively straightforward to convert ASG requirements to Karpenter nodepools. However, it’s essential to note that nodepools in Karpenter don’t mirror all the attributes present in ASGs. For instance, we couldn’t directly match attributes like CPU manufacturers.</p><p>Early in the migration process, we explored the possibility of transferring ownership of nodes from ASGs to Karpenter to ensure a seamless transition. Unfortunately, we discovered that this approach <a href="https://github.com/aws/karpenter-provider-aws/issues/4176">wasn’t feasible</a> with Karpenter.</p><p>Instead, we decided to gradually replace nodes by scaling down the ASG capacity which allowed Clusterman to delete nodes at a slower pace. While node deletions led to unschedulable pods in the pool, Karpenter efficiently detected these unschedulable pods and quickly scaled up the necessary resources to make sure that there is no unschedulable pods in the pool, ensuring the smooth scheduling of pods on the newly provisioned nodes.</p><p>However, the transition was smoother than anticipated due to some strategic decisions we had made previously. We had proactively added <a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets">Pod Disruption Budgets</a> (PDBs) for all our workloads to protect workloads from voluntary disruptions. These PDBs played a pivotal role in mitigating any potential issues for our workloads during the migration.</p><p>To closely monitor and track the migration process, we developed a comprehensive dashboard. This dashboard included various charts and metrics, such as:</p><ul><li>ASGs’ capacity compared to Karpenter’s capacity.</li>
<li>Hourly resource cost to keep a close eye on spending.</li>
<li>Spot interruption rate for monitoring instance stability.</li>
<li>Autoscaler spending efficiency to ensure cost-effectiveness.</li>
<li>Scaling up and down events.</li>
<li>Insights into unschedulable pods and pod scheduling times.</li>
<li>Workload error rates.</li>
</ul><h2 id="allocation-strategies-for-spot-instances">Allocation strategies for Spot Instances</h2><p>One valuable lesson we learned during the migration process revolved around allocation strategies for Spot Instances. In our previous setup, we had a few AWS Auto Scaling Groups (ASGs) configured with the lowest-price allocation strategy. However, AWS Karpenter utilizes a price-capacity-optimized allocation strategy, and what’s more, it’s not configurable.</p><p>At the outset of our migration journey, this seemingly rigid strategy prompted some concerns. We worried that it might lead to increased costs, as our expectations leaned toward unpredictability. However, as our experience with Karpenter unfolded, we were pleasantly surprised. Its capabilities ensured not only a significant reduction in spot interruptions but also cost-effectiveness.</p><h2 id="keeping-free-room-for-critical-services-hpas">Keeping free room for critical services’ HPAs</h2><p>One key lesson we learned during our Clusterman journey was the need to ensure we had some free resources available for our critical workloads. These workloads could require more replicas in a short amount of time, especially during sudden spikes in demand. We can easily solve this by reducing a setpoint setting. However, AWS Karpenter, while highly efficient, didn’t provide a built-in feature to reserve free capacity explicitly.</p><p>To address this challenge, we decided to run dummy pods with a specific ‘PriorityClass.’ This PriorityClass allows the Kubernetes scheduler to preempt those dummy pods if there are any unschedulable pods, creating space for our critical services’ Horizontal Pod Autoscalers (HPAs). This approach effectively ensured that we always had some buffer capacity available for our high-priority workloads.</p><h2 id="aligning-karpenter-with-cluster-configuration-practices">Aligning Karpenter with Cluster Configuration Practices</h2><p>A crucial lesson we gained from our migration experience was the significance of addressing workloads with high ephemeral-storage requirements. As stated above, we are using ASGs in our setup and ASGs use Launch Templates to get EC2 configurations (e.g. ami-id, network, storage etc.). We increased storage information (the instance root volume) in our <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html">launch templates</a>. However, this volume size cannot be discovered by Karpenter even though we specified our launch template name in the Karpenter configuration. Karpenter wasn’t aware of these modifications and assumed that all instances had a standard 17 GB storage. This caused those workloads to stay unschedulable indefinitely.</p><p>This misunderstanding presented a roadblock as Karpenter struggled to find suitable instances for workloads with high ephemeral-storage needs. To overcome this challenge, we initiated an evaluation process to explore the use of blockDeviceMappings in each NodeClass. By utilizing this Karpenter feature, we aim to provide Karpenter with the necessary information about our instances’ actual volume sizes.</p><p>We also encountered another ephemeral-storage issue with our <a href="https://flink.apache.org/">Flink</a> workloads pool. Since Flink workloads demand fast and large storage operations, we selected EC2 instances with local NVMe-based SSD block-level storage (e.g., c6id, m6id, etc.). However, Karpenter was unable to recognize the local SSD storage as ephemeral storage for pods, which blocked our Flink pool migration. Fortunately, the Karpenter team <a href="https://github.com/aws/karpenter-provider-aws/pull/4735">addressed</a> our concern and released a new version that introduced the <a href="https://karpenter.sh/docs/concepts/nodeclasses/#specinstancestorepolicy">instanceStorePolicy</a> setting to resolve this issue.</p><p>Another noteworthy aspect of our migration involved the realization that Karpenter wasn’t inherently aware of our kubelet settings, which reside in our configuration management system. For instance, we had specified that 2 vCPUs and 4 GB of memory should be reserved for system processes using the ‘system-reserved’ configuration. Unfortunately, Karpenter lacked awareness of these specifics, leading to miscalculations in resource allocation. Additionally, we encountered disparities between our ‘max-pods’ setting and Karpenter’s internal calculations. The nuanced differences in how our configuration management system and Karpenter interpreted these settings highlighted the need for a more seamless integration between external configurations and Karpenter’s resource management algorithms.</p><p>These experiences taught us how important it is to follow Karpenter native configurations. When we make sure that AWS Karpenter understands our workload needs and the resources we have, it can do a much better job at managing everything efficiently.</p><p>AWS Karpenter is faster than Clusterman. The key distinction lies in their approaches to resource monitoring. Clusterman relies on periodic checks (minimum 1 minute), causing delays in detecting and responding to unschedulable pods. Instead, Karpenter leverages the power of Kubernetes events, allowing it to promptly detect and react to unschedulable pods in real-time (a couple of seconds). This event-driven model significantly enhances performance, ensuring a more responsive and dynamic scaling experience.</p><p>Karpenter not only outshines Clusterman in performance but also takes the lead in scalability. Clusterman, with its memory-intensive approach of storing all pod and node information, faces challenges as the cluster size grows. The potential for Out-of-Memory errors looms, impacting its scalability. Conversely, Karpenter adopts a more streamlined approach by storing only essential information in memory. Moreover, Karpenter avoids the performance bottleneck of reading all resources from the kube-apiserver, making it a more scalable solution as your cluster expands. This dual focus on enhanced performance and scalability positions Karpenter as a reliable and efficient choice for managing Kubernetes clusters.</p><p>We created a new metric (spending efficiency) to track the computing cost improvements during the migration. The spending efficiency is the price of running one unit resource: CPU/Memory. Karpenter improved our spending efficiency by an average of 25% across all pools.</p><p>Initially, Clusterman was an optimal and practical solution for us at Yelp, especially during our transition from Mesos to Kubernetes. At that time, extending Clusterman’s capability from Mesos to scaling Kubernetes workloads was a strategic decision. This made Clusterman the only open-source autoscaler that supported both Kubernetes and Mesos workloads, simplifying our migration process.</p><p>However, as we moved all workloads to Kubernetes, maintaining Clusterman became an overhead and lacked key features required to run current workloads. This was particularly true when superior open-source autoscalers such as Karpenter became available, offering more advanced features and better support for Kubernetes.</p><p>This was a team project inside Yelp’s Compute Infrastructure team. Many thanks to Ajay Pratap Singh, Max Falk and Wilmer Bandres for being the part of the project, and to the many engineering teams at Yelp that contributed to making the new system a success. Additionally, thanks to Matthew Mead-Briggs for his managerial support.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/12/revisiting-compute-scaling.html</link>
      <guid>https://engineeringblog.yelp.com/2024/12/revisiting-compute-scaling.html</guid>
      <pubDate>Fri, 13 Dec 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Revenue Automation Series: Modernizing Yelp's Legacy Billing System]]></title>
      <description><![CDATA[<p>This blog focuses on how Yelp successfully implemented a multi-year, cross-organizational initiative to modernize its billing processes. The goal was to automate its revenue recognition system by enhancing integration capabilities with third-party financial systems, all while maintaining the accuracy and reliability our users expect.</p><p>When Yelp first developed its billing system a decade ago, the database design was based on the requirements known at that time. These initial choices laid the foundation for the billing system, upon which multiple Yelp systems and processes were built. However, as the company evolved, it became evident that these design choices were not ideal and led to various challenges.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/foundational.png" alt="Source: https://xkcd.com/2347/" /></p><p>The legacy design choices of Yelp’s billing system became a significant blocker when the company sought to integrate with a third-party revenue automation tool due to data format discrepancies. This integration was essential for scaling Yelp’s revenue system to match the company’s growth, with a target completion date of July 2024. However, Yelp’s unique handling of invoices led to misalignments, much like trying to fit mismatched pieces into a jigsaw puzzle.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/jigsaw.png" alt="Jigsaw Puzzle" /></p><p>To address these challenges, Yelp decided to overhaul its billing system to align with industry standards. This initiative required making changes to business-critical systems, where any disruption could have severe consequences, such as the inability to bill or charge customers. The complexity of executing this initiative best relates to the analogy of changing the tire of a car while it’s still running. To ensure a smooth transition, Yelp developed an execution plan, coordinating efforts across multiple teams over several years.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/tire.jpg" alt="Changing Tire" /></p><p>Yelp’s legacy billing system introduced the concept of “Invoice Obviation” around a decade ago, which caused a customer’s unpaid balance to be carried over from one invoice to the next. This concept became central to the billing behavior, with the most recent invoice reflecting the total balance of the account.</p><p>The below diagram shows how a payment needed to be only applied to the most recent invoice in Yelp’s legacy billing system, requiring a one to one relationship between payment and invoice.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/yelp-payment.png" alt="Payment Application in Yelp’s Legacy State:: Created on Canva by the author" /></p><p>A few years later, we realized that standard billing systems do not roll over balances from one invoice to another, which caused Yelp’s solution to behave very differently from industry norms.</p><p>The diagram below illustrates how, in a standard billing system, a single payment can be collected once but applied to multiple invoices, allowing one-to-many relationships between a payment and invoices.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/industry-standard-payment.png" alt="Payment Application in a Standard Billing System: Created on Canva by the author" /></p><p>This caused data discrepancies in Yelp’s system vs any other standard billing system as shown in the table below:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/table-in-summary.png" alt="Table Summary" /></p><p>In Yelp’s legacy system, payments intended for January, February, and March were incorrectly applied solely to March’s invoice, resulting in the revenue being recorded in the wrong revenue period and making it hard to determine invoice aging. This misapplication affected not only payments but also other concepts like credit and debit memos. These inconsistencies also hindered Yelp’s ability to integrate with third-party revenue automation and other financial tools.</p><p>To address this, Yelp decided to invest in a more robust billing model that aligns with current requirements and supports long-term growth. The new model aims to:</p><ul><li>Eliminate the concept of invoice obviation.</li>
<li>Enable one-to-many relationships, allowing payments and credit/debit memos to be applied across multiple invoices.</li>
<li>Correct data discrepancies by ensuring accurate revenue allocation within specific revenue periods, thereby improving financial accuracy and reporting.</li>
</ul><p>This blog post does not delve into the solution itself but rather explains how Yelp implemented an execution plan and delivered this initiative at scale.</p><p>Implementing such changes required over two years of collaborative effort from a team of more than 50 people because we had to fundamentally change the foundation and then rebuild the functionality on top of it. Therefore, developing a robust execution plan was the key to achieving the successful delivery of this initiative. This blog post focuses on the approach taken to execute this massive initiative and roll it out without impacting any of Yelp’s systems ensuring users were billed seamlessly and accurately.</p><h2 id="step-1-requirement-gathering">Step 1: Requirement Gathering</h2><p>It was crucial to ensure that the system being redesigned using significant engineering resources was not short sighted and truly matched the needs of the business. We maintained close collaboration with business stakeholders to ensure we gathered the right long-term requirements.</p><p>Cross-functional requirement gathering often mirrors the popular childhood game of Telephone. The game illustrates how a message could end up distorted as it is conveyed from person to person. The more people involved, the higher the likelihood of distortion. A common cause of the communication distortion is an individual’s choice of words and the difference in comprehension of those words between the involved parties. We settled on a ubiquitous language concept from Domain Driven Design principles early on; modeled after pre-existing accounting terminologies and relying on domain experts as the single source of truth. Armed with a common vocabulary, we were able to reach a high level consensus on the final state of the system.</p><h2 id="step-2-target-architecture">Step 2: Target Architecture</h2><p>To meet the specific set of requirements outlined by our stakeholders, we explored a variety of third-party solutions that are widely used in the industry. However, none of the off-the-shelf products could fully satisfy all the new requirements along with Yelp’s existing use case. After careful consideration, we decided to develop a custom billing architecture tailored specifically to our business model.</p><p>We had the option to patch the existing system, which might have seemed like a simpler solution. However, this approach carried the risk of introducing unintended side effects. Instead, we took inspiration from industry standards for billing architectures to define our target state. Although the target state represented a long roadmap, we committed to moving towards it gradually through multiple projects over the next quarters and years. This strategic decision was key to successfully delivering the project.</p><h2 id="step-3-project-planning">Step 3: Project Planning</h2><h3 id="identify-dependencies-and-order-of-execution">Identify Dependencies and Order of Execution</h3><p>After defining the target architecture, we mapped out all necessary processes and their dependencies. For instance, to support account-level payment collection, it was essential to enable the new payments functionality, implement account-level refunds, and ensure that the user interfaces accurately displayed both payments and refunds at the account level. We prioritized these processes based on stakeholder/business needs. All features required to support a specific process were identified as a single deliverable, helping us in managing interdependencies effectively.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/dependencies.png" alt="Project Dependencies" /></p><p>Tackling projects of this scale required changes across multiple systems, involving the participation of several teams. Instead of having each team take ownership of work within their respective domains, we created multiple cross-functional teams of 4-6 engineers, each assigned to an ongoing project. We called them the Tiger Teams—focused, invincible, and goal-driven! The majority of the teams lasted for the duration of a project, and at any time, 2-4 teams worked in parallel, adapting to evolving priorities.</p><h3 id="coordination-across-teams">Coordination Across Teams</h3><p>An Engineering Manager was assigned to oversee these tiger teams, managing the timelines, staffing, and prioritization. Multiple new processes were introduced to help the EM manage these cross-functional teams:</p><ul><li>Sprint planning is typically done for individual teams at Yelp. However, for the tiger teams, we developed a new process to conduct sprint planning that facilitated cross-functional collaboration.</li>
<li>A Gantt chart was maintained by the EM to track timelines, staffing, and blockers. This chart was also used to update leadership and stakeholders, building trust in the path forward and keeping them informed.</li>
<li>Regular sync-up meetings were set up to track blockers, and if any arose, the teams worked proactively to resolve them.</li>
</ul><p>The Engineering Manager’s role was crucial in ensuring that each tiger team could deliver their projects in realistic timelines.</p><h2 id="step-4-incremental-delivery">Step 4: Incremental Delivery</h2><p>We learned the true meaning of iterative development by delivering smaller changes in multiple iterations, always ensuring we provided a minimal valuable product. While we sometimes compromised on the number of features delivered, we never delivered unfinished ones. Initially, we limited rollouts to a small group of customers who didn’t require advanced functionality. This approach allowed us to roll out to ensure correctness in a controlled environment while continuing to support these customers as additional features were developed.</p><p>Naming our rollouts proved to be a successful strategy. We chose F1 racetrack names for the rollouts, adding an element of fun and improving communication with stakeholders. This made it easy to convey that feature X was part of the Baku rollout, while feature Y would be available soon in the Suzuka rollout.</p><p>A significant achievement was our successful collaboration with business stakeholders. The Finance teams, accustomed to a 100% crossover rollover strategy, initially lacked experience with iterative rollouts, which are more common in engineering. They were skeptical of its reliability due to the nature of their domain. However, through effective communication and demonstration of the approach’s benefits, they became confident and appreciative of the iterative method.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/incremental-delivery.jpg" alt="Incremental Delivery: Created on Canva by the authors" /></p><p>A/B testing is commonly used in the industry to measure the performance of new features by subjecting users to different experiences. By measuring differences in key metrics between the two groups, we can ensure that new features do not negatively impact our business.</p><p>Rather than A/B testing individual features introduced in this project, we opted to incrementally release new features to a singular group of users. This avoids the complications of managing multiple experiments and simplifies the measurement of key metrics.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/stack.png" alt="Iterative Rollout Strategy: Created on Canva by the authors" /></p><h2 id="step-6-user-acceptance-testing">Step 6: User Acceptance Testing</h2><p>The impact of making such a fundamental change was very widespread. We were unable to rely solely on automated tests, as the technical changes affected the business processes of many non-engineering teams, including Finance, Accounting, Customer Support, Analytics, and more. Therefore, we decided to be extremely thorough with the user acceptance testing to allow us to test the correctness of the system and the processes.</p><p>To ensure comprehensive coverage, we involved stakeholders from all the affected teams in the user acceptance testing process. We created detailed test plans and scenarios that covered every aspect of the new billing system. Each stakeholder was responsible for verifying that their systems and processes functioned correctly and as desired.</p><p>We documented our testing process in a Google Sheet and stakeholders were asked to execute the test cases relevant to their domain and mark them as pass or fail. This collaborative approach helped us identify any unintentional impact on internal processes and systems early, ensuring a smooth transition to the new system.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/uat.png" alt="UAT: Created on Google Sheet" /></p><h2 id="step-7-system-observability">Step 7: System Observability</h2><p>Ensuring system observability was crucial to roll out a foundational change quickly to all customers to meet the company’s revenue automation timeline of July 2024. We implemented a multi-layered approach to monitor and maintain the integrity of the system:</p><ul><li>
<p>Alerts/Logging and Monitoring: We set up comprehensive alerts, logging and monitoring dashboards to provide real-time visibility into system performance and anomalies.</p>
</li>
<li>
<p>Integrity Checkers: We developed integrity checkers as our last line of defense. These checkers were designed to continuously validate data in product for consistency and report any anomalies that deviated from the expected behavior.</p>
</li>
<li>
<p>Stakeholder Dashboards: We created dashboards for the stakeholders, providing them with relevant metrics and insights. This transparency helped build trust and allowed stakeholders to monitor the progress and stability of the system.</p>
</li>
</ul><p>By combining these observability practices, we were able to ensure that any discrepancies were caught and addressed before they could impact the customers.</p><p>Even though the initiative required a massive effort from over 50 people working collaboratively for more than two years, it was successfully delivered by following the structured execution plan outlined above. Unlike traditional rollouts at Yelp for critical systems, which usually involve a gradual release, we adopted an accelerated approach to meet our tight timelines for integrating with third-party revenue automation systems. We initiated the update with a small group of customers in October 2023, and by steadily increasing the rollout pace, we achieved 100% customer adoption by July 2024.</p><p>The rollout speed was unusually high for our organization, as the billing system requires high accuracy and customers typically receive only one bill per month, providing limited opportunities to verify the updates’ correctness.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-25-modernization-of-yelps-legacy-billing-system/rollout.png" alt="Rollout" /></p><p>This execution plan allowed us to meet our deadlines while ensuring a seamless transition without compromising system integrity or functionality, thanks to the dedication and cross-functional coordination of the team.</p><p>We would like to thank everyone across multiple organizations at Yelp for their continued support and tenacity in making this a reality. Their efforts have been crucial in helping Yelp automate 90% of its revenue by strengthening the data foundation.</p><p>Keep an eye out for our next blog posts on Yelp’s integration with the third party financial tool, where we’ll dive into how we automated revenue processes once the new billing system was in place.</p><div class="island job-posting"><h3>Join Our Team at Yelp</h3><p>We're tackling exciting challenges at Yelp. Interested in joining us? Apply now!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/12/modernization-of-yelp's-legacy-billing-system.html</link>
      <guid>https://engineeringblog.yelp.com/2024/12/modernization-of-yelp's-legacy-billing-system.html</guid>
      <pubDate>Fri, 06 Dec 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Loading data into Redshift with DBT]]></title>
      <description><![CDATA[<p>At Yelp, we embrace innovation and thrive on exploring new possibilities. With our consumers’ ever growing appetite for data, we recently revisited how we could load data into Redshift more efficiently. In this blog post, we explore how DBT can be used seamlessly with Redshift Spectrum to read data from Data Lake into Redshift to significantly reduce runtime, resolve data quality issues, and improve developer productivity.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-06-loading-data-into-redshift-with-dbt/image1.png" alt="architecture before" /></p><p>Our method of loading batch data into Redshift had been effective for years, but we continually sought improvements. We primarily used Spark jobs to read S3 data and publish it to our in-house Kafka-based Data Pipeline (which you can read more about <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">here</a>) to get data into both Data Lake and Redshift. However, we began encountering a few pain points:</p><ol><li><strong>Performance</strong>: Larger datasets (100M+ rows daily) were beginning to face delays. This was mostly due to table scans to ensure that primary keys were not being duplicated upon upserts.</li>
<li><strong>Schema changes</strong>: Most tables were configured with an <a href="https://avro.apache.org/docs/1.11.1/specification/">Avro schema</a>. Schema changes were sometimes complex, as they required a multi-step process to create and register new Avro schemas.</li>
<li><strong>Backfilling</strong>: Correcting data with backfills was poorly supported, as there was no easy way to modify rows in-place. We often resorted to manually deleting data before writing the corrected data for the entire partition.</li>
<li><strong>Data quality</strong>: Writing to Data Lake and Redshift in parallel posed a risk of data divergence, such as differences in data typing between the two data stores.</li>
</ol><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-06-loading-data-into-redshift-with-dbt/image2.png" alt="architecture after" /></p><p>When considering how to move data around more efficiently, we chose to leverage <a href="https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-overview.html">AWS Redshift Spectrum</a>, a tool built specifically to make it possible to query Data Lake data from Redshift. Since Data Lake tables usually had the most updated schemas, we decided to use it as the data source instead of S3 for our Redshift batches. Not only did it help reduce data divergence, it also aligned with our best practice of treating the Data Lake as the single source of truth.</p><p>For implementation, Spectrum requires a defined schema, which already exists in Glue for our Data Lake tables. The only other additional setup needed was to add the Data Lake tables as external tables, making them accessible from Redshift with a simple SQL query.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-06-loading-data-into-redshift-with-dbt/image3.png" alt="external schema snippet" /></p><p>We had already started adopting <a href="https://www.getdbt.com/product/what-is-dbt">DBT</a> for other datasets, but it also seemed like the perfect candidate to capture our Redshift Spectrum queries in our pipeline. DBT excels at transforming data and helps enforce writing modularized and version controlled SQL. Instead of a Spark job reading from S3 to Redshift, we used DBT to simply copy the data from Data Lake directly to Redshift. Not only did DBT provide its usual trademark benefits of reproducibility, flexibility and data lineage, but it also helped us combat some of the pain points mentioned above.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-06-loading-data-into-redshift-with-dbt/image4.png" alt="dbt model snippet" /></p><h2 id="simplified-schema-changes">Simplified schema changes</h2><p>To simplify schema changes, we took advantage of DBT’s <strong>on_schema_change</strong> configuration argument. By setting it to <strong>append_new_columns</strong>, we ensured that columns would not be deleted if they were absent from the incoming data. We also used DBT contracts as a second layer of protection to ensure that the data being written matched the model’s configuration.</p><h2 id="backfills-less-manual">Backfills less manual</h2><p>Backfilling also became a lot easier with DBT. By using DBT’s <strong>pre_hook</strong> configuration argument, we could specify a query to execute just before the model. This enabled us to delete the data for the partition about to be written more automatically. Now that we could guarantee idempotency, backfills could be done without worrying about stale data not being removed.</p><h2 id="data-deduplication">Data deduplication</h2><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-06-loading-data-into-redshift-with-dbt/image5.png" alt="dbt test snippet" /></p><p>To tackle duplicate rows, we added a deduplication layer to the SQL, which was validated with a DBT test. While DBT has built-in unique column tests, they weren’t feasible for our large tables since they required scanning the entire table. Instead, we used the <strong>expect_column_values_to_be_unique</strong> function from the <strong>dbt_expectations</strong> package. This allowed us to specify a row condition to scan only the rows recently written.</p><h2 id="performance-gains">Performance Gains</h2><p><img src="https://engineeringblog.yelp.com/images/posts/2024-11-06-loading-data-into-redshift-with-dbt/image6.png" alt="performance gains" /></p><p>The most noticeable win was in performance, especially for our largest and most problematic Redshift dataset:</p><ul><li>Writing used to take about 2 hours, but now it typically runs in just 10 minutes.</li>
<li>Before, there were sometimes up to 6 hours of delays per month. Now we no longer experience any delays! This has greatly reduced the burden on our on-call incident response efforts.</li>
<li>Schema upgrades used to be a longer multi-step process. This has been improved to a 3-step process that only takes a few hours.</li>
</ul><h2 id="better-data-consistency">Better data consistency</h2><p>By eliminating the forking of data flows, we increased our confidence that data wouldn’t diverge between different data stores. Since any data entering Redshift must first pass through Data Lake, we could better ensure that Data Lake remained our single source of truth.</p><p>Following the success of the migration, we rolled out these changes to approximately a dozen other datasets and observed similar benefits across the board. By leveraging tools like AWS Redshift Spectrum and DBT, we better aligned our infrastructure with our evolving data needs, providing even greater value to our users and stakeholders.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/11/loading-data-into-redshift-with-dbt.html</link>
      <guid>https://engineeringblog.yelp.com/2024/11/loading-data-into-redshift-with-dbt.html</guid>
      <pubDate>Wed, 06 Nov 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How we improved our Android navigation performance by ~30%]]></title>
      <description><![CDATA[<p>In 2019, Yelp’s Core Android team led an effort to boost navigation performance in Yelp’s Consumer app. We switched from building screens with multiple separate activities to using fragments inside a single activity. In this blog post, we’ll cover our solution, how we approached the migration and share learnings from along the way as well as performance wins.</p><h2 id="where-we-started-circa-2018">Where we started circa 2018</h2><p>Navigating between screens in an Android app is often when the app and device are under the most strain. The new screen and its dependencies are quickly created, which can lead to slow or frozen frames. Prior to 2019, almost every page in Yelp’s Consumer application was in its own activity. Transitioning from one page to another was not always smooth, and the UI was often visibly slow to the naked eye while navigating around.</p><p>Clicking on one of the bottom tabs meant recreating everything from scratch each time a user navigated to a screen. We mitigated this in the short term by bringing an activity to the foreground instead of recreating it if it was present in the activity stack. Although this helped, it still didn’t result in the visibly silky-smooth navigation transition we were hoping for.</p><p>To understand where we could have the most impact and to help focus our efforts, we first did some local benchmarks followed by monitoring performance data from real users to verify our hypothesis that the navigation performance was slow.</p><h2 id="local-benchmarks">Local Benchmarks</h2><p>In 2018, we performed some basic navigation tests on a Pixel device running Android 7.1.1 and measured how long it took from the moment a button was clicked on Screen 1 until Screen 2’s onCreate() lifecycle method completed. The following numbers are an average value derived 10 iterations per scenario:</p><table><thead><tr><th>Scenario</th>
<th>1st Time Navigation to Page (ms)</th>
<th>2nd Time Navigation to Page (ms)</th>
</tr></thead><tbody><tr><td>Plain Activity with no animation</td>
<td>152</td>
<td>65</td>
</tr><tr><td>Plain YelpActivity with no animation</td>
<td>420</td>
<td>116</td>
</tr></tbody></table><p>We had to ask ourselves, why is the Yelp base activity so much slower? It turned out it was a combination of many things.</p><ol><li>
<p>We were creating the Navigation Drawer as soon as the activity was created, instead of doing it lazily—only when the user opens it. This was true even for the bottom tab activities.</p>
</li>
<li>
<p>The layout hierarchy was deep, containing unused and unnecessary layers that could be removed.</p>
</li>
<li>
<p>Each page had to create the entire layout hierarchy, instead of just the content above the bottom navigation bar.</p>
</li>
<li>
<p>We had some slower calls during onCreate(), such as analytics calls and accidental disk io that should be run on a background thread instead.</p>
</li>
<li>
<p>We had lots of small, cheap objects that were setup on each screen to ease development, the sum of which amounted to a significant portion of the slowdown.</p>
</li>
</ol><h2 id="production-data">Production Data</h2><p>We also collected some navigation performance data from real users over a five-month period. Below is the data for two high-traffic flows:</p><table><thead><tr><th>Flow</th>
<th>Average (ms)</th>
<th>P99 (ms)</th>
</tr></thead><tbody><tr><td>Home to Search Overlay</td>
<td>~200</td>
<td>~1000</td>
</tr><tr><td>Search List to the Business Page</td>
<td>~240</td>
<td>~380</td>
</tr></tbody></table><p>We learned the performance was lacking, verified our local benchmark numbers, and proved our seen-with-the-naked-eye hypothesis about the navigation transitions.</p><p>Based on this research, Yelp’s Core Android team decided it would be invaluable to tackle this problem. The recommended architecture for a bottom tab screen is to use a “single” activity with multiple fragments or views. The theory behind this is that creating a fragment or view is much faster and cheaper than creating an activity. However, there are many ways to implement this, so we had a big decision to make.</p><h3 id="fragments-vs-views">Fragments vs Views</h3><p>To determine what screens should be made of in our new single activity setup, we did some more local benchmarking. These measurements were also taken from a Pixel device running Android 7.1.1.</p><table><thead><tr><th>Scenario</th>
<th>1st Time Navigation to Page (ms)</th>
<th>2nd Time Navigation to Page (ms)</th>
</tr></thead><tbody><tr><td>Plain View with no animation</td>
<td>6</td>
<td>3</td>
</tr><tr><td>Plain View with animation</td>
<td>6</td>
<td>3</td>
</tr><tr><td>Plain Fragment with no animation</td>
<td>14</td>
<td>12</td>
</tr><tr><td>Plain Fragment with animation</td>
<td>15</td>
<td>11</td>
</tr></tbody></table><p>We found that either of these solutions resulted in significantly faster navigation performance than using activities. We also benchmarked with and without a shared element transition between screens to evaluate their impact on performance and found they had a negligible negative impact.</p><p>Based on the above, views were clearly the fastest in terms of the timings, but these numbers didn’t tell the whole story. Firstly, the difference in the timings is not visible to the naked eye and both represent a significant performance gain over status quo. Secondly, besides considering performance, we had to also consider the development experience for our Android community and the ongoing support available to us in the future from whatever solution we selected to build our new single activity.</p><p>Google didn’t directly support View-based navigation at the time (it has since become possible with Compose-based navigation). In order to use views, we would have needed to either find an existing view-based navigation architecture library or build one ourselves. There <em>were</em> some promising open-source solutions, such as Scoop by Lyft, Flow &amp; Mortar by Square, and the Conductor library by BlueLine Lab’s. However, 3rd party open-source libraries come with their own set of risks and challenges, such as being dropped or deprecated in time, as happened in two of the libs mentioned above.</p><p>We evaluated Conductor and it had many advantages such as:</p><ol><li>Lightning fast transitions</li>
<li>Great API</li>
<li>Support for shared element transitions out-of-box</li>
<li>Support for RxLifecycle and other architecture components via add-on libraries</li>
</ol><p>However, ultimately, we deemed the risk of using a 3rd party library for navigation to be too great. While views were technically faster, after taking everything into account, we decided to use fragments.</p><p>To use fragments, there are a variety of old and new options provided by Google. Unlike views, fragments are intended to be used as screens within a single activity flow. So by choosing fragments as our solution, we benefit from all the support that comes with it, such as documentation, testing, and lifecycle management.</p><p>The first fragment-based solution we evaluated was Google’s Jetpack Navigation Library. The library was quite new, but it seemed like it should have suited our needs. Developers define a navigation graph in XML and the library auto-generates code to make navigating between the screens defined in the graph simple. However, we quickly discovered various limitations and obstacles to using this.</p><h3 id="blocker-1-feature-modules">Blocker #1: Feature Modules</h3><p><a href="https://engineeringblog.yelp.com/2018/06/how-yelp-modularized-the-android-app.html">Yelp’s Android build is modularized</a> with each feature residing in its own Gradle module. To keep our build speed lean, we don’t allow Gradle modules in the same layer of the build hierarchy to depend on each other. This allows modules to build in parallel, which unlocks a slew of build performance wins.</p><p>Defining navigation routes in an XML file meant having an app-wide navigation graph in a build-layer higher than the feature layer in the build hierarchy. Fragment IDs also had to be declared down the hierarchy to be accessible in all modules and permit inter-module navigation.</p><h3 id="blocker-2-scalability">Blocker #2: Scalability</h3><p>Declaring all screens in a single XML file would also have led to a major scalability issue, where we would have one giant and hard-to-read file which all teams would iterate on frequently. XML is also not dynamic enough for our use-cases. Due to a performance issue in the Android Gradle Plugin, our build times also tripled when attempting to declare the fragment IDs in a lower-level module. Lastly, even with the above approach, inter-module navigation became tricky and negated most of the benefits the library provided.</p><p>After five years of improvements, the Jetpack navigation library can handle more use cases. It is now possible to create a navigation graph dynamically in Kotlin, which should help with some of the issues we faced. We reevaluate this regularly and may switch to using it at a later stage. Overall, this is a great navigation library and we currently use it for small flows within a larger screen.</p><h2 id="selected-approach-plain-old-fragments">Selected Approach: Plain Old Fragments</h2><p>We decided to use plain fragments without using the Jetpack navigation library. Fragments are a well-supported part of the Android ecosystem and are familiar to most developers. By using plain fragment navigation, we could get the performance benefit we wanted, get visually pleasing transitions, and solve the cross-Gradle module navigation issue we encountered in the Jetpack Navigation library.</p><p>Android provides a FragmentTransaction API for showing and hiding fragments, which is what we use under the hood. However, we added a layer of abstraction which hides FragmentTransaction and other navigation specific code from features. We use layers of abstractions when we can to great success. This gives our future selves (thanks, us!) a great advantage by allowing us to switch implementations if necessary, but without updating every navigation point in the app. This abstraction layer exists as an interface we imaginatively named “SingleActivityNavigator”.</p><p>Navigating from one screen to another screen in the single activity requires creating an instance of the new fragment and then calling <code class="language-plaintext highlighter-rouge">displayInSingleActivity</code> which, at minimum, requires an Android context and fragment tag.</p><p>We built the bottom tabs like a regular feature using <a href="https://engineeringblog.yelp.com/amp/2023/04/performance-for-free-on-android-with-our-mvi-library.html">our MVI library “auto-mvi”</a> which is both performant and easily testable. Now in the single activity, there’s only one instance of the bottom tab bar and it’s shared among many screens. This speeds up fragment creation and fragments in the single activity only need to inflate the content above the bottom tab bar.</p><p>We removed the navigation drawer as it was already an outdated Android design trend at the time, and instead moved the content to a “More Tab” accessible via the bottom navigation bar. This boosted performance both for the fragments within the single activity and the single activity itself, as it was no longer required on every screen.</p><p>We allow each fragment in our single activity to configure screen-level properties through the SingleActivityNavigator interface. These properties configure each property required when displaying a fragment and reconfigure the screen’s properties for the last fragment’s requirements on navigating backwards. Configurable options include: the status bar color, status bar icon color, whether the fragment content should be under the status bar and window background color.</p><p>We use dependency injection to retrieve fragments based on a dependency injection string key. This lets us keep fragments in separate feature modules and retrieve an instance of them from anywhere else in the app. One advantage of using the SingleActivityNavigator interface is that, while we mostly use dependency injection to retrieve the fragments, it’s not a hard requirement. We can retrieve fragments by other means, which for our use-case was important to allow backwards compatibility with some legacy code.</p><p>Another advantage of this approach is that it keeps build times fast with our modular Android Gradle build.</p><h3 id="handling-deeplinks">Handling Deeplinks</h3><p>In the Yelp Consumer app, each external deeplink first passes through an activity whose sole purpose is to process the deeplink’s URL parameters and decide if the URL is safe and/or correct. These activities are called URLCatcherActivity’s. Each deeplink destination has its own designated URLCatcherActivity. After processing the URL and parsing whatever relevant data there is, this activity is then responsible for navigating to the actual target destination within the app. While these intermediate activities during app launch are not the best for our app’s cold start timings, we do benefit from avoiding a monolithic-style deeplink handling class and improved readability and testing.</p><p>This brings us to how we added support for deeplink navigation to fragments. Building on the above section, we know we can use dependency injection to retrieve a fragment based on a string key. The key is used to fetch an instance of a fragment from the dependency graph. To navigate to a fragment based on a deeplink, we use an intent extra which denotes the fragment to display. After parsing the data from the URL, we pass it in an Intent. The single activity then uses this Intent extra to fetch an instance of the fragment from the dependency graph. It passes data from the URL into the fragment’s arguments and then finally displays the fragment.</p><p>While this solution satisfies our requirements under the constraints (requiring a URLCatcherActivity), further performance improvements are now unlocked and possible since introducing the single activity. To improve deeplink navigation and cold start performance further, we can now deeplink directly to the single activity and display a fragment which is a significant improvement over status quo.</p><h2 id="migration-path">Migration Path</h2><p>There were three phases to the migration to fragments. Before we could begin the actual fragment migration, we first had to address the navigation drawer and move it to the more tab. Next, we migrated each activity to a fragment while leaving the original activities in place. These activities were mostly empty shells at this point and used the pre-existing navigation code. Next, we gradually rolled out a version where each fragment was used in one activity. Lastly, we monitored navigation performance to verify if we achieved the expected improvements.</p><h2 id="performance-results">Performance Results</h2><p>When recording measurements from production, we focused on tracking the highest traffic screens in the Consumer app. We only tracked the first time a transition occurs within each session because this is where the change is most impactful and noticeable. We found that after screens are already created, navigating among them is exceptionally fast. So, remember the following result includes creating the fragments’ views too.</p><p>On average, across all Android versions and device models (low &amp; high end), we saw a ~30% navigation performance boost. Sometimes, we saw as high as a ~60% improvement in navigation time. The performance improvement depends really on the screen, what it’s doing and how it’s built internally.</p><h2 id="conclusion">Conclusion</h2><p>We learned that multiple fragments in a single activity perform much faster than multiple separate activities. We accomplished visibly smooth animations between our screens while leaving our fragments in separate feature modules. Doing a migration like this gradually and safely is totally achievable</p><p>Although the performance of the underlying Android components - activity / fragment / view - varied quite a bit, the performance gains in any project always depend on the use-case specific code and solutions already in place. That’s why on the Core Android team, we try to tackle performance holistically with performant-by-default solutions when and where we can.</p><p>Our single activity implementation has been working well for many years now, with many teams that work on the Consumer app having adopted the pattern for their screens. Our business owner app also followed suit and migrated to a fragment-based single activity. While our apps are smoother now, we remain optimistic that the Jetpack navigation library will someday solve all of our requirements.</p><h2 id="acknowledgements">Acknowledgements</h2><p>A huge thanks to Core Android’s managers at the time before and during this project, David Brick and Antonio Hernández Niñirola, who helped us make a case to do this work and push it forward. A special thanks to all the feature teams and developers who migrated their screens and provided code reviews, such as Diego Waxemberg, Tyler Argo, Lasya Boddapati, and Sreenivasen Ramasubramanian. Finally, a big thank you to my fellow Core Android members for providing invaluable thoughts, feedback and insight along the way.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/10/how-we-improved-our-android-navigation-performance-by-~30.html</link>
      <guid>https://engineeringblog.yelp.com/2024/10/how-we-improved-our-android-navigation-performance-by-~30.html</guid>
      <pubDate>Tue, 08 Oct 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Migrating in-place from PostgreSQL to MySQL]]></title>
      <description><![CDATA[<p>The Yelp Reservations service (yelp_res) is the service that powers <a href="https://www.yelp.com/reservations">reservations on Yelp</a>. It was acquired along with <a href="https://blog.yelp.com/news/welcoming-seatme-to-yelp/">Seatme in 2013</a>, and is a Django service and webapp. It powers the reservation backend and logic for <a href="https://restaurants.yelp.com/products/yelp-guest-manager/">Yelp Guest Manager</a>, our iPad app for restaurants, and handles diner and partner flows that create reservations. Along with that, it serves a web UI and backend API for our Yelp Reservations app, which has been superseded by Yelp Guest Manager but is still used by many of our restaurant customers.</p><p>This service was built using a DB-centric architecture, and uses a “DB sync” paradigm – a method where clients maintain a local database with a copy of data relevant to them – to sync data with legacy clients. It also relies on database triggers to enforce some business logic. The DB used is PostgreSQL which is not used anywhere else at Yelp, which meant that only a small rotation of long-tenured employees knew Postgres well enough to do outage response. This caused issues in maintenance, visibility, and outage response times. The teams working on the Restaurants products are not infra teams, and the Yelp-wide infra teams (understandably) focus on Yelp-standard infrastructure. As a result, when we did see issues with Postgres it was often a scramble to find people with relevant expertise.</p><p>So, we switched out this DB in-place with a Yelp-standard MySQL DB.</p><p>As restaurants rely on our product to run their business, this system can’t be taken offline for maintenance, and any data loss is unacceptable—we can’t have someone make a reservation and then have it disappear. This led to much of the complexity of this project, as switching gradually between two data stores on the fly introduced new challenges. Much of the existing documentation we could find on this used toy examples or assumed a clean stop, migration, and restart, so this was also somewhat unexplored territory (hence this blog post!).</p><p>Django has MySQL support. As a proof of concept in mid-2022, we switched the development DB (which is local and set up as needed) to a MySQL DB and updated migration code. We got to the point where the service was starting, correctly setting up the DB in MySQL, and responding successfully to some requests. While this ended up being the easy part, it helped prove that the migration was feasible.</p><p>Postgres has a lot of functionality that isn’t supported in MySQL. We also used some features that, while supported by MySQL, are not supported by our infra teams.</p><p>One example: Postgres has native support for array columns. We used these to store the schedule for each table at a restaurant in our database as an array of integers. We re-implemented this behavior to pack this data into a string, which worked cleanly since the length of the array and each of its elements is constant.</p><p>A more complicated set of changes were needed to get rid of database triggers. Triggers are supported by MySQL in general, but are not supported by our MySQL infrastructure. Our code used them to propagate data (triggering when certain database tables are changed) and to enforce constraints around preventing double-booking of tables.</p><p>For data propagation, our old system relied on DB changes for certain tables being published as Advanced Message Queuing Protocol (<a href="https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol">AMQP</a>) events into <a href="https://www.rabbitmq.com/">rabbitmq</a>, which was then consumed by multiple clients that subscribed to the changes relevant to them. This was powered by a Postgres-specific database extension that integrated with Postgres’ transaction management, ensuring clients never received a message until the corresponding transaction was committed. In our new system, we added logic to the Django model’s <code class="language-plaintext highlighter-rouge">save()</code> function to add a post-commit trigger to publish to a new AMQP topic, and refactored our code to eliminate “bulk” operations, which write to the DB without calling <code class="language-plaintext highlighter-rouge">save()</code>. This means our existing watchers could instead listen to this new AMQP topic even when updating MySQL tables. We also introduced transaction grouping by generating a universally unique identifier (UUID) for each transaction at the start of the transaction block. This identifier was written along with the change data to group changes by transaction. We then monitored the existing topic and the new topic and ensured that the data matched, before switching to using the new topic.</p><p>For preventing double-bookings, we had used a system called ‘Atomic Block Holds’. This used a DB trigger to raise an exception and prevent a write if a block (a ‘block’ is a reservation, or anything else, that means a table cannot be reserved at a certain time) on a table would overlap with an existing block. To replicate this behavior without triggers, we created a new table called <code class="language-plaintext highlighter-rouge">TableTimeSlotBlock</code> which contains rows keyed on both the table id and 15-minute timeslots for each existing blocked period. Then the application code checks for conflicts and locks the rows (even if they don’t exist yet) by performing a <code class="language-plaintext highlighter-rouge">SELECT … FOR UPDATE</code> query. We put this logic earlier in the code than the existing DB trigger, so by examining logs we could ensure that the existing trigger was no longer exercised, meaning that the new solution was at least as restrictive as status quo.</p><p>To migrate to this system, we also had to create rows in this new table for all future reservations – and since each existing ‘block’ covered multiple timeslots that meant adding millions of records to the new table.</p><p>This was the scary part. We wanted to be able to release the new DB gradually and be able to roll back to Postgres if needed. This meant that we needed to keep both DBs in sync for some period of time. Django has multi-DB support, but that is intended for writing/reading different things to different DBs, not keeping data exactly in sync across multiple DBs.</p><p>To achieve this for writes, we:</p><ul><li>
<p>Added a new model called <code class="language-plaintext highlighter-rouge">AlsoWriteToMysqlModel</code> in the inheritance hierarchy of all models in our code. This model redefined save() and other object-level DB write functions to write first to a ‘primary’ DB, and then save the object to the ‘secondary’ DB</p>
</li>
<li>Did the same with <code class="language-plaintext highlighter-rouge">AlsoWriteToMysqlQuerySet</code> and queryset operations for all querysets in our code
<ul><li>In Django, not all operations are performed on objects; for example you can have a Queryset which has a filter and then delete it, which will perform a single DB query with that filter and never load the actual objects.</li>
</ul></li>
<li>Added post_save and pre_delete signal handlers for models we don’t control (like the <code class="language-plaintext highlighter-rouge">User</code> model or third party models) that do the same
<ul><li>We could have used this technique for all models, but we felt that having the logic inside the model when we could was easier to reason about and keeps the DB writes as close together as possible.</li>
</ul></li>
<li>Replaced the default Django transaction decorator with a decorator that nested a Postgres transaction inside a MySQL transaction. This meant that almost any DB failure would roll back both DBs, as long as we were in a transaction.
<ul><li>The exception is for failures at MySQL commit-time; the logic here is that DB triggers made Postgres commits sometimes fail, while MySQL commits should always succeed unless there’s an infra issue. We learned this the hard way after originally having the order reversed in an attempt to reduce the risk of introducing new failing Postgres writes during the rollout, and then having some transactions fail due to DB triggers after committing writes to MySQL, leading to inconsistent data across the databases. This is an interesting example where “playing it safe” in one dimension actually caused a bug.</li>
</ul></li>
</ul><p>For reads, we:</p><ul><li>Added logic to the router (the Django class which determines which DB we read/write to) to separate the ‘read db’ and the ‘write db’</li>
<li>Added middleware to set a flag if and when we wanted a request to read from MySQL, which was respected by the router
<ul><li>This flag was set before any DB reads/writes in the middleware stack, to ensure each request only reads from one DB</li>
</ul></li>
</ul><p>During the release, we first kept reads on Postgres, to keep identical behavior as the status quo while also writing to MySQL. This let us cross-check the databases and fix inconsistencies and bugs at our leisure without affecting customers. We then gradually switched requests to read from MySQL, then switched the write logic to write to MySQL first, and finally (several months later) turned off Postgres writes entirely and cleaned up much of the code we had written.</p><p>The release process went relatively smoothly over the course of several months, with a few surprises we describe below.</p><ul><li>
<p>Originally, we planned a transition period where the ‘primary’ database could differ on a per-request basis. However, this causes issues with autoincrement primary keys. Specifically, PostgreSQL maintains a sequence that’s incremented only when a row is inserted without the primary key set. This means that you should either always set the key, or never set the key. Otherwise, each write with the key set, like when we write to MySQL and then save the object to Postgres, will cause a future Postgres write with the key unset to fail. This took some time to figure out during rollout, as the symptom was a small number of errors in status quo flows, but no errors in the MySQL-pinned requests.</p>
</li>
<li>
<p>Django names DB savepoints with random strings. ProxySQL, which in our infrastructure sits between clients and the databases, stores query digests for use in metrics. These digests are meant to be generic representations of queries and not depend on the actual data written or read, but savepoint names are included in the digests, leading to each query using a savepoint having a unique digest. This led to escalating ProxySQL memory usage and a few instances of production issues until we figured it out. We fixed this by changing a setting in our ProxySQL instances.</p>
</li>
<li>Switching from ‘bulk’ operations to individual object-level operations is significant and can lead to logic issues (since things like <code class="language-plaintext highlighter-rouge">save()</code> aren’t called in bulk operations) and performance issues (since an order of magnitude more DB queries could be executed).
<ul><li>In a single instance, this meant rewriting an archival batch job to use raw SQL, but otherwise it turned out that MySQL could easily handle the volume of writes we do.</li>
</ul></li>
<li>
<p>Performing an initial data load (backfilling) early is very useful for testing, but we should have done a full, clean, second backfill once we fixed all the bugs. Instead, we fixed specifically broken data we discovered, but this led to bugs caused by already-fixed code that would have been avoided.</p>
</li>
<li>
<p>Don’t overlook other users of the database. Our analytics pipeline was getting data directly from Postgres, and moving it over to use MySQL ended up being time-consuming. This was the final blocker to decommissioning the old database.</p>
</li>
<li>Using the Yelp-standard stack improved performance. This isn’t due to MySQL being inherently more performant, but by using the same stuff the company uses, we benefit from many people’s efforts monitoring and optimizing our database performance.</li>
</ul><p>This was a large project that took the better part of a year to implement. In the interests of brevity, I’ve focused on a subset of the work Restaurants did, but it was all vital for the success of this project. Special thanks to the Database Reliability Engineering and Production Engineering teams, and everyone from Restaurants who worked on this, especially Boris Madzar, Carol Fu, and Daniel Groppe.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/10/migrating-from-postgres-to-mysql.html</link>
      <guid>https://engineeringblog.yelp.com/2024/10/migrating-from-postgres-to-mysql.html</guid>
      <pubDate>Mon, 07 Oct 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Boosting ML Pipeline Efficiency: Direct Cassandra Ingestion from Spark]]></title>
      <description><![CDATA[Machine Learning Feature Stores ML Feature Store at Yelp Many of Yelp’s core capabilities such as business search, ads, and reviews are powered by Machine Learning (ML). In order to ensure these capabilities are well supported, we have built a dedicated ML platform. One of the pillars of this infrastructure is the Feature Store, which is a centralized data store for ML Features that are the input of ML models. Having a centralized dedicated datastore for ML Features serves a number of purposes: Data Quality and Data Governance Feature discovery Improved operational efficiency Availability of Features in every required environment...]]></description>
      <link>https://engineeringblog.yelp.com/2024/09/boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark.html</link>
      <guid>https://engineeringblog.yelp.com/2024/09/boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark.html</guid>
      <pubDate>Thu, 19 Sep 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Boosting ML Pipeline Efficiency: Direct Cassandra Ingestion from Spark]]></title>
      <description><![CDATA[<h2 id="ml-feature-store-at-yelp">ML Feature Store at Yelp</h2><p>Many of Yelp’s core capabilities such as business search, ads, and reviews are powered by Machine Learning (ML). In order to ensure these capabilities are well supported, we have built a dedicated ML platform. One of the pillars of this infrastructure is the Feature Store, which is a centralized data store for ML Features that are the input of ML models.</p><p>Having a centralized dedicated datastore for ML Features serves a number of purposes:</p><ul><li>Data Quality and Data Governance</li>
<li>Feature discovery</li>
<li>Improved operational efficiency</li>
<li>Availability of Features in every required environment</li>
</ul><p>ML Models at Yelp are usually trained on historical data and used for inference in real-time systems. Thus we need to be able to serve the Features (which are the inputs for the model both during inference and during training) in real time during inference, and as a historical log of all previous values during training.</p><p>The Feature Store is an abstraction over real-time and historical datastores to provide a unified Feature API to the models.</p><p>Specifically, our historical Feature Store is implemented in our <a href="https://engineeringblog.yelp.com/2021/04/powering-messaging-enabledness-with-yelps-data-infrastructure.html">Data Lake</a> and the real-time Feature Stores are implemented in <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">Cassandra</a> or <a href="https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html">NrtSearch</a>.</p><p>Here, we will discuss how we improved the automated data sync from the Historical Feature in DataLake to the Online Feature Store in Cassandra. At a high level, the data movement is carried out by Sync jobs in between different data stores as depicted below.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/feature_store_sync_job.png" alt="Feature Store Sync Job" /></p><p>Our <a href="https://engineeringblog.yelp.com/amp/2022/08/spark-data-lineage.html">Spark ETL framework</a>, an inhouse wrapper around PySpark, did not support direct interactions with any of the online datastores. So any writes had to be routed through <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Yelp’s Data Pipeline</a>.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/flow_before.png" alt="Flow before" /></p><p>Thus, publishing ML Features to Cassandra required a longer route involving multiple steps:</p><ol><li>Create a <strong>Sync job</strong> that reads Features from Data Lake and republishes them to Data Pipeline.</li>
<li>Create and register an <a href="https://avro.apache.org/docs/1.11.1/specification/"><strong>Avro Schema</strong></a>, which is required for publishing data to the Data Pipeline.</li>
<li><strong>Schedule</strong> the Spark job in <a href="https://engineeringblog.yelp.com/2010/09/tron.html">Tron</a>, our centralized scheduling system.</li>
<li>Make <strong>Schema changes</strong> to add the new Feature columns in Cassandra. We have strict controls in place around Cassandra Schema changes at Yelp which require following a separate process.</li>
<li>Create a <strong>Cassandra Sink connection</strong> to push the data into Cassandra from the Data Pipeline.</li>
</ol><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/dev_complexity_before.png" alt="Dev complexity before" /></p><p>This process had a few disadvantages.</p><ul><li>The data first needs to be duplicated into the Data Pipeline, which has some cost implications.</li>
<li>The engineer would have to ensure all five steps are executed successfully when publishing the Features.</li>
<li>Our <strong>Cassandra Sink Connector</strong> relies on eventual publishing of data from Data Pipeline. This means the engineers often have less visibility when the Feature is completely published and available for reads from Cassandra.</li>
</ul><p>In order to deal with the above challenges, Cassandra datastore was made a first-class citizen in the Spark ETL framework. The foundation of it has been laid on top of the <a href="https://github.com/datastax/spark-cassandra-connector">open source Spark Cassandra Connector</a>, allowing us to ingest Spark dataframes to Cassandra tables as well as extract data from Cassandra to Spark dataframes.</p><p>One of the key considerations when supporting Direct Feature Publication was to avoid any impact on the live traffic that our Cassandra clusters serve. One option we considered was spinning up a dedicated datacenter for Spark workloads. We ruled that out primarily for the following two reasons.</p><ol><li>Running Cassandra clusters would contribute additional costs.</li>
<li>Our Spark workloads rely more on writes to Cassandra than on reads. As data needs to be replicated across datacenters, having a dedicated datacenter doesn’t add much value.</li>
</ol><p>Throughout the rest of the article, we are going to focus on the Cassandra publisher aspects only. A number of design decisions that were made to ensure the reliability of our Cassandra production fleet are discussed below.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/cassandra_perspective.png" alt="Cassandra perspective" /></p><h2 id="batch-mode-disabled-for-cassandra-writes">Batch Mode Disabled for Cassandra Writes</h2><p>From our experiments, we found that a Spark dataframe could be partitioned by a column that isn’t a partition key in Cassandra. This means if we enabled batching without re-partitioning, it could result in a request to Cassandra from a Spark job involving multiple different partitions. A re-partition of Spark dataframe appeared to be excessive here, so we kept batching disabled for Cassandra writes.</p><h2 id="limiting-concurrent-writers">Limiting Concurrent Writers</h2><p>Another control we implemented was to limit the number of concurrent writers to Cassandra to avoid putting pressure on Cassandra’s Native Transport Request (NTR) queue, and let Cassandra’s backpressure handle it.</p><p>One of the major challenges was preventing Spark jobs from overloading the Cassandra cluster. The online nature of the datastore means the impact would be sudden and obvious. This was a challenge as there’s no adaptive rate control mechanism in the Spark Cassandra Connector (<a href="https://datastax-oss.atlassian.net/browse/SPARKC-594">SPARKC-594</a>). Spark Cassandra Connector provides static rate-limiting configurations, but those are defined at per executor core level (Spark task). These configuration options look like:</p><div class="language-plaintext highlighter-rouge highlight"><pre>spark.cassandra.output.throughputMBPerSec
spark.cassandra.output.concurrent.writes
</pre></div><p>A couple of situations where a Cassandra cluster can be stressed out include:</p><ol><li>A Spark job launched with a large number of cores/executors, which means there are a large number of parallel works ingesting or reading from the Cassandra cluster.</li>
<li>There are many Spark jobs launched in parallel interacting with a particular Cassandra cluster.</li>
</ol><p>To avoid these situations, we configured a few tuning parameters for Spark jobs. A major one was configuring the capability to rate-limit a Spark job irrespective of the number of executors or cores launched. However, with Spark’s <a href="https://spark.apache.org/docs/3.4.1/job-scheduling.html/#dynamic-resource-allocation">Dynamic Resource Allocation</a> (DRA) enabled, it’s tricky to get the exact number of resources. Therefore, we computed the maximum possible executor cores as follows.</p><p><strong><em>max.executor.cores = min(max.executors * max.cores, max.spark.partitions)</em></strong></p><h2 id="limiting-number-of-concurrent-spark-jobs">Limiting Number of Concurrent Spark Jobs</h2><p>To effectively limit the number of concurrent Spark jobs accessing a Cassandra cluster, we needed some concurrency control mechanism. We implemented it based on distributed locks with Zookeeper. In addition, we kept the <em>lock contention</em> time configurable so that Spark jobs can wait in case the semaphore lock is fully acquired. The positioning of the lock acquisition was interesting, and we deliberately acquired it just before initiating the Spark job to prevent a scenario where resources are allocated but remain idle in a waiting state. The potential request handling capacity of different Cassandra clusters is proportional to the amount of computational resources allocated to it, so we kept the semaphore maximum count configurable.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/concurrent_jobs.png" alt="Concurrent Spark jobs" /></p><p>The direct Feature publication to Cassandra yields us some significant benefits, which are discussed below.</p><h2 id="infrastructure-cost-savings">Infrastructure Cost Savings</h2><p>We used to have 4 different components that contributed to the cost of moving a feature. These included:</p><ul><li>The cost of computational resources allocated for executing a <strong>Spark</strong> job.</li>
<li>The cost of storing data inside Yelp’s <strong>Data Pipeline</strong>.</li>
<li>The cost associated with the Cassandra <strong>Sink Connection</strong> for ingesting data from the Data Pipeline into Cassandra.</li>
<li>The cost of I/O Operations on the <strong>Cassandra</strong> side for publishing data.</li>
</ul><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/flow_after.png" alt="Flow after" /></p><p>Using direct Feature publication, we observe the following improvements:</p><ul><li>Spark jobs now take longer to complete, but they use much fewer executors.</li>
<li>The Data Pipeline is <strong>eliminated completely.</strong></li>
<li>The Cassandra Sink Connection is <strong>eliminated completely.</strong></li>
<li>The cost of I/O Operations in Cassandra remains almost unchanged.</li>
</ul><p>Overall, we observed around <strong>30% in ML Infrastructure Cost Savings</strong>.</p><h2 id="developer-velocity">Developer Velocity</h2><p>There were also benefits seen in terms of improving Engineering Efficiency. Compared to the previous mechanism, engineers can worry less about setting up the Sink Connections for Cassandra. The requirement for the definition of Avro Schemas also got downgraded from a <em>hard requirement</em> to <em>soft requirement</em>, mainly for assisting the engineer in early data validation and verification. These Avro Schema were primarily needed to define schema for Yelp’s Data Pipeline (more details can be found in <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">our Schema Store blog</a>). In total, there’s a <strong>25% improvement in engineering effectiveness</strong> with respect to Feature publishing.</p><h2 id="developer-visibility">Developer Visibility</h2><p>As mentioned earlier, our Cassandra Sink Connector relies on eventual publication of data into the Data Pipeline. This means it is slightly more complicated for developers to track when the Feature is completely published to Cassandra with the status quo. Relying on direct ingestion to Cassandra means data is readily available for reads as soon as the Spark job succeeds.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-27-boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark/dev_complexity_after.png" alt="Dev complexity after" /></p><p>Reducing the complexity of ‌Feature publishing also improved the maintainability of the Feature Store systems.</p><p>The transition to direct publication to Cassandra has yielded considerable advantages, enhancing engineering effectiveness and reducing overall infrastructure costs. However, presence of adaptive rate-limiting on the Spark Cassandra Connector could have helped us in improving the developer experience further. A future potential improvement includes a switch to <a href="https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics">reading/writing data with Spark Bulk Analytics</a>. This will allow us to by-pass Cassandra’s Native Transport Request limits, and theoretically the read/write throughput can reach the max limits supported by the hardware (i.e. disks).</p><p>We would like to thank Adel Atallah, Manpreet Singh and Talal Riaz for their contribution towards successful completion of this work.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp. If you're interested, apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/08/boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark.html</link>
      <guid>https://engineeringblog.yelp.com/2024/08/boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark.html</guid>
      <pubDate>Wed, 28 Aug 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[dbt Generic Tests in Sessions Validation at Yelp]]></title>
      <description><![CDATA[Sessions, Where Everything Started For the past few years, Yelp has been using dbt as one of the tools to develop data products that power data marts, which are one stop shops for high visibility dashboards pertaining to top level business metrics. One of the key data products that’s owned by my team, Clickstream Analytics, is the Sessions Data Mart. This product is our in-house solution to understand what consumers do during their session interaction with Yelp products and provide insights on top of it. This blog post will walk you through how dbt is used as an important test...]]></description>
      <link>https://engineeringblog.yelp.com/2024/08/dbt-Generic-Tests-in-Sessions-Validation-at-Yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2024/08/dbt-Generic-Tests-in-Sessions-Validation-at-Yelp.html</guid>
      <pubDate>Wed, 14 Aug 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Implementing multi-metric scaling: making changes to legacy code safely]]></title>
      <description><![CDATA[<p>We’re excited to announce that multi-metric horizontal autoscaling is available for all services at Yelp. This allows us to scale services using multiple metrics, such as the number of in-flight requests and CPU utilization, rather than relying on a single metric. We expect this to provide us with better resilience and faster recovery during outages.</p><p>This year, PaaSTA (Yelp’s platform-as-a-service, which we use to manage all of the applications running on our infrastructure) turns eleven years old! The first commit was on August 20th, 2013, and the first <a href="https://github.com/Yelp/paasta/commit/201646347d4f8f630cbda979dadc15839f963008">public commit</a> was on October 22nd, 2015. That’s over half of Yelp’s lifetime! It’s quite remarkable that this tool has lasted for so long without being replaced by something else. We think it’s longevity really speaks to the vision and skill of the original PaaSTA authors. Of course, PaaSTA has changed a lot since then, and in this post we discuss how we were able to make a potentially risky change to a bit of legacy PaaSTA code without causing any downtime (our approach in this project was heavily inspired by <a href="https://gaultier.github.io/blog/you_inherited_a_legacy_cpp_codebase_now_what.html">Philippe Gaultier’s post</a> on making changes to legacy codebases, though PaaSTA is in a lot better shape than what he described).</p><p>The feature that we wanted to implement is called “multi-metric scaling.” PaaSTA serves as a platform on top of <a href="http://kubernetes.io">Kubernetes</a>, and as such, it uses the Kubernetes <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/">Horizontal Pod Autoscaler</a> (HPA) to scale applications running on the platform in response to load. In essence, the HPA watches a variety of metrics such as CPU utilization, worker thread count, and others, and uses those input metrics to determine the number of pods (or replicas) that an application should run at any point in time.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2024-08-07-multi-metric-paasta/jira1.png" alt="A screenshot from Jira showing the ticket for supporting multiple metrics in PaaSTA was created on 2021-01-07" /></p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-08-07-multi-metric-paasta/jira2.png" alt="A screenshot from Jira showing the ticket for supporting multiple metrics in PaaSTA was created on 2021-01-07." /><p class="subtle-text"><small>A screenshot from Jira showing the ticket for supporting multiple metrics in PaaSTA was created on 2021-01-07.</small></p></div><p>But here we are in 2024, and PaaSTA still doesn’t support multi-metric scaling, despite having it on its roadmap since at least 2021! Why not?</p><p>The technical change here wasn’t hard. Internally, PaaSTA already represented the HPA metrics source <a href="https://github.com/Yelp/paasta/blob/d945ff1af51a0703379419881f9d9a2ae7c69bac/paasta_tools/kubernetes_tools.py#L815">as an array</a>–an array containing only a single element. “All’’ we needed to do was expose this functionality to our developers. The challenge here was twofold. Firstly, depending on the implementation, this could result in a non-backwards-compatible API change. Secondly, changes to autoscaling are always scary because they have the potential to cause outsize impact on the systems relying on them. So how did we manage it?</p><p>In the first two weeks of the project, we didn’t write a line of code. We had a lot of experience with PaaSTA in the past, but our knowledge was several years out of date, so we spent most of our time reading code, asking questions, writing docs, and getting buy-in for the work. It was clear early on that the biggest concern people had was how this work would impact on-call load, particularly since we wouldn’t be on-call for the resulting changes. Having seen first-hand the damage bad autoscaling changes can cause, we weren’t about to argue with this! Additionally, we needed to ensure that several systems interacting with our autoscaling services could handle these changes. For example, Yelp has a system which we call Autotune that automatically rightsizes resource allocations for our workloads, and this system has special-cased behavior for the various types of service autoscaling that we support.</p><p>The plan we proposed had several steps. Since this was a legacy part of the PaaSTA codebase, the first change we wanted to make was to clean up some of the deprecated or old features. We hoped that doing so would make the codebase easier to work with and rebuild our familiarity with making changes to this code. Next, we suggested adding increased validation via the <code class="language-plaintext highlighter-rouge">paasta validate</code> command. This command is intended to verify that the configs in the services’ <a href="https://paasta.readthedocs.io/en/latest/soa_configs.html">soaconfigs</a> directory are correct, but the validation around the autoscaling configuration was fairly lax. By providing much stricter validation, we would be able to ensure that application owners couldn’t accidentally make incorrect changes to their service configuration, thereby improving safety and reliability overall. Lastly, we suggested that we spend time improving our dashboards and alerts around the HPA, to help us understand what “baseline” behavior was before making any changes.</p><p>The actual API change we agreed upon was straightforward. We would change this code:</p><div class="language-yaml highlighter-rouge highlight"><pre>autoscaling:
  metrics_provider: cpu
  setpoint: 0.8
</pre></div><p>to this:</p><div class="language-yaml highlighter-rouge highlight"><pre>autoscaling:
  metrics_providers:
    - type: cpu
      setpoint: 0.8
</pre></div><p>Since we control both PaaSTA and the soaconfigs directory, we can make a non-backwards-compatible change like this more easily. The procedure we proposed was to have PaaSTA temporarily support both the old and new autoscaling formats, then migrate everything in soaconfigs to the new format, and then remove support from PaaSTA for the old format. Once all of these changes were made, we were finally in a position to start adding multi-metric support to the handful of applications that needed it.</p><p>You might have noticed that 90% of the work we wanted to do was just to change the API. The underlying metrics sources wouldn’t change for any PaaSTA service until the very last step! Which means that, even though the code was old and potentially hard to reason with, we could use <a href="https://insta.rs/">snapshot testing</a> to ensure that behaviors didn’t change. (Snapshot testing is the process of recording the output-or a snapshot-of a program, and then comparing future output from that program to the snapshot.) So that’s what we did.</p><p>PaaSTA uses the <a href="https://github.com/kubernetes-sigs/prometheus-adapter">Prometheus Adapter</a> to collect metrics from Prometheus (our timeseries metrics datastore) and forward them to the HPA. The Prometheus Adapter takes as input a large list of <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a> queries that configure the metric values seen by the HPA. So, in the first phase of the project, we figured out how to generate the Prometheus adapter config locally. Ultimately if the Prometheus Adapter config didn’t change, then the HPA behavior wouldn’t change! We could simply compare the before-and-after output of this config file for each change that we wanted to make to ensure that the overall system would stay stable.</p><p>We also set up some dashboards using <a href="https://grafana.com/">Grafana</a> to monitor the HPA. Since we knew that we would eventually have different services using different combinations of metrics sources, we used Grafana’s “<a href="https://grafana.com/blog/2020/06/09/learn-grafana-how-to-automatically-repeat-rows-and-panels-in-dynamic-dashboards/">Panel Repeat</a>” feature to be able to automatically detect which metric sources a particular application was using, and only show the relevant metrics for those metrics. Even though nothing was using multiple metrics yet, we wanted to have the dashboards in place for when we started.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-08-07-multi-metric-paasta/scaling.png" alt="A screenshot of a Grafana dashboard showing multiple input metric sources for a single application." /><p class="subtle-text"><small>A screenshot of a Grafana dashboard showing multiple input metric sources for a single application.</small></p></div><p>Once all that was in place, we were ready to make the actual changes. Because we were constantly testing against “what was in prod,” we didn’t want to use a static snapshot that might give us an incorrect output or test against old versions of soaconfigs. Instead, for each change, we followed this process: we checked out the “main” branch, generated a snapshot, checked out our feature branch, generated another snapshot, and finally compared the two.</p><p>The command that we used to generate our snapshots looked like this:</p><div class="language-plaintext highlighter-rouge highlight"><pre>paasta list-clusters | xargs -I{} bash -c "python -m paasta_tools.setup_prometheus_adapter_config -d \
    ~/src/yelpsoa-configs -c {} --dry-run &amp;&gt; ~/tmp/{}-prom-conf-rules"
</pre></div><p>This is the shortest part of the blog post, because it was the least eventful part of the project. Using the graphs and snapshot testing framework described above, we were able to get all of our soaconfigs migrated to a new API format, and a handful of services are now using the Kubernetes multi-metric scaling feature-all with no downtime or outages. As it turns out, there was nothing magical or particularly hard about the rollout; just a careful application of testing and a close eye on our graphs and charts.</p><div class="island job-posting"><h3>Become a Software Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bd07a618-9b6f-4920-91c6-99280f1b268d?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/08/multi-metric-paasta.html</link>
      <guid>https://engineeringblog.yelp.com/2024/08/multi-metric-paasta.html</guid>
      <pubDate>Wed, 07 Aug 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Fine-tuning AWS ASGs with Attribute Based Instance Selection]]></title>
      <description><![CDATA[This is the next installment of our blog series on improving our autoscaling infrastructure. In the previous blog posts (Open-sourcing Clusterman, Recycling kubernetes nodes) we explained the architecture and inner-working of Clusterman. This time we are discussing how attribute based instance selection in the autoscaling group has helped us make our infrastructure more reliable and cost effective, while also decreasing the operation overhead. This will also cover how these changes enabled us to migrate from Clusterman to Karpenter. (Spoiler alert: Karpenter blog post is coming soon!) Motivation At Yelp we run most of our workload on AWS spot instances, and...]]></description>
      <link>https://engineeringblog.yelp.com/2024/05/fine-tuning-AWS-ASGs-with-attribute-based-instance-selection.html</link>
      <guid>https://engineeringblog.yelp.com/2024/05/fine-tuning-AWS-ASGs-with-attribute-based-instance-selection.html</guid>
      <pubDate>Wed, 01 May 2024 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Moderating Inappropriate Video Content at Yelp]]></title>
      <description><![CDATA[<p>One of Yelp’s top priorities is the <a href="https://trust.yelp.com/">trust and safety</a> of our users. Yelp’s platform is most well-known for its reviews, and its moderation practices have been recognised in <a href="https://blog.yelp.com/news/academic-research-finds-yelps-content-moderation-practices-mitigate-misinformation-and-build-consumer-trust/">academic research</a> for mitigating misinformation and building consumer trust. In addition to reviews, Yelp’s Trust and Safety team takes significant measures when it comes to protecting its users from inappropriate material posted through other content types. This blog post discusses how Yelp protects its users from inappropriate content in videos.</p><p>Recently, Yelp revamped its review experience by giving users the ability to <a href="https://blog.yelp.com/news/yelp-consumer-product-updates-april-2023/">upload videos</a> alongside their review text. This has led to a significant increase in the total number of videos uploaded to the platform.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-27-moderating-inappropriate-video-content-at-yelp/video-uploads-graph.png" alt="Starting April 2023, video uploads increased significantly at Yelp." /><p class="subtle-text"><small>Starting April 2023, video uploads increased significantly at Yelp.</small></p></div><p>Videos provide an immersive way to capture and share our experiences. However, this also opens the door to bad actors who may attempt to post disturbing videos to the platform. While such content is very rarely posted on Yelp’s platform, examples of such videos include:</p><ul><li>Nudity, sexual activity and suggestive material</li>
<li>Intense violence, graphic gore and disturbing scenes</li>
<li>Extremist imagery and hate symbols</li>
</ul><p>It is extremely important to Yelp to proactively prevent such videos from being displayed to users on our platform, which protects consumers and businesses alike.</p><p>Yelp has been committed to providing more value to consumers and businesses by leveraging AI. We recently announced how we are rapidly <a href="https://blog.yelp.com/news/yelp-enhances-ads-photos-search-waitlist-and-more-with-neural-networks-providing-more-value-for-consumers-and-businesses/">expanding the use of neural networks</a> to enhance ad relevance, search quality, and wait time estimates, among many others. AI-based systems also play a key role at Yelp to detect inappropriate content across various content types, from <a href="https://trust.yelp.com/recommendation-software/">reviews</a> to <a href="https://engineeringblog.yelp.com/2021/05/moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp.html">photos</a>. Videos are no exception.</p><p>Any machine learning model will have a non-zero chance to classify a legitimate video as inappropriate. This is known as the false positive rate. On the other hand, a model’s recall — in this case the measure of how well it can correctly flag a problematic video — should be maximized. There is always a tradeoff between keeping the recall high and false positive rate low. While flagging and removing inappropriate content as swiftly as possible is extremely important, any model that incorrectly removes legitimate content can be extremely frustrating to users and can discourage them from actively participating on the platform. Therefore, in order to maintain a high recall and effectively handle false positives, we include human evaluation of flagged videos as part of our moderation pipeline.</p><p>Yelp’s <a href="https://trust.yelp.com/content-moderation/">User Operations team</a> strives to review flagged videos and promptly restore any false positives to enforce the <a href="https://www.yelp.com/guidelines">Content Guidelines</a> in a fair and effective manner. However, manual moderation can be time consuming and difficult to scale. On top of that, dealing with large volumes of false positives can be frustrating for employees. Therefore, even with human moderators in the loop, an effective content moderation system should keep the number of false positives to a minimum.</p><p>When a video is uploaded to the platform, the moderation pipeline kicks off in parallel to the video ingestion system. The video first gets checked by our matching service, which computes similarity hashes against other videos that were previously removed for violating content guidelines. Matched videos get automatically discarded, which helps manage overall moderation volume by blocking submissions from repeat offenders.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-27-moderating-inappropriate-video-content-at-yelp/video-moderation-pipeline.png" alt="An overview of the video moderation pipeline at Yelp." /><p class="subtle-text"><small>An overview of the video moderation pipeline at Yelp.</small></p></div><p>Videos that pass the check are then fed to a deep learning model, which returns a multi-label classification. If the classification scores are above our thresholds, the videos are hidden and sent to the User Operations team for review. These thresholds are carefully fine-tuned to keep false positives at a minimum, while still catching and flagging inappropriate content. Inappropriate videos are removed, whereas the ones that were incorrectly flagged are restored.</p><p>Moderating videos presents its own unique set of challenges. Videos are much larger in size than other common content types such as reviews and photos. As a result, it takes a lot more time to process and feed them through a neural network. However, it is important to have near real-time classification to remove inappropriate content as quickly as possible. One solution to this challenge includes simply reducing the number of videos going through the neural network by pre-emptively blocking uploads from users with suspicious activity patterns.</p><p>Another strategy to overcome this problem involves selectively sampling frames to pass through the deep learning model instead of passing all video frames. We ran experiments to find the optimal frame sampling technique and frequency that would minimize the inference time without sacrificing classification performance. The classification scores for the sampled frames are combined to give a final score.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-27-moderating-inappropriate-video-content-at-yelp/video-ml-model.png" alt="Sampled frames are fed into the model. The individual scores are combined to give a final score." /><p class="subtle-text"><small>Sampled frames are fed into the model. The individual scores are combined to give a final score.</small></p></div><p>The model used for classifying video frames is built upon the model currently in use for moderating photos, given the close similarities between the photos and video frames classification tasks. The photo moderation model has an excellent track record when it comes to protecting Yelp from inappropriate photos, and building on top of it helps us minimize engineering development costs and maintenance burden.</p><p>At Yelp, trust and safety is a top priority and we are committed to protecting our consumers and business owners. As video submissions to the platform grow, a robust and efficient moderation system is more important than ever, which is why Yelp combines automated and human moderation to protect our platform from inappropriate videos. The Trust &amp; Safety team continuously strives to improve its moderation pipelines to keep Yelp one of the most trusted review platforms on the web.</p><p>This project would not have been possible without the support and collaboration from The Yelp Connect and Consumer Contributions teams. Special thanks to Marcello Tomasini, Gouthami Senthamaraikkannan, Jonathan Wang, Jiachen Zhao, Sandhya Giri, Curtis Wong, and Anka Granovskaya for contributing to the design and implementation of the pipeline.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/03/moderating-inappropriate-video-content-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2024/03/moderating-inappropriate-video-content-at-yelp.html</guid>
      <pubDate>Wed, 27 Mar 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Phone Number Masking for Yelp Services Projects]]></title>
      <description><![CDATA[<p>In this blog post, we highlight how phone number masking helps build consumer trust in the services marketplace at Yelp, decreases the friction in communication with service professionals, and allows for seamless switching between the Yelp app and a user’s phone. We present a high level overview of our in-house phone masking system and dive into the details of the engineering challenge of optimizing the usage of proxy phone number resources at Yelp’s scale.</p><p>Every year, millions of requests for quotes, consultations or other messages are sent to businesses on Yelp. Those users choose to use Yelp to connect with local services professionals for their projects because they value our <a href="https://trust.yelp.com/">trustworthy reviews</a> and seamless search experience. Yelp also provides users with a dedicated <a href="https://blog.yelp.com/news/yelp-introduces-projects/">project workspace</a> where they can outline their request, use our Request a Quote product to get matched with relevant businesses, and use our in-app messaging platform to easily compare quotes and communicate with pros.</p><p>While the messaging platform is a convenient tool for communication, we’ve observed the following pain points:</p><ul><li>Some customers, especially new users, may not be in the habit of checking the Yelp app for new business replies.</li>
<li>Businesses may sometimes feel that communicating via a phone call is more engaging and personal.</li>
<li>When it comes to more urgent or complex projects, it can be inefficient to describe the issue in a message.</li>
</ul><p>To remedy these pain points, we’ve seen a lot of businesses simply ask for the consumer’s phone number. While this solves the problems above, many customers may feel reluctant to share their contact information out of concern for receiving spam calls and unsolicited promotional messages. They want to feel confident that they can trust the business before providing a phone number.</p><p>To facilitate communication between customers and businesses via phone calls, while providing peace of mind that the user’s number is protected, Yelp recently introduced an evolution of the <a href="https://blog.yelp.com/news/yelp-launches-request-a-call-to-help-consumers-connect-quickly-and-seamlessly-with-services-businesses/">Request a Call</a> feature where customers can communicate with pros via masked phone numbers through both calls and SMS. Upon submitting a Request a Quote, the user can opt-in to receiving calls and texts about their project. If they opt-into sharing their number, Yelp assigns a temporary masked number to the customer and the business, which allows both parties to communicate seamlessly through calls, SMS, and the Yelp app.</p><div class="two-images-parent"><div class="image-caption two-images-child"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/phone-masking-opt-in-screen.png" alt="image" /></div><div class="image-caption two-images-child"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/call-text-in-app.png" alt="image" /></div><p class="subtle-text"><small>After the customer shares and enters their phone number (left), the business can call and text the customer’s masked phone number (right).</small></p></div><p>Masked phone numbers provide the following benefits over calling or texting directly:</p><ul><li><strong>Privacy</strong>: Neither party’s real phone number is shared with the other, and both can opt-out of communicating via phone calls at any time.</li>
<li><strong>Protection</strong>: Masked numbers cannot be shared with third parties—only the business can reach the customer through the masked number, and vice versa.</li>
<li><strong>Continuity</strong>: The full history of texts and calls is mirrored on both the app and the user’s phone, which allows for easy switching between the communication channels.</li>
</ul><div class="two-images-parent"><div class="image-caption two-images-child"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/sms-thread.png" alt="image" /></div><div class="image-caption two-images-child"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/app-view.png" alt="image" /></div><p class="subtle-text"><small>The conversation history between the customer and the business is synced between the SMS messages and the Yelp messaging platform.</small></p></div><p>In the next sections, we’ll take you through a high level overview of Yelp’s phone masking process, and highlight the key technical design decisions that we made in order to build consumer trust and provide the convenient benefits outlined above, while minimizing system costs and enabling the system to be scaled to Yelp’s large user base.</p><p>Fortunately for us, when it comes to working with phone numbers there is no need to start from scratch. Telephony API providers make it easy to purchase phone numbers, send or receive SMS messages, and initiate or receive phone calls. Additionally, they allow a phone number’s owner to react immediately to any event that occurs on the number, like an incoming call, through sending webhooks to a custom URL and accepting a response with custom instructions on how to handle the event. For example, the incoming call can get an automatic response or could be redirected to another number.</p><p>Using these building blocks, setting up a phone number masking application is straightforward. Two parties can have a phone call or engage in an SMS conversation through a proxy number without revealing their real numbers. We can simply forward the messages from one number to the other when receiving a webhook. <strong>And it all works seamlessly as if you’re communicating directly with the other person.</strong></p><p>Below is a high-level architecture of how Yelp integrated with a telephony API provider to offer the utility of phone masking, while keeping all phone events in sync with the Yelp conversation.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/high-level-phone-masking-flow.png" alt="For calls, we proxy the call immediately to the second number and reflect the call on the user’s inbox. For SMS, the flow looks very similar, except instead of a call we persist and forward a text message." /><p class="subtle-text"><small>For calls, we proxy the call immediately to the second number and reflect the call on the user’s inbox. For SMS, the flow looks very similar, except instead of a call we persist and forward a text message.</small></p></div><p>The only thing left is an appropriate data model to integrate phone masking with the Yelp conversation. Our key requirement is that only the Yelp business can call the proxy number to reach the customer, and vice versa. Therefore, we need a data model which encapsulates the customer and business numbers and links them via a proxy number, so that we can route messages and calls to the proxy number to the intended recipient. We call this model a “masking session” because it provides a temporary connection between the two real numbers while not exposing them directly to each other. It looks like this at a high level:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/masking-session-model.png" alt="Minimal masking session data model. Each of Yelp’s Services conversations has an associated session which allows us to route messages and calls between the numbers seamlessly, while reflecting the message and call events on the conversation feed." /><p class="subtle-text"><small>Minimal masking session data model. Each of Yelp’s Services conversations has an associated session which allows us to route messages and calls between the numbers seamlessly, while reflecting the message and call events on the conversation feed.</small></p></div><p>The diagram above is a good high-level outline for how phone masking works. But in reality Yelp connects hundreds of thousands of businesses and customers every month, and this model requires that we allocate one number for every connection that we facilitate.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/number-allocation-one-number-per-session.png" alt="The most basic proxy number allocation strategy is to assign a unique number for every masking session." /><p class="subtle-text"><small>The most basic proxy number allocation strategy is to assign a unique number for every masking session.</small></p></div><p>This can quickly get prohibitively expensive, not to mention that phone numbers are a finite resource and even the telephony API provider couldn’t provide us with that many numbers.</p><p>There is a solution though, and it is 2-fold: <strong>recycle</strong> and <strong>reuse</strong>.</p><p>In the following few sections, we walk through several possible phone number allocation strategies that recycle and reuse proxy numbers between sessions in various ways. Ultimately, we arrive at the one that minimizes the size of the proxy number pool that we need to maintain in order to support our system.</p><h2 id="phone-number-recycling">Phone number recycling</h2><p>For certain use cases such as delivery apps or connecting with your rideshare driver, the masking session needs to be relatively short lived (typically several minutes). In those situations an application can get away with having a relatively small pool of proxy numbers by using a very aggressive recycling policy. The two determining factors for the size of the proxy number pool are the number of sessions needed per unit time and the average lifespan of a session.</p><p>For example, if a delivery app has on average 1000 deliveries per hour, typically lasting under 30 minutes, then it would need 500 proxy numbers on average at a given time.</p><p>Yelp’s phone masking system implements recycling, but we need to keep sessions active for a longer period of time given conversations between customers and businesses often last for weeks. There is a potential workaround where we recycle a number after N hours of inactivity and then allocate a new number if the conversation resumes. However, then we could risk breaking the continuity of the SMS conversation if the later messages start coming from a new number, and we may cause confusion when a conversation with a new business starts abruptly from the same number. Because of these considerations, we typically mark a proxy number as recyclable only after 30 days.</p><p>Therefore, we only ever need to maintain as many phone numbers as the number of connections per month, i.e., our costs scale as <strong>O(conversations per month)</strong>. This is definitely an improvement, but it still requires purchasing millions of phone numbers, which means that we need further optimizations to our phone number use.</p><h2 id="phone-number-reuse">Phone number reuse</h2><p>The idea behind proxy phone number reuse is to use the same number in multiple masking sessions simultaneously instead of it only taking part in one session at a time. The tricky part is to assign the numbers in such a way that all phone calls and texts are routed unambiguously to the intended recipients. Below we describe some options we evaluated.</p><h3 id="unique-number-for-every-business">Unique number for every business</h3><p>One approach would be to not actually use a unique number for every conversation, but instead have a constant proxy number for each business on Yelp, such that two different customers see the same number for a given business, and we can disambiguate the conversation based on the sender/caller number. It is also quite natural that the business number doesn’t change between different users.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/number-allocation-one-number-per-business.png" alt="With this proxy number allocation strategy we assign a unique proxy number to each business. Then all customers that have conversations (masking sessions) with the same business always interact with the same proxy number." /><p class="subtle-text"><small>With this proxy number allocation strategy we assign a unique proxy number to each business. Then all customers that have conversations (masking sessions) with the same business always interact with the same proxy number.</small></p></div><p>Unfortunately, this approach still has a couple of problems. First, it doesn’t allow the business to call or text the proxy number because we wouldn’t know which customer we should forward the call to. (This problem is actually solvable with the two-number-pools approach we describe later). And second, it scales as <strong>O(businesses using Request a Quote)</strong> which still doesn’t reduce costs sufficiently, but it’s closer to the optimal solution.</p><h3 id="each-party-sees-unique-numbers-single-proxy-pool">Each party sees unique numbers (single proxy pool)</h3><p>What’s interesting about the unique number per business approach is that it touches on the actual constraints at hand. Namely they are:</p><ul><li><strong>Constraint 1</strong>: Each customer should be interacting with a different number for each different business they are contacting.</li>
<li><strong>Constraint 2</strong>: Each business owner should be interacting with a different number for each different customer they are working with.</li>
</ul><p>However, there is no problem if two different customers see the same number for two different businesses because we can disambiguate who they are calling/texting based on the caller phone number. The same holds true for the business side. Therefore, we only need enough numbers to satisfy the above constraints for all of the customer-business connections every month at Yelp (with recycling).</p><p>We can demonstrate how this works out to be a small number with a hypothetical example. If most customers contact less than 10 businesses per month, and most businesses receive less than 100 requests per month<sup><a href="https://engineeringblog.yelp.com/2024/03/phone-number-masking-for-yelp-services-projects.html#footnote1">1</a></sup>, we only need 100 numbers (max of the two) to satisfy both constraints. We can also add a safety factor to account for outliers, but the number pool size still ends up being a small constant size.</p><p>More importantly, this allocation strategy minimizes our proxy number costs because the number pool size does not need to increase (it is <strong>O(1)</strong>) with the volume of customer quote requests sent on Yelp or the number of businesses we onboard on the platform.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/number-allocation-reuse-numbers-single-pool.png" alt="In this example, there are 2 customers and 2 businesses having a total of 4 unique conversations (masking sessions), facilitated by a pool of 2 proxy numbers. Notice how the proxy numbers are mapped to participants such that each party sees unique numbers for each of their conversations (i.e. both constraints are satisfied)." /><p class="subtle-text"><small>In this example, there are 2 customers and 2 businesses having a total of 4 unique conversations (masking sessions), facilitated by a pool of 2 proxy numbers. Notice how the proxy numbers are mapped to participants such that each party sees unique numbers for each of their conversations (i.e. both constraints are satisfied).</small></p></div><h3 id="each-party-sees-unique-numbers-multiple-proxy-pools">Each party sees unique numbers (multiple proxy pools)</h3><p>As a final improvement, we actually use two pools of proxy numbers, one for the customer side and another for the business side. This way the masking is still seamlessly maintained because the customer always communicates with the same number and so does the business owner. They just happen to be different numbers. The final masking session model looks like this:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-26-phone-number-masking-for-yelp-services-projects/number-allocation-reuse-numbers-two-pools.png" alt="Like before, there are 2 customers and 2 businesses having a total of 4 unique conversations (masking sessions), but now they are facilitated by 2 pools of 2 proxy numbers each. Each party still sees unique numbers for each of their conversations, but the customer and business in a particular session see distinct proxy numbers (each from their respective pool)." /><p class="subtle-text"><small>Like before, there are 2 customers and 2 businesses having a total of 4 unique conversations (masking sessions), but now they are facilitated by 2 pools of 2 proxy numbers each. Each party still sees unique numbers for each of their conversations, but the customer and business in a particular session see distinct proxy numbers (each from their respective pool).</small></p></div><p>This strategy still satisfies both constraints and keeps costs constant, but it has the following benefits:</p><ul><li><strong>Less risk of exhausting numbers</strong>: Customer proxy numbers only need to satisfy constraint 1 from the previous section while business numbers only need to satisfy constraint 2. This makes it less likely to run out of proxy numbers to assign to a session. The two constraints become increasingly harder to be satisfied together by a single number the more sessions we create.</li>
<li><strong>Simpler allocation and routing logic</strong>: The code is easier to maintain and understand.</li>
<li><strong>Greater flexibility</strong>: We can configure each number pool independently. For example, each pool can have a different size, a distinct path for webhooks, specific alerting, etc. We could even change the assignment strategy of each pool if necessary, or we can have additional pools if we needed a different assignment strategy for a new participant type. (E.g. having a constant number per business like mentioned above for the customer side for specific subsets of businesses).</li>
</ul><p>The only downside of this final strategy is that we need to purchase slightly more proxy numbers overall. However, this tradeoff is worth it given the added flexibility and ease of maintenance.</p><p>In this blog post we learned how Yelp’s engineering team developed an in-house phone masking system for the Services Marketplace. The feature helps us uphold our core value of “Protecting the Source” by prioritizing the privacy of consumers when connecting them with professionals over the phone, and maintains professionals’ trust that Yelp connects them with high-intent customers who are more eager to get their projects done.</p><p>At the same time, it poses an interesting technical challenge to prevent costs from increasing linearly with the volume of traffic. We managed to overcome this problem through good data modeling and intelligent allocation of resources, which allows us to offer the convenience and flexibility of masked phone communication for all Request a Quote projects.</p><p>This project required significant cross-team collaboration, and I would like to thank everyone in the Services group and other Yelp teams who contributed to the development and made it possible. Special thanks goes to Yi Qi, Billy Barbaro, James Coles-Nash, Michelle Tan, and Rich Schreiber for your technical and editorial reviews of this article.</p><p><a name="footnote1" id="footnote1">1</a>: These numbers are for illustrative purposes only.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/03/phone-number-masking-for-yelp-services-projects.html</link>
      <guid>https://engineeringblog.yelp.com/2024/03/phone-number-masking-for-yelp-services-projects.html</guid>
      <pubDate>Tue, 26 Mar 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[CHAOS: Yelp's Unified Framework for Server-Driven UI]]></title>
      <description><![CDATA[<p>Yelp develops two major applications, <a href="https://yelp.com">Yelp</a> &amp; <a href="https://business.yelp.com">Yelp for Business</a>, for Web (Desktop &amp; Mobile), iOS, and Android platforms. That’s eight unique clients! Keeping a fresh, consistent UI on all these clients is a major challenge. Server-driven UI (SDUI) has become a standard industry technique for managing UI on multiple platforms. At Yelp, many product teams created SDUI frameworks for their features. Though successful, these frameworks were expensive to develop and maintain, and no single SDUI framework supported all our clients. In late 2021, we began building a unified SDUI framework called <strong>CHAOS</strong> or “Content Hosting Architecture with Optimization Strategies”.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/chaos.png" alt="" /></div><p>CHAOS is a backronym. Initially, we thought it would make a good blog post! But we found deeper meaning in the name. According to chaos theory, small changes to a system can dramatically alter its state. CHAOS would simplify the process of deploying major UI changes on our clients, leading to our slogan: “Small changes have big results”</p><p>Though we chose CHAOS quickly, we went through many proposals for the phrase behind the acronym:</p><ul><li>Creative and Humorous Acronym for Our System</li>
<li>Content Helps Accelerate Our Success</li>
<li>Components Help Accelerate Our Screens</li>
</ul><p>We eventually settled on “Content Hosting Architecture with Optimization Strategies”.</p><p>“Content Hosting Architecture” made sense. UI is the content the user sees and interacts with. We were building an architecture for hosting interactive content. The content could be anything from an entire mobile screen or desktop browser page to a single UI element, often called a component.</p><p>We added “Optimization Strategies” because we planned to use machine learning (ML) to optimize content. For example, some consumers prefer to see photos when searching for businesses while others prefer to see reviews. Sometimes, the consumer’s preference changes depending on the type of business; photos might be more important for finding a good restaurant and reviews more important for finding a plumber. An ML model could select the best search experience automatically.</p><p>SDUI is a popular technique for managing UI on multiple platforms. In a standard UI, the client developer writes both presentation and data fetching logic. Updating the UI requires changing the client. For mobile clients, changes require going through the platform’s app release process and waiting for users to upgrade to the new version. If multiple clients require the same UI changes, the cost of making the changes increases dramatically.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/client-without-sdui.png" alt="" /></div><p>In SDUI, the backend developer writes the presentation and data fetching logic, returning the configured UI to the client. The backend code can be updated without requiring changes to the client, and a single backend change can update the UI on multiple clients.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/client-with-sdui.png" alt="" /></div><p>At Yelp, we’ve built many successful SDUI frameworks. <a href="https://engineeringblog.yelp.com/2021/11/building-a-server-driven-foundation-for-mobile-app-development.html">Building a server-driven platform for mobile app development</a> described one such framework, the Biz Native Foundation or BNF, for managing the UX on the iOS and Android versions of Yelp for Business.</p><p>The BNF has a very typical server-driven architecture for mobile clients. It supports server-driven mobile screens that host a list of <strong>components</strong>. Interacting with a component, such as tapping a button, triggers an <strong>action</strong> that updates the UI directly or indirectly through a <strong>property</strong> – a piece of observable application state.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/sdui-architecture.png" alt="" /></div><p>While the BNF was being developed, several other major SDUI frameworks were being developed for Yelp’s consumer clients, and more teams were considering SDUI for their use cases. We organized an internal SDUI community to foster knowledge sharing and collaboration. Still, each SDUI framework was an independent effort. Some clients even had multiple SDUI frameworks controlling different aspects of the UI. A single product request might require changes to multiple SDUI frameworks!</p><p>Having a single, cross-platform SDUI framework would eliminate duplicate effort and simplify UI changes across multiple clients. We started CHAOS as a community-driven effort to build that framework.</p><p>Historically, we’ve built and maintained multiple REST APIs for our clients. Having different APIs, each with its own <a href="https://swagger.io/">Swagger</a> spec and backend Python service, was a big reason why we couldn’t unify our SDUI frameworks.</p><p>Fortunately, for the last several years, we’ve been switching all Yelp clients to a unified GraphQL API. Therefore, using GraphQL was a requirement for CHAOS. Even if we wanted to use REST for SDUI, our clients would need to support both REST &amp; GraphQL. When Yelp introduced GraphQL, we wanted to replace REST entirely.</p><p>We were initially excited about using GraphQL for SDUI. We thought we could evolve our SDUI graph more easily than a REST API, which requires explicit versioning. We thought the explicitness of client queries would help maintain backwards compatibility because each request would document the supported types and fields. As we’ll discuss in the next section, GraphQL presented some challenges when designing the CHAOS API, and we ultimately embedded some REST objects for pragmatic reasons.</p><p>We’ll start by outlining the original requirements for CHAOS, then discuss the use model and how it was translated into a GraphQL API.</p><h2 id="requirements">Requirements</h2><ul><li>Use GraphQL</li>
<li>Support a variety of use cases on web and mobile clients</li>
<li>Handle forwards &amp; backwards compatibility when making changes</li>
</ul><h2 id="use-model">Use model</h2><p>A <strong>view</strong> is a piece of UI managed by CHAOS. Every view has a unique <strong>name</strong> and a <strong>layout</strong>, which arranges a set of <strong>components</strong>. Components can trigger <strong>actions</strong> to implement side-effects. Every layout, component, or action has a unique versioned type.</p><p>For example, a product manager wants a simple view to help new Yelp users find local business. The initial design requires a single column layout with text, illustration, and button components. Clicking the button opens a deep link to a Yelp search.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/chaos-view.png" alt="" /></div><p>We can easily extend CHAOS to support more use cases by adding more layouts, components, and actions. Layouts can be a single column, a row, or a full web page/mobile screen with multiple sections. Components can be a single piece of text, a button, or an entire section. Actions can open URLs, log analytics, or update application state.</p><p>A Yelp client queries the CHAOS GraphQL API for a view. The GraphQL API loads the view by calling a standardized REST API on a CHAOS backend implemented as a Python service.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/chaos-system.png" alt="" /></div><p>There’s no single CHAOS backend for all views. Rather, CHAOS backends are microservices for UI. They can be responsible for a single view or multiple related views, and the CHAOS API dispatches client queries based on the view name.</p><p>CHAOS provides React, Android, and iOS client libraries for making GraphQL queries and rendering views. CHAOS provides a Python package for building views in CHAOS backends.</p><h2 id="dream-query">Dream Query</h2><p>At Yelp, when building new GraphQL APIs, we start by writing a <a href="https://engineeringblog.yelp.com/2020/10/dream-query.html">Dream Query</a>. We need a query to fetch a CHAOS view by its unique name:</p><div class="language-graphql highlighter-rouge highlight"><pre>queryGetChaosView($name:String!){chaosView(name:$name){views{identifierlayout}initialViewIdcomponentsactions}}</pre></div><p>The query returns a <code class="language-plaintext highlighter-rouge">ChaosConfiguration</code> with an array of views and an initial view ID. Though many CHAOS use cases have a single view, some use cases have a sequence of related views. We could always fetch subsequent views with additional GraphQL queries, but they would require extra round trips over a potentially slow and unreliable network connection. Consequently, CHAOS supports returning multiple views within the same configuration for better performance and reliability.</p><p>Each view has a layout that arranges components by ID. Layouts are represented by the <code class="language-plaintext highlighter-rouge">ChaosLayout</code> union type:</p><div class="language-graphql highlighter-rouge highlight"><pre>unionChaosLayout=ChaosSingleColumn|ChaosMobilePhoneScreen</pre></div><p>CHAOS supports a single column layout that arranges components in a vertical stack, which is great for adding some SDUI to an existing web page or mobile screen.</p><div class="language-graphql highlighter-rouge highlight"><pre>typeChaosSingleColumnimplementsChaosLayout{rows:[String!]!}</pre></div><p>CHAOS also supports a layout for controlling an entire mobile phone screen, a common use case for many of our existing SDUI frameworks.</p><div class="language-graphql highlighter-rouge highlight"><pre>typeChaosMobilePhoneScreenimplementsChaosLayout{toolBar:Stringmain:[String!]!footer:String}</pre></div><p>We’ve been experimenting with layouts for entire web pages and will report on those efforts in subsequent blog posts. More commonly, our web clients use single column layouts to add some SDUI content to a page that otherwise uses traditional data fetching and presentation logic.</p><p>Layouts refer to components by ID, and all components in a <code class="language-plaintext highlighter-rouge">ChaosConfiguration</code> are stored in the top-level <code class="language-plaintext highlighter-rouge">components</code> field. Similarly, components refer to actions by ID, and all actions are stored in the top-level <code class="language-plaintext highlighter-rouge">actions</code> field.</p><p>Storing components and actions in the top-level configuration has some practical benefits. First, it reduces response size when components or actions are referenced multiple times. Second, it improves readability because layouts are compact and focused on how components are arranged.</p><h2 id="modeling-components--actions">Modeling components &amp; actions</h2><p>Initially, we planned to use explicit GraphQL types to model each component and action. We defined interfaces that all components and actions must satisfy. Because we reference components and actions by ID, they must have a unique string <code class="language-plaintext highlighter-rouge">identifier</code>. The other fields depend on the particular component or action.</p><p>Let’s say CHAOS supports a single component (<code class="language-plaintext highlighter-rouge">ChaosButton</code>) and action (<code class="language-plaintext highlighter-rouge">ChaosOpenUrl</code>) with the following GraphQL types:</p><div class="language-graphql highlighter-rouge highlight"><pre>typeChaosButtonimplementsChaosComponent{identifier:String!text:String!onClick:[String!]!}typeChaosOpenUrlimplementsChaosAction{identifier:String!url:String!}</pre></div><p>The client’s query uses fragments to specify the supported component and action types:</p><div class="language-graphql highlighter-rouge highlight"><pre>queryGetChaosView($name:String!){chaosView(name:$name){views{identifierlayout{...onChaosSingleColumn{rows}}}components{...onChaosButton{identifiertextonClick}}actions{...onChaosOpenUrl{identifierurl}}initialViewId}}</pre></div><p>Though this seems like a sensible approach, we found a number of issues in practice.</p><p>First, components and actions aren’t like traditional GraphQL types for data fetching. A main selling point for GraphQL is that clients fetch only the fields they require. Well, the client can’t query some button fields and not others; the button won’t work without <code class="language-plaintext highlighter-rouge">onClick</code>!</p><p>Second, adding new fields must be done carefully. Let’s add a new <code class="language-plaintext highlighter-rouge">style</code> parameter to control the appearance of the button:</p><div class="language-graphql highlighter-rouge highlight"><pre>typeChaosButtonimplementsChaosComponent{identifier:String!text:String!style:ChaosButtonStyleonClick:[String!]!}</pre></div><p>Unfortunately, we’ve already released the original button to mobile clients, and there are older app versions that don’t support style. How did we communicate to the CHAOS backend that the mobile client supports the new field?</p><p>The GraphQL server knows whether the client’s query includes the new field. We use <a href="https://www.apollographql.com/docs/apollo-server/">Apollo Server</a>, and it supplies an <a href="https://www.apollographql.com/docs/apollo-server/data/resolvers#resolver-arguments">info</a> argument to the component’s resolver with an abstract syntax tree (AST) representing the query. But we need to traverse through several nested arrays and objects to find whether <code class="language-plaintext highlighter-rouge">style</code> is part of the <code class="language-plaintext highlighter-rouge">ChaosButton</code> fragment:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/info-new-field.png" alt="" /></div><p>We also need to communicate to the CHAOS backend that the field is available. We’ll be constantly adding and (less frequently) removing fields. Do we send a list of supported fields for every component and action to the backend? That would add a considerable amount of overhead to each request.</p><p>The third issue is that adding a type has the same problem. Let’s add a new component to represent a block of styled text:</p><div class="language-graphql highlighter-rouge highlight"><pre>typeChaosTextimplementsChaosComponent{identifier:String!text:String!textStyle:ChaosTextStyletextAlignment:ChaosTextAlignment}</pre></div><p>The client’s query must be updated to support the new component type:</p><div class="language-graphql highlighter-rouge highlight"><pre>queryGetChaosView($name:String!){chaosView(name:$name){views{identifierlayout{...onChaosSingleColumn{rows}}}components{...onChaosButton{identifiertextstyleonClick}...onChaosText{identifiertexttextStyletextAlignment}}actions{...onChaosOpenUrl{identifierurl}}initialViewId}}</pre></div><p>To determine if the query includes the <code class="language-plaintext highlighter-rouge">ChaosText</code> fragment, the component’s GraphQL resolver must delve deep into the AST, then pass that information along to the CHAOS backend in a list of supported components (and actions):</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/info-new-type.png" alt="" /></div><p>In the end we decided that explicit, unversioned GraphQL types weren’t practical. We’d spend too much time and effort maintaining our GraphQL layer without much real benefit. The clients would be writing large queries, and the server would be parsing them. Instead, we modeled each component or action as a versioned REST object in JSON format.</p><p>Every component or action has a unique type string with an integer version number, such as <code class="language-plaintext highlighter-rouge">chaos.button.v1</code> and <code class="language-plaintext highlighter-rouge">chaos.open-url.v1</code>. GraphQL doesn’t natively support JSON or map fields, so parameters are stored in a stringified JSON object.</p><div class="language-plaintext highlighter-rouge highlight"><pre>type ChaosJsonComponent implements ChaosComponent {
    identifier: String!
    componentType: String!
    parameters: String!
}
type ChaosJsonAction implements ChaosAction {
    identifier: String!
    actionType: String!
    parameters: String!
}
</pre></div><p>For example, a button component in our GraphQL response looks like:</p><div class="language-json highlighter-rouge highlight"><pre>{"identifier":"primacy-cta","componentType":"chaos.button.v1","parameters":"{\"text\": \"Find local businesses\", \"onClick\": [\"open-search-url\"]}","__typename":"ChaosJsonComponent"}</pre></div><p>Clearly, the stringified JSON isn’t very readable. We’ve created developer tools to edit and debug CHAOS configurations.</p><p>We still use GraphQL types for views and layouts. These types change less frequently and contain the high-level structure of the UI, so direct readability is more useful. Internally, we still associate layouts with a unique versioned type string, e.g. <code class="language-plaintext highlighter-rouge">chaos.single-column.v1</code>, and we may switch to embedded REST objects for layouts, too. We’re still figuring out the right balance between GraphQL and REST, but we’ve been using the approach in production for more than two years without revisiting the decision.</p><p>Here’s a complete CHAOS configuration to see how everything comes together:</p><div class="language-json highlighter-rouge highlight"><pre>{"data":{"chaosView":{"views":[{"identifier":"consumer.welcome","layout":{"__typename":"ChaosSingleColumn","rows":["welcome-to-yelp-header","welcome-to-yelp-illustration","find-local-businesses-button"]},"__typename":"ChaosView"}],"components":[{"__typename":"ChaosJsonComponent","identifier":"welcome-to-yelp-header","componentType":"chaos.text.v1","parameters":"{\"text\": \"Welcome to Yelp\", \"textStyle\": \"heading1-bold\", \"textAlignment\": \"center\"}}"},{"__typename":"ChaosJsonComponent","identifier":"welcome-to-yelp-illustration","componentType":"chaos.illustration.v1","parameters":"{\"dimensions\": {\"width\": 375, \"height\": 300}, \"url\": \"https://media.yelp.com/welcome-to-yelp.svg\"}}"},{"__typename":"ChaosJsonComponent","identifier":"find-local-businesses-button","componentType":"chaos.button.v1","parameters":"{\"text\": \"Find local businesses\", \"style\": \"primary\"}, \"onClick”: [\"open-search-url\"]}"}],"actions":[{"__typename":"ChaosJsonAction","identifier":"open-search-url","actionType":"chaos.open-url.v1","parameters":"{\"url\": \"https://yelp.com/search\"}"}],"initialViewId":"consumer.welcome","__typename":"ChaosConfiguration"}}}</pre></div><h2 id="versioning-components--actions">Versioning components &amp; actions</h2><p>When changing a component or action, we increment the version. For example, adding <code class="language-plaintext highlighter-rouge">style</code> to the CHAOS button introduces <code class="language-plaintext highlighter-rouge">chaos.button.v2</code>.</p><p>Clients have their own internal component libraries and use factories associated with each component type to map the CHAOS component to the internal component’s interface. Actions go through a similar mapping process.</p><p>CHAOS backends use a YAML config file to determine what component or action types can be used in a CHAOS configuration. The GraphQL layer passes information about the platform (React, iOS, or Android) to the CHAOS backend. For mobile clients, the GraphQL layer also passes the app version.</p><p>For React clients, we can update all our React clients simultaneously using <a href="https://engineeringblog.yelp.com/2023/03/gondola-an-internal-paas-architecture-for-frontend-app-deployment.html">Gondola</a>, Yelp’s PaaS for front-end deployment. Therefore, we use <code class="language-plaintext highlighter-rouge">web: true</code> to indicate that a type is available for web clients.</p><p>For mobile clients, we can’t update older versions. We also have distinct apps for consumers &amp; business owners on each platform. Therefore, we use <code class="language-plaintext highlighter-rouge">start: &lt;app version&gt;</code> to indicate the first app version that supports a type, and each app/platform combination has its own value.</p><div class="language-yaml highlighter-rouge highlight"><pre>components:
  - type: chaos.button.v1
    web: true
    consumer-ios:
      start: 22.1.0
    consumer-android:
      start: 22.3.0
    biz-ios:
      Start: 22.1.0
    biz-android:
      start: 22.6.0
actions:
  - type: chaos.open-url.v1
    web: true
    consumer-ios:
      start: 22.1.0
    consumer-android:
      start: 22.3.0
    biz-ios:
      Start: 22.1.0
    biz-android:
      start: 22.6.0
</pre></div><p>We shipped the first CHAOS use case to production in early 2022, only a few months after starting development. Since then, we’ve been regularly shipping new use cases. CHAOS development is entirely use-case driven. We add new layouts, components, and actions when they are required.</p><p>CHAOS isn’t intended to replace traditional UI development. We use CHAOS where it makes sense. Usually, a good use case for CHAOS satisfies one or more following conditions:</p><ul><li>It must be consistent across multiple clients.</li>
<li>It has dynamic, highly contextual content.</li>
<li>It must be updated quickly on mobile clients.</li>
</ul><p>For example, CHAOS manages the Yelp for Business support flow on web and mobile clients. When a business owner opens the support flow, we show a CHAOS view with a list of support options:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/chaos-yelps-unified-framework-for-server-driven-ui/chaos-support-flow.png" alt="" /></div><p>Some business owners use multiple clients, and some businesses are managed by multiple owners who use different clients. Therefore, we want to show consistent support options on all clients.</p><p>Support options are also dynamic and highly contextual. Live chat or phone support isn’t available 24/7, and the phone number depends on location.</p><p>Finally, if there’s a technical issue such as an outage, we want to update our mobile clients quickly without waiting for an app release. By adding a note that we’re aware of the issue and working on it, we can keep business owners informed and avoid unnecessary support calls.</p><p>With CHAOS, the support options can be updated on all clients by deploying a change to a single backend service.</p><p>As we adopt CHAOS more broadly within Yelp, we’ve identified some key areas for future investment.</p><h2 id="automated-previews">Automated previews</h2><p>To verify changes to a CHAOS view, a backend developer tests each client manually.</p><p>Though testing web clients is relatively straightforward – everyone has access to a browser – testing mobile clients requires access to simulators or physical devices. Before Yelp switched to remote work, we maintained a mobile device library in each engineering office. After the switch, we integrated with a cloud-based testing solution from a vendor. Even so, manual testing is cumbersome for a backend developer who needs to verify multiple platforms or app versions.</p><p>In the future, we plan to support automated previews. When a backend developer publishes a GitHub PR with changes to a CHAOS view, we’ll automatically generate previews for each platform and attach them to the PR when ready.</p><p>Currently, when a product manager or designer wants to change a CHAOS view, they must ask a backend developer. The backend developer changes the Python code that configures the CHAOS view, creates a PR, gets it approved, and deploys the changes to production. Even simple changes, such as changing copy, require 30 minutes to several hours.</p><p>In the future, we plan to support no-code configuration updates for product managers and designers through internal editing tools.</p><h2 id="optimization-strategies">Optimization strategies</h2><p>Despite being a core part of the CHAOS backronym, we haven’t implemented any optimization strategies for CHAOS content. Selecting, ordering, and configuring CHAOS content must be done manually in Python code.</p><p>In the future, we plan to use ML to automatically select, order, and configure some CHAOS content.</p><p>This is the first in a series of blog posts about CHAOS. In upcoming blog posts, our client engineers will explain how CHAOS works on Web, iOS, and Android clients, and our backend engineers will explain how to build a CHAOS backend in Python.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/03/chaos-yelps-unified-framework-for-server-driven-ui.html</link>
      <guid>https://engineeringblog.yelp.com/2024/03/chaos-yelps-unified-framework-for-server-driven-ui.html</guid>
      <pubDate>Thu, 14 Mar 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Keeping track of engineering-wide goals and migrations]]></title>
      <description><![CDATA[<p>EE Metrics was envisioned as a hub that helps teams manage their technical debt. EE Metrics provides every team with a detailed web page that contains information about technical debt that needs to be addressed. It also serves as a platform to highlight top engineering initiatives at the organization level.</p><p>EE Metrics empowers infrastructure teams to surface important migrations or metrics that could improve the health of software projects. Organization-wide migrations of technologies can often be difficult to surface and keep track of.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/1_lifecycle.png" alt="Figure 1: Diagram showing how EE Metrics is interacted with and consumed" /><p class="subtle-text"><small>Figure 1: Diagram showing how EE Metrics is interacted with and consumed</small></p></div><p>Many of our users will generally browse their respective team’s health reports within their team specific page to understand which migrations/health metric they need to address based on the impact and priority.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/2_overview.png" alt="Figure 2: High level overview of the architecture of EE Metrics" /><p class="subtle-text"><small>Figure 2: High level overview of the architecture of EE Metrics</small></p></div><p>EE Metrics contains two key components - a backend service that collects and calculates audit results, and a frontend service that exposes a web application. The primary interface used by users is the web application. The web application allows audit authors to create audits. These audits can be viewed in full detail within their respective pages and are surfaced through Team Health Reports. Team Health Reports attempts to analyze the teams’ health in various categories and identify areas of improvement which forms the primary purpose of EE Metrics.</p><p>The Team Health Reports act as a data-driven communication platform between infrastructure teams and product teams. There are two primary categories of metrics that comprise “Audits” in EE Metrics. First, there are org-wide initiatives called “Migrations” that are created by infrastructure teams. These initiatives include code and infrastructure updates/changes that improve the health of software projects from a velocity, quality, reliability, and security perspective. Another set of org-wide initiatives that EE Metrics surfaces are called “Health Checks’’. These tend to be recurring long-term metrics that teams attempt to keep within certain thresholds. An example would be Test Run Times. By keeping the run times of all owned services under a certain threshold, this ensures that the team has confidence that they can continue to ship features reliably and quickly.</p><p>The EE Metrics Team Health Report allows teams to view the overall health of a team’s developer velocity, code quality, reliability, and security and gives teams their top priority action items to improve in each of these areas. This helps with balancing the pressure of shipping new product features versus maintenance work.</p><h2 id="how-do-team-health-reports-work">How do team health reports work?</h2><p>Team Health Reports are driven by a series of audits that are run against all of a team’s entities (services, libraries, files, directories, etc.). Entities can be any piece of technology or concept that can be owned by teams. To help assign audits to teams, we use the Ownership service to determine what entities fall under the team’s health report (for more information about ownership, check out our <a href="https://engineeringblog.yelp.com/2021/01/whose-code-is-it-anyway.html">blog post on Ownership</a>). Once the health report is generated for a team, it lists the action items teams can take to make improvements, ranked in order of impact and priority. The results of these audits are collected once a day and can be viewed in the EE Metrics web application, or through a monthly email report sent to the team and org leaders. The status of previous audits are also preserved so that users can view the historical results of audits to figure out if there are any trends.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/3_report.png" alt="Figure 3: This is a snapshot of a team’s health report as seen in the web application" /><p class="subtle-text"><small>Figure 3: This is a snapshot of a team’s health report as seen in the web application</small></p></div><h2 id="what-are-these-scores">What are these scores?</h2><p>The scores in the figure above (figure 3) represent how effective your team is based on the amount of audits outstanding or completed. Audits have a weight assigned to them based on their priority. This helps users understand which audits require more immediate attention. These scores are primarily driven by this factor:</p><ul><li>60% of your score is attributed to audits weighted as HIGH.</li>
<li>30% of your score is attributed to audits weighted as MED.</li>
<li>10% of your score is attributed to audits weighted as LOW.</li>
</ul><p>There are other factors that affect scores such as whether a migration is overdue or not, if it’s an informational audit, or if it is a pending new audit. Primarily, we came up with this scoring to ensure that if a team has completed all their high weighted audits, they are deemed to be in good standing.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/4_email.png" alt="Figure 4: This is a snapshot of a team’s health report in the form of an email surfacing important notes" /><p class="subtle-text"><small>Figure 4: This is a snapshot of a team’s health report in the form of an email surfacing important notes</small></p></div><h2 id="audit-creation-and-guidelines">Audit Creation and Guidelines</h2><p>Audits are created by infrastructure teams. These can be one-time initiatives such as migrating off a deprecated service. Audits can also be long term measurements of metrics that must pass a specific threshold or be within acceptable bounds. An example would be measuring how often a test fails during the release process. If the amount of test failures exceed a specific threshold, this would suggest unreliable test coverage and therefore would need to be addressed.</p><p>Infrastructure teams are empowered to add new audits to EE Metrics when they are trying to enact change in their areas of ownership. These audits are powered by various data sources collected by the EE Metrics Events Pipeline and additional platform services: these are called metrics and are required for audits to determine the state of an entity. Once a metric is tracked, writing a new audit to the platform is simple. After many iterations of audits, we came up with a set of guidelines when writing a new audit:</p><ul><li>Audits should contain enough context for teams to address and solve - if certain audits require a lot of external context, teams are to be directed to additional documentation to help them understand the requirements of the audit.</li>
<li>Audits should be actionable by the teams themselves. If improvements require heavy lifting from an infrastructure team, the infrastructure team should directly drive those improvements.</li>
<li>Audits should be targeted at the team level across the engineering organization. For example, checking for a particular antipattern one specific developer introduced is not the goal for audits.</li>
</ul><p>Once a new audit configuration is deployed, the Team Health Reports are updated to include the new audit.</p><p>We’ve taken a democratic approach of allowing infrastructure teams to define their audits’ thresholds and impact levels by establishing defined criteria and providing guidance. While we initially had concerns that infrastructure teams would view their audits as always having the highest impact, we found metric owners have a good understanding on how their audits play in the bigger picture in the overall health of a team.</p><p>Required Migrations are any engineering efforts highlighted at the organizational level that are deemed important. These are engineering initiatives that are to be completed by their respective due dates. Some examples of a Required Migration could be an internal migration of services from a deprecated technology to a new technology or organization level upgrades to repositories. These are migrations for technologies that pose the most risk or have outsized benefits across the entire engineering org.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/5_required.png" alt="Figure 5: Example of failing Required Migrations" /><p class="subtle-text"><small>Figure 5: Example of failing Required Migrations</small></p></div><h2 id="why-is-ee-metrics-important-for-required-migrations">Why is EE Metrics important for Required Migrations?</h2><p>It can be difficult to highlight and keep track of engineering initiatives that are important to be done at the organization level. Since EE Metrics collects and displays audit results, this can be leveraged to provide an accurate assessment on the completion of an engineering initiative. This aims to provide a platform to keep track of and send detailed reports on the progress of these initiatives at the team and organization level. Teams often do not have the bandwidth to address all of the audits surfaced. To alleviate this, Required Migrations serve as a way to prioritize engineering initiatives. Required Migrations are part of the roadmap planning process org-wide where teams must commit time to addressing these migrations. The goal of EE Metrics is to further increase visibility of these initiatives within the organization.</p><p>Determining whether a migration is a top initiative or not depends on several factors. Generally, the overall process is as follows:</p><ul><li>Migration authors discuss with their Engineering Manager to propose the escalation of migrations based on importance, severity, and potential consequences if left undone.</li>
<li>Various EMs, TPMs, and Directors coordinate the tentative list of required migrations.</li>
<li>VPs approve the list of migrations and it is labeled as Required Migrations.</li>
<li>Migration authors are designated as the owner foreseeing the completion of their migration. A corresponding migration and audit are created in the EE Metrics services for each Required Migration.</li>
</ul><p>As teams and organization leaders are aware of the required migrations that need to be accomplished, it becomes easier to ensure completion of these migrations are done by a specific date.</p><p>EE Metrics serves as a hub for employees to easily identify engineering initiatives and issues that need to be resolved. By handling these issues and performing migrations early on, this reduces technical debt and improves developer effectiveness. As an organization grows and expands, identifying engineering initiatives and potential issues becomes harder to echo without a centralized platform.</p><p>A team at Yelp had a cohorting issue with an experiment they were running. This caused a lot of headaches: identifying the problem was difficult and unclear. The team in question checked their EE Metrics Team Health Report and found the audit pointing out deficiencies in their experiment.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/6_experiment.png" alt="Figure 6: The audit that was pointing out that one of their experiments was deficient" /><p class="subtle-text"><small>Figure 6: The audit that was pointing out that one of their experiments was deficient</small></p></div><p>The team was able to solve their issue and strived to continue to improve their EE Metrics scores. This had been helpful for their team that they decided to share their experiences with us about EE Metrics and how it had helped them.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-13-keeping-track-of-engineering-wide-goals-and-migrations/7_thanks.png" alt="The team at Yelp provided a nice testimonial for our team" /><p class="subtle-text"><small>The team at Yelp provided a nice testimonial for our team</small></p></div><p>We’re delighted by all the internal usage of EE Metrics and we will continue to iterate and develop tools to better surface visibility of debt at the company. We hope to see EE Metrics continue paving the path to become a powerful tool when we’re addressing technical debt.</p><p>We would like to send a warm thank you to all past, present and future individuals who have contributed to the development of EE Metrics.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/03/keeping-track-of-engineering-wide-goals-and-migrations.html</link>
      <guid>https://engineeringblog.yelp.com/2024/03/keeping-track-of-engineering-wide-goals-and-migrations.html</guid>
      <pubDate>Wed, 13 Mar 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Yelp’s AI pipeline for inappropriate language detection in reviews]]></title>
      <description><![CDATA[<p>Yelp’s mission is to connect consumers with great local businesses by giving them access to reliable and useful information. Consumer trust is one of our top priorities, which is why we make significant investments in technology and human moderation to protect the integrity and quality of content on Yelp. As a platform for user-generated content, we rely on our community of users and business owners to help report reviews that they believe may violate our <a href="https://terms.yelp.com/tos/en_us/20240222_en_us/">Terms of Service</a> and <a href="https://www.yelp.com/guidelines">Content Guidelines</a>. Our User Operations team investigates all flagged content and, if it’s found to be in violation of our policies, may remove it from the platform.</p><p>Beyond user reporting, Yelp also has proactive measures in place that help mitigate hate speech, and other forms of inappropriate content through the use of automated moderation systems. In this pursuit, Yelp recently enhanced its technology stack by deploying <strong>Large Language Models (LLMs)</strong> to help surface or identify egregious instances of threats, harassment, lewdness, personal attacks or hate speech</p><p>Automating inappropriate content detection in reviews is a complex task. Given the potential complexities of different contexts, several considerations go into creating a tool that can confidently flag content violating our policies. In the absence of high precision, such a tool can have significant consequences, including delays in evaluating reviews, while less stringent measures can result in the publication of inappropriate and unhelpful content to the public. Addressing this, we have iterated through several approaches to achieve higher precision and recall in the detection of inappropriate content. The tradeoffs in precision-recall drove us to adopt LLMs, which have been largely successful in the field of natural language processing. In particular, we explored the efficacy of LLMs to identify egregious content, such as:</p><ul><li>Hate speech (including disparaging content targeting individuals or groups based on their race, ethnicity, religion, nationality, gender, sexual orientation, or disability)</li>
<li>Lewdness (including sexual innuendos, pickup lines, solicitation of sexual favors, as well as sexual harassment)</li>
<li>Threats, harassment, or other extreme forms of personal attacks</li>
</ul><p>Unrelated to this automated system, as previously mentioned, Yelp allows both consumers and business owners to report reviews they believe violate our content policies, including reviews that contain threats, harassment, lewdness, hate speech, or other displays of bigotry. In 2022, <a href="https://issuu.com/yelp10/docs/2022_yelp_trust_safety_report?fr=sZmZkYzU3NDM2NzY">26,500+ reported reviews were removed</a> from Yelp’s platform for containing threats, lewdness, and hate speech. These reported reviews, along with Yelp’s pre-existing systems that curb inappropriate reviews in real-time, provided us with a large dataset to fine-tune LLMs for the given binary classification task, where the goal was to classify reviews as appropriate or inappropriate, in real-time.</p><p>To train the LLM for classification, we had access to a sizeable dataset of reviews identified as inappropriate in the past. However, given the inherent complexity of language, especially in the presence of metaphors, sarcasm and other figures of speech, it was necessary to more precisely define the task of inappropriate language detection to the LLM. To accomplish this, we collaborated with Yelp’s User Operations team to curate a high-quality dataset comprising the most egregious instances of inappropriate reviews, as well as reviews that adhered to our content guidelines. A pivotal strategy here was the introduction of a scoring scheme that enabled moderators to signal to us the severity level of inappropriateness in a review. To further augment the dataset, we also implemented similarity techniques using sentence embeddings from LLMs, and identified additional reviews that were similar to the high-quality samples we obtained from moderator annotation.</p><p>Apart from this, we also applied sampling strategies on the training data specifically to increase model recall. In order to train a model that can recognize different forms of inappropriate content, it is necessary to have a dataset with enough samples from different sub-categories of inappropriate content. Unfortunately, a large number of reviews that we curated did not contain this information. To solve this problem, we leveraged zero shot and few shot classification capabilities of LLMs to identify the sub-category of inappropriate content and performed under-sampling or over-sampling where needed.</p><p>Using the carefully curated data, we began investigating the effectiveness of large language models for the given text classification task. We downloaded LLMs from the <a href="https://huggingface.co/docs/hub/models-the-hub">HuggingFace model hub</a> and computed sentence embeddings on the preprocessed review samples. Using these embeddings, we determined the separation between appropriate and inappropriate samples by evaluating the silhouette score between the two groups, as well as by plotting them on a two-dimensional space upon dimension reduction with t-SNE. The separation was fairly apparent as can be seen in the figure below.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-12-ai-pipeline-inappropriate-language-detection/ham_spam_separation.png" alt="Visualizing separation between appropriate/inappropriate reviews on model embeddings" /><p class="subtle-text"><small>Visualizing separation between appropriate/inappropriate reviews on model embeddings</small></p></div><p>Encouraged by this, we minimally fine-tuned the same model on the dataset for the given classification task and saw successful results on the class-balanced dataset (see metrics below).</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-12-ai-pipeline-inappropriate-language-detection/balanced_data_model_metrics.png" alt="Trained model metrics on balanced test data" /><p class="subtle-text"><small>Trained model metrics on balanced test data</small></p></div><p>Although the metrics were promising, we still needed to assess the false positive rate generated by the model in real-time traffic. This is because the spam prevalence in actual traffic is very low, so we needed to be extremely careful in our assessment of the model’s performance in real-time and choose a threshold that helps generate high precision.</p><p>In order to simulate the model’s performance in real-time, we generated many sets of mock traffic data with different degrees of spam prevalence. The result of this analysis allowed us to determine the model threshold at which we can identify inappropriate reviews with an accepted range of confidence. Now we were ready to push the model’s deployment to actual traffic on Yelp.</p><p>The following flow diagram illustrates the deployment architecture. Historical reviews stored in Redshift were selected for labeling and similarity matching (as described in the data curation section). The curated dataset is stored into an S3 bucket and fed into the model training batch script. The model generated from the batch is registered in MLFlow from which it is loaded into MLeap for serving predictions inside a service container (model server component in the picture below). Please refer to this <a href="https://engineeringblog.yelp.com/2020/07/ML-platform-overview.html">blog post</a> from 2020 for more details on Yelp’s ML platform.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2024-03-12-ai-pipeline-inappropriate-language-detection/deployment_architecture.png" alt="Model training &amp; deployment process" /><p class="subtle-text"><small>Model training &amp; deployment process</small></p></div><p>Since incorporating LLMs to help detect harmful and inappropriate content, it enabled our moderators to proactively prevent <strong>23,600+ reviews from ever publishing to Yelp in 2023</strong>.</p><p>Yelp makes significant investments in its content moderation efforts to protect consumers and businesses. Recent advancements in Large Language Models have showcased their potential in understanding context, presenting us with a significant opportunity in the field of inappropriate content detection. Through a series of strategies we have now deployed a Large Language Model to live traffic for the purpose of identifying reviews that contain egregious instances of hate speech, vulgar language, or threats and thereby, not in compliance with our Content Guidelines. The flagged reviews are manually reviewed by our User Operations team, and through this combined effort, we have proactively prevented several harmful reviews from ever being published on Yelp. However we still continue to rely on our community of users to report inappropriate reviews. Based on the decisions made by moderators and subsequent retraining of the model, we anticipate further improvements in the model’s recall in the future.</p><p>I would like to acknowledge everyone that was involved in this project. Special thanks to Marcello Tomasini, Jonathan Wang, Jiachen Zhao for contributing to the design and implementation of the work described here. I’d also like to thank members of the ML infra team, Yunhui Zhang, Ludovic Trottier, Shuting Xi, and Jason Sleight for enabling LLM deployment, and the members of the User Operation team.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/03/ai-pipeline-inappropriate-language-detection.html</link>
      <guid>https://engineeringblog.yelp.com/2024/03/ai-pipeline-inappropriate-language-detection.html</guid>
      <pubDate>Tue, 12 Mar 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Building data abstractions with streaming at Yelp]]></title>
      <description><![CDATA[<p>Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This is facilitated by Yelp’s underlying <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">data pipeline infrastructure</a>, which manages the real-time flow of millions of messages originating from a plethora of services. This blog post covers how we leverage Yelp’s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp’s Business Properties ecosystem (explained in the upcoming sections) as an example.</p><p>Let’s start by covering certain key terms used throughout the post:</p><ul><li>
<p><strong>Offline systems</strong> - data warehousing platforms such as AWS Redshift or <a href="https://engineeringblog.yelp.com/2021/04/powering-messaging-enabledness-with-yelps-data-infrastructure.html">Yelp’s Data Lake</a>, which are intended for large-scale data analysis</p>
</li>
<li>
<p><strong>Online systems</strong> - systems designed around high-performance SQL and NoSQL database solutions like MySQL or Cassandra DB, specifically built to handle and serve live traffic in real time, typically via REST APIs over HTTP. These databases are optimized for swiftly processing and delivering data as it’s generated or requested, making them crucial for applications and services that require immediate access to up-to-date information</p>
</li>
</ul><p>Generally speaking, ‘Business Property’ can be any piece of data that is associated with a Yelp business. For example, if we’re talking about a restaurant, its business properties could include things like what payment methods it accepts, what amenities it provides, and when it is open for business.</p><p>There are two types of business properties: Business Attributes and Business Features. You may notice that the terms, attributes and features, are synonymous to each other, and that’s by no accident. The primary distinction is that Business Attributes belong to the legacy system, <strong>yelp-main</strong>, while Business Features are in a dedicated microservice, aligning with Yelp’s transition to Service Oriented Architecture.</p><p>We also gather additional metadata about business properties themselves, such as when they were last modified, how confident we are in their accuracy, and where they originated from. This additional information is referred to as “properties metadata.” We store this metadata in a separate table, which contains data about both Business Features and Business Attributes.</p><p>Business properties data is accessed via two primary methods: HTTP APIs for real-time online applications and streaming for offline data synchronization. This post mainly focuses on the streaming aspect.</p><h2 id="existing-business-properties-streaming-architecture">Existing Business Properties’ streaming architecture</h2><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/building-data-abstractions-with-streaming-at-yelp/existing_streaming_architecture.png" alt="Existing Business Properties' streaming architecture" /><p class="subtle-text"><small>Existing Business Properties' streaming architecture</small></p></div><ol><li>
<p>In yelp-main’s MySQL database, data for Business Attributes is scattered across more than a dozen tables. To share this data efficiently, we employ the <a href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">MySQL Replication Handler</a> to push it to <a href="https://kafka.apache.org/intro">Kafka</a></p>
</li>
<li>
<p>Business Features and metadata for business properties are stored in their respective tables in Cassandra db and we use <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Cassandra Source Connector</a> to publish their data into Kafka</p>
</li>
<li>
<p>Ultimately, we use <a href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Redshift Connector</a> to synchronize data from all these tables with their corresponding tables in Redshift. This process allows us to maintain an up-to-date dataset in Redshift for analysis and reporting</p>
</li>
</ol><h2 id="challenges-with-the-existing-workflow">Challenges with the existing workflow</h2><ul><li>
<p><strong>Weak Encapsulation</strong>: Storing data in offline systems exactly as it is stored in source databases forces our clients to understand the inner workings of the source data, which weakens data encapsulation. Ideally, we wanted to abstract away distinctions like ‘Business Features’ and ‘Business Attributes’ and hide implementation details from clients to simplify their interactions. Furthermore, exposing raw data to offline consumers can lead to the disclosure of outdated or incorrect information. Transformation layers via REST APIs prevented online users from facing data discrepancies. However, offline users analyzing raw data still had to grapple with data accuracy issues, such as managing soft-deleted entries.</p>
</li>
<li>
<p><strong>Discovery and consumption</strong>: The lack of proper abstractions also made data analysis and consumption challenging as it meant that consumers, whether they are Product Managers, Data Analysts, or batch processing systems, must create multiple workflows to collect data from various sources. Not to mention, dealing with edge cases and transforming data into a consistent schema added significant effort and cost, leading to an increase in the friction for consumption and a reduction in the general utility of the data.</p>
</li>
<li>
<p><strong>Maintenance challenges</strong>: It also posed certain maintenance challenges as any alteration in the source schema necessitated corresponding changes in the destination store. Ideally, we would prefer the destination store’s schema to be more flexible, dynamic, and less susceptible to changes. This minimizes disruptions for users and mitigates the risk of infrastructure problems due to frequent schema upgrades. It also underscores the fact that a storage schema suitable for one database system might not be ideal for another.</p>
</li>
</ul><p>We did explore various alternatives, including a non-streaming solution that involved using Apache Spark for routine batch executions to generate data dumps in diverse formats. However, as some of the data consumer use cases required relatively real-time updates, we had to lean towards a streaming approach.</p><h2 id="building-robust-data-abstractions-for-both-offline-and-streaming-data-consumers">Building robust data abstractions for both offline and streaming data consumers</h2><p>We tackled the aforementioned challenges by treating both streaming and offline data consumption as just additional channels for accessing and utilizing data, much like online HTTP clients. Similar to how we simplify complexities for online data consumers through REST APIs, we aimed to provide a consistent experience for streamed data by abstracting away internal implementation details. This means that if a client service transitions from consuming data directly through REST APIs to an asynchronous streaming approach, it will encounter similar data abstractions. For example, just as online consumers won’t see stale or invalid data, the same principle applies to streamed data consumers.</p><p>In order to achieve the same, we implemented a unified stream that delivers all relevant business property data in a consistent and user-friendly format. This approach ensures that Business Property consumers are spared from navigating the nuances between Business Attributes and Features or understanding the intricacies of data storage in their respective online source databases.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/building-data-abstractions-with-streaming-at-yelp/new_streaming_architecture.png" alt="New consolidated business properties streaming architecture" /><p class="subtle-text"><small>New consolidated business properties streaming architecture</small></p></div><ol><li>
<p><strong>Business Attributes data collection and transformation</strong>: we utilize <a href="https://beam.apache.org/">Apache Beam</a> with <a href="https://flink.apache.org/">Apache Flink</a> as the distributed processing backend for data transformation and formatting Business attribute data. Apache Beam transformation jobs process data originating from various input streams generated by the MySQL replication handler. These streams contain replicated data from their corresponding MySQL tables. The transformation jobs are responsible for standardizing the incoming streaming data, transforming it into a consistent format across all business properties. The transformed data is then published into a single unified stream.</p>
</li>
<li>
<p><strong>Streaming Business Features</strong>: in a similar fashion, the output stream for Business Features, sourced from Cassandra using a <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">source connector</a>, also has its dedicated Apache Beam transformer job. This job formats the data to match the unified format used for Business Attributes, and the resulting data is published into the same unified output stream</p>
</li>
<li>
<p><strong>Enrich data with properties metadata</strong>: we employed a <a href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery Flink</a> job - a homegrown solution at Yelp commonly used for joining data across multiple Kafka topics - to amalgamate the business data for both Business Attributes and Features with the corresponding metadata. As a result, the data stream not only contains the business properties data but also the relevant metadata linked to each property.</p>
</li>
<li>
<p><strong>Final data formatting</strong>: transformation job to address issues related to data inconsistencies, remove invalid data entries, and add any necessary supplementary fields, before the final business properties with metadata consolidated stream is exposed for consumption</p>
</li>
<li>
<p><strong>Offline data storage</strong>: the processed business properties data, complete with metadata, is made available for offline consumption and ends up in Redshift, through Redshift Connector. Additionally, it is ingested into Yelp’s Data Lake using a Data Lake connector, making it available for a broader range of analytics and data processing tasks</p>
</li>
<li>
<p><strong>Real-time consumption and Integration</strong>: the same consolidated data stream can cater to real-time consumption by other services within the organization. We use the same stream to sync business property data with Marketing systems, as they require timely syncs for their campaigns</p>
</li>
</ol><p>To summarize, with the architecture described above, we have created a unified business properties stream addressing the challenges with the existing workflow mentioned above. This stream is utilized to sync business properties data into offline systems, enabling users to access all business properties through a singular schema, thereby facilitating data discovery, consumption, and overall ease of use.</p><p>Additionally, this approach allowed us to enrich business property data with associated metadata and resolve data inconsistencies, such as removing duplicate business properties etc. We used the <a href="https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model">entity–attribute–value (EAV) model</a>, which accommodates the frequent introduction of new business properties without requiring modifications to the destination store schemas, hence reducing some of the maintenance overhead.</p><p>This post shows how Yelp’s robust data pipeline infrastructure can be leveraged to create sophisticated data pipelines that provide data in formats which are more suited and beneficial for both offline and streaming users. While this doesn’t imply that streaming and exposing raw data is never appropriate, however in such situations, it may be more effective to offer multiple streams: one with the raw data and others with processed data that is more befitting for data analysis and consumption</p><p>I would like to thank the members of Semantic Business Information team and different streaming teams at Yelp that helped in making this project a reality.</p><p>Special thanks to Joshua Flank, Abhishek Agarwal, Ryan Irwin and Sudhakar Duraiswamy for providing insightful inputs and reviewing the blog.</p><div class="island job-posting"><h3>Become a Data Backend Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2024/03/building-data-abstractions-with-streaming-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2024/03/building-data-abstractions-with-streaming-at-yelp.html</guid>
      <pubDate>Fri, 08 Mar 2024 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Coordinator - The Gateway For Nrtsearch]]></title>
      <description><![CDATA[<p>While we once used Elasticsearch at Yelp, we have since built a replacement called Nrtsearch. The benefits and motivations of this switch can be found in our blog post: <a href="https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html">Nrtsearch: Yelp’s Fast, Scalable and Cost Effective Search Engine</a>. However in this blog post, we will discuss the motivations behind building Nrtsearch Coordinator - a gateway for Nrtsearch clusters. We will also go over how Nrtsearch Coordinator adds sharding logic to Nrtsearch, handles scatter-gather queries, and adds support for dark/live launching cluster improvements.</p><p>We traditionally used a gateway to call Elasticsearch, which provides metrics, isolation rate-limiting per client, and geo sharding, and it also eases Elasticsearch upgrades (see <a href="https://www.youtube.com/watch?v=1D1ED4KxxWQ">Yelp’s Elasticsearch-based Ranking Platform - Indexing and Defense Mechanisms</a> for more details). However, we couldn’t use the same gateway for Nrtsearch for a few reasons:</p><ol><li>It was using the <a href="https://github.com/Netflix/Hystrix">Hystrix</a> library for rate-limiting and isolation which has been deprecated for a while.</li>
<li>It was running on Java 1.8 since Hystrix is not supported on newer Java versions.</li>
<li>It exposed a REST API with JSON while Nrtsearch uses gRPC and Protobuf. Converting the Protobuf messages to JSON would make the responses much larger and harder to parse for clients.</li>
<li>It was built for geo sharding but we needed to shard the data using multiple strategies.</li>
<li>It used a cassandra-based system instead of our more recent <a href="https://engineeringblog.yelp.com/2018/06/fast-order-search.html">Flink-based Elasticpipe</a> for indexing.</li>
</ol><p>We considered modernizing the gateway and supporting the required features, but it would have required a lot of changes in the gateway and also in existing applications. Instead we decided to build <strong>Nrtsearch Coordinator</strong> to address all the issues with the previous gateway. It runs on the latest Java version, uses gRPC and Protobuf, and also has more required features. These features are discussed in detail below.</p><h2 id="sharding">Sharding</h2><p>Nrtsearch clusters have a single primary (which does all the indexing) and multiple replicas which serve search requests. The replicas start up by downloading a copy of the index from S3, and then connect to the primary to get the real-time indexing updates. We also have the replicas keep the docvalues (column-based per-field data structures that are read sequentially) for the entire index in memory using OS disk cache for faster retrieval for search requests. This design presents two challenges:</p><ol><li>Index size is limited by the amount of memory we can get in an instance. Larger instances are also more expensive.</li>
<li>Replicas will require more time to bootstrap the larger an index is – since the download from S3 will take longer – increasing the time it takes to scale up the number of replicas when there is an increase in search traffic.</li>
</ol><p>While these challenges won’t present issues for small indices (sized in 10s of GBs), they will for larger indices (100s of GBs). This is a typical problem faced by databases since data size can easily grow beyond the space available on a disk. Databases typically “shard” (create chunks of) large amounts of data and distribute them across multiple nodes so that each node has a manageable data size. The Nrtsearch Coordinator allows us to do the same for Nrtsearch, but instead of distributing data across multiple nodes in a cluster, we do it across multiple Nrtsearch clusters. We call this logical grouping of clusters a “cluster group.”</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/coordinator-the-gateway-for-nrtsearch/indexing_and_search_with_coordinator.png" alt="Interactions between Nrtsearch primaries and replicas of clusters in a cluster group, and Nrtsearch Coordinator" /><p class="subtle-text"><small>Interactions between Nrtsearch primaries and replicas of clusters in a cluster group, and Nrtsearch Coordinator</small></p></div><p>We can easily create the required number of Nrtsearch clusters, and then Nrtsearch Coordinator will direct both indexing (including add document, delete and commit requests) and search requests to the right clusters. All of these requests include a sharding parameter object which contains the required information for Nrtsearch Coordinator to send the request to the right cluster. Nrtsearch Coordinator also needs a sharding configuration which defines how the sharding will be performed. The information within the sharding parameter and the required configuration will depend on the type of sharding being used:</p><ol><li>
<p><strong class="c1">ID sharding</strong></p>
<p>ID sharding simply takes the mod of an integer by the number of clusters/shards and can index the data in the cluster or search for data in a cluster. While the name implies that the integer must be an ID, it may or may not be the document ID. The sharding configuration needs to map the numbers 0 to n-1 (where n is the number of Nrtsearch clusters) to Nrtsearch cluster and index name. Example ID sharding configuration:</p>
<div class="language-plaintext highlighter-rouge highlight"><pre>clusters_to_indices:
  0:
    cluster_1: index_name_1
  1:
    cluster_2: index_name_2
  2:
    cluster_3: index_name_3
</pre></div>
</li>
<li>
<p><strong class="c1">Geo sharding</strong></p>
<p>With geo sharding, the data in the same region is stored in a single cluster. The sharding parameter may contain a geo point (latitude and longitude) or a geo box (two geopoints representing opposite corners of a rectangular area). The sharding configuration needs to contain a mapping from geo box to a Nrtsearch cluster and index name. A request will be mapped to an Nrtsearch cluster if the point or box are contained in its corresponding geo box. We add some fudge factor to index businesses that are at the boundary to keep the search behavior consistent. Example geo sharding configuration:</p>
<div class="language-plaintext highlighter-rouge highlight"><pre>geoshards:
  - index_name: west_americas
    cluster_name: search_west
    bounds:
      min_latitude: -90.0
      max_latitude: 90.0
      min_longitude: -170.0
      max_longitude: -100.0
  - index_name: east_americas
    cluster_name: search_east
    bounds:
      min_latitude: -90.0
      max_latitude: 90.0
      min_longitude: -100.0
      max_longitude: -30.0
</pre></div>
</li>
<li>
<p><strong class="c1">Default sharding</strong></p>
<p>This implies that we are only using a single Nrtsearch cluster and not sharding the data. The sharding parameter need not contain anything while the sharding configuration needs the single Nrtsearch cluster and index name. Example default sharding configuration:</p>
<div class="language-plaintext highlighter-rouge highlight"><pre>cluster_name: search
index_name: business_v1
</pre></div>
</li>
</ol><p>We select one of these sharding strategies:</p><ul><li>If the index size is small enough to fit on a single cluster use default sharding.</li>
<li>If the index is large, can be split by location, and every search query only has a single geo area use geo sharding.</li>
<li>Use ID sharding for everything else.</li>
</ul><p>When sharding data, databases generally try to split the data evenly across all shards. Queries are fanned out to all shards and then the results are combined. As you can see with ID sharding (unless using document IDs as the sharding parameter) or geo sharding, there is no guarantee that the data will be evenly distributed across Nrtsearch clusters. These sharding strategies can only be used with search queries that access a single shard. Say you have a geo shard for the Eastern U.S. and you have a search request that only needs results within the area of New York. You can direct the search request to the New York shard by setting the sharding parameter to the geo box containing New York. In addition to that you can also add a <a href="https://nrtsearch.readthedocs.io/en/latest/queries/geo_bounding_box.html">geo bounding box</a> to the query to limit the results to New York.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/coordinator-the-gateway-for-nrtsearch/geosharding.png" alt="Geo sharding example" /><p class="subtle-text"><small>Geo sharding example</small></p></div><p>This works with ID sharding too. You can search over all reviews of a single business by ID sharding on business ID instead of review ID. Also since we run Nrtsearch on Kubernetes we can individually set the resources for primaries and replicas in each cluster, and also the number of replicas. For example:</p><ul><li>If a cluster has a small index we can set it to have less memory.</li>
<li>If a cluster has only a few updates we can reduce the CPU on the primary.</li>
<li>If a cluster receives more traffic than other clusters, its replicas can scale up and service the traffic. There is no need to increase the number of replicas for other clusters.</li>
</ul><p>All we need is that the index sizes on each cluster are small enough that the docvalues fit in memory and that Nrtsearch can download the index and startup within a few minutes. But if your search query requires searching over all data across multiple shards, we can ID shard on the document ID to have all data evenly spread across all clusters and use scatter-gather.</p><h2 id="scatter-gather">Scatter-Gather</h2><p>Nrtsearch Coordinator also supports scatter-gather, in other words, it can fan out search requests to all clusters and combine the responses for use-cases where we cannot apply application level sharding logic. This can be used with any type of sharding but is best used with ID sharding using document ID in the sharding parameter to evenly distribute the data and also search load.</p><p>Processing a search request this way enables parallel processing and improves performance for searches over huge datasets contained in a cluster group. Consider an Nrtsearch index that contains reviews and is sharded by review ID. Scatter-Gather can be used to query all reviews containing the word pizza across all clusters. In this case we can send the same query to all the clusters and combine the responses to rank them accordingly.</p><p>We implemented scatter-gather to distribute an incoming search request across multiple clusters using multi-threading to invoke all the search tasks in parallel and with appropriate timeouts to process the request. Nrtsearch Coordinator acts as a collector for these individual search responses. All the logic needed to merge and sort these responses are built into Nrtsearch Coordinator. This requires scatter-gather to be performant to take advantage of Nrtsearch’s high performance searches on each cluster.</p><p>The Nrtsearch Coordinator merges the responses as they are received. The hits are ranked either according to the relevance scores or the query’s sort field type. We use a heap data structure to merge the results and to retain the top N document IDs requested by the client. Currently if any request to a cluster errors out we return an error in the response. Support for partial responses is discussed in the future work section.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/coordinator-the-gateway-for-nrtsearch/scater_gather_query_and_fetch.png" alt="Nrtsearch Coordinator Scatter-Gather feature" /><p class="subtle-text"><small>Nrtsearch Coordinator Scatter-Gather feature</small></p></div><p>An Nrtsearch search response contains the hits results, search diagnostics, collector or aggregation results and several other metrics and information about the search query that is processed. All of these fields are merged accordingly to enrich the combined search response with all the useful information.</p><p>When <a href="https://nrtsearch.readthedocs.io/en/latest/additional_collectors.html">aggregations</a> such as Terms aggregation are requested, Nrtsearch uses collectors to get results from individual segments of an index and a reduce logic computes the aggregations per cluster. If topN results are requested, for example, we get the topN from each shard to combine and sort the individual responses. We use a query-and-fetch approach here instead of query-then-fetch since we did not experience any latency concerns for our current use cases. However in the future, we plan to implement a query-then-fetch approach to handle large search requests to clusters with a higher number of shards. For search clients that require higher accuracy when dealing with imbalanced shards, we will be fetching more than the requested number of results from each shard so that the final topN results have the highest accuracy and relevance.</p><p>In Nrtsearch Coordinator, we recursively process the results of these collectors and the nested collectors within them to merge the responses. These results are then ordered and processed using a priority queue to have top buckets of certain size in the final aggregation result.</p><p>Some search requests can take too long to be processed, which can cause timeouts in the Nrtsearch cluster. The reasons why the query could not be processed within a reasonable time may vary from queries that require ranking a large number of documents, to a lack of resources in the Nrtsearch cluster. We log these slow queries along with the time taken to understand the root cause behind the slow processing time. The slow query is logged in Nrtsearch Coordinator because sharding is not part of Nrtsearch. It would not be possible to investigate a sharding problem if we were logging the slow query through Nrtsearch instead of Nrtsearch Coordinator.</p><p>It is important to note that the information in the slow query log does not contain any sort of sensitive information that could harm users’ privacy. The term “slow” is subjective and configurable in the Nrtsearch Coordinator configuration file. This is an example of a slow query configuration:</p><div class="language-plaintext highlighter-rouge highlight"><pre>queryLogger:
  defaultStreamName: all_slow_queries
  timeTakenMsToLoggingPercentage:
    # 1% of the queries that took more than 150ms but not more than 350ms
    # will be logged into the default all_slow_queries stream
    150: 0.01
    350: 1.0
  timeTakenMsToStreamName:
    # 100% of the queries that took more than 350ms will be logged in the
    # stream name defined below instead of all_slow_queries
    350: slow_queries_over_350_ms
# fields that should be skipped when logging a search response/request
sensitiveFieldsInSearchResponse: [response_sensitive_field]
sensitiveFieldsInSearchRequest: [request_sensitive_field]
</pre></div><h2 id="dark-and-live-launch">Dark and live launch</h2><p>Many changes on Nrtsearch clusters are only infrastructural and not behavioral. For such infrastructural changes, we look for the following:</p><ol><li>Client code should not require any changes.</li>
<li>The new cluster group should return the same response.</li>
<li>The response from the new cluster group should not be slower than the status quo cluster group.</li>
</ol><p>Dark and live launches (also known as blue-green deployment) are a great way for developers to safely test a new Nrtsearch cluster group by slowly shifting incoming traffic to the new cluster group. A comparison between the responses from the status quo and the new cluster groups is very useful to build confidence in the new cluster group behavior before actually serving live traffic to it, avoiding any negative impact on the clients.</p><p>Nrtsearch Coordinator is a good place to add the dark/live launch features because it already routes requests to the proper Nrtsearch cluster based on the sharding parameters. Dark/live launches also route requests to the proper Nrtsearch cluster group, but based on a traffic percentage. Having this logic in Nrtsearch Coordinator instead of client services also means that any client using Nrtsearch Coordinator during a dark/live launch would have the new Nrtsearch cluster changes without the need of any change on the client side.</p><p>All of the traffic percentage and launch type (status quo, dark launched, and live launched) definitions are configurable in the Nrtsearch Coordinator configuration file. Currently, dark/live launches only work for search requests. We can define the different types of launches as follows:</p><ul><li><strong>Status quo</strong> - Status quo is the cluster group that Nrtsearch Coordinator currently sends all search requests to.</li>
<li><strong>Dark launch</strong> - Dark launched cluster groups are the cluster groups that we want to test in a way that does not have any user impact. Dark launching should not affect the status quo response in any way, including the content or timings. To achieve that, Nrtsearch Coordinator sends any search request to the status quo <strong>AND</strong> the dark launched cluster groups. Only the search response from the status quo cluster group is returned to the client. In more detail, the same request is first sent to and processed by the status quo cluster group. Then, the same request is sent to the dark launched cluster group, but in a different thread such that the response from the status quo cluster group is not blocked and it can be returned right away to the client. As a result, we can keep track of both the status quo and the dark launched cluster group responses for the same request. These responses and the search request are logged so that we can later compare if both cluster groups behave the same (more in Comparison Report section).</li>
<li><strong>Live launch</strong> - Live launched cluster groups are cluster groups that usually went through a dark launch first and can now be gradually exposed to users. When live launching, Nrtsearch Coordinator sends any search request to the status quo <strong>OR</strong> one of the live launched cluster groups. The response from the selected (status quo or live launched) cluster group is returned to the user. Since the same request is not sent to both the status quo and the live launched cluster groups, we do not have a comparison log similar to what we have during dark launch.</li>
</ul><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/coordinator-the-gateway-for-nrtsearch/launch_router_example.png" alt="How dark/live launch works in Nrtsearch Coordinator" /><p class="subtle-text"><small>How dark/live launch works in Nrtsearch Coordinator</small></p></div><p>Besides defining the status quo as well as the dark/live launched cluster groups, Nrtsearch Coordinator also needs to know by how much it should route the search traffic to these cluster groups, which can happen from 0% to 100%. A common dark/live launch flow looks like the following:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/coordinator-the-gateway-for-nrtsearch/dark_live_launch_flowchart.png" alt="Dark/live launch flow" /><p class="subtle-text"><small>Dark/live launch flow</small></p></div><h3 id="comparison-report">Comparison report</h3><p>We developed a comparison report tool with the purpose of facilitating the comparison of Nrtsearch search responses between the status quo and dark launched cluster groups. Since we log the status quo and dark launched responses for the same request, we can use these logs to check the behavior of the dark launched cluster group against the status quo. Each line in this log contains the search request, the search response of the status quo, and the search response of the dark launched cluster groups. The comparison report tool uses this log to compare the responses and generates a summary of the comparison, by checking the response equality in the following order: total hits → hit fields that are ids → remaining hit fields → hit scores. The complete Nrtsearch response structure can be found <a href="https://github.com/Yelp/nrtsearch/blob/0aec087cc083d07ea39802d1574b3ae2e19732d1/clientlib/src/main/proto/yelp/nrtsearch/search.proto#L551-L630">here</a>. This is how the comparison report summary looks like:</p><div class="language-plaintext highlighter-rouge highlight"><pre>----- COMPARISON REPORT SUMMARY -----
Dark launch cluster group: test-cluster-group
Total log lines compared: 293
Number of error messages: 15 (5.12% of total log lines)
Number of matching responses: 178 (60.75% of total log lines)
Number of mismatching responses: 100 (34.13% of total log lines)
-- Total hits mismatch stats --
Number of mismatching total hits: 70 (23.89% of total log lines)
Total hits average difference: 60
-- Top hits mismatch stats --
Number of mismatching ids: 7 (2.39% of total log lines)
Number of mismatching fields: 23 (7.85% of total log lines)
Number of mismatching scores: 0 (0.00% of total log lines)
Comparison report saved at nrtsearch_coordinator/generated/comparison_reports/comparison_report_20221109-155500.txt
</pre></div><p>The comparison report is a command line tool that is part of the Nrtsearch Coordinator repository. While this tool could have been released separately from Nrtsearch Coordinator, we thought of deploying it together to avoid installing and deploying it in different environments. It also makes sense to deploy the comparison report tool and Nrtsearch Coordinator together because the comparison tool is highly coupled with the dark launch log formatting, which is defined in Nrtsearch Coordinator.</p><ul><li>Support pagination, partial responses and combining facet results in scatter-gather</li>
<li>Translating coordinator requests to work with API changes in nrtsearch to avoid changes in clients</li>
<li>Add more sharding strategies which work better for a variety of use-cases</li>
</ul><p>We would like to thank all current and past members of Ranking Infrastructure team at Yelp who have contributed to building Nrtsearch Coordinator including Andrew Prudhomme, Erik Yang, Karthik Alle, Mohammad Mohtasham, Tao Yu, Ziqi Wang, Umesh Dangat, Jedrzej Blaszyk and Samir Desai.</p><div class="island job-posting"><h3>Become a Data Backend Engineer at Yelp</h3><p>Do you love building elegant and scalable systems? Interested in working on projects like Nrtsearch? Apply to become a Data Backend Engineer at Yelp.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/10/coordinator-the-gateway-for-nrtsearch.html</link>
      <guid>https://engineeringblog.yelp.com/2023/10/coordinator-the-gateway-for-nrtsearch.html</guid>
      <pubDate>Fri, 06 Oct 2023 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Overview of JupyterHub Ecosystem]]></title>
      <description><![CDATA[<p>At Yelp, <a href="https://spark.apache.org/">Apache Spark</a> and <a href="https://jupyter.org/">JupyterHub</a> are heavily used for batch processing and interactive use-cases, such as in building feature models, conducting ad-hoc data analysis, sharing templates, making on-boarding materials, creating visualizations, and producing sales reports.</p><p>Our initial deployments of Jupyter at Yelp were iPython notebooks managed at an individual level. Later on when Jupyterlab was released (2018), our notebook ecosystem was extended to Jupyter Servers running on dev boxes, which was managed by individual engineering teams. Over time with growing use-cases and data-flow, this introduced unnecessary version variability, became error-prone due to the number of manual steps, caused config duplicacy, lacked comprehensiveness in resource usage and cost monitoring, created security issues, and added maintenance overload at an organizational level.</p><p>In this blog post, we will discuss the evolution of our Jupyterhub ecosystem which is now managed by a single team and presents an easy to use, scalable, robust, and monitored system for all engineers at Yelp. This blog will focus on each major component as part of the ecosystem and describe its purpose and evolution over time. Finally, we will illustrate the evolution of all the components in a unified chronological order in a diagram.</p><p>The Yelp JupyterHub ecosystem encompasses JupyterHub, our internal notebook archiving service <a href="https://engineeringblog.yelp.com/2020/10/introducing-folium-enabling-reproducible-notebooks-at-yelp.html">Folium</a>, <a href="https://papermill.readthedocs.io/en/latest/">Papermill</a>, <a href="https://engineeringblog.yelp.com/2020/03/spark-on-paasta.html">Spark on PaaSTA</a>, and a Spark Job Scheduling Service (e.g. Mesos or Kubernetes). We solved many of our problems through a combination of novel feature development, extension integrations, and migrations of infrastructure components, all while trying to maintain minimal impact to existing Jupyter workflows. The diagram below shows the most common workflow for a user to launch a notebook and upload it to Folium.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/jupyterhub-system-overview.png" alt="High-level Architecture of JupyterHub Ecosystem at Yelp" /><p class="subtle-text"><small>High-level Architecture of JupyterHub Ecosystem at Yelp</small></p></div><p><strong>Scale</strong>: Over the years, we have scaled our usage of the JupyterHub ecosystem to several teams owning thousands of batches. To put this into perspective, Spark batch runs have doubled every year, almost following Moore’s law. As of today, over 100 service owners own more than 1200 batches and hundreds of Jupyter and Folium notebooks are executed daily. These run across different underlying hardware (EMR, gpu, spot, on-demand), processing billions of messages and terabytes of data on a daily basis.</p><p>Jupyter notebook usage started at Yelp through users launching notebook servers from within a service virtual environment on individual dev boxes. As mentioned earlier in this post, as the scale of usage of our ecosystem increased, it brought a bunch of challenges, making it harder to manage use-cases at an organization level.</p><p>As a result, our Spark and JupyterHub infrastructure went through a plethora of migrations to adapt to newer technologies. The chronological stages of the migrations are as below:</p><ul><li>The Jupyterhub setup used as part of individual dev boxes was later extended by running team based Jupyterhub instances on team dev boxes, which utilized the open source docker spawner to launch user servers. This led to teams sharing a common shared infrastructure without a single user having to maintain and set-up their own servers.</li>
<li>A Centralized JupyterHub Ecosystem was built which ran on top of <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>. We started with launching and managing our Jupyter notebooks using marathon spawner. Spark Cluster used a Mesos scheduler to launch its executors. This meant a single team was able to manage the JupyterHub ecosystem, while also providing a single point of entry for launching Spark sessions integrated with PaaSTA infrastructure.</li>
<li>We then adopted a more industry-wide and well-maintained open-sourced orchestration platform, Kubernetes.
<ul><li>The initial phase involved launching Jupyter notebooks with the aim of moving away from Marathon Spawner in favor of using <a href="https://jupyterhub-kubespawner.readthedocs.io/en/latest/">Kubespawner</a>. At this stage, Spark jobs launched on Jupyter notebook ran Spark drivers on Kubernetes while executors were still running on Mesos. Moving to Kubespawner opened doors for many <a href="https://jupyterhub-kubespawner.readthedocs.io/en/latest/#features">features</a>. It provided smarter bin packing, centralized management, and improved monitoring of Jupyter nodes inside a Kubernetes cluster.</li>
<li>The next phase involved migration of Spark schedulers running executors from Mesos over to Kubernetes. This took us one step further towards Mesos deprecation and auto-scaling of executor instances with Dynamic Resource Allocation. This opened doors to integrate security-related improvements such as enabling adding IAM roles for containers through Pod Identity for Spark Drivers..</li>
</ul></li>
</ul><p>All this was done under the hood, without impacting the user-experience and not requiring any service based migrations.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/launching-spark-session.png" alt="Launching Jupyter notebook and writing Spark job without having to deal with underneath components" /><p class="subtle-text"><small>Launching Jupyter notebook and writing Spark job without having to deal with underneath components</small></p></div><p>One of the goals of the ML Compute team – a team focused on batch and machine learning infrastructure – is to continuously work in the direction of a ‘one-click-set-up-everything’ philosophy. This helps Jupyter and Spark users to shift their focus to notebook development instead of infrastructure management. This starts with providing a single web url entry-point for any internal user as shown in the diagram below. The entry-point lets the user launch a Jupyter Server after logging in with their LDAP credentials and using two-factor authentication (2FA).</p><p>The Jupyter Server is run from a docker image, which users can use directly or customize based on their requirements. These images have all the permissions, environment, packaging, and most recommended configurations required to install and run Spark, an otherwise onerous task.</p><p>Customizations to our Jupyter launcher set up user credentials based on assigned AWS roles to access various internal database resources (S3, Redshift), as well as allowing users to select between GPU vs CPU pools with custom resource configurations at the time of launch.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/entrypoint-jupyter-server.png" alt="Single entry point to launch secured and customized notebook server" /><p class="subtle-text"><small>Single entry point to launch secured and customized notebook server</small></p></div><h2 id="customized-jupyter-kernels">Customized Jupyter Kernels</h2><p>The single entry-point leads to spawning a JupyterHub server. Most users have to select the right coding environment (python, sql, etc.) with relevant dependencies installed, often referred to as Kernels. Jupyter notebook comes with a default ipykernel built on top of IPython. We built our own internal custom standard Kernels for IPython and SQL, catered towards data-science and other Yelp Jupyter users. Our Sql Kernel provides an option for users to connect to multiple Datalake or Redshift Clusters and execute SQL queries interactively.</p><h2 id="creating-spark-session">Creating Spark Session</h2><p>Now that we have a notebook server ready to use, one can create a Spark Session with a single api call, <em>create_spark_session</em>. Besides returning an active Spark Session, this api internally takes care of the following:</p><ul><li>Deduces the final set of relevant Spark parameters based on different input sources</li>
<li>Deduces the optimal default AWS resource and docker container configurations</li>
<li>Takes care of setting up the environment variables (example: AWS creds)</li>
<li>Emits resource usage monitoring link, spark history link, estimated cost</li>
<li>Sends request to our other internal system Clusterman to spin-up a Spark Cluster in our shared Spark pool</li>
</ul><p>Once the Spark session is created, a notebook user can focus on developing and iterating Spark batches, building data-science models onto the live Spark cluster. The diagram below shows a sample example of launching a Spark Cluster to use through a single api call.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/creating-spark-session.png" alt="Creating a Spark Session" /><p class="subtle-text"><small>Creating a Spark Session</small></p></div><h2 id="managing-access-controls">Managing Access Controls</h2><p>Notebook users often want to connect to various AWS resources like Yelp’s Datalake, S3 paths, and Redshift. For a secured Yelp infrastructure, we want to make sure that each notebook developer can only access a designated set of clusters and resources based on their team-roles or privileges. Each user at Yelp has a designated set of roles giving them required access controls to AWS resources and databases, with session based creds accessible only after 2FA. To ensure that development experience is not impacted through manual, multi-step, and error-prone setup to manage the desired access controls, we provide easy UI based prompts and reminders for initializing and refreshing session based credentials.</p><p>During early years of our JupyterHub usage, we relied on syncing up each users’ static AWS credentials at the time of the launch of Jupyter Servers from a secured S3 location. Later on we moved towards using federated creds for batches run by human users. The lifespan of these federated creds is less than 12 hours and needs to be refreshed once the old credentials expire. Notebook extensions were added for users to refresh dev or prod credentials with a bunch of button click cycles as shown in the diagram below. The refresh mechanism generates federated credentials, also referred to as temporary credentials, using two-factor authentication linked to one of the designated roles associated with the triggering user. Later on this multi-step process was improved for users to generate credentials as part of a single sign-on process for their designated role. The future plan involves expanding this to auto-refresh credentials on expiry, so that their ongoing job execution or jobs requiring 12 hour runtimes don’t get impacted.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/refresh-creds.png" alt="Pop-up option to refresh the credentials using 2FA authentication" /><p class="subtle-text"><small>Pop-up option to refresh the credentials using 2FA authentication</small></p></div><p>Many use-cases of the JupyterHub ecosystem involve re-running notebooks by multiple users with different inputs over time. As an example, data scientists receive multiple requests to recreate a past report with different sets of inputs. Relying solely on Jupyter notebooks involved a lot of manual steps. Some of these manual steps included starting a Jupyter server, finding their notebooks locally or in S3 buckets, updating the code, running them manually, and emailing the outputs to stakeholders. These steps took up a lot of development time and coordination, and were also error-prone and reduced developer velocity.</p><p>To solve this challenge, we built a notebook archiving and sharing service called Folium. Folium integrates with Jupyterhub to enable notebook reproducibility and improve developer velocity. A notebook developer can upload their notebook to Folium to share or re-run a notebook with a single click to get desired results (e.g., business data, machine learning model outputs, graphs). Later versions of Folium introduced tagging, grouping, and versioning of notebooks followed by integration of generation of temporary AWS role based credentials for the user re-running the notebook. For more details, refer to our previous engineering blog on Folium: Introducing Folium: Enabling Reproducible Notebooks at Yelp.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/workflow-upload-folium.png" alt="Typical workflow for uploading notebook to Folium" /><p class="subtle-text"><small>Typical workflow for uploading notebook to Folium</small></p></div><h2 id="parameterizing-notebook-reruns">Parameterizing Notebook Reruns</h2><p>We used the open-source <a href="https://papermill.readthedocs.io/en/latest/">Papermill</a> library for parameterizing and executing Jupyter notebooks. The built-in support from Papermill only allows input/output to/from the local filesystem, and only supports running notebooks on the local machine. Our integration allowed users to directly rerun a templated notebook with different parameters in Folium, without the need to start up a Jupyter server, update notebook code with different inputs or monitor the running status manually. To do this, we adapted Papermill to use an I/O handler, letting the papermill read input notebooks from Folium and write output notebooks with computed results back to Folium, and provided a UI that launches new k8s pods for running individual notebooks.</p><p>Providing a smooth user-experience is one the key goals of our JupyterHub ecosystem’s evolution. As our ecosystem scaled in terms of its usage and teams, and with the integration of more systems like Folium, Papermill, and Federated Creds, it became necessary to add new features and extensions.</p><p>Here is a summary of some of the JupyterLab extensions and features we added as part of JupyterHub ecosystem:</p><ul><li><strong>Monitoring</strong>
<ul><li>Slack Notifications for long-running and expensive Jupyter notebooks.</li>
<li>Open-source JupyterLab extension <a href="https://pypi.org/project/jupyterlab-sparkmonitor/">Spark Monitor</a> shows the live status of the Spark job action execution within consecutive cells of Jupyter, which help us focus on the current job execution status without having to switch between SparkUI and Jupyter notebook.</li>
<li>Cluster-level Monitoring on SignalFx/Prometheus: Active notebook run count, percentage of pool (CPU, GPU, on-demand) usage, individual notebook resource usage, data for all the customizations (like kernel, container, user) being used.</li>
</ul></li>
<li><strong>Usability</strong>
<ul><li>Menu buttons to upload and download notebooks from Folium.</li>
<li>Menu buttons to refresh or generate AWS temporary creds for both development and production access</li>
<li>Menu button to list available aws roles assigned and identify the privileges assigned to a user.</li>
</ul></li>
<li><strong>Features</strong>
<ul><li>Side-tabs with a list of available Redshift and Datalake tables. Selecting a particular table auto-generates code-template to connect to and query respective databases.</li>
<li>Integration of <a href="https://pypi.org/project/black/">black</a> and <a href="https://pycqa.github.io/isort/">isort</a> code formatter menu buttons inside JupyterLab.</li>
</ul></li>
<li><strong>Cost Savings</strong>
<ul><li>Extension of the <a href="https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/services-basics.html">cull idle notebook</a> server script to identify, report and kill long-running Spark Clusters to save cost. This is besides our regular cron job to kill idle notebook servers.</li>
<li>Dynamic Resource Allocation integration to scale down the Spark Cluster when no spark action is in progress.</li>
<li>Shutdown Server Menu Button option to enable users to manually shutdown or restart their server.</li>
</ul></li>
</ul><p>The diagram below summarizes the evolution of different components in the JupyterHub ecosystem in a timeline view at Yelp.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/overview-of-jupyterhub-ecosystem/timeline-jupyterhub-evolution.png" alt="Timeline flow graph of JupyterHub ecosystem evolution" /><p class="subtle-text"><small>Timeline flow graph of JupyterHub ecosystem evolution</small></p></div><p>At Yelp, our team is committed to continuous evolution of the JupyterHub ecosystem. We have scaled usage of the JupyterHub ecosystem from an individual engineer, to team based deployments, to our current organization-wide deployments. In the process, we learned a lot about reducing complexity and increasing reliability, allowing our current setup to be maintained and evolved by an individual machine learning compute infrastructure team.</p><p>Our vision of increasing the development velocity and ease-of-use of our systems, reducing onboarding time, and ensuring security is at the forefront of our team’s continuous efforts and roadmap. We have accomplished this through a combination of adapting open source projects and current best practices to Yelp infrastructure, while focusing our internal development towards developer pain points specific to Yelp’s internal ecosystem.</p><p>Some of our future initiatives include enabling code navigation, expanding support for different types in parametrized notebooks, making Folium notebooks schedulable, increasing adaptation of GPU servers for model processing, and auto-refreshing federated credentials.</p><p>Special thanks to everyone on the Core ML, Compute Infrastructure, Security and other dependent teams for their tireless contributions in the bringing and continuous evolution of JupyterHub Ecosystems to keep it up-to-date. Thanks to Zeke Koziol, Blake Larkin, Jason Sleight, Ryan Irwin and Jonathan Budning for providing insightful inputs and sharing historical context and reviewing the blog.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/07/overview-of-jupyterhub-ecosystem.html</link>
      <guid>https://engineeringblog.yelp.com/2023/07/overview-of-jupyterhub-ecosystem.html</guid>
      <pubDate>Tue, 25 Jul 2023 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Speeding Up Delivery With Merge Queues]]></title>
      <description><![CDATA[<p>Merging code safely can be quite time consuming for busy repositories. A common method is to test and merge branches serially, and one at a time, in order to ensure the safety of the main branch. However, this method does not scale well when many developers want to merge code at the same time. In this blog post, you’ll see how we’ve sped up code merging at Yelp by creating a batched merge queue system!</p><p>In our <a href="https://engineeringblog.yelp.com/2023/03/gondola-an-internal-paas-architecture-for-frontend-app-deployment.html">blog post about Gondola</a>, our frontend Platform as a Service (PaaS), we talked about the benefits of moving to a monorepo. As we onboarded more teams and developers into our monorepo, we experienced a bottleneck when integrating code changes (merge requests) during peak hours. Ensuring quick code delivery is important to us at Yelp, as it enables us to iterate quickly and ship fast. Whether it’s bundling JavaScript or running mobile builds, we wanted a system that could speed up all our repositories without changing the developer experience (DX).</p><p>We’ve traditionally run pipelines in serial to keep a clean main branch and prevent merge conflicts from happening. However, this does not scale well when many developers want to push code at the same time (we’ve observed our repo is busiest in the morning). As such, we’ve explored different ways to merge code while guaranteeing the same branch safeties.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-07-11-speeding-up-delivery-with-merge-queues/status-quo.gif" alt="An illustration showing how branches were integrated traditionally." /><p class="subtle-text"><small>An illustration showing how branches were integrated traditionally.</small></p></div><p>During our exploratory phase, a common method we saw to expedite code delivery was to run pipelines in parallel when merge requests overlap. The idea behind this approach is to merge any in-progress merge requests along with the new request in our pipeline. This decreases the time spent waiting between pipelines compared to merging in serial, while still guaranteeing merge safety. If a pipeline fails however, the merge request is removed and new pipelines are started for every merge request after the failing merge request, with the failing merge request now removed.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-07-11-speeding-up-delivery-with-merge-queues/merge-parallel.gif" alt="An illustration showing how branches could be integrated in parallel." /><p class="subtle-text"><small>An illustration showing how branches could be integrated in parallel.</small></p></div><p>This approach is quite resource intensive, since for N branches/merge requests, there would be N pipelines running at the same time. On systems with shared/limited resources, this is a heavy burden to carry, as resource constraints may also negatively affect pipelines currently running.</p><p>The merge queue strategy involves batching up merge requests into merge groups and integrating these merge groups sequentially. This approach keeps our resource usage low as we still run one pipeline at a time at most. However, instead of merging one merge request at a time we can merge in as many good, non-conflicting merge requests as possible.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-07-11-speeding-up-delivery-with-merge-queues/merge-queue.gif" alt="An illustration showing how branches are integrated with merge queues." /><p class="subtle-text"><small>An illustration showing how branches are integrated with merge queues.</small></p></div><p>When a merge group fails, we perform a binary search to find the bad merge request(s) by splitting the merge group into two child merge groups. With this strategy, a child merge group that no longer contains any bad merge requests can merge its subset of merge requests all at once. However, if a child merge group continues to fail, we continue the binary search until we get a merge group of one merge request, which either passes or fails and does not split further.</p><p>To illustrate this we created the diagram below. We start with a merge group with six merge requests (labeled A to F), with one of them being unable to merge (labeled C). The first merge group A, B, C, D, E, and F on the left gets split because we are unable to merge C. The next merge group being evaluated contains merge requests A, B, C which gets split again. We are able to evaluate and merge in merge groups containing D, E, F and A, B afterwards. Eventually, we reach a point where there is a merge group containing merge request C, which fails to merge and does not get split into any child merge groups.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-07-11-speeding-up-delivery-with-merge-queues/merge-group-example.png" alt="Example showing how a merge group of 6 merge requests are merged and split over time." /><p class="subtle-text"><small>Example showing how a merge group of 6 merge requests are merged and split over time.</small></p></div><p>For a bit of background, code delivery pipelines at Yelp start with a magic pull request comment, <code class="language-plaintext highlighter-rouge">!integrate</code>. This triggers a pipeline to perform common actions like merging code, running tests, and pushing upstream. With this in mind, we wanted the new system to preserve the developer UX, while still being flexible enough to rollout to any repo.</p><p>To create the merge queue, we began by building a new service that would execute the run loop. This service manages state/logic (such as merge group creation, splitting, etc.) and periodically checks if a pipeline should run for a subsequent merge group. In addition, we extended the <code class="language-plaintext highlighter-rouge">!integrate</code> comment logic to seamlessly replace the old workflow with this new approach. Repos can choose to use the merge queue by adding a config file that specifies which pipeline to run. The existence of this config file also indicates that the new magic comment logic should be used. As a result, the magic comment for this repo will direct a pull request to join the merge queue (after checking mergeability) instead of running a pipeline immediately.</p><p>In our delivery pipelines, most repositories follow 3 steps as mentioned earlier: merge, test, and push. To account for more complex repos, we allowed developers to perform additional actions before/after these standardized steps or replace them instead. This structure also helps standardize and simplify pipeline code for our repository owners as they onboard to using merge queues. With these changes, pipelines can continue performing necessary actions, while being managed by the merge queue to speed up previously sequential builds.</p><p>Implementing merge queues was a huge improvement from our serial integration pipeline. On the extreme end, we’ve even seen merge groups with over 10 merge requests merge in successfully! By using this system over the past year, our frontend monorepo had an average of about 1.2 merge requests per merge group. In a hypothetical world where a pipeline takes one hour to run, this translates to saving 12 minutes of developer time per pipeline run when compared to running pipelines in serial! For busy repos, which can easily have thousands of merge requests a year, those time savings can add up.</p><p>The merge queue project was a collaboration between Webcore and our talented Continuous Integration and Delivery team. Many thanks to the developers who’ve contributed to this project from idea generation to further optimizations.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/07/speeding-up-delivery-with-merge-queues.html</link>
      <guid>https://engineeringblog.yelp.com/2023/07/speeding-up-delivery-with-merge-queues.html</guid>
      <pubDate>Tue, 11 Jul 2023 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Dependency Management at Scale]]></title>
      <description><![CDATA[<p>Keeping project dependencies up to date is an ever-growing concern. An increasing number of dependencies is used for even the most simple applications. It’s easy for teams to deprioritize maintaining them, resulting in numerous security vulnerabilities. As dependencies become increasingly out of date, the level of effort to get a project into a good state increases significantly. Teams may even get blocked by outdated dependencies when doing critical development work.</p><p>Being proactive about applying upgrades goes a long way. Tools like Dependabot can really help with this. But what if you’re trying to enforce these practices across hundreds of teams and thousands of projects? And what if you have complex requirements that need to be enforced? At Yelp, this is where the Yokyo Drift service comes in.</p><p>Yokyo Drift actively scans all repositories in use at Yelp. It submits pull requests that upgrade any outdated dependencies, and tracks and monitors the progress of these upgrades.</p><p>Building a generic solution that works for the majority of projects is challenging. Projects should be relatively standard. This is encouraged by providing a variety of tooling and quality of life upgrades to repositories that adhere to the Yelp standard. The more a project deviates from the standard, the more difficult it becomes to keep it automatically up to date.</p><p>In addition, projects must have a robust testing pipeline and good test coverage. Thorough automated testing should run as part of the CI pipeline before any change is accepted. Upgrading dependencies is likely to introduce bugs, and inadequate testing means that teams may not feel confident merging upgrades, thus encouraging them to stick to outdated dependencies. #Tracking Project State Batch jobs regularly collect and index a variety of information about Yelp repositories. Yokyo Drift monitors the specific dependencies that are used throughout the organization. When a vulnerability is discovered in a dependency, we can immediately identify all affected repositories and dispatch a fix to rapidly eliminate the vulnerability. All indexed information is available in a simple UI.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-05-17-dependency-management-at-scale/1_status.png" alt="Figure 1: Project status screen" /><p class="subtle-text"><small>Figure 1: Project status screen</small></p></div><h2 id="scheduled-upgrades">Scheduled Upgrades</h2><p>We encourage teams to always keep their project dependencies up to date. Small, frequent updates are much easier to manage. Repository owners can configure how frequently they’d like to receive updates. Yokyo Drift performs both major and minor version upgrades, typically on a monthly or quarterly basis.</p><p>Yelp projects rely on curated package repositories, and we are only able to upgrade to these pre-vetted versions, thereby ensuring we don’t introduce any unwanted security issues.</p><p>Scheduled upgrades are randomly distributed throughout the month. This ensures a consistent use of resources with few spikes. More importantly, it allows our teams to provide support to repository owners and not overwhelm them with too many pull requests at the same time. Performing upgrades for all repositories in one day would result in an overwhelming number of questions in a short amount of time.</p><h2 id="targeted-upgrades">Targeted Upgrades</h2><p>Targeted upgrades allow us to upgrade specific libraries to specific minimal versions across the entire organization. These can be invoked dynamically by other teams using the Yokyo Drift API, or manually using the UI shown below.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-05-17-dependency-management-at-scale/2_target.png" alt="Figure 2: Performing a targeted upgrade" /><p class="subtle-text"><small>Figure 2: Performing a targeted upgrade</small></p></div><p>This functionality is frequently used by security teams. Once a vulnerability is discovered in a specific version of a library, we can immediately see the impact and deploy a mass upgrade across all of Yelp’s projects. We then actively monitor the progress and ensure the vulnerability is eliminated in all of Yelp’s systems.</p><p>Library owners are also frequent users of targeted upgrades. They can rapidly deploy bug fixes and other improvements to all relevant projects.</p><h2 id="pull-requests">Pull Requests</h2><p>All changes are submitted as pull requests in GitHub. Since changes go through the existing CI pipeline, a variety of security and automated tests are executed. We rely on the Ownership service to determine the relevant team responsible for each repository. Pull requests are assigned for review to one of the repository’s owners, who is responsible for manually fixing small changes that may be required by library upgrades. The change automatically gets merged once all checks pass and the repository owner approves the pull request.</p><p>Occasionally, teams will be unable to review these pull requests in a timely manner, so automated reminders are sent to the reviewer at a set interval. In addition, Yokyo Drift attempts to always keep the pull request up to date. Merge conflicts are avoided by regularly pulling the latest changes from the master branch and performing the upgrade again if needed.</p><p>Updating dependencies on one repository can be time-consuming. It may involve building the project, performing dependency resolution, and even running some automated checks. This is manageable when upgrading a single repository, but quickly becomes unusable when upgrading hundreds or even thousands of repositories. To address this, we need to be able to automatically scale up and down as needed.</p><p>Creating a new upgrade job enqueues a payload for each repository that needs upgrading. Workers are then responsible for taking items off the queue, performing the necessary changes, and submitting the pull requests. Workers are configured to automatically scale up as queue size increases and scale back down when the queue clears. Because of this, thousands of complex upgrades can be executed quickly.</p><p>The Yokyo Drift UI tracks the progress of each task. A typical successful task will move through the following stages: pending, in_progress (the upgrade is in progress), open (pull request is open), and merged.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-05-17-dependency-management-at-scale/3_progress.png" alt="Figure 3: Upgrade progress tracking" /><p class="subtle-text"><small>Figure 3: Upgrade progress tracking</small></p></div><p>The job progress page keeps track of how these updates affect repositories. A status of “checks_failed” indicates that the repository is failing automated tests. This status is not uncommon, however, a large number of repositories failing tests may indicate a fundamental problem with the upgrade. Migration authors such as package owners can investigate this and determine if any changes should be made, the end goal being to reduce friction with teams and make these upgrades as easy as possible to integrate.</p><p>This progress screen can also be used to directly control job progress. Upgrades can be rerun on individual repositories, the entire change can be canceled or reverted, and teams can be nudged to review and approve the changes if necessary.</p><p>Dependencies can easily become outdated and cause significant problems to development teams. Updating them regularly makes the process more manageable and reduces the number of security vulnerabilities. Enabling teams to upgrade a single dependency across thousands of projects is valuable, both for security teams and dependency developers.</p><p>Thanks to Luis Perez, Kyle Deal, James Flinn, Jason Tran, Rebecca Fan, Mitali Parthasarathy, Hanna Farah, and many others that have contributed to Yokyo Drift over the years.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/05/dependency-management-at-scale.html</link>
      <guid>https://engineeringblog.yelp.com/2023/05/dependency-management-at-scale.html</guid>
      <pubDate>Wed, 17 May 2023 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Performance for Free on Android with our MVI Library]]></title>
      <description><![CDATA[<p>In 2018, Yelp switched from using the MVP architecture to the MVI architecture for Android development. Since then, adoption of our new MVI architecture library has risen and we’ve seen some great performance and scalability wins. In this blog post, we’ll cover why we switched to MVI in the first place, how we managed to get performant screens by default, and our take on unit testing MVI.</p><h2 id="what-is-mvi">What is MVI?</h2><p>One of the main reasons to use an architecture is to make things easier to test by separating concerns. For Android, this means keeping the Android SDK out of our presenters and abstracting away all the code that will cause issues for unit tests.</p><p>The general idea of Model View Intent (MVI) is that when the user interacts with the UI, a view event is sent to be processed in the model. The model can make network requests, manipulate some view state and send the state back to the view. They’re connected by an event bus or stream so no direct references to Android are required (thus concerns are separated for testing).</p><h2 id="why-we-switched-away-from-mvp">Why we switched away from MVP</h2><h3 id="our-mvp-implementation-did-not-scale-well">Our MVP implementation did not scale well</h3><p>Although <a href="https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93presenter">Model-View-Presenter</a> (MVP) is a great architecture with a lot of benefits, we found that it didn’t scale well for our larger, more complicated pages. Our presenters grew to have far too many lines of code and became unwieldy and awkward to maintain as we needed to add more state-management and create more complex presenter logic for MVP pages. It was possible to scale an MVP page using multiple presenters, but there was no one approach documented. Our MVP contracts also contained many duplicated interface methods.</p><h3 id="we-wanted-free-performance-by-default">We wanted free performance by default</h3><p>When Google introduced the <a href="https://developer.android.com/topic/performance/vitals">Android Vitals</a> dashboard and announced that performance can affect our listing and promotability in the Play Store, Yelp’s Core Android team invested effort in improving our cold start timings, frame rendering timings, and frozen frames percentages. Although we made significant improvements in those areas, we found that performance regressions were easy to come by and our performance degraded again over time.</p><p>There are a few ways to prevent performance regressions: we could set up performance alerts, we could try to catch regressions before they’re merged, or we could also try to make our apps run smoothly by default. While we did try all of these in the end, our performance came to us for free through auto-mvi, our new MVI library.</p><h2 id="why-we-chose-mvi-and-not-mvvm">Why we chose MVI and not MVVM</h2><p>We evaluated both the MVI and the <a href="https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel">Model-View-ViewModel</a> (MVVM) architectures before ultimately deciding on MVI. First, we looked at the basic requirements in our apps. Both of Yelp’s apps require a lot of scrolling and clicking in comparison to, for example, video streaming applications. Next, we looked at what other technologies we were using and determined which architecture would be most compatible with them.</p><p>We rely heavily on our in-house <a href="https://github.com/Yelp/bento">Bento</a> library which is a wrapper around RecyclerView. In Bento, a Component is a part of the UI which can be slotted into any RecyclerView. We set up each Component to be its own mini MVP-universe that has its own view, model, and presenter.</p><p>In our prototypes, we found that combining Bento with the MVVM pattern was confusing and led to difficult to read code. However, MVI complimented Bento and allowed click events to be fired from within view holders without the need for direct references to the encompassing Fragment or Activity. Additionally, since some of our screens have a lot of UI elements, MVVM would require some data classes with many (greater than 30) fields which would not scale well.</p><h2 id="how-does-auto-mvi-work">How does auto-mvi work?</h2><p>When the user interacts with the app, view events are emitted from the view (Fragment or Activity). A view event might be a click or scroll event. A presenter (note: to avoid confusion, at Yelp, we refer to the Model in MVI as the “presenter”) receives the events and sends back view states. The view then responds to these states and decides what to show accordingly. These view events and states are represented as sealed classes in Kotlin. They are emitted over an event bus which both the view and presenter can listen to for new events and states.</p><h3 id="scaling-and-readability-with-annotations">Scaling and readability with annotations</h3><p>Both the presenter and view must handle all of these incoming states. Most Android MVI implementations accomplish this with a <code class="language-plaintext highlighter-rouge">when</code> statement in Kotlin. However, the <code class="language-plaintext highlighter-rouge">when</code> statement wouldn’t scale very well for Yelp. It would be difficult to read. Imagine the following but with fifty other <code class="language-plaintext highlighter-rouge">is</code> clauses:</p><div class="language-kotlin highlighter-rouge highlight"><pre>private fun onViewEvent(viewEvent : MyFeatureEvents) {
  when (viewEvent) {
      is HeaderClicked -&gt; onHeaderClick()
      is FooterClicked -&gt; onFooterClick()
  }
}
</pre></div><p>To get around the <code class="language-plaintext highlighter-rouge">when</code> condition problem, the general idea was to route states and events to function references using a map. That meant going from the above code to:</p><div class="language-kotlin highlighter-rouge highlight"><pre>private val functionMap = mapOf(
    HeaderClicked::class to ::onHeaderClick,
    FooterClicked::class to ::onFooterClick
)
private fun onViewEvent(viewEvent: MyFeatureEvents) {
   ((functionMap[viewEvent::class]) as KFunction0&lt;Unit&gt;).invoke()
}
</pre></div><p>Then all the onViewFunction() needs to do is look up the map.</p><p>So we could avoid the big <code class="language-plaintext highlighter-rouge">when</code> statement. Writing the function map is gross though and still defeats our scalability goal. We’d just be trading a large <code class="language-plaintext highlighter-rouge">when</code> statement for a large map. We would also need to handle the number of parameters the functions can have. The above code only covers the easiest, zero-parameter case.</p><p>This is how we arrived at the idea to annotate the functions instead. When the presenter and view are created, we use reflection (on a background thread) to create the map of states to functions. Our interface AutoFunction (which is where “auto” comes from) provides the mechanism for this and also routes incoming states and events to relevant functions, and then executes the function with reflection. Again, taking the following example:</p><div class="language-kotlin highlighter-rouge highlight"><pre>private fun onViewEvent(viewEvent : MyFeatureEvents) {
  when (viewEvent) {
      is HeaderClicked -&gt; onHeaderClick()
      is FooterClicked -&gt; onFooterClick()
  }
}
</pre></div><p>Instead we have:</p><div class="language-kotlin highlighter-rouge highlight"><pre>@Event(HeaderClicked::class)
fun onHeaderClick() {
  // do something
}
@Event(FooterClicked::class)
fun onFooterClick() {
  // make network request etc
}
</pre></div><p>With this approach, the scaling issue is solved. There is no when statement at all, no function map, and not even a specific function responsible for handling incoming events or states. It also has the advantage that it’s incredibly easy to read.</p><h3 id="scaling-with-sub-presenters">Scaling with sub presenters</h3><p>One of the issues we found while using MVP was that for the most complex screens in Yelp’s consumer app, the presenters quickly grew difficult to maintain and understand. With this in mind, the auto-mvi library has a strategy for scaling presenters for complex screens such as this. A page will define one main presenter and within it there can be multiple sub presenters. A sub presenter can handle the logic for a particular feature or part of the UI. For example, for a page with these click events defined in the contract:</p><div class="language-kotlin highlighter-rouge highlight"><pre>sealed class MyFeatureEvents : AutoMviViewEvent {
   object MyButton1Clicked : MyFeatureEvents()
   object MyButton2Clicked : MyFeatureEvents()
   object MyButton3Clicked : MyFeatureEvents()
}
</pre></div><p>We could respond to them all in one presenter like this:</p><div class="language-kotlin highlighter-rouge highlight"><pre>class MyFeaturePresenter(
   eventBus: EventBusRx
) : AutoMviPresenter&lt;MyFeatureEvents, MyFeatureStates&gt;(eventBus) {
   @Event(MyButton1Clicked::class)
   fun onMyButton1Clicked() {
       // do something
   }
   @Event(MyButton2Clicked::class)
   fun onMyButton2Clicked() {
       // do something
   }
   @Event(MyButton3Clicked::class)
   fun onMyButton3Clicked() {
       // do something
   }
}
</pre></div><p>But with a sub presenter, we can handle a subset of events elsewhere:</p><div class="language-kotlin highlighter-rouge highlight"><pre>class MyFeaturePresenter(
  eventBus: EventBusRx
) : AutoMviPresenter&lt;MyFeatureEvents, MyFeatureStates&gt;(eventBus) {
    // The rest of click events are handled in here
   @SubPresenter private val subPresenter = MyFeatureSubPresenter(eventBus)
   @Event(MyButton1Clicked::class)
   fun onMyButton1Clicked() {
        // do something
   }
}
</pre></div><p>Since everything is connected via an event bus, it’s simple for one sub presenter to handle a portion of the incoming view events and respond to the view. A bonus win of this pattern is that the organization of unit tests is much improved as each sub presenter can have its own separate unit test. This sub presenter pattern also helps put scaling code at the forefront of one’s mind during planning. If there is a clear division of logic, e.g. header logic vs footer logic, you can easily plan this from the beginning instead of waiting until the presenter is over a thousand lines long at some future point.</p><h3 id="performance-for-free">Performance for free</h3><p>With auto-mvi using reflection to execute functions, an opportunity presented itself. The reflection call is straightforward:</p><div class="language-kotlin highlighter-rouge highlight"><pre>myFunctionReference.invoke()
</pre></div><p>The function – like all the functions in our previous MVP presenters – executes on the main thread. However, by moving the execution of this one line to a background thread instead, we shifted a large portion of the total code that executes in the Yelp apps off the main thread leading to increased performance over all. This change only affects the presenters. The view code still runs on the main thread as it is required to.</p><p>The code executes on a single background thread to ensure that each unit of work is carried out sequentially. This means all the presenter code, be it performant or not – it’s all running on a background thread in the model now.</p><h3 id="testing">Testing</h3><p>Writing unit tests for MVP presenters and views is easy and one of the greatest advantages it has over other architectures. We used Mockito to verify that functions were called on the interfaces that made up the MVP contract which is a seamless and straightforward way to test. For example;</p><div class="language-kotlin highlighter-rouge highlight"><pre>fun whenButtonClicked_loadingProgressShown() {
       presenter.buttonClicked() // Simulated UI interaction
       Mockito.verify(view).showLoadingProgress()
}
</pre></div><p>In MVI, we wanted to make sure that the code was still easily testable. The approach we decided on was to record the events and states that are emitted over the event bus and make assertions on them.</p><p>To simplify testing, we created a JUnit test rule called PresenterRule. In addition to abstracting away most of the setup required for the presenter and event bus, the presenter rule also acts as an event bus recorder and provides a set of functions for asserting what happened.</p><p>Taking the example above, this looks like:</p><div class="language-kotlin highlighter-rouge highlight"><pre>fun whenButtonClicked_loadingProgressShown() {
     presenterRule.sendEvent(ButtonClicked)
     presenterRule.assertEquals { listOf(ShowLoadingProgress) }
}
</pre></div><p>Along with verifying that functions are executed, this approach also provides a high-level look at what events and states were triggered and in what order. Lastly, developers can also assert that certain states were <em>not</em> triggered.</p><h2 id="reflecting-4-years-later">Reflecting 4 years later</h2><h3 id="does-it-actually-help-scalability">Does it actually help scalability?</h3><p>Many teams have made use of the sub presenter pattern with great results. In 2020, the Biz Mobile Foundation team rewrote Yelp’s Business Owner App’s home screen using auto-mvi and made great use of the sub presenter pattern. By utilizing sub presenters, this complicated page’s presenter size remained small and manageable, less than 200 lines with 8 sub presenters. There are also separate unit test classes for the sub presenters which are a lot more manageable than if all the tests were in one file.</p><h3 id="does-it-actually-help-performance">Does it actually help performance?</h3><p>From a high level, we can use Android Vitals to gauge our apps’ performance. However, auto-mvi is just one tool in Yelp’s performance arsenal. In combination with the Core Android team’s other performance efforts, Yelp’s consumer app’s frozen frame and rendering statistics on Google Play’s Android Vitals dashboard are significantly better than our competitors.</p><p>Looking at a more specific use case, in 2020, Yelp’s Growth team migrated the onboarding pages to auto-mvi and analyzed the frame rendering timings of the old flow vs the new MVI one, and found a &gt; 50% improvement in the MVI version. This is precisely the kind of improvement we should expect as the presenter code isn’t clogging up the main thread anymore. Below outlines the speed gains we saw on this page with auto-mvi vs MVP.</p><table><thead><tr><th>Avg Frame Render Time Improvement (Relative)</th>
<th>P90 Frame Render Time Improvement (Relative)</th>
<th>Frozen Frame % Improvement (Absolute)</th>
</tr></thead><tbody><tr><td>-51%</td>
<td>-67%</td>
<td>-3.99%</td>
</tr></tbody></table><p>The performance boost resulted in an improvement in product metrics too, with a 6.32% relative lift for the Onboarding Flow Completion rate and an 8.26% relative lift for Signup Rate Completion.</p><p>Without any involved, special or overly scientific effort made for performance in particular here, the page’s performance improved. You might even say the performance was free.</p><h3 id="is-unit-testing-still-easy">Is unit testing still easy?</h3><p>Most if not all of Yelp’s MVI presenters are accompanied by unit tests and the provided testing rule has proven to speed-up developer workflow. To date, we have thousands of unit tests making sure Yelp’s apps are doing what they’re supposed to do.</p><h2 id="conclusion">Conclusion</h2><p>In summary, every architecture has its advantages and disadvantages but the most important thing is to choose the one that’s most suitable for your business needs. Auto-mvi has allowed Yelp to tackle the development of simple screens to complex screens and everything in between in a scalable and testable way while keeping runtime performance a feature and <em>not</em> an afterthought.</p><h2 id="acknowledgments">Acknowledgments</h2><p>Thanks to Diego Waxemberg, Jason Liu, and all the feature teams at Yelp who provided invaluable feedback on our early prototypes and more importantly, adopted auto-mvi on their screens. On Core Android, shoutout to Kurt Bonatz, Matthew Page, Ying Chen for their contributions and help maintaining auto-mvi over the years. Many thanks to all the past members of Yelp who contributed ideas and feedback too.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/04/performance-for-free-on-android-with-our-mvi-library.html</link>
      <guid>https://engineeringblog.yelp.com/2023/04/performance-for-free-on-android-with-our-mvi-library.html</guid>
      <pubDate>Mon, 24 Apr 2023 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Yelp Content As Embeddings]]></title>
      <description><![CDATA[<p>Yelp aims to offer easily accessible high-quality content. We need to tag, organize and rank online content to attain this goal. For this purpose, Yelp engineers have started using general embeddings on different data. It improves usability and efficiency for all kinds of model development. Having embeddings that encapsulate semantic information readily available for the massive amounts of data Yelp owns makes implementing new deep learning models easier, since it can serve as an excellent baseline for any model input.</p><p>This blog post discusses how the Content and Contributor Intelligence team generates low-dimensional representations of review text, business information and photos for any unspecified machine learning task.</p><h2 id="text-embeddings">Text Embeddings</h2><p>Text embedding has been researched in depth in the scientific community. First, embeddings were generated with sparse vectors representing words. Embeddings developed further with context-aware embeddings since the same word can have different meanings depending on how it is used in a sentence. With the use of transformers in recent years, we now have text snippet embeddings that capture more semantic meaning.</p><p>Semantic comprehension of the text is essential for Yelp. Yelp reviews are our most valuable asset since they contain a lot of business context and sentiment. We want to capture the essence of each review text to serve their information to our users better. We looked for versatility in our embedding as we try to use the same embedding in various tasks: tagging, information extraction, sentiment analysis and ranking.</p><p>Embeddings based on reviews are currently generated by the Universal Sentence Encoder off-the-shelf model offered by Tensorflow. This blog section will present the USE model, any modifications tested to improve it and its advantages for the Yelp dataset.</p><h3 id="universal-sentence-encoder">Universal Sentence Encoder</h3><p>The Universal Sentence Encoder (USE) shows many advantages for Yelp data. It transforms varying sentence lengths into a fixed-length vector representation. The generated representation aims to encode the meaning and context of the text snippet instead of simply averaging the words together or getting their position in a learned latent space like Latent Dirichlet Allocation (LDA).</p><p>The <a href="https://arxiv.org/abs/1803.11175">paper presenting the Universal Sentence Encoder</a> trained a model on various data sources and tasks like text classification, semantic similarity, and clustering. Training a model on varying tasks makes it more general and captures all the possible expressiveness of a text snippet. The model demonstrates promising results on eight transfer tasks and suggests that training this model on diversified data sources and sufficiently varied tasks makes it universal, as the name suggests. Universal embeddings are what we were looking for to exploit our most diverse and deep content, Yelp reviews. We want to extract the business information and context given in the review, do sentiment analysis and even rank the reviews based on their relevance and information diversity with the help of the generated review embeddings.</p><p>The deep averaging network (DAN) version of USE takes words and bigram embeddings and then averages them together. This resulting embedding serves as input to a feedforward deep neural network that produces the universal sentence embedding we aim to obtain.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/DAN-architecture.png" alt="An architecture overview of DAN, taken from https://amitness.com/2020/06/universal-sentence-encoder/" /><p class="subtle-text"><small>An architecture overview of DAN, taken from https://amitness.com/2020/06/universal-sentence-encoder/</small></p></div><h3 id="yelp-exploration">Yelp Exploration</h3><p>By nature, most NLP models will perform better when trained on domain-specific text. With this hypothesis, we developed and compared a Yelp fine-tuned encoder with the pre-trained USE model available on TensorFlow-Hub. We aimed to create a better model adapted to the Yelp domain than the pre-trained model. After fine-tuning the model, we wanted to use it to generate an embedding for reviews specifically.</p><p>Yelp data contains different text formats like reviews, captions, searches, and survey responses that can all be used to fine-tune the USE encoder. Since these models are not generative, we needed to create generic supervised learning tasks to fine-tune the model on Yelp domain text.</p><p>Some examples of learning tasks we used:</p><ul><li>Review Category Prediction</li>
<li>Review Rating Prediction</li>
<li>Search Category Prediction</li>
<li>Sentence Order Prediction</li>
<li>Same Business Prediction</li>
</ul><p>For the evaluation task, we chose:</p><ul><li>Photo Caption Classification</li>
<li>Menu Item Classification</li>
<li>Business Property Classification</li>
<li>Synonym Generation for a phrase input</li>
</ul><p>The model evaluation made on the Yelp domain showed that the ready-to-use model performed better than or as well as the Yelp pre-trained encoder for all tasks. These results happened because either the Yelp domain touches many generic subjects in the USE model or our experiments lacked task diversity to gain an edge. Based on these results, we decided to keep the off-the-shelf USE pre-trained model.</p><h3 id="use-on-yelp-domain">USE on Yelp Domain</h3><p>We can measure two embeddings’ relatedness when they are projected together in the same vector space. This is helpful for semantic search, cluster analysis, and other applications.</p><p>Below is a graph representation of a USE embedding space applied on the Yelp dataset. We wanted to verify that the embedding representation and their vector space position related to each other, which is expected of a semantic embedding that captures the general subject of the text snippet it encodes.</p><p>We computed the cosine similarity between different embedding representations of reviews from different categories and regrouped them in the following heatmap. We verified that reviews from the same category domain were closer in the vector space than reviews from a different domain, as shown by the lighter boxes in the graph below.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/USE-heatmap.png" alt="Numbers on the axis reference 44 different reviews IDs. Those reviews business categories are shown in the table below. We can see a clear correlation between reviews from a similar business type." /><p class="subtle-text"><small>Numbers on the axis reference 44 different reviews IDs. Those reviews business categories are shown in the table below. We can see a clear correlation between reviews from a similar business type.</small></p></div><table><thead><tr><th>Labels association table</th>
<th> </th>
</tr><tr><th>Reviews ID</th>
<th>Yelp Business Type</th>
</tr></thead><tbody><tr><td>0 to 10</td>
<td>Restaurants</td>
</tr><tr><td>11 to 21</td>
<td>Dry Cleaning</td>
</tr><tr><td>22 to 33</td>
<td>Groomer</td>
</tr><tr><td>33 to 43</td>
<td>Plastic Surgeon</td>
</tr></tbody></table><h2 id="business-embeddings">Business Embeddings</h2><p>After Yelp created embedding representations of reviews which showed great potential across several projects, we explored different possibilities to grow our vector representations. We started exploring ways to develop a business vector representation using all of its metadata.</p><p>We chose to base our business embedding on user content. We select the 50 most recent reviews and average their vector embeddings to create our first business embedding representation. It’s a great way to start since reviews contain quality content describing the businesses. The next step will be to add the photo embeddings as well.</p><p>Business embeddings help generate a top-k similarity list to relate businesses to other businesses, users to businesses and users to users based on their matching business interaction history. This correlation matrix of similarity helps show signification recommendations like “Users like you also liked…” or “Since you like business A, you might like business B”. You can learn more about this use case in <a href="https://engineeringblog.yelp.com/2022/04/beyond-matrix-factorization-using-hybrid-features-for-user-business-recommendations.html">this blog post</a>.</p><h2 id="photo-embeddings">Photo Embeddings</h2><p>Review and business vector representations have existed at Yelp for some time already. Last year, when the <a href="https://arxiv.org/abs/2103.00020">paper</a> that presented the Contrastive Language-Image Pre-training (CLIP) model was published, it inspired us at Yelp to generate more semantic data representation based on photos this time.</p><p>Research made on photo’s semantic representation improved significantly with the use of transformers applied on images. This section will present OpenAi’s CLIP model, its known capabilities, CLIP’s pre-trained effectiveness on the Yelp domain and some vulnerabilities that are good to be aware of before using it.</p><h3 id="clip-model">CLIP model</h3><p>Our photo encoder is based on the CLIP model’s performance and ability. This model has learned to associate an image with the most relevant text given. It is a pre-trained zero-shot model that associates a natural language with high-level visual concepts.</p><p>CLIP’s input is two sets of features, an image and text. The feature embedding of a single image and a bunch of possible texts are generated alongside their respective encoders. After, CLIP aims to regroup similar image-text pairs in the embedding space and distance dissimilar ones using contrastive representation learning based on their cosine similarity. Our first goal here is to generate photo embeddings. To that end, we experimented with the CLIP pre-trained model then used the generated embeddings on the Yelp dataset in the next section and compared it with our models in production.</p><p>The CLIP model is a zero-shot model, meaning it can infer successfully from an unseen dataset. A zero-shot model is an opportunity for Yelp to better identify and tag unseen photo categories to improve photo search. Our classifier won’t need a thousand examples for each new tag or label added.</p><p><a href="https://openai.com/blog/multimodal-neurons/">Research done on CLIP</a> showed its multimodal neuron ability in an abstract concept. Instead of reacting to a specific image feature like a Convolutional Neural Network model’s neuron, it responds to a cluster of ideas with a high-level theme.</p><p>In the table below are some examples of high-level themes. You can see tombstones in the image associated with the word Halloween. Those images, generated using different tools referenced in the <a href="https://openai.com/blog/multimodal-neurons/">openAI blog post</a>, try maximizing a single neuron activation with gradient-based optimization for the input (i.e. “Halloween”) and the distribution of images.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/CLIP-visual-neurons.png" alt="Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/" /><p class="subtle-text"><small>Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/</small></p></div><h3 id="evaluation-made-on-yelp-photo-dataset">Evaluation made on Yelp Photo Dataset</h3><p>We compared the CLIP model with three existing ResNet50 classification models to evaluate CLIP’s capability on the Yelp domain. Our 5-way Restaurant, Food and Nightlife classifier identifies <em>Food, Drinks, Menu, Interior or Exterior</em> categories for photos. The food classifier consists of 27 food dishes, and the Home Services Contractor Classifier identifies five categories of repairs. We tested the CLIP model without any fine-tuning applied to the pre-trained model found on <a href="https://huggingface.co/docs/transformers/model_doc/clip">HuggingFace</a>. We manually engineered the classes’ labels to optimize the CLIP model performance but didn’t optimize the categories themselves since we wanted a direct comparison with the existing models.</p><h4 id="5-way-restaurant-food-and-nightlife-classifier">5-Way Restaurant, Food and Nightlife Classifier</h4><p>While experimenting, we rapidly concluded that we could not simply reuse our previously-used class names as input. The paper suggested adding ‘<em>A photo of</em>’ in front of our title, but it didn’t prove effective for all the categories. The table below contains the label engineering applied to the 5-Way Restaurant, Food and Nightlife classification problem.</p><table><thead><tr><th>Original Labels</th>
<th>Engineered Labels</th>
</tr></thead><tbody><tr><td>Drink</td>
<td>A photo of a drink</td>
</tr><tr><td>Food</td>
<td>A photo of food</td>
</tr><tr><td>Menu</td>
<td>A photo of a menu</td>
</tr><tr><td>Interior</td>
<td>A photo of inside a restaurant</td>
</tr><tr><td>Outside</td>
<td>A photo of a restaurant exterior</td>
</tr><tr><td> </td>
<td>A photo of other</td>
</tr></tbody></table><p>The following table compares the ResNet50 model currently in production and the zero-shot CLIP model. Results for the 5-way restaurant, food and nightlife classifier show that CLIP has potential and that label engineering could beat a domain-trained deep learning model. These results also encourage us to explore further the potential of a fine-tuned CLIP model on Yelp domain.</p><table><thead><tr><th>Comparison Table of the 5-Way Classifier</th>
<th> </th>
<th> </th>
<th> </th>
<th> </th>
</tr><tr><th> </th>
<th>ResNet50</th>
<th> </th>
<th>CLIP</th>
<th> </th>
</tr><tr><th> </th>
<th>Precision</th>
<th>Recall</th>
<th>Precision</th>
<th>Recall</th>
</tr></thead><tbody><tr><td>Drink</td>
<td>96.8 %</td>
<td>87.1 %</td>
<td>96 %</td>
<td>91 %</td>
</tr><tr><td>Food</td>
<td>96.0 %</td>
<td>92.7 %</td>
<td>88 %</td>
<td>91 %</td>
</tr><tr><td>Menu</td>
<td>95.0 %</td>
<td>80.3 %</td>
<td>51 %</td>
<td>94 %</td>
</tr><tr><td>Interior</td>
<td>89.4 %</td>
<td>92.2 %</td>
<td>92 %</td>
<td>77 %</td>
</tr><tr><td>Outside</td>
<td>84.3 %</td>
<td>94.6 %</td>
<td>96 %</td>
<td>80 %</td>
</tr><tr><td>Other</td>
<td> </td>
<td> </td>
<td>29 %</td>
<td>38 %</td>
</tr></tbody></table><p>Let’s dive deeper into the results shown above with the table below. It shows the precision with which the CLIP model predicted the hand-labeled Yelp dataset. More importantly, it shows which categories get mixed up together. Most notable are photos classified as Other by the CLIP model that were otherwise labeled in the dataset.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/5-way-precision.jpg" alt="" /></div><p>With a closer look, we observe that many <strong>Interior</strong> and <strong>Exterior</strong> photos get classified as <strong>Other</strong> by the CLIP model. Here are some examples for <strong>Interior</strong>.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/5-way-imgs.png" alt="Images taken from yelp.com" /><p class="subtle-text"><small>Images taken from yelp.com</small></p></div><p>We can see that these misclassifications are due to photo composition. People are often in the foreground of interior and exterior photos. The CLIP model is built to emphasize the embedding representation of concepts shown in the images. The attention model emphasized foreground elements at the cost of background elements.</p><h4 id="food-classifier">Food Classifier</h4><p>The Food Classifier aims to identify a dish showcased in a photo. The production model is a ResNet50 trained on 27 food classes (Comparison Table found in appendix 1). CLIP performed well in general compared to the production model, but it still needs improvement in multiple categories.</p><p>CLIP is a peculiar model. Using it like a ResNet50 might create some error opportunities. First, we must remember that category labels were engineered but not the categories themselves. Too many labels hindered the results of models like ResNet since each label category is trained from scratch and requires many examples.</p><p>On the contrary, using as many dish names as possible would better reflect the photo for the CLIP model. CLIP was trained by pairing one image with 32,768 randomly sampled text snippets. The model can work with a wide range of possible outputs. For our comparison tests, category engineering wasn’t done.</p><p>Second, we also found that some original dishes might confuse our results. Images labeled <strong>Waffles</strong> in our dataset were considered misclassified as <strong>Chicken Wings or Fried Chicken</strong> by the CLIP model. Hand verification shows its classification accurately represents the images that showcase a Texan dish of <em>Fried Chicken and Waffles</em>.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/food-waffle-fc-3.png" alt="Image taken from yelp.com" /><p class="subtle-text"><small>Image taken from yelp.com</small></p></div><table><thead><tr><th>Label</th>
<th>Probability</th>
</tr></thead><tbody><tr><td>Chicken Wings &amp; Fried Chicken</td>
<td>44 %</td>
</tr><tr><td>Waffles</td>
<td>11 %</td>
</tr><tr><td>Ribs</td>
<td>9 %</td>
</tr><tr><td>Dessert</td>
<td>9 %</td>
</tr><tr><td>Tacos</td>
<td>5 %</td>
</tr><tr><td>Steak</td>
<td>5 %</td>
</tr><tr><td>Sandwiches</td>
<td>5 %</td>
</tr></tbody></table><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/food-waffle-fc-4.png" alt="Image taken from yelp.com" /><p class="subtle-text"><small>Image taken from yelp.com</small></p></div><table><thead><tr><th>Label</th>
<th>Probability</th>
</tr></thead><tbody><tr><td>Chicken Wings &amp; Fried Chicken</td>
<td>51 %</td>
</tr><tr><td>Waffles</td>
<td>42 %</td>
</tr><tr><td>Ribs</td>
<td>3 %</td>
</tr><tr><td>Pancakes</td>
<td>1 %</td>
</tr><tr><td>Dessert</td>
<td>1 %</td>
</tr></tbody></table><p>Lastly, some dishes’ names represent the protein in the meal but aren’t plated to showcase it. For example, the CLIP model misclassified some images labeled <strong>Grilled Fish</strong> in the <strong>Salad</strong> category.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/food-imgs.png" alt="Images taken from yelp.com" /><p class="subtle-text"><small>Images taken from yelp.com</small></p></div><h4 id="home-services-contractor-classifier">Home Services Contractor Classifier</h4><p>Home Services Contractors Classifier got great results with CLIP for most categories. The categories were highly curated in the past to optimize the ResNet50 model in production as it was with the 27-food classifier seen previously. CLIP offers the possibility to remove the restraint of needing a large number of examples for each category our model infers. A review of the possible output classes from CLIP will lead to more diversified content tags on Yelp.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/plah-precision.png" alt="" /></div><p>In the confusion matrix above, we can see that CLIP doesn’t identify enough photos labeled “Other” in our dataset. To remedy that, we tried using a 70% compatibility threshold to label an image, with the default being Other. The table below shows the results. The table shows that having a threshold is a tradeoff between increasing precision (decreasing the number of false positives) and decreasing recall (reducing the percentage of positives identified).</p><table><thead><tr><th>Comparison Table of the Home Services Contractor Classifier</th>
<th> </th>
<th> </th>
<th> </th>
<th> </th>
</tr><tr><th> </th>
<th>CLIP</th>
<th> </th>
<th>CLIP - 70% threshold</th>
<th> </th>
</tr><tr><th> </th>
<th>Precision</th>
<th>Recall</th>
<th>Precision</th>
<th>Recall</th>
</tr></thead><tbody><tr><td>Bathroom, Bathtub and Shower</td>
<td>88 %</td>
<td>87 %</td>
<td>91 %</td>
<td>82 %</td>
</tr><tr><td>Decks and Railing</td>
<td>20 %</td>
<td>84 %</td>
<td>35 %</td>
<td>76 %</td>
</tr><tr><td>Door, Door Repair &amp; Installation</td>
<td>24 %</td>
<td>81 %</td>
<td>38 %</td>
<td>74 %</td>
</tr><tr><td>Kitchen</td>
<td>92 %</td>
<td>85 %</td>
<td>94 %</td>
<td>79 %</td>
</tr><tr><td>Solar Panel</td>
<td>83 %</td>
<td>77 %</td>
<td>89 %</td>
<td>69 %</td>
</tr><tr><td>Other Contractors</td>
<td>80 %</td>
<td>57 %</td>
<td>69 %</td>
<td>77 %</td>
</tr></tbody></table><h3 id="clips-vulnerability">CLIP’s vulnerability</h3><p>Before using CLIP and publishing its results, it’s better to know its vulnerabilities and how to optimize the model performance. We already identified label and category engineering and used threshold. Here we describe another likely inconvenience of the model at Yelp.</p><p>Below are some examples of high-level themes, as seen previously. Let’s focus on more abstract concepts like typographical neurons that show images of word snippets related to syllables.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/CLIP-visual-neurons-vulnerability.png" alt="Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/" /><p class="subtle-text"><small>Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/</small></p></div><p>This demonstrates the model’s capability to “read” as mentioned in the image caption. It comes with the caveat of easily fooling the algorithm with typographic attacks. A prominent word like iPod handwritten on a sticker in any photo would classify it as an iPod, even if the picture clearly shows an apple.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/CLIP-vulnerability-apple-ipod.png" alt="Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/" /><p class="subtle-text"><small>Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/</small></p></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-04-20-yelp-content-as-embeddings/CLIP-vulnerability-apple-pizza.png" alt="Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/" /><p class="subtle-text"><small>Image taken from openAI blogpost: https://openai.com/blog/multimodal-neurons/</small></p></div><p>On Yelp’s dataset, it means restaurant merchandise lying around in a photo could create more misclassifications by the model.</p><h2 id="conclusion">Conclusion</h2><p>While working on this new project, we took the opportunity to review and upgrade our storing system for the vector representation we are responsible for. We aimed to make this data as accessible and easy to use as possible for any internal project.</p><p>We needed to generate new embeddings for all our collected Yelp data to complete this project with the identified models and techniques chosen to create our content embeddings.</p><p>Yelp aims to constantly grow the breadth, depth, and accuracy of the data we show to our consumers. Review and text embeddings show great promise in helping us improve in all three dimensions.</p><p>Many teams are working on the extensive datasets Yelp offers. There are still a lot of unexploited opportunities based on Yelp’s datasets, especially in deep learning. CLIP-based embedding is the first version of photo embedding generation and is only the beginning. Fine-tuning the CLIP model on the Yelp domain will improve the photo embeddings. Our team is presently exploring it. Also, the business embedding is currently only incorporating the review embeddings. It could also use photos or other metadata as inputs.</p><p>This project means that Yelp now owns a database with hundreds of millions of embeddings. Many Yelp teams are already exploiting them to improve their products.</p><h2 id="acknowledgements">Acknowledgements</h2><p>Many people were involved in these projects, but special thanks to Parthasarathy Gopavarapu, Satya Deo, John Roy, Blake Larkin, Shilpa Gopi, and Jason Sleight. They helped with the design and implementation of these projects or the content of this post.</p><h2 id="appendix">Appendix</h2><p>Comparison table Prod (ResNet50) compared to zero-shot CLIP model after some label engineering.</p><table><thead><tr><th>Comparison Table of the Food Classifier</th>
<th> </th>
<th> </th>
<th> </th>
<th> </th>
</tr><tr><th> </th>
<th>ResNet50</th>
<th> </th>
<th>CLIP</th>
<th> </th>
</tr><tr><th> </th>
<th>Recall</th>
<th>Precision</th>
<th>Recall</th>
<th>Precision</th>
</tr></thead><tbody><tr><td>Pizza</td>
<td>0.96</td>
<td>0.92</td>
<td>0.90</td>
<td>0.83</td>
</tr><tr><td>Sushi &amp; Sashimi</td>
<td>0.87</td>
<td>0.78</td>
<td>0.79</td>
<td>0.69</td>
</tr><tr><td>Ramen &amp; Noodles</td>
<td>0.82</td>
<td>0.95</td>
<td>0.70</td>
<td>0.55</td>
</tr><tr><td>Sandwiches</td>
<td>0.93</td>
<td>0.97</td>
<td>0.57</td>
<td>0.44</td>
</tr><tr><td>Tacos</td>
<td>0.78</td>
<td>0.75</td>
<td>0.83</td>
<td>0.59</td>
</tr><tr><td>Salads</td>
<td>0.67</td>
<td>0.92</td>
<td>0.65</td>
<td>0.50</td>
</tr><tr><td>Donuts</td>
<td>0.8</td>
<td>0.77</td>
<td>0.55</td>
<td>0.87</td>
</tr><tr><td>Steak</td>
<td>0.84</td>
<td>0.84</td>
<td>0.39</td>
<td>0.46</td>
</tr><tr><td>Burgers</td>
<td>0.84</td>
<td>0.87</td>
<td>0.77</td>
<td>0.59</td>
</tr><tr><td>Bagels</td>
<td>0.91</td>
<td>0.90</td>
<td>0.55</td>
<td>0.85</td>
</tr><tr><td>Cupcakes</td>
<td>0.75</td>
<td>0.81</td>
<td>0.74</td>
<td>0.93</td>
</tr><tr><td>Fish &amp; Chips</td>
<td>0.87</td>
<td>0.77</td>
<td>0.89</td>
<td>0.74</td>
</tr><tr><td>Burritos &amp; Wraps</td>
<td>0.79</td>
<td>0.67</td>
<td>0.47</td>
<td>0.66</td>
</tr><tr><td>Hot Dogs</td>
<td>0.76</td>
<td>0.73</td>
<td>0.54</td>
<td>0.90</td>
</tr><tr><td>Crepes</td>
<td>0.94</td>
<td>0.94</td>
<td>0.69</td>
<td>0.55</td>
</tr><tr><td>Waffles</td>
<td>0.89</td>
<td>0.89</td>
<td>0.49</td>
<td>0.88</td>
</tr><tr><td>Pancakes</td>
<td>0.69</td>
<td>0.79</td>
<td>0.38</td>
<td>0.83</td>
</tr><tr><td>Nachos</td>
<td>0.81</td>
<td>0.86</td>
<td>0.77</td>
<td>0.74</td>
</tr><tr><td>Soups &amp; Chowder</td>
<td>0.7</td>
<td>0.71</td>
<td>0.47</td>
<td>0.69</td>
</tr><tr><td>Ribs</td>
<td>0.67</td>
<td>0.6</td>
<td>0.6</td>
<td>0.69</td>
</tr><tr><td>Curry</td>
<td>0.64</td>
<td>0.61</td>
<td>0.57</td>
<td>0.62</td>
</tr><tr><td>Paella</td>
<td>0.79</td>
<td>0.82</td>
<td>0.9</td>
<td>0.79</td>
</tr><tr><td>Oysters &amp; Mussels</td>
<td>0.69</td>
<td>0.79</td>
<td>0.69</td>
<td>0.87</td>
</tr><tr><td>Grilled Fish</td>
<td>0.86</td>
<td>0.77</td>
<td>0.51</td>
<td>0.59</td>
</tr><tr><td>Pasta</td>
<td>0.65</td>
<td>0.53</td>
<td>0.55</td>
<td>0.85</td>
</tr><tr><td>Chicken Wings &amp; Fried Chicken</td>
<td>0.81</td>
<td>0.83</td>
<td>0.57</td>
<td>0.56</td>
</tr><tr><td>Dessert</td>
<td>0.85</td>
<td>0.86</td>
<td>0.58</td>
<td>0.41</td>
</tr></tbody></table><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/04/yelp-content-as-embeddings.html</link>
      <guid>https://engineeringblog.yelp.com/2023/04/yelp-content-as-embeddings.html</guid>
      <pubDate>Thu, 20 Apr 2023 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Gondola: an internal PaaS architecture for frontend app deployment]]></title>
      <description><![CDATA[<p>The Yelp website serves millions of users and business owners each day, and engineers in our product teams are constantly adding and improving features across hundreds of pages. Webcore, Yelp’s frontend infrastructure team, is always looking to ensure that web developers can ship their changes quickly and safely, without the burden of maintaining complex team-specific infrastructure.</p><p>To achieve this, we made some significant changes to our internal deployment model for <a href="https://reactjs.org/">React</a> pages in late 2019. This blog post will explain why we made these changes, describe the new architecture we implemented, and share some of the lessons we learned along the way.</p><p>We ended up with an architectural model based on an immutable <a href="https://en.wikipedia.org/wiki/Key%E2%80%93value_database">key-value (KV) store</a> with clearly defined page boundaries: frontend asset manifests that can be hot-swapped quickly and safely in production. Alongside that platform layer, “Gondola”, we rolled out a new <a href="https://en.wikipedia.org/wiki/Monorepo">monorepo</a>, solving many of the challenges we had begun facing as we scaled the number of feature teams and webpages across the site.</p><p>Yelp’s website was originally served by a large Python <a href="https://en.wikipedia.org/wiki/Monolithic_application">monolith</a>, and over time this has shifted towards a <a href="https://en.wikipedia.org/wiki/Microservices">microservice architecture</a> for backend services, allowing teams to maintain their own Docker images, deployment pipelines, and runbooks. This concept was then expanded to the frontend, which brought over frontend asset build configs (<a href="https://webpack.js.org/">webpack</a>, <a href="https://babeljs.io/">Babel</a>, <a href="https://eslint.org/">ESLint</a>…) for teams to maintain. Webcore set up shared configs and CLI tooling to encode recommended best practices in order to ensure a consistent frontend build experience.</p><p>In this environment, each individual feature team at Yelp ended up owning one small “website slice”, from top to bottom. Full-stack developers on these teams would be responsible for their entire stack, encompassing both the frontend and backend as well as the linting, testing, and on-call responsibilities that came along with it. Even with the help of the shared Webcore-provided frontend infra tooling, relying on teams to keep the shared configs up to date wasn’t ideal - especially if certain frontend microservices had minor deviations.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-02-24-gondola-an-internal-paas-architecture-for-frontend-app-deployment/status_quo.png" alt="Our status quo model, where each team owns a potentially-fragmented piece of the website stack" /><p class="subtle-text"><small>Our status quo model, where each team owns a potentially-fragmented piece of the website stack</small></p></div><p>As a result, we often saw a lag between releasing a new version of our shared build infrastructure and seeing its effects on the wider set of web pages. We’d sometimes even have cases where pages would be stuck on an old version of our tooling for months, and so it was difficult for Webcore to have confidence in infrastructure changes we released. Manually testing every frontend microservice wasn’t feasible because they often drifted from Webcore standards, resulting in custom deployment models and unique setups.</p><p>As we started moving to React and away from our Python-powered templating, it was clear that we were becoming less reliant on server-side logic. Much of our UI was starting to be described via React (rendered through Server Side Rendering), and our data fetching was moving to GraphQL on a per-component basis. Despite not needing anything other than simple data fetching and stitching on the server, developers would have to deploy a full Python service to make even a simple copy change or style update. This could sometimes take an hour or more for larger deployments when many instances were required, and rolling back or reverting changes could take a similar amount of time even for frontend-only updates!</p><h2 id="a-better-model">A better model</h2><p>When comparing our largest frontend microservices at Yelp, we could see that much of our existing infrastructure concerning the deployment of pages could be simplified. Large amounts of boilerplate code existed in order to fetch data, manipulate it into an appropriate form, and then send it off to be server-side rendered using a specified React component representing the whole page.</p><p>We also saw room for improvement given the fact that our services were now generally “thin”, since they delegate <a href="https://www.youtube.com/watch?v=G8P9njqLwHo">React SSR to an external service powered by Hypernova</a> (something we published <a href="https://engineeringblog.yelp.com/2022/02/server-side-rendering-at-scale.html">an updated blog post talking about</a> recently). We imagined a new, centralized service containing generalized logic built to serve all web pages at Yelp. Essentially an internal <a href="https://en.wikipedia.org/wiki/Platform_as_a_service">Platform-as-a-service</a> for React pages!</p><p>Our service, “<strong>Gondola</strong>”, had the following requirements:</p><ol><li>Deploying and rolling back frontend code should be near-instant</li>
<li>Deployment of assets should be decoupled from the Python code powering Gondola</li>
<li>The service should contain minimal page-specific logic: all page behavior should be described by the rendered React components</li>
<li>Teams should only be required to own product code, not infrastructure, and ownership should be clearly defined</li>
</ol><p>Our first step was to reduce the scope of team ownership from a microservice (the “full website slice”) to a “page”. A Gondola page can be defined as an asset manifest describing all JS and CSS entrypoint files that we need to include in order to fully describe a desired UI, along with appropriate chunk names (including async chunks) mapped to public <a href="https://en.wikipedia.org/wiki/Content_delivery_network">CDN</a> urls for each asset. It gives us a way to fully describe each page’s frontend needs and can be generated at build time by webpack:</p><div class="language-json highlighter-rouge highlight"><pre>{"entrypoints":{"gondola-biz-details":{"js":["gondola-biz-details.js","common.js"],"css":["gondola-biz-details.css"]},"gondola-search":{"js":["gondola-search.js","common.js"],"css":["gondola-search.css"]}},"common.js":"commons-yf-81b79eb1bc6d156.js","gondola-biz-details.js":"gondola-biz-details_a775bc492d91960a.js","gondola-biz-details.css":"gondola-biz-details_eabd4c9f434f9468.css","gondola-search.js":"gondola-search_69082d627b823fd5.js","gondola-search.css":"gondola-search_d0ef76f21dcbf11d.css"}</pre></div><p>This choice was very deliberate, as it allows us to embrace the web platform (with URLs at its core) as the primary building block for routing, bundling, and deploying code to yelp.com. This simplified many decisions in the rest of our design once we had settled on this level of granularity as our main abstraction.</p><p>We then took our existing Pyramid React renderer (a <a href="https://docs.pylonsproject.org/projects/pyramid/en/latest/narr/renderers.html">Pyramid renderer</a> designed to take props from Pyramid and produce a rendered SSR page via React), which was built for individual teams to use in their services, and tweaked it to work alongside a fast KV store powered by DynamoDB. In our database, we store our page manifest data keyed by Gondola page version, and in a separate table track the active Gondola page for a given path (we use the commonly-adopted <a href="https://github.com/pillarjs/path-to-regexp">path-to-regexp</a> format for matches here).</p><p>All interaction with our KV store is performed via a small CLI tool we distribute across our development environments (including Jenkins) which talks to DynamoDB in a consistent schematised way. The Gondola service itself only requires read-only access to the database so that it can serve the appropriate pages as requests come in.</p><p>This means that the flow for an incoming request to Yelp looks as follows:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-02-24-gondola-an-internal-paas-architecture-for-frontend-app-deployment/request_flow.png" alt="An incoming request hitting the Gondola service to render the Search page" /><p class="subtle-text"><small>An incoming request hitting the Gondola service to render the Search page</small></p></div><ol><li>A user requests a Gondola-powered page such as /search - this goes directly to Gondola, and matches the route via <a href="https://github.com/pillarjs/path-to-regexp">path-to-regexp</a></li>
<li>Gondola queries DynamoDB to determine the active version for /search, and the accompanying asset manifest for that version</li>
<li>A query is made to our dedicated Server Side Rendering (SSR) service which returns rendered html</li>
<li>The appropriate asset tags from the manifest are included in the page shell to hydrate the page</li>
</ol><p>By basing the rendering of the page entirely on the contents of the page manifest, the Gondola service has a lot of flexibility: this model supports our first requirement of near-instant deployments, since “deploying” a Gondola page now consists of updating a single version row in our DB. This assumes you’ve built and uploaded your assets and manifest, but this can happen at any time beforehand: creating a new Gondola page version isn’t tied to deployment.</p><p>This means that our merge pipeline becomes a lot safer. The only thing that can affect production is the DB being updated to flip active versions, and the version can be instantly reverted in the same way if we spot errors during rollout.</p><p>The nature of the KV store model also lends itself to cacheability: a given page and version pair is <strong>immutable</strong>, and we can serve manifests very efficiently from an in-memory cache layer without needing complex cache invalidation.</p><p>One of the most important benefits to this model is that Webcore now has the ability to make changes to all Gondola pages at once, and introduce significant UX and DX improvements across all pages with ease. For example, we can add new metrics to our performance logging infrastructure centrally, or optimise our first-byte times for all pages with a single Pull Request.</p><p>In a world where teams maintain their own frontend microservices, we don’t have the ability to make sweeping changes. This would require either a large amount of onboarding and education or Webcore-lead migrations to get everyone onto the latest and greatest libraries containing any improvements we ship out. This comes with its own set of dependency versioning challenges and is generally no fun for anyone.</p><h3 id="deployment-previews">Deployment Previews</h3><p>Another win for this model is the ability to add additional logic around the hot-swapping of frontend versions: as one important example, it allows us to implement Deployment Previews internally, where we can tag specific versions as pre-release and view them against the production website instantly via a query param.</p><p>A deployment preview model naturally fits with our routing behavior above. Deployment Preview IDs (using memorable and fun names like cool-purple-hippo-24!) can slot in anywhere that versions are used, and the logic remains almost identical.</p><p>While not a novel feature in the wild across most modern static site hosts, the existence of Deployment Previews internally allows for:</p><ul><li>Realistic demos against prod data rather than relying on persistent sandboxes or screenshots</li>
<li>The ability to quickly compare two versions against the same environment, including unreleased versions</li>
<li>Audits and automatic smoke tests run on every PR, against the Deployment Preview url</li>
</ul><p>The last point is something that has a great deal of potential in the future, too: we already have several “Page Checks” which run Lighthouse performance audits, checks for console errors or JS exceptions, A11y audits, and automatic screenshots. All of these checks can be run at PR-time without the developer having to do anything, and results are reported back via Github status checks conveniently:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-02-24-gondola-an-internal-paas-architecture-for-frontend-app-deployment/page_checks.png" alt="Example Page Checks running against a branch’s Deployment Preview at PR time" /><p class="subtle-text"><small>Example Page Checks running against a branch’s Deployment Preview at PR time</small></p></div><p>This all hinges on our ability to switch out the running version of a page near-instantly in any environment, made possible by Gondola. There are likely also lots of other opportunities that we’ve yet to explore which are unlocked by this newfound freedom!</p><h2 id="the-monorepo">The Monorepo</h2><p>In addition to our work to build out the Gondola service, we needed a pipeline to ferry changes between a Pull Request, asset manifest, and our DB, so that they can subsequently be deployed.</p><p>Our status quo was a loose collection of team-owned Jenkins pipelines spread across many different individual git repositories. This was never ideal due to the reasons outlined earlier on, but the rethinking of our deployment model gave us a great opportunity to do something about our package dependency model. The result was a new monorepo for frontend code.</p><p>By moving to a monorepo, we sought to solve some of the largest problems that had been frustrating developers previously:</p><ul><li>No more “<a href="https://en.wikipedia.org/wiki/Dependency_hell">dependency hell</a>” - updating a monorepo package version automatically releases a new page if the package is directly or transitively depended upon, and packages are enforced to only depend upon the latest version on disk
<ul><li>It’s easier to reason about the dependencies that will be bundled in the final page</li>
<li>We can also globally enforce a <a href="https://yarnpkg.com/cli/dedupe#details">deduplicated lockfile</a> to minimise our install and build times</li>
</ul></li>
<li>No backend infrastructure to maintain: we’ve moved all of that to the Gondola service, so the monorepo can be 100% frontend code</li>
<li>Any improvements to the build immediately benefit all developers, with no need for migrations - all build infrastructure and tooling is shared and maintained by Webcore, and any changes can be easily confirmed to work in the monorepo</li>
<li>Faster, more efficient bundling: since all pages and packages live together, we’re able to run a single Webpack build with multiple entry points, and utilize <a href="https://web.dev/granular-chunking-nextjs/">granular chunking strategies</a> that can take advantage of cross-page shared chunks</li>
</ul><p>To avoid the growth of the monorepo slowing down developers, we built and continue to maintain tooling to run tests only against packages that have been affected by the PR in question (we use <a href="https://github.com/lerna/lerna">lerna</a> with additional custom scripts). This was and is one of the biggest concerns that tends to appear when discussing monorepos, and it’s important that we stay on top of build performance to ensure that we’re not frustrating developers.</p><p>We also enforce strict package boundaries and require that each package in the monorepo has a <a href="https://engineeringblog.yelp.com/2021/01/whose-code-is-it-anyway.html">defined owner</a>. We provide a helpful scaffold to make this process simple when first adding code to the monorepo, which has helped significantly with onboarding.</p><h2 id="developers-developers-developers">Developers, developers, developers</h2><p>A major part of the work involved with Gondola was to ensure that developers could be onboarded with minimal disruption. A lot of this work was non-technical: we felt it was important to involve our customers (in this case, internal front-end developers) as early as possible in the design process and make sure that what we were building was actually useful for them! Writing docs as we went and pairing with early adopters directly helped mitigate a vast swathe of potential problems which we may not otherwise have discovered.</p><p>In our case, since we were asking developers to change some of their pre-established patterns of working on frontend code, we sought to maintain as much familiarity as possible with our tooling decisions. As one example, at Yelp we use Make as standard in all repos, so it was important to ensure that a developer opening up the monorepo for the first time would feel at home. We set up symlinked Makefiles per-package to ensure that running commands from within a package would feel close-to-identical to the old flow.</p><p>We also set up a dedicated docsite with an in-depth migration guide, and provided clear iterative steps: specifically in our case our emphasis was on the fact that step one would involve moving frontend code to the monorepo <em>without</em> the requirement that they move to Gondola. This made it easier for teams to tackle the migration at their own pace without the need for any “big bang” rewrites.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-02-24-gondola-an-internal-paas-architecture-for-frontend-app-deployment/gondola_docs.png" alt="Part of our dedicated Gondola migration guide written internally for developers" /><p class="subtle-text"><small>Part of our dedicated Gondola migration guide written internally for developers</small></p></div><h3 id="supporting-legacy-data-fetching">Supporting legacy data fetching</h3><p>While GraphQL is our primary supported data fetching method at Yelp, there are still some services which continue to fetch their data via Python. Since we don’t expose the Python backend to Gondola users, this poses a problem: how can we allow developers to onboard onto Gondola without requiring them to take on an <em>additional</em> GraphQL migration?</p><p>We solved this by building a custom <a href="https://docs.pylonsproject.org/projects/pyramid/en/latest/narr/renderers.html">Pyramid renderer</a> we call the “Gondola Legacy Renderer”: it’s designed to be plugged into any existing service, and fire off a request to Gondola with an additional set of “legacy props” passed via GET request body internally. This means that we unlock the ability for any existing service to <em>become a proxy for Gondola itself</em>, gaining the majority of benefits of a “real” Gondola page while teams complete their migration to GraphQL.</p><p>Several teams have adopted the Legacy Renderer and we’re pleased with its ability to bridge the gap for developers who otherwise may not have had the bandwidth to start migrating away from dedicated team-owned services.</p><h2 id="the-future">The future</h2><p>With Gondola, we aimed to build a platform for all of Web at Yelp: we wanted to introduce a large shift in our mental model of deployment and question some of the existing assumptions we had about what was feasible to design.</p><p>So far, we’ve seen positive signs from our customers that our approach was successful. The majority of Yelp’s web traffic is now served by Gondola, but there’s lots more to do: the Gondola platform can never really be “finished”, so we continue to roll out and improve core features and take into account feedback from web developers across the company.</p><p>As teams continue to onboard, we’ve introduced optimistic build queues, started incrementally adopting fast rust-based tooling like <a href="https://swc.rs/">swc</a> in critical areas, and continue to implement Page Checks to provide assurance that PRs created against Gondola meet the company’s web performance goals. There’s also room for exciting new Deployment Preview integrations and ways to improve our DX for all developers.</p><p>With releases like <a href="https://reactjs.org/blog/2021/06/08/the-plan-for-react-18.html">React 18</a> and its support for streamed SSR responses, our ability to make sweeping changes across the monorepo (and by extension all Gondola pages at Yelp) gives us confidence that we can perform this and other large migrations in ways that stay out of feature developers’ way: something that’s critical to ensure we’re not negatively affecting deployment velocity while embracing industry best practices.</p><h2 id="conclusion">Conclusion</h2><p>The creation of Gondola itself was years in the making: the journey from our legacy Python/jQuery templates, to React, to GraphQL, and finally to the monorepo model did not happen overnight. It was important to iterate gradually with immediate benefits gained at each stage - <a href="https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/">rewrites should always be avoided</a>!</p><p>By simplifying and slimming down our deployment model, we’ve been able to introduce features that were impossible before, removing a large amount of cognitive overhead from feature devs who shouldn’t be required to maintain their own website stacks top-to-bottom.</p><p>It’s been exciting and encouraging to see the positive response from developers, as well as the amount of support we’ve had from all our internal customers! There’s a lot more we want to get done, but Gondola serves as a great platform for us to do it, and the future of web development is looking exciting at Yelp.</p><h2 id="acknowledgements">Acknowledgements</h2><p>Gondola wouldn’t have been possible without the input from many teams and individuals across the company. Thanks go out to current and past members of the Webcore team, the many contributors to Gondola’s codebase and docs, as well as the initial spec reviewers from our product teams that helped turn the idea into reality.</p><p>Additional thanks goes out to all the developers in our web tech community that work every day with the platform and offer us honest and direct feedback that helps us shape Gondola’s roadmap!</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/c/engineering-jobs?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/03/gondola-an-internal-paas-architecture-for-frontend-app-deployment.html</link>
      <guid>https://engineeringblog.yelp.com/2023/03/gondola-an-internal-paas-architecture-for-frontend-app-deployment.html</guid>
      <pubDate>Fri, 03 Mar 2023 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How Yelp's Security Team Does Threat Hunting]]></title>
      <description><![CDATA[<p>Here at Yelp, we have multiple security teams specialized in various areas. One thing we all have in common is the fact that we all enjoy a bit of threat hunting occasionally. We opted for taking advantage of everyone’s diverse knowledge and began our journey of creating our own threat hunting methodology. This blog post includes the less glamorous details, such as our early beginnings and initiative, our “success in progress” and the multitude of approaches that we considered. In the end, we will present the stable process that we are now using and continuing to improve at every iteration.</p><p><em>Imaginary engineer (working outside Yelp): ‘Wait, so does Yelp conduct threat hunts?’</em></p><p><em>Yelp: ‘Of course, we do! Do you not?’</em></p><p><em>Imaginary engineer: ‘Well.. it looks so complex that we don’t know where to start yet’</em></p><p><em>Yelp: ‘Oh, it’s actually only as complex as you allow it to be. We tried a complex process but we also tried a working one. Here, let me tell you what we tried and how it suited us.’’</em></p><p>We have multiple security teams specialized in various and we all enjoy a bit of threat hunting from time to time. We made the participation in threat-hunting exercises voluntary and available to all our great security engineers, rather than having a dedicated threat-hunting team. Our success story stems from having captured plenty of interest and soon enough we had more and more people participating. Our threat hunters are your typical security engineers who put their blue hat on and start building security tooling, processes and everything else they see fit to make sure nothing keeps Yelp awake at night. They are curious and tenacious and they like putting on other hats for a change, so they are happy to put away their blue hat and try on a red one every once in a while.</p><h2 id="the-lets-start-threat-hunting-phase-aka-phase-0">The “Let’s Start Threat Hunting!” Phase (aka Phase 0)</h2><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-03-01-our-threat-hunting-journey/phase0.png" alt="" /></div><p>Luckily, before our time at Yelp, there were 2 great engineers who started it all. They would get together once in a while. They would think about the security gaps and shortcomings their organization shared but didn’t record them anywhere. They’d cherry-pick one and exploit it to the limit. Who doesn’t like breaking stuff? Quietly, then they would throw away their red hat as if nothing happened, put back their more comfortable hat, the blue one, and fix those security shortcomings. As quiet as they were, they still caught attention. But the good kind of attention, as that’s how we’ve got buy-in from stakeholders to do more of these! Yay!</p><p>So.. what went wrong? Well, the team got bigger and bigger and everyone wanted to be part of their success story. But we couldn’t contribute - we weren’t part of the circle when the knowledge was shared. It didn’t scale. It didn’t fit new employees.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-03-01-our-threat-hunting-journey/before_phase1.png" alt="" /></div><h2 id="the-lets-ramp-up-everyone-phase-aka-phase-1">The “Let’s Ramp up Everyone” Phase (aka Phase 1)</h2><p>What does your instinct tell you to do when you have 10 people instead of 2? Group them in teams! So we split the group into three teams. The red team would plan a threat hunt, emulate their attack and map their recordings to MITRE ATT&amp;CK and Cyber Kill Chain. The blue team would investigate it as per our Incident Response procedure. The purple team would analyze what was caught and what was missed and they would dive deep into anomalous logs to make sure there is no real threat present in our environment, putting aside our emulation. Then we’d all brainstorm security controls to improve our posture for every security gap, and implement them. Then we’d try the attack again and send an executive report to stakeholders with the TL;DR. Then finally we would do a retrospective.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-03-01-our-threat-hunting-journey/phase1.png" alt="" /></div><p>A lot of steps, isn’t it? Can you guess how many were missed during the threat hunts? Plenty. Why? This process was so laborious and intensive that it never fully caught on. And there was always so much work for everyone! Many lost interest and excitement quickly because, as you may recall, we are not a threat hunting team. Threat hunting was seriously competing with our other roadmapped projects and initiatives. We did achieve some success, with the greatest one being that by working in teams, everyone got ramped up and was happy to contribute. That was our greatest struggle during Phase 0, hence we’re confident that this step was necessary before we were able to move forward.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-03-01-our-threat-hunting-journey/before_phase2.png" alt="" /></div><h2 id="the-lets-get-back-to-basics-phase-aka-phase-2">The “Let’s Get Back to Basics” Phase (aka Phase 2)</h2><p>By this point, our process was too complex a machine, so we decided to scale it back down. Although it wasn’t in vain - we’ve got plenty of good ideas, some that are now in use and some others that are dormant, waiting for us to be ready. THIS IS OUR CURRENT PROCESS.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-03-01-our-threat-hunting-journey/phase2.png" alt="" /></div><p>We went back to 2 people working on a threat hunt, while the rest are free to work on their other responsibilities. We’ve got better at planning: we only choose granular exploits, similar in size and difficulty with a MITRE TTP. TTP stands for “Tactics, Techniques and Procedures” within the MITRE ATT&amp;CK framework, which is a knowledge base for past and current malicious activity all over the world. TTP is like a zoom tool with 3 levels. You zoom once and you see why an attacker might perform an action. You zoom twice and you see how they do it at a conceptual level. You zoom thrice and you see the actual tools that they use and other hands-on details.</p><p>A granular exploit that we choose to hunt is actually often a TTP. And if we should threat hunt a complex scenario, like a real attack from the news, then we break it down in TTPs and then conduct one threat hunt per one TTP. Depending on the exploit complexity and the engineers’ time availability, the team is free to choose how many iterations a threat hunt needs.</p><p>Also, we carefully choose the scope rather than trying to tackle all of our environments and edge cases at once. As we’re working at such granularity, the 2 people can take the threat hunt from beginning to end and deliver measurable value fast: they plan the threat hunt, exploit it, investigate if we had any security controls in place to catch it and fix them if needed. They also look at our real logs that match the same criteria and investigate any anomalies, for potential real threats. Finally, they prepare a short presentation (a handful of slides of content) and present it to the rest of the teams and interested stakeholders.</p><p>This process was well-embraced by the team and everyone gets to be part of the success story while still having the time to deliver their other projects. And we get measurable and consistent results that are easy to grasp by anyone interested in our work.</p><h2 id="the-hopes-for-the-future-phase-aka-phase-3">The “Hopes for the Future” Phase (aka Phase 3)</h2><p>So this is our current process. Are we done? No, we are not done. Remember that some ideas that came up during Phase 1 are still dormant? We’d like to wake them up at some point. Not today, but when we’re ready. By then, who knows, maybe we will even automate them. In this way, we would keep the process as light as today, but we would inform it with logic on prioritization, risk assessment, metrics, stakeholders bulletin etc.</p><p>But until then, <a href="https://i.pinimg.com/originals/46/1e/a2/461ea2bdb2dfd17c166bbd2f7379384e.jpg">we’ll do as Saitama</a> and leave tomorrow’s problems to tomorrow’s us.</p><p><em>Imaginary Engineer: ‘Thanks! That’s quite a journey! I’ll give it a try… But what if we fail?’</em></p><p><em>Yelp: ‘Well, remember what Albert Einstein said: “Failure is success in progress”. We’re confident that we’re progressing towards success with each and every phase. And so are you.’</em></p><p>First and foremost, many thanks to Andrea Dante Bozzola for supporting our threat hunting interest in contrast with the roadmap we would wishfully like to see delivered. It goes without saying that the CorpSec team has been of tremendous help even though we, Security Effectiveness, have institutionalized the threat hunting here at Yelp. Finally, we’d like to thank Matteo Piano for reviewing this blog post and for his anime expertise. Here is a list of everyone who has been actively contributing to threat hunting since its inception, in no particular order: Matt Carroll, Matteo Piano, Ioana Iliescu, Ramona Tame, Daniel Popa Cristobal, Joey Weate, Andrea Dante Bozzola, Tommy Stallings, Ignacio Rodriguez Paez, Florian Stein.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bd07a618-9b6f-4920-91c6-99280f1b268d?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/02/how-yelps-security-team-does-threat-hunting.html</link>
      <guid>https://engineeringblog.yelp.com/2023/02/how-yelps-security-team-does-threat-hunting.html</guid>
      <pubDate>Mon, 20 Feb 2023 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Rebuilding a Cassandra cluster using Yelp’s Data Pipeline]]></title>
      <description><![CDATA[<p>Robots are frequently used in the manufacturing industry for numerous use-cases. Amongst many, one case is to eliminate defective products automatically from reaching the finished goods inventory. The same principles of these systems can be adopted to filter out malformed data from datastores. This blog post deep dives into how we rebuilt one of our Cassandra(C*) clusters by removing malformed data using Yelp’s Data Pipeline.</p><p><a href="https://cassandra.apache.org/">Apache Cassandra</a> is a distributed wide-column NoSQL datastore and is used at Yelp for storing both primary and derived data. Many different features on Yelp are powered by Cassandra. Yelp orchestrates Cassandra clusters on Kubernetes with the help of operators (explained in <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">our Operator Overview post</a>). At Yelp, we tend to use multiple smaller clusters based on the data, traffic and business requirements. This strategy assists in containing the blast radius in case of failure events.</p><p>For us at Yelp, the primary driver for this effort was the discovery of data corruption across multiple nodes inside one of our Cassandra clusters. This corruption was widespread to different tables including those in the <a href="https://docs.datastax.com/en/cql-oss/3.x/cql/cql_using/useQuerySystem.html">system</a> keyspace. The following were some of the events that happened as we discovered the issue.</p><ol><li>
<p>Numerous exceptions started happening in the Cassandra logs indicating the corruption. The exceptions were found to be happening over multiple nodes in the cluster.</p>
</li>
<li>
<p>Repairs began failing on the Cassandra cluster, which can lead to inconsistencies and data resurrection.</p>
</li>
<li>
<p>The compaction process was seen failing on the Cassandra cluster. The compaction process allows <a href="https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html#sstables">SSTable</a> (Sorted String Table) to be merged together, leading to maintenance of fewer SSTables, and hence improved read performance.</p>
</li>
</ol><p>Since the corruption was widespread, removing SSTables and running repairs wasn’t an option as it would have led to data loss. Also, based on corruption size estimates and recent data value, we opted not to restore the cluster to the last corruption free backed up state.</p><p>More technical details about the corruption and the initial remediation steps like repairs and data scrubbing are covered in the Appendix. Though those steps didn’t help us in fixing the issue, they provide vital information about the nature of the corruption.</p><p>In order to mitigate the issue and stop more data from getting corrupted, we decided to rebuild a new cluster by migrating data from the existing cluster.</p><h2 id="overall-strategy">Overall Strategy</h2><p>The overall high level strategy for rebuilding a new Cassandra cluster for mitigating the issue is quite similar to the sortation systems used for quality checking in the manufacturing industry. Within the industry, automatic sorters are installed on the conveyors that inspect the product and filter out the defective ones from reaching the finished goods inventory.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-01-30-rebuilding-a-cassandra-cluster-using-yelps-data-pipeline/sortation-system.png" alt="Conceptual Model of Sortation System" /><p class="subtle-text"><small>Conceptual Model of Sortation System</small></p></div><p>Using the same principle, a Data Pipeline was created to rebuild a new Cassandra cluster after eliminating the malformed data as depicted in the figure below.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-01-30-rebuilding-a-cassandra-cluster-using-yelps-data-pipeline/mitigation-strategy.png" alt="Corruption Mitigation Strategy at a High Level" /><p class="subtle-text"><small>Corruption Mitigation Strategy at a High Level</small></p></div><p>The process extensively relies on the different connectors and pipeline tools developed by Yelp’s Data Infrastructure teams. Here’s a quick explanation of the overall dataflow.</p><ul><li>
<p>A new Cassandra cluster “Sanitized Cassandra Cluster” was spun up on Yelp’s modern Kubernetes infrastructure. This allowed the new cluster to leverage from many hardware and software upgrades.</p>
</li>
<li>
<p>The data from the original Cassandra cluster was published into Yelp’s Data Pipeline to create an “Original Data Stream” through Yelp’s Cassandra Source Connector. The Cassandra Source connector relies on the <a href="https://cassandra.apache.org/doc/latest/cassandra/operating/cdc.html">Change Data Capture (CDC)</a> feature, which was introduced in the Cassandra 3.8 version. More details about the Cassandra Source connector can be found in the blogpost: <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Streaming Cassandra into Kafka in (near) Real Time</a>.</p>
</li>
<li>
<p>The Stream Processors allow transformation of the Data Pipeline streams. This stream process acts as an “automatic sorter” responsible for eliminating the malformed data from reaching the destination. Of the various <a href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">different supported stream processors</a> by Yelp’s Data Pipeline, Stream SQL was adopted here in this case as it allowed writing stream processing applications in a language similar to SQL. While writing the stream processor, there were a few considerations required.</p>
<ul><li>
<p><strong>Source and Destination Data Stream identifiers</strong>: The identifiers allow selection of the input &amp; output Data Pipeline topics.</p>
</li>
<li><strong>Sanitization Criteria</strong>: This specifies the valid list/ranges of values for fields inside the Data Pipeline. Inspecting the data, we figured out that using a criteria based on the id &amp; time values can filter out malformed data. A simple stream SQL statement for sanitizing on the basis of a criteria based on non-negative id and valid time_created range would look as follows.
<pre>
SELECT
  id, created_time
FROM 
WHERE
id IS NOT NULL
AND id &gt;= 0
AND time_created IS NOT NULL
AND TIMESTAMPDIFF(DAY, CURRENT_TIMESTAMP, time_created) &lt;= 1
AND TIMESTAMPDIFF(YEAR, CAST('2000-01-01 00:00:00' AS TIMESTAMP),
                 time_created) &gt;= 0;</pre></li>
<li><strong>Malformed Stream Criteria</strong>: This allows creation of a data stream containing all the malformed data. That can simply be created by inverting the sanitization stream SQL statement.
<pre>
SELECT
 id, created_time
FROM 
WHERE <strong>NOT(</strong>
id IS NOT NULL
AND id &gt;= 0
AND time_created IS NOT NULL
AND TIMESTAMPDIFF(DAY, CURRENT_TIMESTAMP, time_created) &lt;= 1
AND TIMESTAMPDIFF(YEAR, CAST('2000-01-01 00:00:00' AS TIMESTAMP),
                time_created) &gt;= 0
<strong>)</strong>;</pre></li>
</ul></li>
<li>
<p>The data from the sanitized data stream was ingested into the Sanitized Cassandra Cluster through Yelp’s Cassandra Sink Connector.</p>
</li>
<li>
<p>The data from the malformed data stream was further analyzed to discover</p>
<ul><li>whether the corruption is legit</li>
<li>what percentage of data got corrupted</li>
<li>whether there is a possibility of extracting useful information from it</li>
</ul></li>
</ul><h2 id="data-validation">Data Validation</h2><p>Like any other data migration project, validation of data was of utmost importance. A couple of steps were used for data validation, which ultimately verified the above strategy.</p><h3 id="validation-using-random-sampling">Validation using Random Sampling</h3><p>This is perhaps the most common strategy for validating data migration analogous to Quality Control inspections of finished products in manufacturing industries. A random subset of the migrated data was selected and value comparison for all the columns was done between the <em>Original Cassandra Cluster</em> and <em>Sanitized Cassandra Cluster</em>.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-01-30-rebuilding-a-cassandra-cluster-using-yelps-data-pipeline/validation-random-sampling.png" alt="Data Validation using Random Sampling" /><p class="subtle-text"><small>Data Validation using Random Sampling</small></p></div><p>Since this is a statistical sampling technique, the confidence level greatly depends upon the sample size. Cochran’s equation helped us in estimating a sample size for sufficiently large tables since the data residing inside the Cassandra tables was sufficiently large.</p>\[n = Z^2 p (1-p) / e^2\]<p>where n is the sample size, Z is the z-score for confidence interval; chosen as 1.96 for 95% confidence interval p(1-p) determines the degree of variability; Value of p chosen as 0.5 for maximum variability e is the sampling error; used as 5%</p><p>The total number of partitions randomly sampled were 400 (&gt;385 from Cochran’s equation) for the tables. One of our tables has a total data of 162G divided into approximately around 7.2 million partitions.</p><h3 id="validation-using-comparison-tee">Validation using Comparison Tee</h3><p>The Database Reliability Engineering team at Yelp uses a proxy for our Cassandra datastores in order to isolate the infrastructure complexity from the developers. The proxy supports a few different wrappers, with Tee being particularly relevant here.</p><p>Until this point, the traffic was still being served by the Original Cassandra Cluster. This Teeing feature allowed us to do further verification from client request perspectives. The conceptual model of Teeing is depicted in the figure below.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-01-30-rebuilding-a-cassandra-cluster-using-yelps-data-pipeline/validation-comparison-tee.png" alt="Data Validation using Comparison Tee" /><p class="subtle-text"><small>Data Validation using Comparison Tee</small></p></div><p>Here is a brief explanation of the model.</p><ul><li>A fraction of read requests were sent to both the Original Cassandra Cluster and the Sanitized Cassandra Cluster before switching the traffic to the sanitized cluster.</li>
<li>Comparison was done on the responses observed from both the clusters, and the comparison results were logged.</li>
<li>Response from the Original Cassandra Cluster was sent back to the requesting client.</li>
<li>Offline Analysis of logged comparison results led to Data validation between the two clusters.</li>
</ul><p>An example client performing Comparison Tee for keyspace <strong>kspace</strong> would look like:</p><pre>
original_client = DataClient(cluster="original_cassandra_cluster"
)
sanitized_client = DataClient(cluster="sanitized_cassandra_cluster"
)
def compare_results(main_result, tee_result):
    if main_result != tee_result:
        return {"original": main_result, "sanitized": tee_result}
    return {}
teed_client = ComparisonTee(
    client=original_client,
    tee_client=sanitized_client,
    comparison_fn=compare_results,
)
</pre><h2 id="switching-traffic">Switching Traffic</h2><p>The total amount of corruption observed in the cluster was roughly estimated to be around 0.009% of the total data. Once the data was completely validated, the traffic was switched from the <em>Original Cassandra Cluster</em> with faults to the <em>Sanitized Cassandra Cluster</em>. The <em>Original Cassandra Cluster</em> was torn down after moving the entire traffic. This allowed a seamless transition with zero downtime and without any visible effect on the user experience.</p><p>The execution of the project allowed us to rebuild the cluster with sanitized data, but also enabled us to move our cluster to an improved infrastructure with zero downtime. There were quite a few learnings from this project.</p><ul><li>
<p>It is important to have validation plans at each stage (and if possible multiple validation criteria) when carrying out a complex data movement.</p>
</li>
<li>
<p>Cassandra logs provide great insight into the database operations being performed. This includes information about any uncaught exceptions, garbage collector, cluster topology, compaction, repairs etc. Any anomaly observed inside the logs can be pretty useful for debugging errors or performance issues. From an operational perspective, it’s better to create alerts for any new uncaught exceptions and analyze them as they happen.</p>
</li>
<li>
<p>Repairs are essential for a guaranteed data consistency on a Cassandra cluster in case one of the data nodes goes down for an extended duration (greater than <em><a href="https://cassandra.apache.org/doc/3.11/cassandra/operating/hints.html">max_hint_window_in_ms</a></em>). Absence of periodic repairs on a Cassandra cluster can lead to data integrity issues. However, running repairs on an unhealthy, broken or corrupted cluster is <a href="https://www.datastax.com/blog/interpreting-cassandra-repair-logs-and-leveraging-opscenter-repair-service">not recommended</a> and is likely going to make things worse.</p>
</li>
</ul><p>There is so much more to write here with respect to the learnings - Data Pipeline infrastructure tools, datastore connectors, Scribe Log Streams, CI/CD pipelines for Cassandra deployments - and much more. If you are interested to know more about these, what better way is there than to come and work with us.</p><ul><li>Thanks to Adel Atallah, Michael Persinger, Toby Cole and Sirisha Vanteru who assisted at various stages of the design and implementation of the project.</li>
<li>The authors would like to thank the Database Reliability Engineering team at Yelp for various contributions in handling the issue.</li>
</ul><h2 id="data-corruption-overview">Data Corruption Overview</h2><p>The corruption was detected when engineers observed exceptions of the following form in the Cassandra <a href="https://cassandra.apache.org/doc/latest/cassandra/troubleshooting/reading_logs.html#system-log">system.log</a> file in one of the clusters.</p><pre>
Last written key DecoratedKey(X) &gt;= current key DecoratedKey(Y)
</pre><p>This Cassandra cluster was still on our old <a href="https://aws.amazon.com/ec2/">AWS EC2</a> based infrastructure as described in our <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">Operator overview post</a>. Along with the above exception, the engineers also observed the Cassandra process crashing on a few nodes in the same cluster while trying to deserialize <a href="https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html#commit-log">CommitLog Mutations</a>. A mutation is synonym to a Database write since it changes the data inside the database. Exceptions of the following form were observed in the Cassandra logs.</p><pre>
org.apache.cassandra.serializers.MarshalException: String didn't validate
</pre><p><a href="https://cassandra.apache.org/doc/latest/cassandra/operating/repair.html">Repairs</a> are required for a guaranteed data consistency on a Cassandra cluster in case one of the data nodes goes down. At Yelp, we run periodic repairs on Cassandra clusters for fixing any data inconsistencies. However, following this issue, engineers observed that the repairs started to fail on the above cluster, and actually caused the “Last written key” exception to spread to all the nodes inside that cluster. The cluster contained two data centers, with each having a replication factor of 3. Even though there wasn’t any observable impact due to necessary replication and validation safeguards, the exceptions still required further analysis from operational perspectives. An immediate action was taken to stop the repairs from running for this cluster.</p><p>The investigation around the exception revealed that at-least one of the <a href="https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html#sstables">SSTable</a> (Sorted String Table) rows was unordered, which caused the compaction operation to fail. SSTables are immutable files that are always sorted by the primary key. This indicated a corruption event inside the Cassandra SSTables. These SSTable corruptions were observed for different tables, including the tables in the <a href="https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useQuerySystem.html">system</a> keyspace, across multiple nodes in that cluster, indicating a distributed corruption present on multiple nodes in the cluster. This means that using full table scans on user keyspaces via a batch processing framework like <a href="https://www.datastax.com/blog/kindling-introduction-spark-cassandra-part-1">Spark</a> wouldn’t completely solve the problem, as the corruptions would still be persisted in system keyspaces.</p><p>Since the SSTable corruption was widespread across all the nodes inside the cluster, removing the SSTables and running the repairs wasn’t an option, as this will lead to data loss.</p><p>Restoring the cluster from the periodic backups was another open option for us. However, there’s a trade-off for losing recent data inserted after the last backup with no corruptions. A quick impact analysis revealed that it’s more valuable to retain the recent data as compared to the old corrupted one.</p><h2 id="scrubbing-sstables">Scrubbing SSTables</h2><p>Data Scrubbing process is used as a data cleansing step, and aims to remove the invalid data from the database. With Cassandra, we had 2 options for running the scrubbing process.</p><ol><li>Online Scrubbing</li>
<li>Offline Scrubbing</li>
</ol><h3 id="online-scrubbing">Online Scrubbing</h3><p>Online scrubbing can be invoked using either the <a href="https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/tools/toolsScrub.html">nodetool scrub</a> or <a href="https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/tools/toolsUpgradeSstables.html">nodetool upgradesstables</a> command, with the latter being recommended. Since the online scrubbing process is much slower than the offline one, we opted for the offline scrubbing.</p><h3 id="offline-scrubbing">Offline Scrubbing</h3><p>Offline scrubbing can be performed with an opensource tool <a href="https://cassandra.apache.org/doc/latest/cassandra/tools/sstable/sstablescrub.html">sstablescrub</a>, that gets shipped with Cassandra. We stopped the Cassandra node gracefully after running <em>nodetool drain</em>, as it is a prerequisite for the execution of <em>sstablescrub</em>. The data for keyspace <strong>kspace</strong> &amp; table <strong>table</strong> can be scrubbed as follows.</p><p><code class="language-plaintext highlighter-rouge">sstablescrub kspace table</code></p><p>However, there were failures seen in the offline scrubbing process and following logs were observed in the output.</p><div class="language-plaintext highlighter-rouge highlight"><pre>WARNING: Out of order rows found in partition:
</pre></div><div class="language-plaintext highlighter-rouge highlight"><pre>WARNING: Error reading row (stacktrace follows):
WARNING: Row starting at position 491772 is unreadable; skipping to next
........
WARNING: Unable to recover 7 rows that were skipped. You can attempt manual recovery from the pre-scrub snapshot. You can also run nodetool repair to transfer the data from a healthy replica, if any
</pre></div><div class="language-plaintext highlighter-rouge highlight"><pre>WARNING: Row starting at position 22560156 is unreadable; skipping to next
null
Exception in thread "main" java.lang.AssertionError
        at org.apache.cassandra.io.compress.CompressionMetadata$Chunk.&lt;init&gt;(CompressionMetadata.java:474)
        at org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:239)
        at org.apache.cassandra.io.util.MmappedRegions.updateState(MmappedRegions.java:163)
        at org.apache.cassandra.io.util.MmappedRegions.&lt;init&gt;(MmappedRegions.java:73)
        at org.apache.cassandra.io.util.MmappedRegions.&lt;init&gt;(MmappedRegions.java:61)
        at org.apache.cassandra.io.util.MmappedRegions.map(MmappedRegions.java:104)
        at org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:362)
        at org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:331)
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinal(BigTableWriter.java:336)
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:318)
        at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:322)
        at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:370)
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173)
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:184)
        at org.apache.cassandra.io.sstable.SSTableRewriter.finish(SSTableRewriter.java:357)
        at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:291)
        at org.apache.cassandra.tools.StandaloneScrubber.main(StandaloneScrubber.java:134)
</pre></div><p>This led to failure in complete removal of corrupted rows inside the SSTable.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bd07a618-9b6f-4920-91c6-99280f1b268d?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/01/rebuilding-a-cassandra-cluster-using-yelps-data-pipeline.html</link>
      <guid>https://engineeringblog.yelp.com/2023/01/rebuilding-a-cassandra-cluster-using-yelps-data-pipeline.html</guid>
      <pubDate>Mon, 30 Jan 2023 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Recycling Kubernetes Nodes]]></title>
      <description><![CDATA[<p><em>Manually managing the lifecycle of Kubernetes nodes can become difficult as the cluster scales. Especially if your clusters are multi-tenant and self-managed. You may need to replace nodes for various reasons, such as OS upgrades and security patches. One of the biggest challenges is how to terminate nodes without disturbing tenants. In this post, I’ll describe the problems we encountered administering Yelp’s clusters and the solutions we implemented.</em></p><p>At Yelp we use <a href="https://github.com/Yelp/paasta">PaaSTA</a> for building, deploying and running services. Initially, PaaSTA just supported stateless services. This meant it was relatively easy to replace nodes since we only needed to gracefully remove the pods from our service mesh on shutdown. However, it may still result in services with fewer replicas than expected. We now run many diverse workloads in our clusters including stateful services, batch jobs and pipeline tasks. Some workloads run on private pools (groups of nodes) but many workloads run in shared pools. At Yelp, we use <a href="https://github.com/Yelp/clusterman">Clusterman</a> to manage our Kubernetes pools. Clusterman is an open source autoscaling engine that we initially wrote to scale our Mesos clusters and subsequently adapted to support Kubernetes.</p><p>There are many challenges in multi-tenant clusters since tenants and cluster administrators often work on different teams (and maybe different time zones). Cluster administrators often need to perform maintenance on their clusters, including the replacement of nodes for security fixes, OS upgrades, or other tasks. Given the diversity of workloads running on the clusters, it’s very difficult for administrators to do so safely without working closely with the workload owners to ensure that the termination and replacement pods are done safely. This can also be difficult in Yelp’s distributed, asynchronous work environment. Maintenance can take a long time given the diverse set of workloads and large size of the clusters. Additionally, manual work is error-prone, and a human is more likely to mistakenly delete the wrong node or pod! We decided to tackle the problems in two parts:</p><ol><li>Protecting workloads from disruptions.</li>
<li>Node replacement automation.</li>
</ol><h2 id="1-protecting-workloads-from-disruptions">1. Protecting workloads from disruptions</h2><p>A good place to start is the Kubernetes documentation on <a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/">disruptions</a>. There are two types of disruptions:</p><ul><li>Voluntary (by cluster admin): draining a node for an upgrade or a scaling-down</li>
<li>Involuntary: hardware failures, kernel panic, network partition, etc.</li>
</ul><p>We will focus on voluntary disruptions in our case. <a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets">Pod Disruption Budget</a> (PDB) is the industry standard to protect Kubernetes workloads from <a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#voluntary-and-involuntary-disruptions">voluntary disruptions</a>. “As an application owner, you can create a PDB for each application. A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front-end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.” (Kubernetes, <a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets">Pod Disruption Budget</a>). At Yelp we have some sensitive workloads like <a href="https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html">Nrtsearch</a> and <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">Cassandra</a> where we don’t want to disrupt more than one pod at a time in each cluster.</p><p>If you have bare (without controller) pods in your cluster you should consider some <a href="https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors">limitations</a> to using PDB. Specifically, you can not use the maxUnavailable and percentage fields.</p><p>Besides PDBs, we also evaluated some alternative ways to prevent involuntary disruptions. For example by using <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook">Validating Admission Webhook</a> and <a href="https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks">PreStop Hooks</a> to protect workloads but we decided to continue with PDBs since it was designed for exactly this use case.</p><h2 id="2-node-replacement-automation">2. Node replacement automation</h2><p>Once we defined PDBs for all the applications running on our clusters, we moved on to thinking about the automation needed to replace nodes. We chose to add features to Clusterman to manage node replacement. Before getting into the solution, it is helpful to know a little about Clusterman’s internal components.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-01-05-recycling-kubernetes-nodes/clusterman-components.png" alt="Clusterman components" /><p class="subtle-text"><small>Clusterman components</small></p></div><ul><li>Metrics Data Store: All relevant data used by scaling signals is written to a single data store for a single source of truth about historical cluster state. At Yelp, we use AWS DynamoDB for this datastore. Metrics are written to the datastore via a separate metrics library.</li>
<li>Pluggable Signals: Metrics (from the data store) are consumed by signals (small bits of code that are used to produce resource requests). Signals run in separate processes configured by <a href="http://supervisord.org/">supervisord</a>, and use Unix sockets to communicate.</li>
<li>Core Autoscaler: The autoscaler logic consumes resource requests from the signals and combines them to determine how much to scale up or down via the cloud provider.</li>
</ul><p>We added two more components to solve the node replacement problem: Drainer and Node Migration Batch</p><p><strong>Drainer</strong></p><p>The <a href="https://clusterman.readthedocs.io/en/latest/drainer.html">Drainer</a> is the component which drains pods from the node before terminating. It may drain and terminate nodes for three reasons:</p><ul><li>Spot instance interruptions</li>
<li>Node migrations</li>
<li>The autoscaler scaling down</li>
</ul><p>The Drainer uses <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/">API-initiated eviction</a> for node migrations and scaling down. API-initiated eviction is the process by which you use the <a href="https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.25/#create-eviction-pod-v1-core">Eviction API</a> to create an Eviction object that triggers graceful pod termination. Crucially, API-initiated evictions respect your configured <a href="https://kubernetes.io/docs/tasks/run-application/configure-pdb/">PodDisruptionBudgets</a> and <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-termination">terminationGracePeriodSeconds</a>.</p><p>The Drainer <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration">taints</a> nodes as a first step to prevent the Kubernetes Scheduler from scheduling new pods to the draining node. Then it tries to evict the pods periodically until the node is empty. After evicting all pods, it deletes the node and terminates the instance. In cases where we have defined PDBs that are very strict or there is not much spare capacity in the pool, this can take a long time. We’ve added a user-configurable threshold to prevent very long (or indefinitely) draining nodes. The Drainer will then forcibly delete or un-taint the node depending on the uptime requirements of the workloads running in that pool.</p><p><strong>Node Migration</strong></p><p>The <a href="https://clusterman.readthedocs.io/en/latest/node_migration.html">Node Migration</a> batch allows Clusterman to replace nodes in a pool according to various criteria. This automates the process of replacing nodes running software with security vulnerabilities, upgrading the kernel we run, or upgrading the whole Operating System to newer versions. It chooses which nodes to replace and sends them to the Drainer to terminate gracefully, continuously monitoring the pool capacity to ensure we don’t impact the availability of workloads running on the cluster.</p><p>We’ve created <a href="https://clusterman.readthedocs.io/en/latest/node_migration.html#migration-event-trigger">NodeMigration</a> <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#customresourcedefinitions">Custom Resource</a> to specify migration requirements. We can request migration based on kernel version, OS version, instance type and uptime. For instance, the target of the following manifest is to keep nodes uptime less than 30 days:</p><div class="language-plaintext highlighter-rouge highlight"><pre>apiVersion: "clusterman.yelp.com/v1"
kind: NodeMigration
metadata:
 name: my-test-migration-220912
 labels:
   clusterman.yelp.com/migration_status: pending
spec:
 cluster: mycluster
 pool: default
 condition:
   trait: uptime
   operator: lt
   target: 30d
</pre></div><h2 id="conclusion">Conclusion</h2><p>Finally, we can describe the high level design of our new system as the following.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2023-01-05-recycling-kubernetes-nodes/high-level-design.png" alt="High level design of the system" /><p class="subtle-text"><small>High level design of the system</small></p></div><p>Now that we have this system running we are able to more easily deploy new versions of Ubuntu or keep nodes fresh. We can create migration manifests using a <a href="https://clusterman.readthedocs.io/en/latest/node_migration.html#migration-event-trigger">CLI</a> tool and Clusterman will gradually replace all the instances whilst ensuring that the workloads are not disrupted and that new nodes are running correctly</p><h2 id="acknowledgements">Acknowledgements</h2><p>This was a cross-team project between Yelp’s Infrastructure and Security team. Many thanks to Matteo Piano for being the main part of the project and leading it, and to the many teams at Yelp that contributed to making the new system a success. We want to thank Compute Infra, Security Effectiveness and any of the other teams that contributed by making PDBs. Additionally, thanks to Matthew Mead-Briggs and Andrea Dante Bozzola for their managerial support.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bd07a618-9b6f-4920-91c6-99280f1b268d?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2023/01/recycling-kubernetes-nodes.html</link>
      <guid>https://engineeringblog.yelp.com/2023/01/recycling-kubernetes-nodes.html</guid>
      <pubDate>Thu, 05 Jan 2023 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Lessons from A/B Testing on Bandit Subjects]]></title>
      <description><![CDATA[<p><strong>Abstract</strong>   <em>Compared to full-scale ML, multi-armed bandit is a lighter weight solution that can help teams quickly optimize their product features without major commitments. However, bandits need to have a candidate selection step when they have too many items to choose from. Using A/B testing to optimize the candidate selection step causes new bandit bias and convergence selection bias. New bandit bias occurs when we try to compare new bandits with established ones in an experiment; convergence selection bias creeps in when we try to solve the new bandit bias by defining and selecting established bandits. We discuss our strategies to mitigate the impacts of these two biases.</em></p><p>We have many multi-armed bandits running at Yelp. They help us select the best contents we show on our webpage, choose the optimal ad rendering format on our app, and pick the right channel and timing to reach our users and business owners.</p><p>We typically use the <a href="https://en.wikipedia.org/wiki/Thompson_sampling">Thompson Sampling</a> method. Thompson Sampling is a Bayesian method that combines the domain knowledge we have via prior distributions and the real-world observations we collected for each arm. It is easy to understand for broader audiences and simple to implement. It also introduces noises throughout the day even though our bandits are typically updated nightly. Compared to its alternatives, research has shown that it performs better in the real world (Chapelle and Li 2011).</p><p>Compared to machine learning (ML) models or ML based contextual bandits, simple multi-armed bandits<sup><a href="https://engineeringblog.yelp.com/2022/12/lessons-from-ab-testing-on-bandit-subjects.html#footnote1">1</a></sup> (bandits henceforth) have several important infrastructural and logistical advantages:</p><ol><li>Code light: our bandit implementation is a Python function with only a couple of lines. At serving time, user teams only need to pass in the prior distribution and the real observations of each arm as a dictionary.</li>
<li>Setup light: compared to <a href="https://engineeringblog.yelp.com/2020/07/ML-platform-overview.html">serving a model</a>, bandits do not require a separate service call to make predictions. Typically user teams only need to set up a <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">Cassandra table</a> that stores past observations. Past observations can be computed via a nightly batch and piped into the aforementioned Cassandra table.</li>
<li>Resource light: unlike models, bandits do not require features to learn. This means the product owner does not need to staff a sizable team building a feature engineering pipeline and researching the model architecture.</li>
<li>Maintenance light: bandits do not need heavy monitoring and alerting because it has no complex dependency. By design, bandits balance exploration and exploitation gracefully. With an appropriate data retention window, bandits can also handle data drifts without human intervention. From our experience, for bandits to work correctly, the oncall person only needs to ensure the bandits are updated correctly, which typically is a light task.</li>
</ol><p>Because of these advantages, bandit is a sweet spot for many teams to try out before they fully commit to ML. For some applications, the bandit performance may be good enough such that teams choose to stay in the bandit world.</p><h2 id="a-seemingly-minor-drawback">A seemingly minor drawback</h2><p>As with all the good things in life, bandits do not come without drawbacks. One drawback we face is the difficulty of handling too many items (the curse of dimensionality). When having too many arms, the exploration requires too much data and takes too long from a practical perspective.</p><p>A common practice to mitigate this issue is performing a candidate selection step and sending only top results to the bandit. The candidate selection step can be anything from a simple heuristic, a rule based formula, to a simple model, or a hybrid of all. We only require it to be mostly stable day to day so that bandit’s historical learnings are still useful today. Because of such freedom, a lot of work can be done to optimize the candidate selection step.</p><p>This seemingly innocuous candidate selection step causes many challenges when it comes to A/B testing different candidate selection models. To show this point, let’s first materialize the case of advertising photo selection as an example.</p><p>When advertisers choose “<a href="https://blog.yelp.com/businesses/getting-started-with-yelp-ads/">Let Yelp optimize</a>” for their advertising photos, we test different photos and learn which one gets the most clicks. Under the hood, this is achieved by a bandit system. In particular, each pull is an impression while each success is a click. We use the standard Beta-Bernoulli Bandit with K arms (K is a small fixed number). Because many advertisers simply have too many high quality images for the bandits to learn within a reasonable time window, we have a candidate selection step before the bandit.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-12-21-lessons-from-ab-testing-on-bandit-subjects/fig1-pipeline.png" alt="A high level summary of Yelp’s advertising photo selection pipeline" /><p class="subtle-text"><small>A high level summary of Yelp’s advertising photo selection pipeline</small></p></div><p>For the illustration purpose, let’s assume the status quo candidate selection method is a rule based formula while the challenger is a light-weight model trained on some pre-computed image embeddings. Because these two approaches are quite different, they typically produce distinct top K images.</p><p>To verify the new model selects better performing candidates, we set up an A/B experiment diverted by <code class="language-plaintext highlighter-rouge">advertiser_id</code>. If we stop here and naively run this experiment as is, we may reach a false conclusion caused by the new bandit bias.</p><h2 id="new-bandit-bias">New bandit bias</h2><p>Let’s examine the following mock up example. In this example, the top 3 photos produced by the status quo rule are (1, 2, 3) while the top 3 photos resulted from the new model are (4, 5, 6). Let’s assume that the true click-through rate (CTR) of 1, 2, 3 are 0.2%, 1.5%, and 1.0% while the true CTR of 4, 5, 6 are 0.3%, 2.0%, 1.0% respectively. So the new model is superior by construction.</p><p>However, because the bandit has no data about (4, 5, 6), it has to start from scratch. In particular, at the beginning of the experiment, the bandit will evenly allocate impressions to all three. On the contrary, the bandit in the status quo cohort has figured out photo 2 is the best among (1, 2, 3) and most traffic is allocated to photo 2 already. The following table shows a possible scenario on day 1 of the experiment. Notice on day 1 the CTR of the status quo group is 1.3% but the treatment group is only 0.9%. The bandit will eventually figure out photo 5 is a better performing one and allocate more traffic to it. But until then, the treatment will continue to underperform.</p><table><thead><tr><th>photo_id</th>
<th>True CTR</th>
<th>Cohort</th>
<th>Day 1 impressions</th>
<th>Day 1 clicks</th>
<th>Observed CTR</th>
</tr></thead><tbody><tr><td>1</td>
<td>0.2%</td>
<td>Status Quo</td>
<td>40</td>
<td>0</td>
<td>1.3%</td>
</tr><tr><td>2</td>
<td>1.5%</td>
<td> </td>
<td>356</td>
<td>5</td>
<td> </td>
</tr><tr><td>3</td>
<td>1.0%</td>
<td> </td>
<td>61</td>
<td>1</td>
<td> </td>
</tr><tr><td>4</td>
<td>0.3%</td>
<td>Treatment</td>
<td>161</td>
<td>0</td>
<td>0.9%</td>
</tr><tr><td>5</td>
<td>2.0%</td>
<td> </td>
<td>149</td>
<td>3</td>
<td> </td>
</tr><tr><td>6</td>
<td>1.0%</td>
<td> </td>
<td>147</td>
<td>1</td>
<td> </td>
</tr></tbody></table><p>What if we wipe out the bandit history in the status quo group as well before the experiment? This indeed is the cleanest way to compare the two groups, but we will nuke the performance of the whole system, which typically is not acceptable from a business perspective.</p><p>What if we remove bandits from the equation during experimentation since we’re comparing the candidate selection methods? This idea does not work. The treatment is the middle step of the system but the success metric is defined only after the bandit does its magic. Because we don’t know which photo will give us the highest CTR a priori, we cannot remove the step that is designed to find the highest CTR. In other words, our experimentation subject has to be the whole system, bandit included.</p><p>Some bandits will have more data and hence learn faster than others. In practice, we typically observe a big performance plunge from the treatment group at the beginning but it will be gradually improving throughout the experiment. The real difficulty is to tell when the new bandit bias is small enough such that we can attribute the difference between treatment and control groups to our new model.</p><p>In summary, the first lesson we learned is that bandits need to be converged to be comparable. So we came up with a definition of convergence such that when a bandit is declared converged, it won’t cause major new bandit bias.</p><h2 id="the-80-80-rule-of-convergence">The 80-80 rule of convergence</h2><p>Intuitively, if a bandit is considered converged, it must be done with exploration and be mainly working on exploitation. We believe this intuition can be further broken down into two subdimensions:</p><ol><li>If there’s a clear best performing arm, then the bandit has found it.</li>
<li>If the bandit can’t distinguish multiple arms, then the bandit must have enough evidence to show they have similar enough performance.</li>
</ol><p>Notice for the bandit to move on to exploitation, it does not need to exactly pinpoint the performance of each arm. For worse performers, knowing “they are worse” is enough.</p><p>Inspired by the Upper Confidence Bound algorithm, we use confidence intervals<sup><a href="https://engineeringblog.yelp.com/2022/12/lessons-from-ab-testing-on-bandit-subjects.html#footnote2">2</a></sup> (CI) of posterior distributions to define convergence. Our definition of convergence for advertising photo selection is as follows. Note that this definition is not necessarily appropriate for your case. But you can use it as an inspiration.</p><ol><li>Compute the 80% CI of the posterior distribution for each arm.</li>
<li>Apply the merge interval algorithm (see, e.g., <a href="https://leetcode.com/problems/merge-intervals/">LeetCode 56</a>) on 80% CIs. That is, put all arms into one group if their CIs have some overlap. If there is no overlap, then the arm is its own group.</li>
<li>Rank the groups by their posterior means. This ranking is well defined because all groups are separated after the previous step.</li>
<li>[80% CI no overlap] Examine the group with the highest CTR (top group henceforth). If the top group has only one arm for the past 7 days, then we call the bandit converged.</li>
<li>[80% CI width drop] Otherwise, if all CIs in the top group are less than 20% width of the prior distribution’s CI for the past 7 days, then we call the bandit converged.</li>
<li>Once the bandit is considered converged, its data may be used for analysis purposes starting from the next day.</li>
</ol><p>The 80% CI no overlap rule captures the case when there is a clear winner. Based on our experience, once any arm’s 80% CI is separated from others, the underperforming ones stop receiving much traffic even if their performance estimates still contain much uncertainty (a.k.a., knowing “they are worse” is enough).</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-12-21-lessons-from-ab-testing-on-bandit-subjects/fig2-no-overlap.png" alt="The posterior and traffic plot of a newly created bandit that converged under the 80% CI no overlap rule. The solid lines are posterior CTRs of photos and the shaded areas are their corresponding 80% CIs. They are plotted on the linear scale. The dotted lines are impressions the bandit allocated to each arm, in log scale. In the initial phase, the bandit is mostly working on exploration so each arm gets a decent amount of traffic. On day 7, the orange arm’s CI is separated from other arms’ and the other arms only receive about 1-5% of the traffic." /><p class="subtle-text"><small>The posterior and traffic plot of a newly created bandit that converged under the 80% CI no overlap rule. The solid lines are posterior CTRs of photos and the shaded areas are their corresponding 80% CIs. They are plotted on the linear scale. The dotted lines are impressions the bandit allocated to each arm, in log scale. In the initial phase, the bandit is mostly working on exploration so each arm gets a decent amount of traffic. On day 7, the orange arm’s CI is separated from other arms’ and the other arms only receive about 1-5% of the traffic.</small></p></div><p>The 80% CI width drop rule captures the case where the difference between multiple arms are not practically significant. In this case, bandits will continue to allocate traffic to all arms in the top group so typically the CI width drop in the top group is fast.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-12-21-lessons-from-ab-testing-on-bandit-subjects/fig3-width-drop.png" alt="The posterior and traffic plot of a newly created bandit that converged under the 80% CI width drop rule. At day 5, the green &amp; orange arms’ CIs are separated from the blue arm’s. While the bandit stops allocating traffic to the blue arm, the bandit cannot significantly differentiate the green and orange arms so both continue to receive significant traffic." /><p class="subtle-text"><small>The posterior and traffic plot of a newly created bandit that converged under the 80% CI width drop rule. At day 5, the green &amp; orange arms’ CIs are separated from the blue arm’s. While the bandit stops allocating traffic to the blue arm, the bandit cannot significantly differentiate the green and orange arms so both continue to receive significant traffic.</small></p></div><p>Under our definition, the new bandit bias is usually of a smaller magnitude than our usual effect size. Moreover, a usual t-test cannot distinguish newly converged bandits and fully converged bandits permitted by data retention period.</p><h2 id="convergence-selection-bias">Convergence selection bias</h2><p>Unfortunately, even though it may help reduce the new bandit bias, just applying the definition of convergence introduces another bias: convergence selection bias. To explain this bias, let’s consider the following example.</p><table><thead><tr><th>advertiser_id</th>
<th>Cohort</th>
<th>Observed CTR</th>
<th>Is Converged</th>
<th>Average CTR of Converged Bandits</th>
</tr></thead><tbody><tr><td>1</td>
<td>Status Quo</td>
<td>0.7%</td>
<td>Yes</td>
<td>1.0%</td>
</tr><tr><td>2</td>
<td> </td>
<td>1.4%</td>
<td>Yes</td>
<td> </td>
</tr><tr><td>3</td>
<td> </td>
<td>0.9%</td>
<td>Yes</td>
<td> </td>
</tr><tr><td>4</td>
<td>Treatment</td>
<td>0.8%</td>
<td>Yes</td>
<td>0.9%</td>
</tr><tr><td>5</td>
<td> </td>
<td>1.5%</td>
<td>No</td>
<td> </td>
</tr><tr><td>6</td>
<td> </td>
<td>1.0%</td>
<td>Yes</td>
<td> </td>
</tr></tbody></table><p>This example is constructed such that treatment has a superior performance. Notice all bandits in the status quo cohort are converged because they have been collecting data for a longer period while only some bandits are converged in the treatment cohort as they are shorter lived. If we compare average CTR of converged bandits in the two groups, then we would falsely conclude that the treatment is doing worse.</p><p>You might dismiss this example since we conveniently mark the best performing bandit unconverged and remove it from comparison. This is not the case; it is a real concern. If we apply the bandit convergence algorithm, then the converged bandits will typically NOT be representative of the whole population. Converged bandits are associated with more traffic in general, and more traffic is associated with more advertising budget, certain advertiser types, more densely populated geolocation and probably some other unknown factors. Because of these factors, the treatment and control balance no longer holds and we re-introduce confounding into a randomized experiment.</p><p>Formally, we are running into the <a href="https://cpb-us-e1.wpmucdn.com/sites.dartmouth.edu/dist/5/2293/files/2021/03/post-treatment-bias.pdf">selection on post-treatment variables</a> issue. That is, in the analysis, we pick samples based on variables that may be affected by the treatment in some unknown way. Such variables may be correlated with the outcome variable in some unknown way. Therefore, the selected sample for analysis is, in some sense, cherry picked, which is absolutely not okay in experiment analysis. Moreover, because we have to define convergence based on post-treatment variables, in general we cannot get around the post-treatment selection with <em>any</em> definition of convergence.</p><p>We may frame the convergence selection bias as a form of missing data bias: the data from the unconverged bandits are missing. Therefore, we can draw some insights from the missing data literature.</p><p>After doing some literature review, we conclude that a carefully implemented matched pair design with pairwise deletion can help minimize the bias in this situation. In particular, <a href="https://gking.harvard.edu/files/gking/files/spd.pdf">King et al’s (2007)</a> matched pair design insulates their policy experiment from certain selection bias of missing data. <a href="https://www.sas.rochester.edu/psc/polmeth/papers/Fukumoto_2015_Polmeth.pdf">Fukumoto (2015)</a> examined the missing data bias in detail for matched pair design. According to Fukumoto, pairwise deletion has a smaller bias compared to all other methods. <a href="https://imai.fas.harvard.edu/research/files/mismatch.pdf">Imai and Jiang (2018)</a> developed a sensitivity analysis that provides a bias bound for the matched pair design.</p><p>We combined the recommendations from these papers. Our matched pair design with pairwise deletion and sensitivity analysis goes as follows:</p><ol><li>Compute the feature set with respect to the population of interest. In the advertising photo selection case, we actually can get this step for free because Yelp has <a href="https://engineeringblog.yelp.com/2020/01/modernizing-ads-targeting-machine-learning-pipeline.html">other ML feature pipeline</a> running for advertisers.</li>
<li>Match subjects as much as possible. Two advertisers in the same pair should have the same value for categorical variables. For numerical values, we may use <a href="https://en.wikipedia.org/wiki/Mahalanobis_distance">Mahalanobis distance</a> for their matching. After this step, all observed confounders are accounted for.</li>
<li>Within the same pair, apply treatment to one subject and control to the other randomly. This step is to randomize the unobserved potential confounders.</li>
<li>Apply pairwise deletion. That is, if both bandits in a pair are judged converged, add them into the analysis pool; otherwise drop both from the analysis.</li>
<li>In the event of advertiser churn, perform pairwise deletion as well.</li>
<li>Perform sensitivity analysis as in <a href="https://imai.fas.harvard.edu/research/files/mismatch.pdf">Imai and Jiang (2018)</a> Section 2.4, Theorem 3.</li>
</ol><p>Unfortunately, this design is no panacea. First, as stated in Fukumoto (2015), the matched pair design can help reduce the bias, but it cannot guarantee bias free.</p><p>Second, the result we get after pairwise deletion is not an estimate of the average treatment effect. The pairs that have little chance to converge within the experiment window are less represented in the final analysis pool and hence we count their treatment effects less. Formally, the pairwise deletion estimand can be interpreted as a weighted average treatment effect, where the weights are the relative ex ante probability of convergence. But in practice, it means it is difficult to communicate how much effect we can expect if we ship the experiment to the whole population.</p><p>Third, this design is complicated and time consuming to perform. So in our work we actually do not perform it unless we have to. Because of the new bandit bias, a negative is not necessarily a true negative but a positive is a true positive. So if a vanilla A/B experiment read (dropping the early period) provides a positive finding already, then we will conclude and ship the experiment.</p><h2 id="conclusion">Conclusion</h2><p>Compared to full-scale ML, multi-armed bandit is a lighter weight solution that can help teams quickly optimize their product features without major commitments. However, because of its inability in handling high cardinality, we have to couple bandits with a candidate selection step. This practice creates two biases whenever we want to improve the candidate selection step: new bandit bias and convergence selection bias.</p><p>Our current recommendation to experimental design has two components: a 80-80 definition of bandit convergence and a matched pair design with pairwise deletion. The former reduces the new bandit bias and the latter minimizes the selection bias. Working together, they deliver a successful A/B experiment on bandit subjects.</p><h2 id="references">References</h2><ol><li>Chapelle, Olivier, and Lihong Li. “An empirical evaluation of Thompson sampling.” Advances in neural information processing systems 24 (2011).</li>
<li>Fukumoto, Kentaro. “Missing data under the matched-pair design: a practical guide.” Technical Report, Presented at the 32nd Annual Summer Meeting of Society for Political Methodology, Rochester, 2015.</li>
<li>Imai, Kosuke, and Zhichao Jiang. “A sensitivity analysis for missing outcomes due to truncation by death under the matched‐pairs design.” Statistics in medicine 37, no. 20 (2018): 2907-2922.</li>
<li>King, Gary, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T. Moore, Jason Lakin, Manett Vargas, Martha María Téllez‐Rojo, Juan Eugenio Hernández Ávila, Mauricio Hernández Ávila, and Héctor Hernández Llamas. “A “politically robust” experimental design for public policy evaluation, with application to the Mexican universal health insurance program.” Journal of Policy Analysis and Management 26, no. 3 (2007): 479-506.</li>
</ol><h2 id="acknowledgements">Acknowledgements</h2><p>The content of this blog is a multi-year effort and we have lost track of all our talented colleagues who have contributed to this problem space. An incomplete list of contributors (with a lot of recency bias) is: Wesley Baugh, Sam Edds, Vincent Kubala, Kevin Liu, Christine Luu, Alexandra Miltsin, Alec Mori, Sonny Peng, Yang Song, Vishnu Sreenivasan Purushothaman, Jenny Yu. I also thank Marcio Cantarino O’Dwyer for reviewing and helpful suggestions.</p><h3 id="notes">Notes</h3><p><a name="footnote1" id="footnote1">1</a>: Multiple simple bandits and a lookup table based finite state contextual bandit are equivalent. For example, if we set up a simple bandit for each advertiser, it is equivalent to setting up a contextual bandit with the context vector being onehot(advertiser_id). On the contrary, having a lookup table based contextual bandit is equivalent to setting up one multi-armed bandit per state. Therefore, we do not distinguish them in this blog post.</p><p><a name="footnote2" id="footnote2">2</a>: Technically, we should use the term <a href="https://en.wikipedia.org/wiki/Credible_interval">credible interval</a>. But we maintain the terminology confidence interval because, in this context, the difference between the two only introduces unnecessary complications to people who are less familiar with Bayesian statistics.</p><div class="island job-posting"><h3>Become an Applied Scientist at Yelp!</h3><p>Are you intrigued by data? Uncover insights and carry out ideas through statistical and predictive models.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/cc5ce7e2-26e9-4290-8847-c082632df9e8/Applied-Scientist-Inference-and-Metrics-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/12/lessons-from-ab-testing-on-bandit-subjects.html</link>
      <guid>https://engineeringblog.yelp.com/2022/12/lessons-from-ab-testing-on-bandit-subjects.html</guid>
      <pubDate>Wed, 21 Dec 2022 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Spark Data Lineage]]></title>
      <description><![CDATA[<p>In this blog post, we introduce Spark-Lineage, an in-house product to track and visualize how data at Yelp is processed, stored, and transferred among our services.</p><p><strong>Spark and Spark-ETL:</strong> At Yelp, <a href="https://spark.apache.org/">Spark</a> is considered a <a href="https://engineeringblog.yelp.com/2020/03/spark-on-paasta.html">first-class citizen</a>, handling batch jobs in all corners, from crunching reviews to identify similar restaurants in the same area, to performing reporting analytics about optimizing local business search. Spark-ETL is our inhouse wrapper around Spark, providing high-level APIs to run Spark batch jobs and abstracting away the complexity of Spark. Spark-ETL is used extensively at Yelp, helping save time that our engineers would otherwise need for writing, debugging, and maintaining Spark jobs.</p><p><strong>Problem:</strong> Our data is processed and transferred among hundreds of microservices and stored in different formats in multiple data stores including Redshift, S3, Kafka, Cassandra, etc. Currently we have thousands of batch jobs running daily, and it is increasingly difficult to understand the dependencies among them. Imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by a few critical Yelp services; you are about to make structural changes to the batch job and want to know who and what downstream to your service will be impacted. Or imagine yourself in the role of a machine learning engineer who would like to add an ML feature to their model and ask — “Can I run a check myself to understand how this feature is generated?”</p><p><strong>Spark-Lineage:</strong> Spark-Lineage is built to solve these problems. It provides a visual representation of the data’s journey, including all steps from origin to destination, with detailed information about where the data goes, who owns the data, and how the data is processed and stored at each step. Spark-Lineage extracts all necessary metadata from every Spark-ETL job, constructs graph representations of data movements, and lets users explore them interactively via a third-party data governance platform.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-08-04-spark-lineage/i1.png" alt="Figure 1. Example of Spark-Lineage view of a Spark-ETL job" /><p class="subtle-text"><small>Figure 1. Example of Spark-Lineage view of a Spark-ETL job</small></p></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-08-04-spark-lineage/i2.png" alt="Figure 2. Overview of Spark-Lineage" /><p class="subtle-text"><small>Figure 2. Overview of Spark-Lineage</small></p></div><p>To run a Spark job with Spark-ETL is simple; the user only needs to provide (1) the source and target information via a yaml config file, and (2) the logic of the data transformation from the sources to the targets via python code.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-08-04-spark-lineage/i3.png" alt="Figure 3. An example diagram of a Spark-ETL job" /><p class="subtle-text"><small>Figure 3. An example diagram of a Spark-ETL job</small></p></div><p>On the backend side, we implement Spark-Lineage directly inside Spark-ETL to extract all pairs of source and target tables having dependency relationships from every batch job. More precisely, we use the <a href="https://networkx.org/">NetworkX</a> library to construct a workflow graph of the job, and find all pairs of source and target tables that have a path between them in the corresponding Directed Acyclic graph (DAG) workflow of that job. All the middle tables in the transformation are not recorded in Lineage because they are temporary. For example, (Input Table 1, Output Table 2) is a pair in Figure 3 since there is a path between them, while (Input Table 2, Output Table 2) is not. For every such pair, we emit a message to Kafka including the identifiers of the source and target, together with other necessary metadata. These messages are then transferred from Kafka to a dedicated table in Redshift.</p><p>The reason we go with a two-step process instead of sending messages directly to one place is that Redshift has maintenance downtime while Kafka is highly available to receive newly emitted messages at all times. On the other hand, storing data in Redshift is highly durable and easy to query for analytics purposes. At Yelp, we have on the order of thousands of batches per day, and on average each job emits around 10 messages. In total, the Lineage table grows by a couple of million rows per year, which can be handled at ease by Redshift. Spark-Lineage then reads from the Redshift table and serves users, using an ETL tool plug-in.</p><h2 id="building-spark-lineages-ui">Building Spark-Lineages UI</h2><p>First, we parse the metadata made available from the above steps in Redshift and identify the source and target information. This metadata is first read into a staging table in the Redshift database. The reason we stage this data is to identify any new jobs that have been introduced in the daily load or to capture any updates to the existing scheduled jobs.</p><p>We then create a link (a canonical term for tables, files, etc.) for each Spark-ETL table together with additional information extracted from the metadata. We also add relationships between these jobs with their respective schemas. Finally we establish the connections among source and target tables according to the DAG extracted from Spark-ETL.</p><p>A mock-UI of Spark-Lineages is shown in Figure 1, where the user can browse or search for all Spark tables and batch jobs, read the details of each table and job, and track the dependencies among them from their origin to their end.</p><h2 id="understanding-a-machine-learning-feature">Understanding a Machine Learning feature</h2><p>Data scientists working on Machine Learning models often look for existing data when building new features. In some cases the data they find might be based on different assumptions about what data should be included. For example, one team may include background events in a count of all recent events that a given user has performed, when the model does not wish to include such events. In such a case, Spark-Lineage allows a team to track down what data is used to identify these different decisions and what data can alleviate the discrepancies.</p><h2 id="understanding-the-impacts">Understanding the impacts</h2><p>One of the major advantages of having data lineage being identified and documented is that it enables Yelpers to understand any downstream/upstream dependencies for any changes that will be incorporated into a feature. It also provides an ability for easy coordination across relevant teams to proactively measure the impact of a change and make decisions accordingly.</p><h2 id="fixing-data-incidents">Fixing data incidents</h2><p>In a distributed environment, there are many reasons that can derail a batch job, leading to incomplete, duplicated, and/or partially corrupt data. Such errors may go silently for a while, and when discovered, have already affected downstream jobs. In such cases, the response includes freezing all downstream jobs to prevent the corrupt data from spreading further, tracing all upstream jobs to find the source of the error, then backfilling from there and all downstream inaccurate data. Finally, we restore the jobs when the backfilling is done. All of these steps need to be done as fast as possible and Spark-Lineage could be the perfect place to quickly identify the corrupted suspects.</p><p>Besides, mentioning the responsible team in Spark-Lineage establishes the accountability for the jobs and thus maintenance teams or on-point teams can approach the right team at the right time. This avoids having multiple conversations with multiple teams to identify the owners of a job and reduces any delay in this that could adversely affect the business reporting.</p><h2 id="feature-store">Feature Store</h2><p>Yelp’s ML Feature Store collects and stores features and serves them to consumers to build Machine Learning models or run Spark jobs and to data analysts to get insights for decision-making. Feature Store offers many benefits, among them are:</p><ol><li>Avoiding duplicated work, e.g. from different teams trying to build the same features;</li>
<li>Ensuring consistency between training and serving models; and</li>
<li>Helping engineers to easily discover useful features.</li>
</ol><p>Data Lineage can help improve the Feature Store in various ways. We use Lineage to track the usage of features such as the frequency a feature is used and by which teams, to determine the popularity of a feature, or how much performance gain a feature can bring. From that, we can perform data analytics to promote or recommend good features or guide us to produce similar features that we think can be beneficial to our ML engineers.</p><h2 id="compliance-and-auditability">Compliance and auditability</h2><p>The metadata collected in Lineage can be used by legal and engineering teams to ensure that all data is processed and stored following regulations and policies. It also helps to make changes in the data processing pipeline to comply with new regulations in case changes are introduced in the future.</p><p>​​This post introduces the Yelp Spark-Lineage and demonstrates how it helps tracking and visualizing the life cycle of data among our services, together with applications of Spark-Lineage on different areas at Yelp. For readers interested in the specific implementation of Spark-Lineage, we have included a server- and client-side breakdown below (Appendix).</p><h2 id="implementation-on-the-server-side">Implementation on the server side</h2><h3 id="data-identifiers">Data identifiers</h3><p>The most basic metadata that Spark-Lineage needs to track are the identifiers of the data. We provide 2 ways to identify an input/output table: the <em>schema_id</em> and the <em>location</em> of the data.</p><ul><li>
<p><strong>Schema_id:</strong> All modern data at Yelp is schematized and assigned a schema_id, no matter whether they are stored in Redshift, S3, Data Lake, or Kafka.</p>
</li>
<li>
<p><strong>Location:</strong> Table location, on the other hand, is not standardized between data stores, but generally it is a triplet of (collection_name, table_name, schema_version) although they are usually called something different for each data store, to be in line with the terminologies of that data store.</p>
</li>
</ul><p>Either way, if we are given one identifier, we can get the other. Looking up schema information can be done via a CLI or PipelineStudio – a simple UI to explore the schemas interactively, or right on Spark-Lineage UI with more advanced features compared to PipelineStudio. By providing one of the two identifiers, we can see the description of every column in the table and how the schema of the table has evolved over time, etc.</p><p>Each of the two identifiers has its own pros and cons and complements each other. For example:</p><ul><li>The schema_id provides a more canonical way to access the data information, but the location is easier to remember and more user-friendly.</li>
<li>In the case the schema is updated, the schema_id will no longer be the latest, while looking up using the pair (collection_name, table_name) will always return the latest schema. Using schema_id, we can also discover the latest schema, but it takes one more step.</li>
</ul><h3 id="tracking-other-information">Tracking other information</h3><p>Spark-Lineage also provides the following information:</p><ul><li>
<p><strong>Run date:</strong> We collect the date of every run of the job. From this we can infer its running frequency, which is more reliable than based on the description in the yaml file, because the frequency can be changed in the future. In the case we don’t receive any run for a month, we still keep the output tables of the job available but mark them as deprecated so that the users are aware of that.</p>
</li>
<li><strong>Outcome:</strong> We also track the outcome (success/failure) of every run of the job. We do not notify the owner of the job in case of a failure, because at Yelp we have dedicated tools for monitoring and alerts. We use this data for the same purpose as above; if a service fails many times, we will mark the output tables to let the users know about that.</li>
<li><strong>Job name and yaml config file:</strong> This helps the user quickly locate the necessary information to understand the logic of the job, together with the owner of the job in case the user would like to contact for follow-up questions.</li>
<li><strong>Spark-ETL version, service version, and Docker tag:</strong> This information is also tracked for every run and used for more technical purposes such as debugging. One use case would be if an ML engineer finds out a statistical shift of a feature recently, he can look up and compare the specific code of a run today versus that of last month.</li>
</ul><h2 id="implementation-on-the-client-side">Implementation on the client side</h2><p><strong>Representation of Spark ETL jobs:</strong> As a first step to represent a Spark ETL job, a new domain named “Spark ETL” is created. This enables easy catalog searching and results in a dedicated area for storing the details of Spark-ETL jobs from the Redshift staging table. Once the domain is available, unique links (for the spark ETL jobs) are created in the data governance platform with job name as the identifier.</p><p><strong>Adding metadata information:</strong> The details of the Spark ETL job (e.g., Repository, source yaml, etc.) are attached to the respective links created above. Each of the metadata information is given a unique id and value with a relation to the associated job. The current mechanism implemented for the Spark ETL jobs can be extended to represent the additional information in future.</p><p><strong>Assign accountability:</strong> As the information about the owners is fetched from Kafka into Redshift, the responsibility section of the job link in the data governance platform can be modified to include the “Technical Steward” – an engineering team who is accountable for the Spark ETL job, including producing and maintaining the actual source data and responsible for technical documentation of data and troubleshooting data issues.</p><p><strong>Establishing the lineage:</strong> Once the Spark-ETL jobs and the required metadata information are made available in the data governance platform, we establish the 2-way relation to depict source to Spark ETL job and Spark ETL job to target relation. The relations are established using a REST POST API call. After the relations are created, the lineage is auto-created and is made available for use. There are multiple views that can be used for depicting the relations but “Lineage View” captures the dependencies all the way until Tableau dashboards (See Figure 1).</p><p>Thanks to Cindy Gao, Talal Riaz, and Stefanie Thiem for designing and continuously improving Spark-Lineage, and thanks to Blake Larkin, Joachim Hereth, Rahul Bhardwaj, and Damon Chiarenza for technical review and editing the blog post.</p><div class="island job-posting"><h3>Become an ML Platform Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become an ML Platform Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5699bbf-77ac-47ad-abf1-53638d8a5dec?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html</link>
      <guid>https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html</guid>
      <pubDate>Thu, 04 Aug 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Android in Analytics Infra]]></title>
      <description><![CDATA[At Yelp, we have a reasonably large Android community for a company of Yelp’s size. These talented and skilled Android engineers work on Yelp’s client and business applications. We would like to share some of the unique challenges that we’ve experienced along with our various efforts to overcome those challenges. Analytics Infra is a team at Yelp that works on experimentation and logging platforms and supports them across the entire Yelp ecosystem. Within the Analytics Infra team, we have an Android working group. You may consider our team as an infrastructure team - a team that implements end-user functionality -...]]></description>
      <link>https://engineeringblog.yelp.com/2022/08/android-in-analytics-infra.html</link>
      <guid>https://engineeringblog.yelp.com/2022/08/android-in-analytics-infra.html</guid>
      <pubDate>Wed, 03 Aug 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Writing Emails Using React]]></title>
      <description><![CDATA[<p>As part of our effort to connect users with great local businesses, Yelp sends out tens of millions of emails every month. In order to support the scale of those sends, we rely on third-party Email Service Providers (ESPs) as well as our internal email system, Mercury.</p><p>Delivering the emails is just part of the challenge—we also need to give email developers a way to craft sophisticated templates that conform to our <a href="https://www.yelp.com/styleguide">Yelp design guidelines</a>. In the past, Yelp web and full stack engineers would rely on our legacy template language, Cheetah, to write emails. However, as the Yelp design language continued to evolve, this approach began to show its age: the code wasn’t maintained and visuals were no longer consistent with those of our apps and website. Additionally, Cheetah is a little-known language that represents an entirely different development workflow from what Yelp engineers are most accustomed to writing in their day-to-day work. Essentially all new web development is done in React.</p><p>In 2021, we set out to solve these problems by <strong>creating an email development system based on React components</strong>. Since its general release, this system has been used to develop more than a dozen new email types to send at scale, with <strong>millions of emails sent to date</strong>.</p><p>In this blog post, we’ll detail how we’ve repurposed elements of our website’s infrastructure to support email development, and how these systems address common problems encountered by email developers.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/react_emails.png" alt="React emails" /></p><p>Yelp web developers write React code in our frontend monorepo, where they can create new packages, write components, and deploy pages. By using React for email development as well, developers are able to write emails in familiar ways:</p><ul><li>New emails are scaffolded using <a href="https://yeoman.io/">Yeoman</a>.</li>
<li>Each email template is its own React component. The component’s children are made up of shared email components that are imported from elsewhere in the monorepo.</li>
<li>Our React email code is type checked and linted, following the same rules as the rest of the monorepo.</li>
<li>Developers can write <a href="https://storybook.js.org/">Storybook</a> examples for emails and see them displayed in the same way as our web components.</li>
<li>Emails are tightly integrated with our core web infrastructure; building, image imports, CDN asset upload, and i18n are all supported for free.</li>
</ul><p>We’ve also created tooling for email developers that want to send tests to real email clients. <strong>yelp-js-email</strong> extracts markup from Storybook examples and uses it as a basis for generating a preview email. With a little bit of AST modification, we can generate a test application which renders a preview email that closely conforms to the process used in a production send. The outcome is that developers are able to create and test new emails seamlessly without any backend work or campaign configuration.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/storybook_example.png" alt="A storybook example of an email" /><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/yelp_js_email.png" alt="A terminal command to send a preview email" /></p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/preview_email.png" alt="An example email rendered in Storybook and sent to a developer’s inbox with a simple command." /><p class="subtle-text"><small>An example email rendered in Storybook and sent to a developer’s inbox with a simple command.</small></p></div><p>When it comes time to send an email, developers release a new version of their package containing the email component. Next, they <strong>add the released version to our email-rendering microservice</strong>. After making a <strong>backend config change</strong> describing the new email campaign, it’s ready to be sent. Our backend developers can then submit email payloads to Mercury through Yelp’s data pipeline which triggers a server-side render of the email component. After some post-processing transformation to prepare the rendered HTML for sending, we then pass the email through the rest of our email pipeline on to our ESP. In a matter of seconds, a user receives an email from Yelp that was written with React!</p><p>For more details on the implementation of React emails and challenges we’ve encountered, read on.</p><p>Email clients have a (not undeserved) reputation for being difficult to develop for. Maintaining good compatibility while accounting for the various quirks of dozens of individual clients and platforms is a task that requires expert knowledge, thorough testing, and a lot of patience. Unlike modern evergreen web browsers, which have largely coalesced around a common standard, email clients have gotten away with custom and broken behaviors for years.</p><p>The critical takeaway from this insight is that <strong>email developers ought to explicitly define email clients that they intend to support</strong>. In the same way that developers across the industry have factored declining market share and maintenance costs into their decisions to drop compatibility for Internet Explorer 11, email developers can make choices when it comes to the email clients that they support. Throughout our testing we’ve found that <strong>popular email clients have better HTML and CSS compatibility than most developers realize</strong>. Outside of a few notable “problem clients” (Desktop Outlook, Windows 10 Mail, etc.), the big players (Gmail, Apple Mail, Yahoo, outlook.com, etc.) largely render spec compliant markup and styles more or less correctly. Common email wisdom, such as the recommendation to never use &lt;div&gt; tags, does not apply to these clients in 2022.</p><p>We looked at engagement numbers for our various emails and found that a small percentage of email recipients were opening their mail in legacy email clients. After consulting with our Product team, we determined that we would drop support for Desktop Outlook and related clients. Dropping support means that the text content of the email will still render, but we don’t consider it blocking if the email is otherwise visually broken. By explicitly defining the email clients we intend to support, we:</p><ol><li>Give developers confidence when developing emails that they will display correctly for users within our support standards.</li>
<li>Get a better sense of the absolute market share of our emails.</li>
<li>Are able to craft more compelling, visually appealing, and responsive emails for the majority of Yelp users.</li>
</ol><p>Providing developers with drop-in email components that have already been audited against our support standards is critical. In the same way that web engineers compose pages on the Yelp website by arranging a set of common, consistently designed components, we want to allow the possibility of building emails with minimal custom code required.</p><p>We determined early on that our Design Systems team did not have the resources to build and maintain an entirely separate set of React components exclusively for emails. Our design language is constantly evolving across our three major platforms and requires continual upkeep. Adding email into the mix would incur a significant cost. Our approach was then to reuse as much of the implementations of our existing web React components as we could.</p><p>This might seem impossible at first glance, as web React components are built for a browser context that involves lots of functionality not supported in email clients (interactivity, statefulness, and even animation). However, at its most basic, <strong>React functions just like a classic templating language</strong>. Blocks of template code can be conditionally rendered, composed, and injected with data, all in the context of JSX. When server-side rendered, a React component is transformed into HTML. That HTML is what we can assemble into our email body and send as a static email.</p><p>We employ two strategies to ensure that our existing web components are able to render properly in an email context: component wrappers, used to refine the prop APIs of our web components, and CSS in JS transforms, used to make individual style tweaks. Where neither approach is suitable we’ll create custom components built just for emails.</p><h2 id="component-wrappers">Component Wrappers</h2><p>Each of our repurposed web React components are wrapped in a corresponding component prefixed with “Email” (e.g., <code class="language-plaintext highlighter-rouge">Text</code> -&gt; <code class="language-plaintext highlighter-rouge">EmailText</code>, <code class="language-plaintext highlighter-rouge">Container</code> -&gt; <code class="language-plaintext highlighter-rouge">EmailContainer</code>, etc.). This wrapper component allows us to modify the available props for the component and provide one-off tweaks and overrides.</p><p>For example, our standard <code class="language-plaintext highlighter-rouge">Button</code> component supports an <code class="language-plaintext highlighter-rouge">onClick</code> handler, but we don’t want email developers to use it (no JavaScript means it won’t work!). We simply define a new <code class="language-plaintext highlighter-rouge">Props</code> type (at Yelp we use Flow) with the restricted component API. We also use it as an opportunity to document relevant compatibility notes we’ve encountered in our testing, and to make other tweaks to the template before forwarding along props to the underlying <code class="language-plaintext highlighter-rouge">Button</code> component:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/email_button.png" alt="Sample code for an EmailButton React component" /></p><h2 id="css-in-js-transforms">CSS in JS Transforms</h2><p>We’re big fans of CSS in JS at Yelp, and are in the process of migrating most of our React components from SASS stylesheets to <a href="https://emotion.sh/docs/introduction">Emotion</a>. One of the often overlooked features of writing CSS in JS is the level of control it provides over styles at runtime. Emotion actually <a href="https://emotion.sh/docs/cache-provider">facilitates this</a> through custom <a href="https://github.com/thysultan/stylis">Stylis plugins</a>, which allow developers to systematically transform their styles while a component is rendering!</p><p>We put this to good use for our React email components, tweaking CSS to maximize email compatibility and minimize cognitive load for developers. Check out this example Stylis plugin that helps address a compatibility issue with usage of “var” in property values:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/stylis_plugin.png" alt="Sample code for a custom CSS in JS plugin" /></p><p>Without this plugin, we’d be hard put to reuse our React Button component in emails. Since the problematic “var” style is embedded in Button’s render method, we’d likely be forced to add an email-specific prop with some branching logic to conditionally remove the CSS. This adds a maintenance burden and introduces a concern with email rendering that our web components would not otherwise have.</p><p>Using this custom CSS in JS plugin, we get to <strong>maintain the encapsulation of our web components</strong> while still <strong>making the tweaks we need</strong> to use them effectively in emails. This is particularly important for emails that we send using <a href="https://amp.dev/about/email/">AMP</a>, which validates styles against a very strict subset of web CSS. Using these plugins, we can modify our styles to ensure they conform to the specification.</p><p>One other advantage of using CSS in JS to style our emails is that we know we won’t be wasting bytes with styles that we don’t need. When we’re building, our styles will be tree shaken alongside the rest of our components’ JS.</p><p>We aren’t always able to repurpose an existing web component using these two approaches; sometimes there are fundamental incompatibilities. In these cases, we’re able to write an <code class="language-plaintext highlighter-rouge">Email*</code> component from scratch, knowing that we aren’t unnecessarily introducing duplication. We probably want to <strong>revisit the design of these components in an email context anyway</strong>. For example, our <code class="language-plaintext highlighter-rouge">RatingSelector</code> component allows users to start a review on the Yelp website:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/rating_selector.png" alt="Screenshot of a RatingSelector on the Yelp website" /></p><p>In an email context, we still want to allow users to tap a star rating to begin a review, but we’re not able to replicate the highlighting behavior on hover in email clients. There’s no way to address this difference using a wrapping component or CSS in JS plugin, so we created a custom <code class="language-plaintext highlighter-rouge">EmailRatingSelector</code> inspired by the original design but better suited for rendering statically.</p><p>As another example, at Yelp we use pre-built React components to standardize most of our layout needs. Most notably we use one called <code class="language-plaintext highlighter-rouge">Arrange</code>. As <code class="language-plaintext highlighter-rouge">Arrange</code> relies on some pretty tricky styles and media queries that don’t play nice with most email clients, we decided to create a custom <code class="language-plaintext highlighter-rouge">EmailArrange</code> component in its place. For <code class="language-plaintext highlighter-rouge">EmailArrange</code> we greatly simplified the available options, opting for a fixed layout <code class="language-plaintext highlighter-rouge">&lt;table&gt;</code>, but maintained a similar props API to the web <code class="language-plaintext highlighter-rouge">Arrange</code> component. Developers consuming <code class="language-plaintext highlighter-rouge">EmailArrange</code> will see it work in much the same way that they’re accustomed to, and the <strong>implementation details of the email-specific differences are abstracted away from them</strong>.</p><p>At Yelp, we’ve found SSR (Server-Side Rendering) of React pages to be a critical component of our web application, positively impacting SEO, page performance, and user experience. Since we need to turn our React email components into HTML for sending, rendering them server-side (ideally using our existing infrastructure) was a critical piece of the puzzle.</p><p>The performance characteristics of rendering emails fundamentally differs from serving requests from web traffic. Web traffic tends to scale gently, following predictable curves as users frequent Yelp more often at certain times of the day. Email rendering happens entirely differently. Except in cases where they’re sent immediately in response to a particular user action, emails are typically sent in massive, scheduled campaigns that consist of thousands or even millions of emails in one batch.</p><p>We performed some early tests on our legacy SSR system and found that it was a bottleneck—suddenly queuing thousands of requests to render emails at once quickly overwhelmed it. We were forced to throttle the requests at just tens of emails per second, which was unacceptable for the production campaigns we knew we needed to run.</p><p><a href="https://engineeringblog.yelp.com/2022/02/server-side-rendering-at-scale.html">As outlined in a previous blog post</a>, we’d encountered a myriad of similar challenges scaling Server-Side Rendering in other places, so our awesome web infrastructure team invested in a new system that SSRs pages with far greater performance and reliability. Since the widespread rollout of this modern system, we’ve been able to easily scale to <strong>thousands of emails sent per second</strong>, such that the rendering step is no longer a bottleneck in our pipeline.</p><p>When a backend developer writes a batch for a massive email campaign (typically using <a href="https://spark.apache.org/">Spark</a>), they queue email-send requests to Mercury, our internal notifications system (powered by our data pipeline and <a href="https://kafka.apache.org/">Kafka</a>). Those sends contain basic information such as the user to send to and the campaign it belongs to, as well as a payload containing data that’s required to render the template. These send requests are ingested by workers in our email-rendering microservice, which in turn triggers a request to our SSR service shard (the payload gets turned into React props). The shard returns our rendered email in HTML, which is then forwarded along to the rest of the email-sending pipeline.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/mercury_diagram.png" alt="Services involved in an end to end send of React emails" /><p class="subtle-text"><small>Services involved in an end to end send of React emails</small></p></div><p>There’s another benefit to rendering emails using our existing SSR infrastructure: since our React web pages are already configured to support making GraphQL queries during SSR using <a href="https://www.apollographql.com/">Apollo</a>, we can make online queries to include data in our email templates using our GraphQL API with no additional work needed.</p><p>Through the previous steps we’ve outlined, we’ve prepared our server-side rendered HTML to be sent to email clients. Even after the initial render, there’s still a little post-processing work to be done.</p><p>First, we make an effort to clean up the SSR HTML—in its raw state it’s tailored to be rendered and hydrated in a web browser. Using pyquery we can <strong>clean up</strong> extraneous script tags and attributes that we won’t use. Next, we want to <strong>establish the metadata</strong> for the email (subject line, from address, etc.). Inside each React email component, we use a <code class="language-plaintext highlighter-rouge">&lt;title&gt;</code> tag rendered via <a href="https://github.com/staylor/react-helmet-async#readme">react-helmet-async</a> to set our subject line. Data attributes on that tag provide the rest of the metadata we need.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2022-07-20-writing-emails-using-react/post_processing.png" alt="Sample code for post-processing meta data for an email" /></p><p>Using this approach lets us keep the source of truth for email metadata alongside the component itself.</p><p>Finally, <strong>we need to inline our styles</strong>. Even modern email clients like Gmail suffer from <a href="https://github.com/hteumeuleu/email-bugs/issues/90">limitations like <code class="language-plaintext highlighter-rouge">&lt;style&gt;</code> tag byte limits</a> that make it necessary to move <code class="language-plaintext highlighter-rouge">&lt;head&gt;</code> tag CSS into HTML element <code class="language-plaintext highlighter-rouge">style=""</code> attributes. There are a bunch of open source options to accomplish this task (e.g., <a href="https://github.com/Automattic/juice">juice</a> and <a href="https://github.com/premailer/premailer">premailer</a>). In our testing, we found that existing Python implementations were far too slow for our needs, sometimes taking upwards of a second. Instead, we opted to write a custom style inliner built on <a href="https://pypi.org/project/tinycss2/">tinycss2</a> and <a href="https://pypi.org/project/selectolax/">selectolax</a>. Even with some AST traversal, the implementation is quite straightforward, and we were able to minimize the time we spent inlining styles down to just a few milliseconds.</p><p>After performing all these tasks, we construct a basic email HTML structure and forward it on its way. The email is ready to be sent!</p><p>By relying on React components and existing Yelp web infrastructure, we were able to architect an email template system that’s easy to use for developers, has up-to-date designs, lowers maintenance costs, and surpasses our performance requirements. In aligning product and engineering needs and clearly defining our email compatibility standards, we spend less time concerned with outlier email clients and more time creating compelling campaigns for Yelp users.</p><p>While the approach to sending emails outlined above is from the frame of reference of Yelp’s infrastructure, the overall system can be replicated using some fundamental building blocks like CSS in JS, a mature SSR platform, and an extensible email-rendering pipeline.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b970ccef-75bf-45ce-bda5-e6f3f3988e38/Senior-Software-Engineer-Full-Stack-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/07/writing-emails-using-react.html</link>
      <guid>https://engineeringblog.yelp.com/2022/07/writing-emails-using-react.html</guid>
      <pubDate>Wed, 20 Jul 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Migrating from Styleguidist to Storybook]]></title>
      <description><![CDATA[<p>One of the core tenets for our infrastructure and engineering effectiveness teams at Yelp is ensuring we have a best-in-class developer experience. Our React monorepo codebase has steadily grown as developers create new React components, but our existing <a href="https://react-styleguidist.js.org/">React Styleguidist</a> (Styleguidist, for short) development environment has failed to scale in parallel. By transitioning from Styleguidist to <a href="https://storybook.js.org/">Storybook</a>, we were able to offer a faster and more user-friendly development environment for React components along with better alignment to developer and designer workflows. In this post we’ll take a deep dive into how and why we migrated to Storybook.</p><p>Styleguidist is an interactive React component development environment that developers use to develop and view their user interfaces. Styleguidist can also be used to produce static documentation pages (style guides) that can be hosted and shared with stakeholders.</p><p>Documentation is created using Markdown with code blocks that render a React component in an isolated interactive playground. A simple example looks like the following:</p><div class="language-markdown highlighter-rouge highlight"><pre>The `&lt;ButtonGroup /&gt;` component is used to arrange multiple `&lt;Button /&gt;`
components side-by-side.
```jsx
const Button = require('../Button').default;
&lt;ButtonGroup&gt;
    &lt;Button text="Foo" /&gt;
    &lt;Button text="Bar" /&gt;
    &lt;Button text="Baz" /&gt;
&lt;/ButtonGroup&gt;
```
</pre></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-07-06-migrating-from-styleguidist-to-storybook/styleguidist_example.png" alt="An example Styleguidist playground" /><p class="subtle-text"><small>An example Styleguidist playground</small></p></div><p>At Yelp, we’ve encountered various drawbacks from using Styleguidist that have led to a subpar React development experience:</p><ul><li>Styleguidist lacks an add-ons ecosystem due to a lack of wider Web community support, so additional functionality in Styleguidist would have to be written from scratch.</li>
<li>Styleguidist does not scale well with large packages because it renders an isolated playground for every example in that package, resulting in slow initial load times and slow hot reloads.</li>
<li>Developers have to create many permutations of each of their components to show every possible state a component supports.</li>
<li>Editing Styleguidist markdown to change component state in the UI is not intuitive for developers and non-technical users.</li>
</ul><p><a href="https://storybook.js.org/">Storybook</a> is an open source UI development and documentation tool that has gained popularity in the Web community in the past few years. It has strong community support and a rich add-ons ecosystem, making it easy to extend for accessibility testing, cross-browser testing, and other functionality.</p><p>Storybook allows users to browse and develop component examples one by one via <a href="https://storybook.js.org/docs/react/get-started/whats-a-story">Stories</a>. Stories capture the rendered state of a React Component, just like a Styleguidist Markdown example. This contrasts with the significantly slower Styleguidist, which always renders every example of every component in a package.</p><p>In Styleguidist, developers often create one example per visual permutation of their component, resulting in added maintenance burden (e.g. updating every example after changing a component API). In Storybook, developers can utilize auto generated <a href="https://storybook.js.org/docs/react/essentials/controls">Controls</a> via <a href="https://github.com/reactjs/react-docgen">react-docgen</a> that allow users to mutate and preview components directly in the documentation UI. This further streamlines the experience compared to Styleguidist, because documentation users no longer need to edit Markdown to change a component’s state.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-07-06-migrating-from-styleguidist-to-storybook/storybook_example.png" alt="An example Storybook playground" /><p class="subtle-text"><small>An example Storybook playground</small></p></div><p>Our React monorepo contained thousands of Styleguidist files, each with many examples of component usage within it. It was not feasible to migrate these by hand, and it would be unreasonable to force developers to manually rewrite their examples in the new Storybook format. To maintain our existing React component examples and reduce developer overhead in our migration, we developed the following requirements:</p><ul><li>Our existing Styleguidist files used ES5 style imports and syntax. We want to keep our new Storybook syntax consistent with component source code by using ES6 everywhere.</li>
<li>Documentation in Storybook should be familiar to developers who have used Styleguidist.
<ul><li>Storybook supports <a href="https://mdxjs.com/">MDX</a> which is a file format that combines Markdown with JSX to render React components in Markdown for documentation pages, and we can translate existing Styleguidist Markdown to MDX.</li>
</ul></li>
<li>Each example code block in Styleguidist should be translated into a <a href="https://storybook.js.org/docs/react/get-started/whats-a-story">Story</a>, and the component’s stories.js file should contain all examples.</li>
</ul><p>With these goals in mind, we decided to use codemods to refactor our style guide files into the Storybook format. Codemods are a series of scripted actions that transform a codebase programmatically, and allow for large automated changes to be made without manual work.</p><p>First we extracted the Styleguidist code blocks; the rest of the contents of the Markdown file (e.g. plaintext descriptions) could be directly copied verbatim to the new MDX file. To achieve a one to one migration, we consider each code block as its own Story. We were able to leverage existing tools like <a href="https://www.npmjs.com/package/remark-code-blocks">remark-code-blocks</a> to extract Javascript codeblocks, and <a href="https://github.com/5to6/5to6-codemod">5to6-codemod</a> to convert ES5 syntax within these codeblocks to ES6 syntax.</p><div class="language-js highlighter-rouge highlight"><pre>// before:
// const Button = require('../Button').default;
import Button from '../Button';
</pre></div><p>To reduce developer friction during this transition, we decided to contain all Stories for a component in the same <code class="language-plaintext highlighter-rouge">component.stories.js</code> file, which is then displayed in the <code class="language-plaintext highlighter-rouge">component.stories.mdx</code> Docs Page. However, we discovered that MDX code blocks are run in the same context, and our assumption of maintained playground isolation from Styleguidist is no longer true. This issue is particularly problematic when dealing with transforming multiple Styleguidist examples in the same file, because joining the code blocks together results in duplicate imports:</p><div class="language-markdown highlighter-rouge highlight"><pre>```jsx
import Button from '../Button';
Full width `ButtonGroup` example:
&lt;ButtonGroup fill&gt;
(omitted for brevity)
```
```jsx
import Button from '../Button'; // &lt;-- this import is duplicated from above!
Disabled `ButtonGroup` example:
&lt;ButtonGroup disabled&gt;
(omitted for brevity)
```
</pre></div><p>After combining the above stories into a single JS file, the Button import is duplicated. Our codemod needs to parse and dedupe these imports to prevent runtime errors. Additionally, we need to include the components that <a href="https://react-styleguidist.js.org/docs/documenting/#writing-code-examples">Styleguidist implicitly imports</a> for us:</p><div class="language-jsx highlighter-rouge highlight"><pre>// ButtonGroup.stories.js
import Button from '../Button'; // deduped
import { ButtonGroup } from './'; // added implicit import explicitly
&lt;ButtonGroup&gt;
    &lt;Button text="Foo" /&gt;
    &lt;Button text="Bar" /&gt;
    &lt;Button text="Baz" /&gt;
&lt;/ButtonGroup&gt;
</pre></div><p>Next, we write the extracted Markdown code blocks with deduped imports and ES6 syntax in <code class="language-plaintext highlighter-rouge">component.stories.js</code>, and a <code class="language-plaintext highlighter-rouge">component.stories.mdx</code> file with standard Storybook boilerplate:</p><div class="language-jsx highlighter-rouge highlight"><pre>// ButtonGroup.stories.mdx
import { ArgsTable, Canvas, Description, Meta, Story } from '@storybook/addon-docs';
import * as stories from './ButtonGroup.stories.js';
import { ButtonGroup } from './';
&lt;Meta
    title="yelp-react-component-button/ButtonGroup"
    component={ButtonGroup}
/&gt;
The `&lt;ButtonGroup /&gt;` component is used to arrange multiple `&lt;Button /&gt;`
components side-by-side.
&lt;Canvas&gt;
  &lt;Story name="Example0" story={stories.Example0} /&gt;
&lt;/Canvas&gt;
</pre></div><p>Lastly, we needed Storybook to understand how to build our components. We were able to extend the <a href="https://storybook.js.org/docs/react/builders/webpack#extending-storybooks-webpack-config">Storybook build configuration</a> with our existing production webpack configuration. This allowed us to preserve Storybook’s automatic docgen functionality, and miscellaneous features like code preview blocks. Using our existing webpack configuration also meant that components would appear and behave exactly as they do in real production pages.</p><p>Migrating our React component examples from Styleguidist to Storybook has massively improved developer experience and component playground performance. We were able to utilize Storybook features like <a href="https://storybook.js.org/docs/react/configure/overview#on-demand-story-loading">on-demand loading</a> to improve performance by generating a smaller bundle at compile time, resulting in faster playground boot times. Using our codemod migration strategy, we were able to transform nearly all of the examples in our monorepo without runtime errors, without blocking developers during the migration process.</p><p>Switching to Storybook opens up new possibilities for Yelp, and we’re excited to onboard add-ons to accelerate frontend developer productivity further.</p><p>We hope that this breakdown in our migration process helps teams facing similar migrations!</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a6cfee89-2dd0-4451-bf52-746b9547dfb7/Software-Engineer-Full-Stack-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/07/migrating-from-styleguidist-to-storybook.html</link>
      <guid>https://engineeringblog.yelp.com/2022/07/migrating-from-styleguidist-to-storybook.html</guid>
      <pubDate>Wed, 06 Jul 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Spark Data Lineage in Collibra]]></title>
      <description><![CDATA[<p>In this blog post, we introduce Spark-Lineage, an in-house product to track and visualize how data at Yelp is processed, stored, and transferred among our services.</p><p><strong>Spark and Spark-ETL:</strong> At Yelp, <a href="https://spark.apache.org/">Spark</a> is considered a <a href="https://engineeringblog.yelp.com/2020/03/spark-on-paasta.html">first-class citizen</a>, handling batch jobs in all corners, from crunching reviews to identify similar restaurants in the same area, to performing reporting analytics about optimizing local business search. Spark-ETL is our inhouse wrapper around Spark, providing high-level APIs to run Spark batch jobs and abstracting away the complexity of Spark. Spark-ETL is used extensively at Yelp, helping save time that our engineers would otherwise need for writing, debugging, and maintaining Spark jobs.</p><p><strong>Problem:</strong> Our data is processed and transferred among hundreds of microservices and stored in different formats in multiple data stores including Redshift, S3, Kafka, Cassandra, etc. Currently we have thousands of batch jobs running daily, and it is increasingly difficult to understand the dependencies among them. Imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by a few critical Yelp services; you are about to make structural changes to the batch job and want to know who and what downstream to your service will be impacted. Or imagine yourself in the role of a machine learning engineer who would like to add an ML feature to their model and ask — “Can I run a check myself to understand how this feature is generated?”</p><p><strong>Spark-Lineage:</strong> Spark-Lineage is built to solve these problems. It provides a visual representation of the data’s journey, including all steps from origin to destination, with detailed information about where the data goes, who owns the data, and how the data is processed and stored at each step. Spark-Lineage extracts all necessary metadata from every Spark-ETL job, constructs graph representations of data movements, and lets users explore them interactively via <a href="https://www.collibra.com/us/en">Collibra</a>, a third-party data governance platform.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-29-spark-lineage/i1.png" alt="Figure 1. Example of Spark-Lineage view of a Spark-ETL job" /><p class="subtle-text"><small>Figure 1. Example of Spark-Lineage view of a Spark-ETL job</small></p></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-29-spark-lineage/i2.png" alt="Figure 2. Overview of Spark-Lineage" /><p class="subtle-text"><small>Figure 2. Overview of Spark-Lineage</small></p></div><p>To run a Spark job with Spark-ETL is simple; the user only needs to provide (1) the source and target information via a yaml config file, and (2) the logic of the data transformation from the sources to the targets via python code.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-29-spark-lineage/i3.png" alt="Figure 3. An example diagram of a Spark-ETL job" /><p class="subtle-text"><small>Figure 3. An example diagram of a Spark-ETL job</small></p></div><p>On the backend side, we implement Spark-Lineage directly inside Spark-ETL to extract all pairs of source and target tables having dependency relationships from every batch job. More precisely, we use the <a href="https://networkx.org/">NetworkX</a> library to construct a workflow graph of the job, and find all pairs of source and target tables that have a path between them in the corresponding Directed Acyclic graph (DAG) workflow of that job. All the middle tables in the transformation are not recorded in Lineage because they are temporary. For example, (Input Table 1, Output Table 2) is a pair in Figure 3 since there is a path between them, while (Input Table 2, Output Table 2) is not. For every such pair, we emit a message to Kafka including the identifiers of the source and target, together with other necessary metadata. These messages are then transferred from Kafka to a dedicated table in Redshift.</p><p>The reason we go with a two-step process instead of sending messages directly to one place is that Redshift has maintenance downtime while Kafka is highly available to receive newly emitted messages at all times. On the other hand, storing data in Redshift is highly durable and easy to query for analytics purposes. At Yelp, we have on the order of thousands of batches per day, and on average each job emits around 10 messages. In total, the Lineage table grows by a couple of million rows per year, which can be handled at ease by Redshift. Collibra then reads from the Redshift table and serves users, using a Snaplogic plug-in.</p><h2 id="building-spark-lineages-ui-on-collibra">Building Spark-Lineages UI on Collibra</h2><p><a href="https://www.collibra.com/us/en">Collibra</a> is a platform to collaborate and establish effective governance for data management and stewardship, enabling their customers/users to find meaning in their data and improve business decisions. Collibra is used at Yelp to provide a platform for data cataloging, discovery, and governance. The tool is being used by Engineering, Product, and several business teams across Yelp.</p><p>First, we parse the metadata made available from the above steps in Redshift and identify the source and target information. This metadata is first read into a staging table in the Redshift database using a <a href="https://www.snaplogic.com/">Snaplogic</a> ETL tool. The reason we stage this data is to identify any new jobs that have been introduced in the daily load or to capture any updates to the existing scheduled jobs.</p><p>We then create an asset (a canonical term for tables, files, etc., in Collibra) for each Spark-ETL table together with additional information extracted from the metadata. We also add relationships between these assets with existing assets (e.g., schemas). Finally we establish the connections among source and target tables according to the DAG extracted from Spark-ETL.</p><p>The UI of Spark-Lineages is shown in Figure 1, where the user can browse or search for all Spark tables and batch jobs, read the details of each table and job, and track the dependencies among them from their origin to their end.</p><h2 id="understanding-a-machine-learning-feature">Understanding a Machine Learning feature</h2><p>Data scientists working on Machine Learning models often look for existing data when building new features. In some cases the data they find might be based on different assumptions about what data should be included. For example, one team may include background events in a count of all recent events that a given user has performed, when the model does not wish to include such events. In such a case, Spark-Lineage allows a team to track down what data is used to identify these different decisions and what data can alleviate the discrepancies.</p><h2 id="understanding-the-impacts">Understanding the impacts</h2><p>One of the major advantages of having data lineage being identified and documented is that it enables Yelpers to understand any downstream/upstream dependencies for any changes that will be incorporated into a feature. It also provides an ability for easy coordination across relevant teams to proactively measure the impact of a change and make decisions accordingly.</p><h2 id="fixing-data-incidents">Fixing data incidents</h2><p>In a distributed environment, there are many reasons that can derail a batch job, leading to incomplete, duplicated, and/or partially corrupt data. Such errors may go silently for a while, and when discovered, have already affected downstream jobs. In such cases, the response includes freezing all downstream jobs to prevent the corrupt data from spreading further, tracing all upstream jobs to find the source of the error, then backfilling from there and all downstream inaccurate data. Finally, we restore the jobs when the backfilling is done. All of these steps need to be done as fast as possible and Spark-Lineage could be the perfect place to quickly identify the corrupted suspects.</p><p>Besides, mentioning the responsible team in Spark-Lineage establishes the accountability for the jobs and thus maintenance teams or on-point teams can approach the right team at the right time. This avoids having multiple conversations with multiple teams to identify the owners of a job and reduces any delay in this that could adversely affect the business reporting.</p><h2 id="feature-store">Feature Store</h2><p>Yelp’s ML Feature Store collects and stores features and serves them to consumers to build Machine Learning models or run Spark jobs and to data analysts to get insights for decision-making. Feature Store offers many benefits, among them are:</p><ol><li>Avoiding duplicated work, e.g., from different teams trying to build the same features;</li>
<li>Ensuring consistency between training and serving models; and</li>
<li>Helping engineers to easily discover useful features.</li>
</ol><p>Data Lineage can help improve the Feature Store in various ways. We use Lineage to track the usage of features such as the frequency a feature is used and by which teams, to determine the popularity of a feature, or how much performance gain a feature can bring. From that, we can perform data analytics to promote or recommend good features or guide us to produce similar features that we think can be beneficial to our ML engineers.</p><h2 id="compliance-and-auditability">Compliance and auditability</h2><p>The metadata collected in Lineage can be used by legal and engineering teams to ensure that all data is processed and stored following regulations and policies. It also helps to make changes in the data processing pipeline to comply with new regulations in case changes are introduced in the future.</p><p>​​This post introduces the Yelp Spark-Lineage and demonstrates how it helps tracking and visualizing the life cycle of data among our services, together with applications of Spark-Lineage on different areas at Yelp. For readers interested in the specific implementation of Spark-Lineage, we have included a server- and client-side breakdown below (Appendix).</p><h2 id="implementation-on-the-server-side">Implementation on the server side</h2><h3 id="data-identifiers">Data identifiers</h3><p>The most basic metadata that Spark-Lineage needs to track are the identifiers of the data. We provide 2 ways to identify an input/output table: the <em>schema_id</em> and the <em>location</em> of the data.</p><ul><li>
<p><strong>Schema_id:</strong> All modern data at Yelp is schematized and assigned a schema_id, no matter whether they are stored in Redshift, S3, Data Lake, or Kafka.</p>
</li>
<li>
<p><strong>Location:</strong> Table location, on the other hand, is not standardized between data stores, but generally it is a triplet of (collection_name, table_name, schema_version) although they are usually called something different for each data store, to be in line with the terminologies of that data store.</p>
</li>
</ul><p>Either way, if we are given one identifier, we can get the other. Looking up schema information can be done via a CLI or PipelineStudio – a simple UI to explore the schemas interactively, or right on Collibra with more advanced features compared to PipelineStudio. By providing one of the two identifiers, we can see the description of every column in the table and how the schema of the table has evolved over time, etc.</p><p>Each of the two identifiers has its own pros and cons and complements each other. For example:</p><ul><li>The schema_id provides a more canonical way to access the data information, but the location is easier to remember and more user-friendly.</li>
<li>In the case the schema is updated, the schema_id will no longer be the latest, while looking up using the pair (collection_name, table_name) will always return the latest schema. Using schema_id, we can also discover the latest schema, but it takes one more step.</li>
</ul><h3 id="tracking-other-information">Tracking other information</h3><p>Spark-Lineage also provides the following information:</p><ul><li><strong>Run date:</strong> We collect the date of every run of the job. From this we can infer its running frequency, which is more reliable than based on the description in the yaml file, because the frequency can be changed in the future. In the case we don’t receive any run for a month, we still keep the output tables of the job available in Collibra but mark them as deprecated so that the users of Collibra are aware of that.</li>
<li><strong>Outcome:</strong> We also track the outcome (success/failure) of every run of the job. We do not notify the owner of the job in case of a failure, because at Yelp we have dedicated tools for monitoring and alerts. We use this data for the same purpose as above; if a service fails many times, we will mark the output tables to let the users know about that.</li>
<li><strong>Job name and yaml config file:</strong> This helps the user quickly locate the necessary information to understand the logic of the job, together with the owner of the job in case the user would like to contact for follow-up questions.</li>
<li><strong>Spark-ETL version, service version, and Docker tag:</strong> This information is also tracked for every run and used for more technical purposes such as debugging. One use case would be if an ML engineer finds out a statistical shift of a feature recently, he can look up and compare the specific code of a run today versus that of last month.</li>
</ul><h2 id="implementation-on-the-client-side">Implementation on the client side</h2><p><strong>Creating assets in Collibra:</strong> As a first step to create the Spark ETL assets in Collibra, a data domain named “Spark ETL” is created for easy catalog searching and to have a dedicated area for storing the details of these jobs within Collibra. Once the domain is made available, the Spark ETL job details that are staged in the staging Redshift table are loaded as assets using Collibra API with the job name as the unique identifier.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-29-spark-lineage/i4.png" alt="" /></div><p><strong>Assigning attributes to the assets:</strong> The details of the Spark ETL job (e.g., Repository, source yaml, etc.) are attached to the respective assets created above as attributes. Each of the attributes has a unique id and value with a relation to the associated asset using asset_attribute_key. The current asset attributes for the Spark ETL jobs can be extended to represent the additional information in future.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-29-spark-lineage/i5.png" alt="" /></div><p><strong>Accountability of the asset:</strong> As the information about the owners is fetched from Kafka into Redshift, the responsibility of the asset can be modified to include the “Technical Steward” – an engineering team who is accountable for the Spark ETL job, including producing and maintaining the actual source data and responsible for technical documentation of data and troubleshooting data issues.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-29-spark-lineage/i6.png" alt="" /></div><p><strong>Establishing the lineage:</strong> Once the assets and the required attributes are made available in Collibra, we establish the 2-way relation to depict source to Spark ETL job and Spark ETL job to target relation. The relations are established using a REST POST Collibra API call. After the relations are created, the lineage is auto-created and is available under the diagram section of the asset. There are multiple views that can be used for depicting the relations among the established Collibra assets but “Lineage View” captures the dependencies all the way until Tableau dashboards (See Figure 1).</p><p>Thanks to Cindy Gao, Talal Riaz, and Stefanie Thiem for designing and continuously improving Spark-Lineage, and thanks to Blake Larkin, Joachim Hereth, Rahul Bhardwaj, and Damon Chiarenza for technical review and editing the blog post.</p><div class="island job-posting"><h3>Become an ML Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/855b8be8-29b3-40c6-be1f-dd1f22663cc8?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/06/spark-data-lineage-in-collibra.html</link>
      <guid>https://engineeringblog.yelp.com/2022/06/spark-data-lineage-in-collibra.html</guid>
      <pubDate>Wed, 29 Jun 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[A Simply, Ordinary Reduction]]></title>
      <description><![CDATA[<p>Experimentation has become standard practice for companies, and one of the most important aspects is how to evaluate the results to make ship/no-ship decisions. Have you run into experiments where you don’t have enough data for statistically significant results or perhaps the performance of your primary metric seemingly disagrees with that of your secondary metrics? If so, leveraging existing features to perform variance reduction may help with coming to a conclusion. At Yelp, we have found that using features typically used in ML modeling, in particular, can help with measuring treatment effects better than solely using t-tests!</p><p>Before deciding to fully launch a new feature, you will typically want to have some confidence that the feature will actually lead to some form of a win (e.g. engagement, revenue, etc.). To test the feature change, one of the most common ways is to use an A/B experiment. At its simplest level, start by randomly assigning half of your users to see the new feature and the other half to not. Once the experiment has run for a sufficiently long amount of time, the experiment is done and you can compare the results.</p><p>For this comparison of the control and treatment cohorts, standard practice is to use a <a href="https://www.investopedia.com/terms/t/t-test.asp">t-test</a> to determine if the two cohorts have statistically significant differences. First, you need to choose some metric to represent the performance of each cohort. Once you have calculated the metric for each user in control and treatment cohort, the treatment lift can simply be the average of treatment metrics minus the average of control metrics. To determine if that lift is statistically significant, use a t-test to compare the two sets of metrics for control and treatment cohorts.</p><p>While this all sounds great in practice, one of the key downsides of only using a t-test is that when there is a significant amount of unexplained variation in the comparison metric, you may have to run the experiment longer than you would like to reach a statistically significant difference. This is where variance reduction techniques come into play. To start this blog post, let’s actually go through a demo of how we would use an Ordinary Least Square regression to help in our experiment analysis! Ordinary Least Square regression is reminiscent of a <a href="https://www.youtube.com/watch?v=CUjrySBwi5Q&amp;ab_channel=FunnyTikTok">certain popular TikTok video</a> that will serve as a great guide as we learn more about how it works.</p><h2 id="a-fresh-pie">A Fresh Pie!</h2><p>For our demo, let’s use Yelp, a company you are hopefully very familiar with. One way Yelp helps connect users to local businesses is through ads on various parts of the Yelp website/app. Let’s say we identify a specific segment of advertisers who could really benefit from spending slightly more in their advertising budget with Yelp by using a new feature on the Yelp dashboard. We believe if a business owner sees this on their Yelp dashboard, they will be more likely to increase their advertising budget with Yelp.</p><p>As a side note, in practice, we are actually working on product features that help advertisers to make the best spending recommendations (see the Budget Design and Infrastructure Updates section of this <a href="https://blog.yelp.com/news/yelp-releases-new-yelp-for-business-features-enabling-more-effective-advertising-and-adding-control-and-value-for-business-owners/">blog post</a>) for every local business here at Yelp!</p><p>Now back to the demo! We can set up this in Python with the following code snippet:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig1.png" alt="" /></div><p>In this code snippet, we have 50 new Yelp advertisers that visit the Yelp dashboard per day and we run this experiment for a month. Let’s assume that the current budget of each Yelp advertiser is normally distributed and a minimum value of $5 since budget cannot be negative. We also assume that the proposed treatment, on average, results in a $2 increase in the business owner’s advertising budget, which we assume to be normally distributed and independent of the business’s existing advertising budget.</p><p>Thus, the metric we would want to compare is what the post-treatment advertising budgets are between control and treatment cohorts to see if there is a statistically significant difference. Here is how we would do this with a t-test:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig2.png" alt="" /></div><p>What we observe is that, on average, post-treatment advertising budgets are ~$1.80 higher in the treatment cohort than in the control cohort (well within the variation we set for the expected $2 budget increase). More importantly, we see from the t-test that the <a href="https://www.investopedia.com/terms/p/p-value.asp">p-value</a> between resulting advertising budgets between the two cohorts is 0.0473, which means this difference is indeed statistically significant. Perfect, we are now more confident that our treatment has the desired effect on increasing advertisers’ budgets!</p><h2 id="save-me-a-slice">Save Me a Slice!</h2><p>Now I know what you’re thinking. That was quite a lot of assumptions we made to simplify our A/B experiment, so let’s complicate things quite a bit. Different advertisers at Yelp have different budgeting needs. Let’s incorporate this difference into our code and try running the same t-test.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig3.png" alt="" /></div><p>This defines three possible types of advertiser budgets: <code class="language-plaintext highlighter-rouge">low</code>, <code class="language-plaintext highlighter-rouge">mid</code>, and <code class="language-plaintext highlighter-rouge">high</code>, each with a different distribution of advertising budgets that make up 25%, 50%, and 25% of Yelp’s advertisers respectively.</p><p>From the results of the t-test, we can see that the difference between advertising budgets of treatment and control cohorts is negative and more importantly, we do not observe a statistically significant difference. The problem here with running a t-test is that the added variance from the three different types of advertiser budgets is attributed as noise, but in reality, it can be explained.</p><p>As an exercise, let’s see what would happen if we ran the experiment longer to reduce this “noise.” If we were to run this experiment for 3 months, we would actually still see the same results and it is not until the 4th month that we see a statistically significant lift in advertising budget in the treatment cohort.</p><p>We probably don’t want to be running an experiment for 4 months for a variety of reasons (e.g. this subset of advertisers might not even benefit from the increased advertising budget after that long). Let’s see if we can come to a different conclusion using an <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">Ordinary Least Square</a> (OLS) regression where we define the dependent variable as post-treatment advertising budget. We can define it as the following:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig4.png" alt="" /></div><p>As an example, if we only choose <code class="language-plaintext highlighter-rouge">cohort</code> as the only feature, which is equivalent to having an empty <code class="language-plaintext highlighter-rouge">X</code>, we will actually get the same results as our t-test.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig5.png" alt="" /></div><p>We can ignore most of the numbers in the summary of our OLS regression and focus on the <code class="language-plaintext highlighter-rouge">coef</code> and <code class="language-plaintext highlighter-rouge">P&gt;|t|</code> values for our treatment indicator feature (<code class="language-plaintext highlighter-rouge">cohort[T.Treatment]</code>). This treatment indicator feature is simply 1 if the advertiser belonged to the treatment cohort and 0 otherwise. One thing to note is that there is no feature for the Status Quo cohort since our OLS regression has selected it to be the reference group for the <code class="language-plaintext highlighter-rouge">cohort</code> feature. We can manually select the reference group as an input to the OLS regression if necessary, but if not, it will be random.</p><p>The coefficient of this feature, <code class="language-plaintext highlighter-rouge">coef</code>, represents the average effect that being in the treatment group has on the advertiser’s post-treatment budget. Thus, like the t-test, we are seeing that the treatment leads to $1.85 lower post-treatment advertising budget. <code class="language-plaintext highlighter-rouge">P&gt;|t|</code> represents the statistical significance of this feature’s coefficient, which for this OLS regression has the same p-value we calculated in our t-test. Again, in this example, we see that the coefficient is not statistically significant with a value of 0.386.</p><p>Rather than just use <code class="language-plaintext highlighter-rouge">cohort</code> as a feature, however, let’s see what happens when we add pre-treatment advertising budgets as part of <code class="language-plaintext highlighter-rouge">X</code>. Since cohort assignments are randomly picked, there should be no violation of independence between the treatment label and adding this second feature.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig6.png" alt="" /></div><p>What we observe now is that the coefficients of both features in our OLS are significant. The values themselves are also informative! The coefficient for the treatment cohort matches our expectations of a $2 increase in advertising budget and the coefficient for pre-treatment advertising budget is 1. Thus, our model essentially believes that the post-treatment budget can be represented by the pre-treatment budget plus $2 if the advertiser was in the treatment cohort.</p><p>To truly understand how much time using an OLS regression with informative predictors can save in an experiment, we can create an A/A test and run a power analysis. For the A/A test, we will run the same two OLS regressions as above, but set the <code class="language-plaintext highlighter-rouge">treatment_effect</code> equal to the <code class="language-plaintext highlighter-rouge">sq_effect</code>. Once we have these two regressions, we can calculate an estimate of the population standard deviation of our treatment indicator feature from the std err output and use that in our power analysis.</p><p>Let’s assume a relatively standard alpha of 0.05 and beta of 0.2. If we wanted to detect a minimum effect size of $0.10 without pre-treatment budget as a feature, we would need over 10,000,000 samples. Note, this is equivalent to our initial methodology of just running a t-test.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig7.png" alt="" /></div><p>Instead, if we add pre-treatment budgets as a feature to detect the same minimum effect size, we’ll need 800 samples. This illustrates the immense impact that informative predictors can have on making the correct ship/no-ship decision with significantly shortened experiment lengths.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig8.png" alt="" /></div><h2 id="thats-enough-slices">That’s Enough Slices!</h2><p>If you’re still not convinced that using an OLS regression is necessary, I would absolutely agree. In the previous example, we could have run a t-test to look at the differences between post and pre-treatment budgets as our primary metric.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig9.png" alt="" /></div><p>Let’s make things even more complicated then! Our other assumption was that the treatment would have the same effect on all advertisers, which is very rarely the case in practice. Let’s replicate this behavior in our demo by varying the treatment effect for the category that the advertiser is a part of. For example, let’s say that <code class="language-plaintext highlighter-rouge">restaurant</code>, <code class="language-plaintext highlighter-rouge">plumber</code>, <code class="language-plaintext highlighter-rouge">electrician</code> categories make up 25%, 25%, and 50% of all Yelp advertisers. Although not true in practice, let’s also assume that the category of advertiser and their advertising budget are independent of one another.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig10.png" alt="" /></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig11.png" alt="" /></div><p>Let’s now run the same OLS and add category as a feature since we know that the treatment effect is dependent on what category the business is a part of.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig12.png" alt="" /></div><p>As before, our reference group for <code class="language-plaintext highlighter-rouge">category_type</code> is the <code class="language-plaintext highlighter-rouge">electrician</code> value so we do not see an indicator feature for that specific category of advertiser.</p><p>Unfortunately, the results aren’t exactly what we would have expected. For example, if a business owner is an <code class="language-plaintext highlighter-rouge">electrician</code>, our model would be predicting that the treatment would increase their advertising budget by ~$0.75, whereas, in reality, it should have decreased budget by $0.50. This $0.75 represents the average treatment effect on advertising budget since over all advertisers, 50% (electricians) will see a budget decrease of $0.50, 25% (plumbing) will see a budget increase of $4, and 25% (restaurants) will see no treatment (<code class="language-plaintext highlighter-rouge">-$0.50*50% + $4*25% + $0*25% = $0.75</code>). Sometimes this is actually all you need, especially in the case that you just want to understand what will happen if Yelp decides to treat a randomly selected advertiser.</p><p>Say we want to dive deeper and understand the treatment effect for each category. The problem with our current OLS model is that we are unable to capture the interaction effect between categories and what the conditional treatment effect will be. To remedy this, let’s leverage interaction variables in our OLS by multiplying the treatment label with each categorical feature in the following manner:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig13.png" alt="" /></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-06-27-a-simply-ordinary-reduction/Fig14.png" alt="" /></div><p>Now we can see the OLS regression has two additional interaction features. For example, <code class="language-plaintext highlighter-rouge">cohort[T.TREATMENT]:category_type[T.plumber]</code> will be 1 if an advertiser is a plumber and in the treatment group and 0 otherwise. Essentially, this feature, combined with our treatment indicator feature, will give us the average treatment effect on advertising budget for plumbers. It is also worth noting that <code class="language-plaintext highlighter-rouge">category_type</code> features alone are not statistically significant, which makes sense since category alone should not affect a business’s advertising budget in our example.</p><p>This is something that is both more interpretable and consistent with the data we generated in our demonstration. For each category of advertiser, we can see that:</p><ol><li><code class="language-plaintext highlighter-rouge">electrician</code>: There is no interaction term because this is our reference group for <code class="language-plaintext highlighter-rouge">category_type</code>. Thus, the treatment effect is simply the coefficient of the treatment label, -0.4450. Thus, the advertising budget is roughly $0.50 less, as expected.</li>
<li><code class="language-plaintext highlighter-rouge">plumber</code>: The coefficient of the interaction term is 4.4844, so if we subtract the coefficient of the treatment label, the advertising budget is roughly $4 more, also as expected.</li>
<li><code class="language-plaintext highlighter-rouge">restaurant</code>: The coefficient of the interaction term is 0.5002, so if we subtract the coefficient of the treatment label, the advertising budget is neutral, also as expected.</li>
<li>Also, note that the coefficients for all the features mentioned in this section are statistically significant.</li>
</ol><p>Thus, we have been able to show that using an OLS for variance reduction can significantly help with two parts of experiment analysis: decreasing the amount of time we need to run the experiment as well as giving insight into the varying effects that the proposed treatment will have on different populations in the experiment.</p><h2 id="requirements-for-ols">Requirements for OLS</h2><p>Now that we have gone through a demonstration of how an OLS regression may help with experiment analysis, let’s talk about some of the caveats of performing this type of analysis.</p><ol><li>The first is fairly straightforward; using an OLS regression is a conditional expectation and will give the average effect of a feature if we do not include any interaction terms.
<ul><li>In practice, the treatment effect will likely not be uniform across all subjects. For example, if our treatment has a larger effect on businesses with more reviews on Yelp, an OLS regression without interaction terms would not be able to distinguish the treatment effect difference between two identical businesses with 0 and 10 reviews than two identical businesses with 1000 and 1010 reviews.</li>
<li>Despite this, starting with an OLS regression can still help with identifying predictive features and can sometimes be all you may want or need in your experiment analysis.</li>
</ul></li>
<li>The second caveat is that the selection of businesses for treatment must be independent of other features used in the OLS regression.
<ul><li>When running a randomized experiment, this criteria will usually be met as whether or not a business receives the treatment is random.</li>
</ul></li>
<li>Variance reduction will only be noticeable when the features are highly predictive of the dependent variable.
<ul><li>The theory behind variance reduction is that we want to attribute what a t-test would consider unexplainable noise to be explained by other features. If these features are unable to explain much, we would not be significantly reducing variance and be no better off than running a t-test for analyzing the experiment.</li>
</ul></li>
<li>Be careful with regularization!
<ul><li>Adding a regularization term in your OLS regression can give you a biased read on coefficients because the coefficient will likely be smaller than their original, unbiased values.</li>
</ul></li>
</ol><p>Arguably, the most important requirement when we perform variance reduction is that the features we use must be pre-treatment values. If we do use post-treatment features, there are two scenarios possible:</p><ol><li>The treatment has no effect on the post-treatment feature.
<ul><li>If the post-treatment feature has no effect on what we are trying to predict, we actually don’t accomplish anything by including the post-treatment feature. In fact, if we add too many useless features, we may incorrectly inflate the standard error of the treatment indicator due to a decrease in degrees of freedom from the extra features.</li>
<li>If the post-treatment feature does have an effect on what we are trying to predict, our coefficient for the treatment indicator feature will remain the same, but the statistical significance of that feature may change. Since we are reducing the amount of total variance in the predictor with a post-treatment feature, the standard error of the treatment indicator feature will decrease.</li>
<li>Ultimately, if we are absolutely sure that the treatment will have no influence on some feature, it should be safe to add the post-treatment feature, but in practice, it is hard to make and prove such a statement.</li>
</ul></li>
<li>The treatment does have an effect on the post-treatment feature.
<ul><li>First, this violates one of our previous requirements since the treatment indicator is no longer independent of all other features.</li>
<li>Let’s also take an example from literature to illustrate the problems with doing this in more detail. Let’s suppose our treatment is a Yelp advertising tutorial and we are trying to measure the effect the tutorial has on businesses purchasing advertisements. Our post-treatment feature will be a sentiment score for each business towards Yelp and for this scenario, assume that the Yelp advertising tutorial does lead to higher sentiment scores. This example is adapted from <a href="https://doi.org/10.7910/DVN/EZSJ1S">Montgomery et al</a>:</li>
</ul></li>
</ol><p><strong>Scenario 1</strong>: Sentiment scores have a relationship with purchasing ads (let’s assume a positive one). If this is the case and we include it as a feature, the higher ads purchase rate will be attributed to higher sentiment scores rather than the treatment, causing the coefficient of the treatment indicator of the OLS model to be biased. Simply removing the feature will correctly attribute the higher levels of ads purchases to the treatment label feature.</p><p><strong>Scenario 2</strong>: Sentiment score does not have any effect on purchasing ads. While this may seem harmless to include post-treatment sentiment scores as a feature, it actually is not. Let’s say there is a confounding feature such as business age, where older businesses tend to have higher rates of purchasing Yelp ads and sentiment scores towards Yelp.</p><ul><li>What will happen now is for businesses with higher sentiment scores, there can be two scenarios: they belong to an older business age demographic or they received treatment. All else being equal, businesses that are older will have higher purchase rates than those who received treatment. This will cause our OLS model to falsely associate the treatment with negatively impacting ads purchase rates when we hold the post-treatment feature of sentiment scores equal.</li>
<li>Note, if we include business age as a feature in the OLS, this would no longer be an issue. However, because we cannot determine all such unknown variables, we will always face the possibility of having a biased coefficient for our treatment indicator feature if we decide to include post-treatment features.</li>
</ul><p>We also want to note that having stale features may be problematic as well. Stale features may lead to having features that are too old to be informative predictors for our dependent variable, which will in term decrease the effect of variance reduction in our experiment analysis (see Caveat 3 above).</p><p>This highlights the importance of having time travelable features or more specifically, having an ETL with the ability to generate event-based features. There exists numerous sources (e.g. from <a href="https://netflixtechblog.com/distributed-time-travel-for-feature-generation-389cccdd3907">Netflix</a>) of online content discussing the benefits of time travelability and feature logging, but those primarily focus on how this infrastructure benefits robust training processes for machine learning (ML) models. Time travelability allows ML practitioners to generate features at prediction time since features generated any further would result in label leakage.</p><p>What these articles don’t cover and what we have done at Yelp is leverage the same ETL to generate pre-treatment feature sets since we know exactly when treatment occurs for our population (essentially replace prediction time with treatment time in the ETL). This allows a proper setup of pre-treatment features if we decide to use an OLS regression in an experiment analysis.</p><h2 id="conclusion">Conclusion</h2><p>TLDR: Using an OLS regression may be superior to a t-test in interpreting experiment results!</p><ul><li>The simplest form of an OLS regression is equivalent to a t-test, where the only feature is the treatment indicator label.</li>
<li>The more variance introduced into an experiment, which happens naturally in the real world, the more likely it will be the case that a t-test will not be sufficient.</li>
<li>With an OLS regression, we can also leverage interaction terms if there is reason to believe that treatment will affect separate populations differently.</li>
<li>Of all the criteria of using an OLS regression, we would like to emphasize the importance of not using post-treatment features as this can significantly distort the interpretations of treatment effects.</li>
<li>Overall, an OLS regression can more accurately capture treatment effects on specific segments of your population and in a significantly less amount of experiment run time when we have highly predictive pre-treatment features.</li>
</ul><p>As a side note, we would also like to call out that this is not the first time this technique has been used for experiment analysis. Please see a prior <a href="https://engineeringblog.yelp.com/2021/07/analyzing-experiments-with-changing-cohort-allocations.html">Blog Post</a> by Alexander Levin about how we can use the same technique to account for mixshift changes over the course of an experiment.</p><h2 id="acknowledgments">Acknowledgments</h2><ul><li>Shichao Ma for the idea to try this when analyzing an experiment we designed and ran</li>
<li>Yang Song for reviewing and adding helpful comments</li>
</ul><div class="island job-posting"><h3>Become an Applied Scientist at Yelp!</h3><p>Are you intrigued by data? Uncover insights and carry out ideas through statistical and predictive models.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/5b9e5f45-b501-447f-857b-72ee24699765/Applied-Scientist-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/06/a-simply-ordinary-reduction.html</link>
      <guid>https://engineeringblog.yelp.com/2022/06/a-simply-ordinary-reduction.html</guid>
      <pubDate>Mon, 27 Jun 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Data Sanitization with Vitess]]></title>
      <description><![CDATA[<p>Our community of users will always come first, which is why Yelp takes significant measures to protect sensitive user information. In this spirit, the Database Reliability Engineering team implemented a data sanitization process long ago to prevent any sensitive information from leaving the production environment. The data sanitization process still enables developers to test new features and asynchronous jobs against a complete, real time dataset without complicated data imports. MySQL and other open source project innovations over the last decade have led us on a journey to Vitess, which is now responsible for over 1500 workflows across more than 100 database schemas that serves the sanitized data needs of all of our developers at the click of a button.</p><h2 id="vitess-concepts">Vitess Concepts</h2><p>The following are excerpts or paraphrases from the vitess.io site and will be helpful to know when seeing these terms used later on:</p><ul><li><a href="https://vitess.io/">Vitess</a> is a database clustering system for horizontal scaling of MySQL</li>
<li><a href="https://vitess.io/docs/14.0/concepts/vstream/">VReplication</a> is a system where a subscriber can indirectly receive events from the binary logs of one or more MySQL instance shards, and then apply it to a target instance</li>
<li><a href="https://vitess.io/docs/14.0/concepts/tablet/">vt-tablet</a> processes connect to a MySQL database, local or remote</li>
<li><a href="https://vitess.io/docs/14.0/concepts/vtctld/">vtctld</a> is an HTTP server useful for troubleshooting or getting a high-level overview of the state of Vitess</li>
</ul><h2 id="why-did-yelp-choose-vitess">Why did Yelp choose Vitess</h2><p>Yelp began exploration of Vitess in late 2019 when a need was growing for new capabilities within our MySQL infrastructure. Data sanitization was the most pressing need at the time, and the newly developed VReplication features would help improve the reliability and scalability of our existing sanitization system. The potential of using Vitess to also serve as a data migration tool, and multi-version replication medium in the future also helped lead us to choosing Vitess.</p><h2 id="basics-of-our-mysql-setup">Basics of our MySQL Setup</h2><p>MySQL is the primary datastore for all transactional workloads at Yelp. The production environment contains more than 20 distinct replication clusters across cloud datacenters in multiple regions of the United States. Nearly every action a user takes on Yelp will be handled on the backend by MySQL. Our largest three MySQL clusters are responsible for serving over 300,000 queries per second covering data measuring in the tens of terabytes, not even counting the queries satisfied by the caching in front of them.</p><p>Each MySQL cluster is organized with a single source of row-based replication, depicted in the diagram below as “Primary”. Replication then continues on to an intermediary, which serves as the replication source to all leaves below it. Our leaves can have different roles, and may be consumer facing or internal-facing. The role of “Replica” is one that is restricted to the leaf level, and serves as the data sanitization source to our development environment.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2022-05-25-data-sanitization-with-vitess/mysql-cluster-replication-hierarchy.png" class="c1" alt="image" /></div><h2 id="legacy-data-sanitization">Legacy Data Sanitization</h2><p>The ability to query data, test batches, and run developer playgrounds outside of the production environment against sanitized data was first provided using trigger-based standard MySQL 5.5.x Replication (statement-based replication).</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2022-05-25-data-sanitization-with-vitess/firewall.png" class="c3" alt="image" /></div><p>Statement-based sanitization was inherently flawed, but usable as a rough approximation of production for many years. When rows are written, triggers present on the sanitized database replica match a pattern for things such as addresses, emails, or names and are obfuscated in a variety of ways.</p><p>Trigger-based sanitization came in various forms, the simplest of which was to clear the column, and then clear the column continuously going forward:</p><figure class="code"><figure class="highlight"><pre class="language-sql" data-lang="sql">UPDATE user SET last_name = '' ;
DROP TRIGGER IF EXISTS user_insert ;
DELIMITER ;;
CREATE TRIGGER user_insert BEFORE INSERT ON user FOR EACH ROW BEGIN SET NEW.last_name = '' ; END ;;
DELIMITER ;
DROP TRIGGER IF EXISTS user_update ;
DELIMITER ;;
CREATE TRIGGER user_update BEFORE UPDATE ON user FOR EACH ROW BEGIN SET NEW.last_name = '' ; END ;;
DELIMITER ;</pre></figure></figure><h2 id="flaws-in-the-trigger-based-system">Flaws in the Trigger-based System</h2><p>Among the trigger-based system’s worst flaws was that data correctness, even of the unsanitized columns, within this system was never really possible. Once data is obfuscated it cannot always be updated or deleted through statement-based replication in the future.</p><figure class="code"><figure class="highlight"><pre class="language-sql" data-lang="sql">CREATE TABLE user (
  id int NOT NULL AUTO_INCREMENT,
  first_name varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL,
  last_name varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (id),
  KEY last_name_idx (last_name),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci</pre></figure></figure><p>To illustrate, take this simple example, with the user_insert trigger in place and the simplified table structure above:</p><figure class="code"><figure class="highlight"><pre class="language-sql" data-lang="sql">INSERT INTO user (first_name,last_name) VALUES ('john','smith');
TRIGGER ACTION: BEFORE INSERT ON user FOR EACH ROW BEGIN SET NEW.last_name = '' ;
UPDATE user SET first_name = 'james' WHERE first_name = 'john' and last_name = 'smith';
No Rows Affected</pre></figure></figure><p>The result of the trigger sanitization is that the statement no longer matches as it would on an unsanitized host, effectively leaving rows impossible to reference in this manner.</p><p>Multiple terabytes of data had to be intermittently rebuilt from scratch due to infrastructure failures, migrations to the cloud, functionally sharding the source cluster, and version upgrades. When executing this process for copying, then sanitizing, and applying triggers, the engineering time required rose dramatically from hours, to days, and towards the end of life of this implementation, a full week. To alleviate manual interventions, backups were enabled to reduce the need to rebuild these hosts from scratch. Testing was implemented to ensure they worked, but even so they were becoming increasingly unwieldy as they were less and less capable of using our standard tooling. Innovations in MySQL enabled by upgrading to newer versions eventually led to the failure of the trigger system in standard MySQL, as triggers do not execute on the replicas in row-based replication (RBR).</p><h2 id="mariadb-workaround-for-trigger-based-system">MariaDB Workaround for Trigger-based System</h2><p>Having overlooked the inability of triggers from executing on replicas, we quickly pivoted to find an alternative upon implementation of RBR across our fleet. MariaDB proved to be a serviceable option, providing the ability to execute triggers on row-based events.</p><p>The downside to running with MariaDB, which we did for just over a year, was the necessity of maintaining two versions of every tool. While largely compatible with MySQL, the MariaDB tools subtly renamed a lot of the commands, implemented backups a little bit differently, and required maintaining two versions of packages.</p><h2 id="vitess-setup">Vitess Setup</h2><p>Our Vitess deployment consists of more than 2000 vt-tablets deployed across dozens of machines residing in our dev, staging, and production environments. These vt-tablets are responsible for VReplication of over 6000 distinct workflows that materialize data from one database instance to another that share no traditional MySQL replication. Several hundred of these vt-tablets are responsible for over 1500 workflows involved in the data sanitization process.</p><p>Much of our core setup is off the shelf, and the best resource for deploying Vitess can be found <a href="https://vitess.io/docs/get-started/operator/">here</a>. The implementation we went with for our initial deployment of Vitess was couched in the knowledge that we had no consumer-facing use cases, little local knowledge of Vitess, and a likely need to implement and materialize data in the future. Knowing that, and that no sharding was needed for this use-case, we created a slimmed-down deployment to only include vtctld and vt-tablet containers.</p><p>Our tablets all connect to external MySQL databases, and were deployed on dedicated servers for vt-tablets and vtctlds. The tablets were launched in pairs, with a source tablet and a target tablet for each MySQL schema living on the same physical machine to minimize network transit. Tablet state is stored in Zookeeper, and actual tablet deployments are coordinated with a scheduled job and static configuration file managed by humans based on resource consumption of different tablets.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-05-25-data-sanitization-with-vitess/tablet-diagram.png" alt="Data flow through tablets" /><p class="subtle-text"><small>Data flow through tablets</small></p></div><p>Another role of MySQL hosts was created for each physical MySQL cluster Vitess would materialize data from, which we denoted as ‘migration’ hosts explicitly to serve the needs of Vitess VReplication. Like the ‘replica’ role, this role is exclusive to the leaf level of the replication hierarchy. The migration role is advertised via Envoy/Smartstack, <a href="https://engineeringblog.yelp.com/2020/11/minimizing-read-write-mysql-downtime.html">the discovery system used at Yelp</a>, and discovered by the appropriate vt-tablets. With its own role, in this case the target (writable host) and the sanitized server are discovered the same way by the vt-tablets and is a full-fledged replication hierarchy with automatic failover targets available to maintain up-time and ease maintenance.</p><table><thead><tr><th>Ecosystem</th>
<th>Non-Prod</th>
<th>Prod</th>
</tr></thead><tbody><tr><td>Zookeeper</td>
<td>m5d.xlarge</td>
<td>m5d.xlarge</td>
</tr><tr><td>Tablet Hosts</td>
<td>c6i.4xlarge</td>
<td>c6i.12xlarge</td>
</tr></tbody></table><h2 id="vitess-materialization-logic">Vitess Materialization Logic</h2><p>The logic for data sanitization was previously captured as what to change a column value to after it was already seen in replication, and was not directly compatible with Vitess. Another way of thinking about this, is that the unsanitized data was actually replicated to the sanitized server in the relay log, and then on write was modified based on the trigger rules. With materialization rules, the sanitized data is never replicated to the sanitized server, and is instead retrieved directly from the source in a modified, or custom, fashion. In the process of creating this setup, we iterated over every table and created a purpose built rule for sanitizing (or not) the data for use by our developers. All of our workflows are stored in a simple git repository for later re-use, such as for re-materialization or schema changes necessitating modification of the custom rules.</p><h3 id="example-of-a-simple-custom-materialization-rule">Example of a simple custom materialization rule:</h3><figure class="code"><figure class="highlight"><pre class="language-jql" data-lang="jql">{
  "workflow": "user_notes_mview",
  "sourceKeyspace": "yelp_source",
  "targetKeyspace": "yelp_target",
  "stop_after_copy": false,
  "tableSettings": [
    {
      "targetTable": "user_notes",
      "sourceExpression": "SELECT id, user_id, 'REDACTED' AS note, note_type, time_created FROM user_notes",
      "create_ddl": "copy"
    }
  ]
}</pre></figure></figure><h3 id="example-of-a-normal-materialization-rule">Example of a normal materialization rule:</h3><figure class="code"><figure class="highlight"><pre class="language-jql" data-lang="jql">{
  "workflow": "user_mview",
  "sourceKeyspace": "yelp_source",
  "targetKeyspace": "yelp_target",
  "stop_after_copy": false,
  "tableSettings": [
    {
      "targetTable": "user",
      "sourceExpression": "SELECT * FROM user",
      "create_ddl": "copy"
    }
  ]
}</pre></figure></figure><p>There are over 1500 materialization rules in place to vreplicate some or all of the tables from over 100 database schemas into one monolithic database from multiple physical source clusters. At any given time there is near real-time VReplication happening between the originating write and the downstream sanitized write for each of the workflows. Co-locating all of the sanitized data was a conscious choice, and provides a single target for playgrounds our developers run to connect to, eases management, and in the case of data corruption is simple to re-seed.</p><h2 id="vitess-performance-considerations">Vitess Performance Considerations</h2><p>We learned early on that workflows are not created equal, and that the more workflows that run on a schema the higher resources that are used by the source and target tablets to manage the binary logs and data streaming. As a result of these heavy weight tablets we had to scale up our instances, and further coordinate which tablets run on which hosts in order to spread the load as evenly as possible. The actual load wasn’t the only limiting factor either, as running too many containers on a single server can become unstable and will result in dockerd issues eventually. In the final deployment, we are running over 250 tablets and attempt to keep the number of tablets per node to no more than 50 to limit the dockerd issues we encounter. These tablets are always paired, source and target, as seen below.</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2022-05-25-data-sanitization-with-vitess/tablet-series.png" class="c5" alt="image" /></div><p>For deployments like this, it’s important to understand the impact large numbers of workflows will have on recovery in the event of failure. When enabling workflows you could also encounter throughput issues that are easier to intuit because the data is being actively copied by Vitess. Doing these materializations in chunks is an obvious optimization, and largely fixes the issues encountered during the course of standing up a sanitized database as we have done. If instead, though, you have an existing system that fails (host that a tablet runs on dies, target writable host dies, service mesh goes down, etc) recovering is not trivial.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-05-25-data-sanitization-with-vitess/workflow-pie-chart.png" alt="Workflows per database" /><p class="subtle-text"><small>Workflows per database</small></p></div><p>This chart shows the relative number of Vitess workflows per database schema. The bigger the slice, the more workflows.</p><p>We have more than 100 database schemas, and many have few workflows as visualized in the above chart. Upon failure, these smaller sets of workflows are able to rapidly re-read binary logs and pick up where the local state indicates they should in quick order. There are also three schemas with upwards of 100 tables, one has nearly 600, and these workflows each must re-establish their positions independently from each other (our workflows are all created 1:1 to tables). On occasions when there is a failure with hundreds of workflows involved on one tablet, we found that stopping and starting them in a staggered way (25 per 3 minutes for example) can help the system recover to working order where it may have never recovered otherwise.</p><h2 id="vitess-to-the-future">Vitess to the Future</h2><p>With Vitess, Yelp was able to eliminate mountains of technical debt, bring in a tool with boundless potential, and improve the security and speed of our Sanitization process. Our old system was no longer scaling, and we started to have lengthy manual maintenance cycles whenever a problem came up. Problems with Vitess are easy to fix, and best of all can be automated in most situations.</p><p>We have plans in motion for using k8s <a href="https://github.com/Yelp/paasta">paasta</a> instead of managing the infrastructure directly. Using the standard k8s operator and a more broadly understood deployment will help as we begin to utilize more Vitess components.</p><p>Other projects include one dubbed internally “Dependency Isolation”, where an existing binlog based data-pipeline system is being moved away from the source clusters to one driven by Vitess. This allows us to decouple our consumer-facing cluster upgrades from the data pipeline databases, and to perform the upgrades consciously and independently. A third project in flight is designed to harness the ability to materialize read-only view tables into different database schemas, a common enough use of Vitess. Providing local read-only views of tables can allow for faster development cycles, and easier extraction of data from our monolith.</p><div class="island job-posting"><h3>Become a Database Reliability Engineer at Yelp</h3><p>Do you want to be a Database Reliability Engineer that builds and manages scaleable, self-healing, globally distributed systems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b3e09e7e-736a-4ca0-9d45-6fc6368b2796/Database-Reliability-Engineer-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/06/data-sanitization-with-vitess.html</link>
      <guid>https://engineeringblog.yelp.com/2022/06/data-sanitization-with-vitess.html</guid>
      <pubDate>Wed, 22 Jun 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Beyond Matrix Factorization: Using hybrid features for user-business recommendations]]></title>
      <description><![CDATA[<p>Yelp’s mission is to connect people with great local businesses. On the Recommendations &amp; Discovery team, we sift through billions of users-business interactions to learn user preferences. Our solutions power several products across Yelp such as personalized push notifications, email engagement campaigns, the home feed, Collections and more. Here we discuss the generalized user to business recommendation model which is crucial to a lot of these applications.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-04-25-beyond-matrix-factorization-blog/high-level-overview.png" alt="High level overview of our recommendation system." /><p class="subtle-text"><small>High level overview of our recommendation system.</small></p></div><p>Our previous approach for user to business recommendation was based on <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html">Spark’s Alternating Least Squares (ALS)</a> algorithm which factorized the user-business interaction matrix to user-vectors and business-vectors. By performing a dot-product on top of these vectors we are able to come up with top-k recommendations for each user. We explained the approach in detail in a prior blog - <a href="https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html">“Scaling Collaborative Filtering with PySpark”</a>.</p><p>In this blog, we discuss how we switched from a collaborative filtering approach to a <strong>hybrid approach</strong> - which can handle multiple features and be trained on different objectives. The new approach doubled the number of users we could recommend for while also vastly improving performance for all users. The main takeaway here is that we were able to achieve these results pretty quickly by having a <strong>clearly defined objective</strong> and following a <strong>cost-efficient design</strong> which saved huge development costs for our initial Proof of Concept.</p><p>We start with discussing drawbacks of matrix factorization followed by some guidelines that shaped our approach. We later present the solution along with challenges and the improvements gained</p><p>Matrix factorization learns ID-level vectors for each user and business and requires a good number of user/business level interactions. This leads to a couple of major drawbacks:</p><ol><li>Worse performance on tail users (users who have very few interactions).</li>
<li>An inability to add content-based features such as business reviews, ratings, user segment, etc</li>
</ol><p>Because of drawback #1, we identify two segments of users - head and tail.</p><ul><li><strong>Head users</strong> have enough interactions with businesses to learn vector representations using the matrix factorization approach.</li>
<li><strong>Tail users</strong> have very few interactions and suffer from the cold-start problem. They were excluded from matrix factorization which resulted in better performance on head users and also made the approach more scalable.</li>
</ul><p>The solution for drawbacks 1 and 2 is to use a hybrid approach which uses content-based features in addition to interaction features. In the evaluation section, we show how content and collaborative features could play different roles for these user types in a hybrid model which results in a better model performance.</p><p>In our initial exploration phase we considered approaches like <a href="http://staff.ustc.edu.cn/~hexn/papers/www17-ncf.pdf">Neural Collaborative Filtering</a>, a <a href="https://www.tensorflow.org/recommenders">Two tower model</a>, a <a href="https://www.tensorflow.org/api_docs/python/tf/keras/experimental/WideDeepModel">WideDeep model</a> and <a href="https://snap.stanford.edu/graphsage/">GraphSage</a>. Even though implementations for these approaches were readily available, we found them to be either hard to scale for our problem size or poorly performant when used off the shelf.</p><p>To be cost-efficient and gather early feedback, we took an iterative approach towards building a custom solution. We set the following guidelines to adhere to the <strong>iterative design</strong>:</p><ul><li><strong><em>Model infrastructure first:</em></strong> Build a training, evaluation and prediction pipeline with a few clearly defined objectives.</li>
<li><strong><em>Reduce dev-effort when you can:</em></strong> Use a supervised technique like <a href="https://xgboost.readthedocs.io/en/stable/">XGBoost</a> (or <a href="https://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a>) which were better supported by our ML infrastructure team.</li>
<li><strong><em>Know your friend:</em></strong> Replacing matrix factorization seemed like a farther out goal as it is known to work pretty well. So instead of replacing it, we planned to build our hybrid model on top of it by taking its scores as one of the key features.</li>
<li><strong><em>Gain more friends:</em></strong> Enrich signals used by the recommender by deriving a good set of content-based features.</li>
</ul><p>We used a <strong>supervised <a href="https://en.wikipedia.org/wiki/Learning_to_rank">learning to rank technique</a></strong> to combine both the content and collaborative approaches. The entire approach is summarized in the diagram below:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-04-25-beyond-matrix-factorization-blog/diagram-of-hybrid-ranking-approach.png" alt="A diagram of our hybrid recommendation approach." /><p class="subtle-text"><small>A diagram of our hybrid recommendation approach.</small></p></div><p>Similar to many other machine learning projects, we anticipated feature engineering to play a key role. Hence, most of our effort went into building a good set of features for the model to learn from. Features were extracted at the user-business level for a specific date which marks the end of the feature period.</p><p>The set of features can be categorized into two major buckets:</p><ul><li><strong>Interaction features:</strong> Include output affinity scores from matrix factorization and aggregates for different interaction types at user, business and user ✕ business level.</li>
<li><strong>Content-based features:</strong> Include features like categories of a business, review rating, review count, user type, user metadata, etc. Apart from general content features, we also added a <strong>text-based similarity</strong> feature computed between a user and a business.</li>
</ul><p>We derived the text-based similarity feature from Yelp’s business reviews. Reviews are encoded with a <strong><a href="https://tfhub.dev/google/universal-sentence-encoder-large/3">Universal Sentence encoder</a></strong> and later aggregated at the business-level by either a <strong>max or average <a href="https://d2l.ai/chapter_convolutional-neural-networks/pooling.html">pooling</a></strong>. The business-level embeddings were later aggregated at the user-level by associating a user with all the businesses they have interacted with. The text-based similarity is computed as a cosine similarity between the user-level and the business-level aggregate embeddings. This feature turned out to be the most important content-based feature as discussed in the evaluation section.</p><p>With all these features, we need an objective to optimize for which is discussed below.</p><p>As we aimed to come up with a <strong>personalized ranked order of businesses</strong>, we used <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG">Normalized Discounted Cumulative Gain (NDCG)</a> as our primary metric. The relevance level for NDCG was defined based on the <strong>strength of interaction</strong> between a user and a business. For example, the gain from business views could be 1.0 as it’s a low-intent interaction whereas bookmarks could be 2.0 as it is a stronger-intent interaction. To ensure there isn’t any label leakage, we made sure there is a clear separation between the time-period where features and labels were generated from.</p><p>To optimize for NDCG, we relied on <a href="https://xgboost.readthedocs.io/en/stable/">XGBoost’s</a> <strong><code class="language-plaintext highlighter-rouge">rank:ndcg</code></strong> objective which internally uses the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf">LambdaMART</a> approach. One thing that’s worth mentioning here is how we defined “groups” for the ranking task. XGBoost uses the group information to construct a pairwise loss where two training rows from the same group are compared against each other. Since our objective was to get personalized recommendations based on where the user is located, we defined our <strong>group based on both user and location</strong>. We use the same definition of groups when evaluating our models.</p><h2 id="negative-sampling">Negative sampling</h2><p>Since we are using supervised learning, our model will be most effective if it has both positive and negative interaction examples to learn from. For the hybrid user-business model, we don’t have negatives as most of the implicit user-business interactions (e.g. get directions, visit the website, order food, etc.) are positive. Deriving an implicit negative interaction like a user viewing but not interacting with a business is tricky and can be heavily biased as it depends on what businesses were shown to them (sample bias), how it was shown to them (presentation bias), and so forth. Handling these biases are usually product specific (e.g. the presentation biases for search vs. recommendation could be very different) which makes it harder to build a generalized user to business model. A more generalized approach for negative sampling would be to consider all non-interacted businesses as negatives. In fact, some of the common techniques to fetch negatives involve subsampling the non-positive candidates either randomly or based on popularity (refer <a href="http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/">word2vec negative sampling</a>).</p><p>We take a generalized approach to negative sampling but we also introduce a <strong>recall-step</strong>. Candidate businesses are recalled using a specific selection criteria like user preferences, location radius, etc., and only the recalled candidates are used to train the model. Candidates can be labeled as either positive or negative based on whether these had future interactions. This approach worked well for our use case for couple of reasons:</p><ul><li>A recall step that we can evaluate for only a few candidates per user is more efficient and allows us to scale predictions up to millions of users and businesses.</li>
<li>A recall step ensures the relationship with the label is learned without any bias when a similar recall strategy is used during training and prediction time. Common techniques of negative sampling relies on resampling negatives multiple times during training (e.g. each training iteration) to reduce bias, but this approach can be difficult to implement with supervised training like XGBoost.</li>
</ul><p>During training, we used a special type of recall that allowed the model to learn generic preferences instead of being very application specific. Users were associated with a sampled set of locations from their past history and a user’s top-k businesses for the locations were recalled using matrix factorization scores or business popularity. We downsampled the user, location pairs and businesses to make the training data size manageable.</p><h2 id="scaling-predictions">Scaling predictions</h2><p>Prediction is an intensive job where we need to identify top-k recommendations for tens of millions of users and businesses. When using the matrix factorization based approach we <a href="https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html">scaled the naive approach of evaluating all pairs of dot products</a> using numpy BLAS optimizations and a file-based broadcast on Pyspark. We couldn’t use the same approach here as both feature computation and XGBoost model evaluation are more expensive than just doing a dot-product.</p><p>To speed up prediction we added a <strong>recall step</strong> based on the downstream product application. These applications restrict recommendation candidates based on the following criteria:</p><ul><li><strong><em>User location:</em></strong> For localized recommendations we need to consider only businesses near the city or neighborhood where the user is located.</li>
<li><strong><em>Product level constraints:</em></strong> Candidate businesses are further restricted by category or attribute constraints based on the product application (e.g. new restaurants for the Hot &amp; New business push-notification campaign, businesses with Popular Dishes for the Popular Dish push-notification campaign, etc.) These criteria let us narrow down the set of user-business pairs for which the model needs to be evaluated, thereby making the predictions more scalable.</li>
</ul><p>To prove to ourselves the hybrid approach works better, we evaluated the models offline based on historical data. We also evaluated the model subjectively by running a survey among a few Yelp employees who were tasked to rate recommendation rankings from different approaches. Both these evaluations suggested that the new hybrid approach performs much better than the baseline approaches. Here, we share these metrics based results.</p><p>Since this is a ranking task, we chose <strong><a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG">Normalized Discounted Cumulative Gain</a> (NDCG)</strong> and <strong><a href="https://queirozf.com/entries/evaluation-metrics-for-ranking-problems-introduction-and-examples#map-mean-average-precision">Mean average precision</a> (MAP)</strong> as metrics. The hybrid approach was compared against a couple of baselines:</p><ol><li>Popular businesses in the user’s location - available for both head and tail users</li>
<li>Matrix factorization - available only for head users</li>
</ol><p>In order to mimic production settings using historical data, we created test sets which are in the future of the model’s training period (i.e. both feature generation period and label period were shifted into the future for the test set).</p><p>At first, we look at the relative improvement from the business popularity baseline at different values of rank (i.e. rank k=1, 3, 5, 10, 20, 30, .., 100). We find that the model <strong>more than doubles</strong> the NDCG and MAP metrics compared to a “locally popular” baseline at k=1!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-04-25-beyond-matrix-factorization-blog/performance-against-business-popularity.png" alt="Relative percentage improvement of hybrid approach vs. the popularity baseline. We see positive improvements overall. At k=100 and user_type=head we see an improvement of 30% in the NDCG metric and 81% improvement in the MAP metric. At k=100 and user_type=tail we see a 20% improvement in NDCG metric and 52% improvement in the MAP metric." /><p class="subtle-text"><small>Relative percentage improvement of hybrid approach vs. the popularity baseline. We see positive improvements overall. At k=100 and user_type=head we see an improvement of 30% in the NDCG metric and 81% improvement in the MAP metric. At k=100 and user_type=tail we see a 20% improvement in NDCG metric and 52% improvement in the MAP metric.</small></p></div><p>When comparing with the matrix factorization baseline, the improvement at different ranks (k) roughly ranges between <em>5-14</em>% for NDCG and <em>10-13</em>% for MAP.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-04-25-beyond-matrix-factorization-blog/performance-against-matrix-factorization.png" alt="Relative percentage improvement of hybrid approach vs. the matrix factorization baseline. We see positive improvements overall. At k=100 and user_type=head we see an improvement of 5% in the NDCG metric and 9.8% improvement in the MAP metric." /><p class="subtle-text"><small>Relative percentage improvement of hybrid approach vs. the matrix factorization baseline. We see positive improvements overall. At k=100 and user_type=head we see an improvement of 5% in the NDCG metric and 9.8% improvement in the MAP metric.</small></p></div><h2 id="content-vs-collaborative-signals-for-head--tail-users">Content vs Collaborative signals for head &amp; tail users</h2><p>For a hybrid model to work effectively, it should use both <strong>content and collaborative signals to achieve the best of both worlds</strong>. For head users with a good number of collaborative signals it should rely more on these signals whereas for tail users it should rely more on content-based features. We wanted to validate whether this was indeed happening in our hybrid model.</p><p>To perform this analysis, we picked <strong>representative features for content and collaborative signals</strong>. Review text based similarity and matrix factorization score were the top features in the model and it made sense to pick these as representative features. We use <a href="https://christophm.github.io/interpretable-ml-book/pdp.html">Partial Dependence plots</a> (PDPs) against these features which shows the average prediction on the entire dataset when a feature is set to a particular value.</p><p>First, we plot PDP for head vs. tail users against feature percentiles of review text similarity. The plot below shows percentiles of the review text similarity feature (content similarity) on the x-axis and the average of prediction along with spread on y-axis. We see that tail users have a stronger relationship against this feature which indicates that the model relies more on the content-based feature for tail users.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-04-25-beyond-matrix-factorization-blog/pdp-text-similarity-head-vs-tail.png" alt="Partial dependence plot (PDPs) with content similarity percentile on x-axis and average prediction on the y-axis. We plot the PDPs for head vs. tail users separately. The plot shows a stronger relation for tail users." /><p class="subtle-text"><small>Partial dependence plot (PDPs) with content similarity percentile on x-axis and average prediction on the y-axis. We plot the PDPs for head vs. tail users separately. The plot shows a stronger relation for tail users.</small></p></div><p>Since matrix factorization scores are available only for head users, we plot PDP against collaborative vs. content for only head users. The plot below shows percentiles of content or collaborative features on the x-axis and average and spread of predictions on the y-axis.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-04-25-beyond-matrix-factorization-blog/pdp-content-vs-collaborative-for-head.png" alt="Partial dependence plot (PDPs) against feature percentiles. We overload the x-axis with percentiles from two features and plot two PDPs in the same plot - text similarity feature (content) and matrix factorization score (collaborative). The confidence spread shows a stronger relation for the collaborative feature as the spread narrows at higher percentiles." /><p class="subtle-text"><small>Partial dependence plot (PDPs) against feature percentiles. We overload the x-axis with percentiles from two features and plot two PDPs in the same plot - text similarity feature (content) and matrix factorization score (collaborative). The confidence spread shows a stronger relation for the collaborative feature as the spread narrows at higher percentiles.</small></p></div><p>We see that both the content and collaborative features are strongly related to the user business relevance prediction, which means that both features are used effectively in the hybrid model. The collaborative feature has a stronger relationship as the prediction spread narrows at higher percentiles. This suggests that head users have a detailed enough browsing history that lets us learn user specific preferences, for example that a user is vegetarian but doesn’t particularly like Thai food.</p><p>The above plots confirm our initial thoughts of how <strong>content and collaborative features can play different roles for different user types</strong> in the hybrid model.</p><ul><li><strong><em>Write down your objective:</em></strong> Recommendation is a vast space and there are a lot of approaches one could take to improve it. Our initial exploration phase had a lot of uncertainty. However, writing down our specific goals to “Provide model-based recommendations for tail users” and “Enable support for more content-based features” gave us the focus we needed to improve our models and made it easier to get buy-in from the product team.</li>
<li><strong><em>Setup model training infrastructure early:</em></strong> In the beginning, it was hard to debug and iterate with several copies of code on different ad hoc notebooks. Once we built out the first version of the training pipeline to include feature ETLs, sampling and label strategies, it was easy to iterate on each of these components separately.</li>
<li><strong><em>Think about evaluation earlier:</em></strong> We set up the baselines based on matrix factorization and business popularity very early in our model development. This made it easy to compare results against these baselines and iterate on the modeling and training phases until we beat the baseline.</li>
<li><strong><em>Use subjective evaluation in conjunction:</em></strong> In addition to objective metrics, it is important to look at individual recommendations, feature importances and PDPs to make a better judgment. At one point, we had an issue with negative sampling where all negative samples came from matrix factorization top-k which made the model learn a negative relationship with respect to this feature. It’s hard to debug these issues without the help of model debugging tools.</li>
</ul><p>Switching to a hybrid approach was a major change in our user to business recommendation system. In this blog, we documented our journey in developing this new approach and are glad to see big improvements. We plan to run several additional A/B experiments for push notification and email notification campaigns to confirm that these improvements translate to better user experiences. Given the current infrastructure, we feel more confident to try more complex models based on neural networks.</p><p>If you are inspired by recommender systems, please check out the <a href="https://www.yelp.careers/us/en">careers page</a>!</p><p>This blog was a team effort. I would like to thank Blake Larkin, Megan Li, Kayla Lee, Ting Yang, Thavidu Ranatunga, Eric Hernandez, Jonathan Budning, Kyle Chua, Steven Chu and Sanket Sharma for their review and suggestions. Special thanks to Parthasarathy Gopavarapu for working on generating text-based embeddings.</p><div class="island job-posting"><h3>Become an ML Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/e9a3e447-7271-431d-b8d3-29168c9c01ef/Software-Engineer-Machine-Learning-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/04/beyond-matrix-factorization-using-hybrid-features-for-user-business-recommendations.html</link>
      <guid>https://engineeringblog.yelp.com/2022/04/beyond-matrix-factorization-using-hybrid-features-for-user-business-recommendations.html</guid>
      <pubDate>Mon, 25 Apr 2022 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 2 - Migration)]]></title>
      <description><![CDATA[<p>In a <a href="https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html">previous post</a> we detailed the architecture and motivation for developing our new <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>-based deployment model. We’d now like to share our strategy for seamlessly migrating our existing Kafka clusters from <a href="https://aws.amazon.com/pm/ec2/">EC2</a> to our <a href="https://kubernetes.io/">Kubernetes</a>-based internal compute platform. To help facilitate the migration, we built tooling which interfaced with various components of our cluster architecture to ensure that the process was automated and did not impair clients’ ability to read or write Kafka records.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-03-03-kafka-on-paasta-part-two/ec2_to_paasta.png" alt="Migrating Kafka on EC2 to Kafka on PaaSTA" /><p class="subtle-text"><small>Migrating Kafka on EC2 to Kafka on PaaSTA</small></p></div><h2 id="background">Background</h2><p>In the status quo implementation, EC2-backed Kafka brokers within a cluster were associated with an <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html">auto scaling group</a> (ASG). Attached to each ASG was an <a href="https://aws.amazon.com/elasticloadbalancing/">Elastic Load Balancer</a> (ELB) which facilitated all connections to the cluster and acted as an entrypoint. Several auxiliary services and jobs also accompanied each cluster, but most of these were already deployed on PaaSTA. However, some important management systems ran directly on Kafka servers as cron jobs. Of particular importance for this redesign were the <a href="https://github.com/Yelp/kafka-utils/blob/master/kafka_utils/kafka_cluster_manager/cmds/rebalance.py">cluster rebalance algorithm</a> and the topic auto partitioning algorithm. The rebalance algorithm attempts to evenly distribute partitions and leaders across the brokers of the cluster, while the auto partitioning algorithm automatically sets topic partition counts based on throughput metrics. Since we were already planning on incorporating Cruise Control in our architecture, now was a good time to migrate to a new rebalancing algorithm.</p><p>Thus, the three critical components we focused on replacing during this migration were the cluster entrypoint, the cluster balancing algorithm, and the topic auto partitioning algorithm. We didn’t need to look far for a replacement to the ELB since PaaSTA natively provides load balancing capabilities through Yelp’s service mesh, which makes it simple to advertise the Kafka on Kubernetes containers which compose a cluster. In the status quo EC2 scenario we also ran a custom rebalance algorithm on Kafka hosts, but this was ultimately replaced by Cruise Control (see <a href="https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html">part 1</a> for more details on this service) which exposed comparable functionality. Finally, our <a href="https://puppet.com/">Puppet</a>-based cron job running a topic auto partitioning script was replaced with a similar <a href="https://github.com/Yelp/Tron">Tron</a> job running on PaaSTA. Below is a table providing an overview of the different components across the deployment approaches.</p><table><thead><tr><th>Component</th>
<th>EC2</th>
<th>PaaSTA</th>
</tr></thead><tbody><tr><td>Cluster Entrypoint</td>
<td><a href="https://aws.com">ELB</a></td>
<td>Yelp’s service mesh</td>
</tr><tr><td>Cluster Balance</td>
<td><a href="https://github.com/Yelp/kafka-utils/blob/master/kafka_utils/kafka_cluster_manager/cmds/rebalance.py">rebalance algorithm in kafka-utils</a></td>
<td><a href="https://github.com/linkedin/cruise-control">Cruise Control</a></td>
</tr><tr><td>Topic Auto Partitioning</td>
<td>cron job (Puppet-based)</td>
<td><a href="https://github.com/Yelp/Tron">Tron</a> job</td>
</tr></tbody></table><figure class="code"><figcaption class="c1">Table of Components Used by Each Deployment Approach</figcaption></figure><p>Since we would not be migrating all of our clusters simultaneously, we wanted to avoid the need to make significant changes to our Kafka cluster discovery configuration files. For additional context, at Yelp we use a set of <code class="language-plaintext highlighter-rouge">kafka_discovery</code> files (generated by Puppet) which contain information about each cluster’s bootstrap servers, <a href="https://zookeeper.apache.org/">ZooKeeper</a> chroot, and other metadata. Many of our internal systems (such as <a href="https://github.com/Yelp/schematizer">Schematizer</a> and <a href="https://engineeringblog.yelp.com/2020/01/streams-and-monk-how-yelp-approaches-kafka-in-2020.html">Monk</a>) rely on the information in these files. This migration strategy entailed updating only the broker_list to point to the service mesh entrypoint, thereby retaining compatibility with our existing tooling. We did take this migration as an opportunity to improve the propagation method by removing Puppet as the source of truth and instead opted to use srv-configs (the canonical place for configurations used by services). An example discovery file is shown below:</p><div class="language-plaintext highlighter-rouge highlight"><pre>&gt;&gt; cat /kafka_discovery/example-cluster.yaml
---
clusters:
  uswest1-devc:
        broker_list:
        - kafka-example-cluster-elb-uswest1devc.&lt;omitted&gt;.&lt;omitted&gt;.com:9092
        - kafka-example-cluster-elb-uswest1devc.&lt;omitted&gt;.&lt;omitted&gt;.com:9092
        zookeeper: xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181/kafka-example-cluster-uswest1-devc
local_config:
  cluster: uswest1-devc
  ...
</pre></div><h2 id="migration-strategy-overview">Migration Strategy Overview</h2><p>At a high level the goal of the migration was to seamlessly switch from using EC2-compatible components to using PaaSTA-compatible components without incurring any downtime for existing producer and consumer clients. As such, we needed to ensure that all the new components were in place <em>before</em> migrating any data from EC2-based brokers to PaaSTA based-brokers. We also wanted to minimize the amount of engineering time required for the migrations, so we implemented some tools to help automate the process. Finally, we needed to ensure that this process was thoroughly tested and rollback-safe.</p><p>The first step of the migration process was to set up a PaaSTA-based load balancer for each of our Kafka clusters, which could also be used to advertise EC2-based brokers. This exposed two distinct methods of connecting to the Kafka cluster: the existing ELB and the new service mesh proxy which would be used for the PaaSTA-based brokers during and after the migration. This entailed updating the aforementioned <code class="language-plaintext highlighter-rouge">kafka_discovery</code> files to include the alternate connection method, and we also devised a new way to propagate these files with a cron job rather than rely on Puppet. As alluded to in the prior post, reducing our reliance on Puppet helped us halve the time to deploy a new Kafka cluster since we could alter and distribute these configuration files much more quickly. After this was done we also invalidated any related caches to ensure that no clients were using the outdated cluster discovery information. Below is a set of figures illustrating this process during the migration:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-03-03-kafka-on-paasta-part-two/migration_example.gif" alt="Cluster Connection Migration" /><p class="subtle-text"><small>Cluster Connection Migration</small></p></div><p>Next, we deployed a dedicated instance of Cruise Control for the cluster, with <a href="https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/cctrl-overview/topics/cctrl-self-healing.html">self-healing</a> <strong>disabled</strong>. We didn’t want multiple rebalance algorithms to run simultaneously, and since the self-healing algorithm is able to rebalance the cluster, we prevented Cruise Control from automatically moving topic partitions. After this we created a PaaSTA instance for the cluster, except we explicitly disabled the Kafka Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator’s</a> use of Cruise Control. For an EC2 cluster with <em>N</em> brokers we then added an additional <em>N</em> PaaSTA-based brokers, effectively doubling the cluster size during the migration.</p><p>After the new PaaSTA brokers were online and healthy, the cluster had an equal number of EC2 brokers and PaaSTA brokers. We also enabled metrics reporting by creating the <a href="https://github.com/linkedin/cruise-control/blob/fb13240bc5759b30720339c27fdc3a04b8544c23/config/cruisecontrol.properties#L49-L50">__CruiseControlMetrics</a> topic and setting up the appropriate configs prior to each migration. To retain control over when partitions would be moved, we disabled our status quo automated rebalance algorithm. At this point we were ready to start moving data away from the EC2 brokers and leveraged Cruise Control’s API to remove them. Note that this API only moves partitions away from the specified brokers and does not actually decommission the hosts. We continued to <a href="https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_RecordLifecycleActionHeartbeat.html">send heartbeats for EC2 lifecycle actions</a> throughout the migration procedure since the autoscaling group associated with the EC2 brokers would persist until the end of the migration process. Below is a figure illustrating the state of each component throughout the migration:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-03-03-kafka-on-paasta-part-two/migrate_rebalance.gif" alt="Migrating from Conditional Rebalance Script to Cruise Control" /><p class="subtle-text"><small>Migrating from Conditional Rebalance Script to Cruise Control</small></p></div><p>Rather than manually issue broker removal requests, we built a rudimentary migration helper service to check the cluster state, repeatedly issue requests to the Cruise Control REST API, and remove EC2 brokers one by one. After Cruise Control finished moving all partition data away from the EC2 brokers and onto the PaaSTA brokers, we were ready to terminate the EC2 brokers. This was accomplished by shrinking the size of the ASG from <em>N</em> to 0 and by removing references to the old EC2 ELBs in our configuration files. Since we use <a href="https://www.terraform.io/">Terraform</a> to manage AWS resources, the rollback procedure was as simple as a <code class="language-plaintext highlighter-rouge">git revert</code> to recreate the resources. After the EC2 brokers had been decommissioned, we removed the instance of our decommission helper service and enabled self-healing in the cluster’s Cruise Control instance. This was now safe to do since the cluster was composed entirely of PaaSTA-based brokers. At this point the cluster migration was complete, and the remaining work entailed cleaning up any miscellaneous AWS resources (autoscaling SQS queues, ASGs, ELBs, etc.) after deeming it safe to do so.</p><h2 id="risks-rollbacks-and-darklaunches">Risks, Rollbacks, and Darklaunches</h2><p>While we strove to optimize safety over migration speed, there were naturally still some risks and drawbacks associated with our approach. One consideration was the temporary cost increase due to doubling the size of each cluster. The alternative to this was to iteratively add one PaaSTA broker, perform data migration away from one EC2 broker, decommission one EC2 broker, and repeat. Since this approach confines the data movement to one broker’s replica set at a time, this approach would have extended the total duration of the migration procedure. Ultimately we decided that we favored migration speed, so the up-front cost of having twice as many brokers was a cost that we were willing to pay. Additionally, we estimated that the benefits associated with having the cluster on PaaSTA would outweigh these initial costs in the long run. Another tradeoff was that doubling the size of the cluster would also result in very large cluster sizes for some of our high traffic clusters. Those clusters required additional attention during the migration process, and this engineering time-cost was also an initial investment that we were willing to make for the sake of shorter migrations.</p><p>In case of a catastrophic issue during the migration, we also needed to devise a rollback procedure. Sequentially reversing the order of the migration procedure at any stage was sufficient to roll back the changes (this time using Cruise Control’s <code class="language-plaintext highlighter-rouge">add_broker</code> API rather than the <code class="language-plaintext highlighter-rouge">remove_broker</code> API after removing any pending reassignment plans). The primary risk associated with this is that both the migration and the rollback procedure are heavily reliant on Cruise Control being in a healthy state. To mitigate this risk we assessed the resource requirements of these instances on test clusters and then overprovisioned the hardware resources for the non-test Cruise Control instances. We also ensured that there was adequate monitoring and alerting on the health of these instances. Finally, we provisioned backup instances which would serve as a replacement if the primary instance became unhealthy.</p><p>While the plan seemed sound in theory, we needed to test it on real clusters and thoroughly document any anomalies. To do this we first used <a href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330">Kafka MirrorMaker</a> to clone an existing cluster and then performed a darklaunch migration in its entirety in a non-production environment before repeating the darklaunch migration in a production environment. Once we had established sufficient confidence and documentation, we performed real migrations of all of our Kafka clusters in development and staging environments before performing any production migrations.</p><h2 id="challenges-and-learnings">Challenges and Learnings</h2><p>As previously alluded to, the major risk with the plan was that Cruise Control needed to be healthy in order to proceed with a migration or rollback. We did encounter some instability in some of our non-prod migrations wherein a Cruise Control instance became unhealthy due to offline partitions in a Kafka cluster which temporarily experienced broker instability. Since Cruise Control’s algorithms and internal cluster model rely on being able to read from (and write to) a set of metrics topics, communication between Cruise Control and each Kafka cluster must be maintained. Offline partitions can thus prevent Cruise Control from operating properly, so in those cases the priority is to first triage and fix the issue in Kafka. Additionally, Cruise Control exposes configuration values for tuning various aspects of its internal metrics algorithm, and we found that it was sometimes helpful to reduce the lookback window and number of required data points. Doing so helped Cruise Control regenerate its internal model more quickly in cases where Kafka brokers encountered offline partitions.</p><p>Since we were migrating individual clusters, beginning with clusters in our development environment, we were able to gain insights into the performance characteristics of a Kafka cluster when it was running on PaaSTA/Kubernetes compared to when it was running on EC2. Much like with our instance selection criteria when running on bare EC2 instances, we were able to set up Kafka pools with differing instance types according to resource requirements (e.g. a standard pool and a large pool, each containing different instance types).</p><p>Another approach we initially considered for our migration procedure was to set up a fresh PaaSTA-based cluster with <em>N</em> brokers and then use Kafka MirrorMaker to “clone” an existing EC2 cluster’s data onto that new cluster. We also considered adjusting the strategy such that we would add one PaaSTA broker, remove one EC2 broker, and repeat <em>N</em> times. However, this would have entailed updating our operator’s reconcile logic for the purpose of the migration, and we would have needed to manually ensure that each broker pair was in the same availability zone. It would have also introduced a lengthy data copying step which we did not feel was acceptable for large clusters. After some further testing of procedures in our development environment, we ultimately settled on the procedure described here.</p><h2 id="acknowledgements">Acknowledgements</h2><p>Many thanks to Mohammad Haseeb, Brian Sang, and Flavien Raynaud for contributing to the design and implementation of this work. I would also like to thank Blake Larkin, Catlyn Kong, Eric Hernandez, Landon Sterk, Mohammad Haseeb, Riya Charaya, and Ryan Irwin for their valuable comments and suggestions. Finally, this work would not have been realized without the help of everyone who performed cluster migrations, so I am grateful to Mohammad Haseeb, Jamie Hewland, Zhaoyang Huang, Georgios Kousouris, Halil Cetiner, Oliver Bennett, Amr Ibrahim, and Alina Radu for all of their contributions.</p><div class="island job-posting"><h3>Principal Platform Software Engineer (Data Streams) at Yelp</h3><p>Want to build next-generation streaming data infrastructure?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a04be5e0-7421-48c7-8a4a-9c02b9c758cd/Principal-Platform-Software-Engineer-Data-Streams-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/03/kafka-on-paasta-part-two.html</link>
      <guid>https://engineeringblog.yelp.com/2022/03/kafka-on-paasta-part-two.html</guid>
      <pubDate>Thu, 03 Mar 2022 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Server Side Rendering at Scale]]></title>
      <description><![CDATA[<p>At Yelp, we use Server Side Rendering (SSR) to improve the performance of our React-based frontend pages. After a string of production incidents in early 2021, we realized our existing SSR system was failing to scale as we migrated more pages from Python-based templates to React. Throughout the rest of the year, we worked to re-architect our SSR system in a way that increased stability, reduced costs, and improved observability for feature teams.</p><h2 id="what-is-ssr">What Is SSR?</h2><p>Server Side Rendering is a technique used to improve the performance of JavaScript templating systems (such as React). Rather than waiting for the client to download a JavaScript bundle and render the page based on its contents, we render the page’s HTML on the server side and attach dynamic hooks on the client side once it’s been downloaded. This approach trades increased transfer size for increased rendering speeds, as our servers are typically faster than a client machine. In practice, we find that it significantly improves our <a href="https://web.dev/lcp/" target="_blank">LCP</a> timings.</p><p>We prepare components for SSR by bundling them with an entrypoint function and any other dependencies into a self-contained .js file. The entrypoint then uses <a href="https://reactjs.org/docs/react-dom-server.html" target="_blank">ReactDOMServer</a>, which accepts component props and produces rendered HTML. These SSR bundles are uploaded to S3 as part of our continuous integration process.</p><p>Our old SSR system would download and initialize the latest version of every SSR bundle at startup so that it’d be ready to render any page without waiting on S3 in the critical path. Then, depending on the incoming request, an appropriate entrypoint function would be selected and called. This approach posed a number of issues for us:</p><ul><li>Downloading and initializing every bundle significantly increased service startup time, which made it difficult to quickly react to scaling events.</li>
<li>Having the service manage all bundles created a massive memory requirement. Every time we scaled horizontally and spun up a new service instance, we’d have to allocate memory equal to the sum of every bundle’s source code and runtime usage. Serving all bundles from the same instance also made it difficult to measure the performance characteristics of a single bundle.</li>
<li>If a new version of a bundle was uploaded in between service restarts, the service wouldn’t have a copy of it. We solved this by dynamically downloading missing bundles as needed, and used an LRU cache to ensure we weren’t holding too many dynamic bundles in memory at the same time.</li>
</ul><p>The old system was based on Airbnb’s <a href="https://github.com/airbnb/hypernova" target="_blank">Hypernova</a>. Airbnb has written their own <a href="https://medium.com/airbnb-engineering/operationalizing-node-js-for-server-side-rendering-c5ba718acfc9" target="_blank">blog post</a> about the issues with Hypernova, but the core issue is that rendering components blocks the event loop and can cause several Node APIs to break in unexpected ways. One key issue we encountered is that blocking the event loop will break Node’s HTTP request timeout functionality, which significantly exacerbated request latencies when the system was already overloaded. Any SSR system must be designed to minimize the impact of blocking the event loop due to rendering.</p><p>These issues came to a head in early 2021 as the number of SSR bundles at Yelp continued to increase:</p><ul><li>Startup times became so slow that Kubernetes began marking instances as unhealthy and automatically restarting them, preventing them from ever becoming healthy.</li>
<li>The service’s massive heap size led to significant garbage collection issues. By the end of the old system’s lifetime, we were allocating nearly 12GB of old heap space for it. In one experiment, we determined that we were unable to serve &gt;50 requests per second due to lost time spent in garbage collection.</li>
</ul><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-02-22-server-side-rendering-at-scale/latency.png" alt="Request Latency" /><p class="subtle-text"><small>Request Latency</small></p></div><ul><li>Thrashing the dynamic bundle cache due to frequent bundle eviction and re-initialization created a large CPU burden that began affecting other services running on the same host.</li>
</ul><p>All of these issues degraded Yelp’s frontend performance and led to several incidents.</p><p>After dealing with these incidents, we set out to re-architect our SSR system. We chose stability, observability, and simplicity as our design goals. The new system should function and scale without much manual intervention. It should be easy to observe not only for infra teams, but for bundle-owning feature teams as well. The design of the new system should be easy for future developers to understand.</p><p>We also chose a few specific, functional goals:</p><ul><li>Minimize the impact of blocking the event loop so that features like request timeouts work correctly.</li>
<li>Shard service instances by bundle, so that each bundle has its own unique resource allocation. This reduces our overall resource footprint and makes bundle-specific performance easier to observe.</li>
<li>Be able to fast-fail requests we don’t anticipate being able to serve quickly. If we know it’ll take a long time to render a request, the system should immediately fall back to client-side rendering rather than waiting for SSR to time out first. This provides the fastest possible UX to our end users.</li>
</ul><h2 id="language-choice">Language Choice</h2><p>We evaluated several languages when it came time to implement the SSR Service (SSRS), including Python and Rust. It would have been ideal from an internal ecosystem perspective to use Python, however, we found that the state of V8 bindings for Python were not production ready, and would require a significant investment to use for SSR.</p><p>Next, we evaluated Rust, which has <a href="https://github.com/denoland/rusty_v8" target="_blank">high quality V8 bindings</a> that are already used in popular production-ready projects like <a href="https://github.com/denoland/deno" target="_blank">Deno</a>. However, all of our SSR bundles rely on the Node runtime API, which is not part of bare V8; thus, we’d have to reimplement significant portions of it to support SSR. This, in addition to a general lack of support for Rust in Yelp’s developer ecosystem, prevented us from using it.</p><p>In the end, we decided to rewrite SSRS in Node because Node provides a <a href="https://nodejs.org/api/vm.html" target="_blank">V8 VM API</a> that allows developers to run JS in sandboxed V8 contexts, has high quality support in the Yelp developer ecosystem, and would allow us to reuse code from other internal Node services to reduce implementation work.</p><p>SSRS consists of a main thread and many worker threads. Node worker threads are different from OS threads in that each thread has its own event loop and memory cannot be trivially shared between threads.</p><p>When the main thread receives an HTTP request, it executes the following steps:</p><ol><li>Check if the request should be fast-failed based on a “timeout factor.” Currently, this factor includes the average rendering run time and current queue size, but could be expanded upon to incorporate more metrics, like CPU load and throughput.</li>
<li>Push the request to the rendering worker pool queue.</li>
</ol><p>When a worker thread receives a request, it executes the following steps:</p><ol><li>Performs server side rendering. This blocks the event loop, but is still allowable since the worker only handles one request at a time. Nothing else should be using the event loop while this CPU-bound work happens.</li>
<li>Return the rendered HTML back to the main thread.</li>
</ol><p>When the main thread receives a response from a worker thread, it returns the rendered HTML back to the client.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2022-02-22-server-side-rendering-at-scale/architecture.png" alt="SSRS Architecture" /><p class="subtle-text"><small>SSRS Architecture</small></p></div><p>This approach provides us with two important guarantees that help us meet our requirements:</p><ul><li>The event loop is never blocked in the main web server thread.</li>
<li>The event loop is never needed while it’s blocked in a worker thread.</li>
</ul><p>We used <a href="https://github.com/piscinajs/piscina" target="_blank">Piscina</a>, a third-party library that provides the functionality described above. It manages thread pools with support for task queueing, task cancellation, and many other useful features. <a href="https://www.fastify.io/" target="_blank">Fastify</a> was chosen to power the main thread web server because it’s both highly performant and developer-friendly.</p><p>Fastify Server:</p><div class="language-javascript highlighter-rouge highlight"><pre>const workerPool = new Piscina({...});
app.post('/batch', opts, async (request, reply) =&gt; {
       if (
           Math.min(avgRunTime.movingAverage(), RENDER_TIMEOUT_MSECS) * (workerPool.queueSize + 1) &gt;
           RENDER_TIMEOUT_MSECS
       ) {
           // Request is not expected to complete in time.
           throw app.httpErrors.tooManyRequests();
       }
       try {
           const start = performance.now();
           currentPendingTasks += 1;
           const resp = await workerPool.run(...);
           const stop = performance.now();
           const runTime = resp.duration;
           const waitTime = stop - start - runTime;
           avgRunTime.push(Date.now(), runTime);
           reply.send({
               results: resp,
           });
       } catch (e) {
           // Error handling code
       } finally {
           currentPendingTasks -= 1;
       }
   });
</pre></div><h2 id="autoscaling-for-horizontal-scaling">Autoscaling for Horizontal Scaling</h2><p>SSRS is built on PaaSTA, which provides <a href="https://paasta.readthedocs.io/en/latest/autoscaling.html" target="_blank">autoscaling mechanisms</a> out of the box. We decided to build a custom autoscaling signal that ingests the utilization of the worker pool:</p><p><code class="language-plaintext highlighter-rouge">Math.min(currentPendingTasks, WORKER_COUNT) / WORKER_COUNT;</code></p><p>This value is compared against our target utilization (setpoint) over a moving time window to make horizontal scaling adjustments. We found that this signal helps us keep per-worker load in a healthier, more accurately provisioned state than basic container CPU usage scaling does, ensuring that all requests are served in a reasonable amount of time without overloading workers or overscaling the service.</p><h2 id="autotuning-for-vertical-scaling">Autotuning for Vertical Scaling</h2><p>Yelp is composed of many pages with different traffic loads; as such, the SSRS shards that support these pages have vastly different resource requirements. Rather than statically defining resources for each SSRS shard, we took advantage of dynamic resource autotuning to automatically adjust container resources like CPUs and memory of shards over time.</p><p>These two scaling mechanisms ensure each shard has the instances and resources it needs, regardless of how little or how much traffic it receives. The biggest benefit is running SSRS efficiently across a diverse set of pages while remaining cost effective.</p><p>Rewriting SSRS with Piscina and Fastify allowed us to avoid the blocking event loop issue that our previous implementation suffered from. Combined with a sharded approach and better scaling signals allowed us to squeeze more performance, while reducing cloud compute costs. Some of the highlights include:</p><ul><li>An average reduction of 125ms p99 when server side rendering a bundle.</li>
<li>Improved service startup times from minutes in the old system to seconds by reducing the number of bundles initialized on boot.</li>
<li>Reduced cloud compute costs to one-third of the previous system by using a custom scaling factor and tuning resources more efficiently per-shard.</li>
<li>Increased observability since each shard is now responsible for rendering one bundle only, allowing teams to more quickly understand where things are going wrong.</li>
<li>Created a more extensible system allowing for future improvements like CPU profiling and bundle source map support.</li>
</ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="http://www.yelp.com/careers?job_id=3358a10e-b1af-4a5a-bd0e-4aa6bab35c93?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/02/server-side-rendering-at-scale.html</link>
      <guid>https://engineeringblog.yelp.com/2022/02/server-side-rendering-at-scale.html</guid>
      <pubDate>Tue, 22 Feb 2022 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Developing a New Native Ads Dashboard Using Server-Driven UI]]></title>
      <description><![CDATA[<p>Updating the ads experience for Yelp Advertisers by creating a new Native Ads Dashboard using Server-Driven UI.</p><p>The Yelp Ads Dashboard is a tool that advertisers can use to update their ad settings and keep track of how their ad is performing. In 2020, we revamped the Ads Dashboard web experience to provide greater visibility into an ad’s performance and better access to control and customize options from a single page.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-10-15-developing-a-new-native-ads-dashboard-using-server-driven-ui/ads-dashboard-screenshot.png" alt="Ads Dashboard on Desktop" /><p class="subtle-text"><small>Ads Dashboard on Desktop</small></p></div><p>In order to ensure consistency across platforms from both a visual and feature standpoint, we decided to update our Ads Dashboard experience on mobile to continue to provide advertisers with an exceptional experience.</p><p>When we started planning this project, we agreed on four specific objectives: Feature consistency with the web; we wanted our customers (business owners) to be able to do everything on mobile that they could do on the web, with no additional steps. Visual consistency with the web; we wanted to ensure that the visual experience was similar to the web so that it felt familiar and Yelpy. A fast and easy way to add new components or edit existing ones. Avoid duplicating logic, especially anything already written for the web.</p><p>The Ads Dashboard would live on the Yelp for Business App — an iOS and Android app for business wwners on Yelp. This app is developed natively, which means there are separate codebases for iOS and Android. So whenever we make a change to the app, we need to update both the iOS and Android versions of it, and push them out separately.</p><p>We knew we still wanted to develop the Ads Dashboard natively, but noticed that our current processes would not meet our requirements. Having separate codebases meant:</p><p>Adding new components or updating existing ones would require a lot of time and coordination between iOS and Android engineers. Logic, both for achieving visual consistency and getting data to power components, would have to be duplicated in two different places.</p><p>Luckily, we weren’t the first ones to encounter these problems, and a solution was already in the works: the <a href="https://engineeringblog.yelp.com/2021/11/building-a-server-driven-foundation-for-mobile-app-development.html">Biz Native Foundation</a>.</p><p>Biz Native Foundation (BNF) is a Server-Driven UI framework currently being developed at Yelp for our Biz App. The goal of BNF is to accelerate app development for the Biz App by consolidating the business logic and screen configurations in the backend, instead of having it exist separately within the iOS and Android apps.</p><p>With BNF, our backend service for mobile apps is able to send a screen configuration to the apps, informing them of which components to render, as well as which data and properties to prepopulate those components with. The mobile apps are configured to parse the screen configuration sent to them and understand what components they need to render.</p><p>However, there were a few obstacles here that we needed to tackle.</p><p>Firstly, BNF was an entirely new framework. We were going to be one of the first teams at Yelp to build out an entire page using BNF. This meant that we had no common components built out for us. <strong>Not only did we have to build these from scratch, we also had to set the standard for how to do so for future projects, and make sure that everything we built was reusable and extensible.</strong></p><p>We compiled a list of common and specialized components we’d need, including things like Buttons, Headers, and Charts. We started to anticipate which parts of these components we’d want to have control over when building new features and ensure they were customizable.</p><p>Soon enough, we had a library of components we could use. Once the pieces were built, putting them together was as easy as making a single code change to the backend, and the magic of Server-Driven UI became clear.</p><p>Making changes to the screen layout, as well as components and what they looked like, was fast. We were able to test different configurations and easily play around with the screen. On top of that, a lot of the complex logic that drives the component structure and layout lived in a single place and was easy to update without redeploying the iOS and Android apps.</p><p>The second problem was a little more subtle. We knew that some of the components, like Charts, would need to display some data. However, retrieving that data from the backend on page load was expensive and could slow down the load time. Additionally, since BNF is essentially a way for us to configure the screen, we didn’t want to burden it with the responsibility of providing data that was agnostic to the UI.</p><p>Instead, we wanted a way for each component to fetch additional data as needed after it was loaded on the screen. Enter GraphQL.</p><p>GraphQL is a query language that makes it super easy and straightforward to fetch data. At Yelp, we’ve been using GraphQL to power a lot of data-driven components like Charts and Graphs. In fact, our components on the Ads Dashboard on the web are currently making GraphQL queries to fetch data for Ad Performance.</p><p>We realized that GraphQL was the way to go for our components on mobile as well. Not only was it fast, it also parallelized a lot of data fetching. Beyond that, since we’d already written GraphQL queries to fetch data on the web, it was easy to use the same queries for mobile!</p><p>We had our solutions, and things were coming together. The flow was simple: our backend service sent a list of components and their properties to mobile clients, and some data-driven components used GraphQL to fetch any additional data. The mobile client did not need to perform any additional tasks. We performed a number of tests to ensure the Ads Dashboard was robust, and easy to update with new features.</p><p>We launched the first version of the Ads Dashboard for mobile in Q1 2021. Since then, we’ve been adding new features and components to provide even more valuable information to advertisers. Looking back, the Biz Native Foundation was the right choice as it exceeded our expectations and allowed us to iterate faster than ever on a mobile app.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="http://www.yelp.com/careers?job_id=3358a10e-b1af-4a5a-bd0e-4aa6bab35c93?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2022/02/developing-a-new-native-ads-dashboard-using-server-driven-ui.html</link>
      <guid>https://engineeringblog.yelp.com/2022/02/developing-a-new-native-ads-dashboard-using-server-driven-ui.html</guid>
      <pubDate>Tue, 15 Feb 2022 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 1 - Architecture)]]></title>
      <description><![CDATA[<p>Yelp’s <a href="https://kafka.apache.org/">Kafka</a> infrastructure ingests tens of billions of messages each day to facilitate data driven decisions and power business-critical pipelines and services. We have recently made some improvements to our Kafka deployment architecture by running some of our clusters on <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>, Yelp’s own Platform as a Service. Our <a href="https://kubernetes.io/">Kubernetes</a> (k8s) based deployment leverages a custom Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator</a> for Kafka, as well as <a href="https://github.com/linkedin/cruise-control">Cruise Control</a> for lifecycle management.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-15-kafka-on-paasta-part-one/stack.png" alt="Kafka on PaaSTA on Kubernetes" /><p class="subtle-text"><small>Kafka on PaaSTA on Kubernetes</small></p></div><h2 id="architectural-motivations-and-improvements">Architectural Motivations and Improvements</h2><p>In the past, all of our Kafka clusters ran on dedicated <a href="https://aws.amazon.com/pm/ec2/">EC2</a> instances on AWS. Kafka was deployed directly on these hosts and configuration management was highly reliant on our centralized <a href="https://puppet.com/">Puppet</a> repository. The deployment model was somewhat cumbersome and creating a new cluster took over two hours on average. We set out to develop a new deployment model with the following goals in mind:</p><ul><li>Reduce the dependency on slow Puppet runs.</li>
<li>Promote adoption of PaaSTA internally and leverage its CLI tools to improve productivity.</li>
<li>Improve maintainability of our lifecycle management system.</li>
<li>Simplify the process of performing OS host level upgrades and Kafka version upgrades.</li>
<li>Streamline the creation of new Kafka clusters (aligned with how we deploy services).</li>
<li>Expedite broker decommissions and simplify recovery process when hosts fail. Having the ability to re-attach EBS volumes also allows us to avoid unnecessarily consuming network resources, which helps save money.</li>
</ul><p>Yelp had previously developed practices for running <a href="https://kubernetes.io/docs/tutorials/stateful-application/">stateful applications</a> on Kubernetes (e.g. <a href="https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html">Cassandra on PaaSTA</a> and <a href="https://engineeringblog.yelp.com/2020/10/flink-on-paasta.html">Flink on PaaSTA</a>), so PaaSTA was a natural choice for this use case.</p><p>The new deployment architecture leverages PaaSTA pools–or groups of hosts–for its underlying infrastructure. Kafka broker <a href="https://kubernetes.io/docs/concepts/workloads/pods/">pods</a> are scheduled on Kubernetes <a href="https://kubernetes.io/docs/concepts/architecture/nodes/">nodes</a> in these pools, and the broker pods have detachable <a href="https://aws.amazon.com/ebs/">EBS</a> volumes. Two key components of the new architecture are the Kafka operator and Cruise Control, both of which we will describe in more detail later. We deploy instances of our in-house Kafka Kubernetes operator and various sidecar services on PaaSTA, and one instance of Cruise Control is also deployed on PaaSTA for each Kafka cluster.</p><p>Two crucial distinctions between the new architecture and the old architecture are that Kafka now runs within a <a href="https://www.docker.com/">Docker</a> container, and our configuration management approach no longer relies on Puppet. Configuration management is now in accord with the PaaSTA-based configuration management solution in which <a href="https://www.jenkins.io/">Jenkins</a> propagates YAML file changes whenever they are committed to our service config repository. As a result of this architectural overhaul, we’re now able to leverage existing PaaSTA CLI tooling to see the status of clusters, read logs, and restart clusters. Another major benefit is that we’re now able to provision new Kafka clusters by providing the requisite configuration (see below), and this approach has allowed us to <em>halve the time taken</em> to deploy a new Kafka cluster from scratch.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-15-kafka-on-paasta-part-one/paasta-tooling.png" alt="PaaSTA Tooling Example" /><p class="subtle-text"><small>PaaSTA Tooling Example</small></p></div><figure class="code"><figure class="highlight"><pre class="language-yaml" data-lang="yaml">example-test-prod:
  deploy_group: prod.everything
  pool: kafka
  brokers: 15
  cpus: 5.7  # CPU unit reservation breakdown: (5.7 (kafka) + 0.1 (hacheck) + 0.1 (sensu)) + 0.1 (kiam) = 6.0 (as an example, consider that our pool is comprised of m5.2xlarge instances)
  mem: 26Gi
  data: 910Gi
  storage_class: gp2
  cluster_type: example
  cluster_name: test-prod
  use_cruise_control: true
  cruise_control_port: 12345
  service_name: kafka-2-4-1
  zookeeper:
    cluster_name: test-prod
    chroot: kafka-example-test-prod
    cluster_type: kafka_example_test
  config:
    unclean.leader.election.enable: "false"
    reserved.broker.max.id: "2113929216"
    request.timeout.ms: "300001"
    replica.fetch.max.bytes: "10485760"
    offsets.topic.segment.bytes: "104857600"
    offsets.retention.minutes: "10080"
    offsets.load.buffer.size: "15728640"
    num.replica.fetchers: "3"
    num.network.threads: "5"
    num.io.threads: "5"
    min.insync.replicas: "2"
    message.max.bytes: "1000000"
    log.segment.bytes: "268435456"
    log.roll.jitter.hours: "1"
    log.roll.hours: "22"
    log.retention.hours: "24"
    log.message.timestamp.type: "LogAppendTime"
    log.message.format.version: "2.4-IV1"
    log.cleaner.enable: "true"
    log.cleaner.threads: "3"
    log.cleaner.dedupe.buffer.size: "536870912"
    inter.broker.protocol.version: "2.4-IV1"
    group.max.session.timeout.ms: "300000"
    delete.topic.enable: "true"
    default.replication.factor: "3"
    connections.max.idle.ms: "3600000"
    confluent.support.metrics.enable: "false"
    auto.create.topics.enable: "false"
    transactional.id.expiration.ms: "86400000"</pre></figure><figcaption class="c1">Example configuration file for a cluster with 15 brokers running Kafka version 2.4.1</figcaption></figure><h2 id="the-new-architecture-in-detail">The New Architecture in Detail</h2><p>One primary component of the new architecture is the Kafka Kubernetes operator which helps us manage the state of the Kafka cluster. While we still rely on external <a href="https://zookeeper.apache.org/">ZooKeeper</a> clusters to maintain cluster metadata, message data is still persisted to the disks of Kafka brokers. Since Kafka consumers rely on persistent storage to be able to retrieve this data, Kafka is considered a stateful application in the context of Kubernetes. Kubernetes natively exposes abstractions for managing stateful applications (e.g. <a href="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/">StatefulSets</a>), but Kubernetes has no notion of Kafka-specific constructs by default. As such, we needed additional functionality beyond that of the standard Kubernetes API to maintain our instances. In the parlance of Kubernetes, an <em>operator</em> is a custom controller which allows us to expose this application-specific functionality.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-15-kafka-on-paasta-part-one/operator-overview.png" alt="Kafka Operator Overview" /><p class="subtle-text"><small>Kafka Operator Overview</small></p></div><p>The operator is in charge of establishing when Kubernetes needs to perform an action on the cluster. It has a reconcile loop in which it observes the state of custom cluster resources and reconciles any discrepancies by interacting with the Kubernetes API and by calling APIs exposed by another key architectural component: Cruise Control.</p><p><a href="https://github.com/linkedin/cruise-control">Cruise Control</a> is an open-source Kafka cluster management system developed by LinkedIn. Its goal is to reduce the overhead associated with maintaining large Kafka clusters. Each Kafka cluster has its own dedicated instance of Cruise Control, and each cluster’s operator interacts with its Cruise Control instance to perform lifecycle management operations such as checking the health of the cluster, rebalancing topic partitions and adding/removing brokers.</p><p>The paradigm used by Cruise Control is in many ways similar to the one used by the operator. Cruise Control monitors the state of the Kafka cluster, generates an internal model, scans for anomalous goal violations, and attempts to resolve any observed anomalies. It exposes APIs for various administrative tasks and the aforementioned lifecycle management operations. These APIs serve as a replacement for our prior ad hoc lifecycle management implementations which we used for EC2-backed brokers to perform conditional rebalance operations or interact with AWS resources like <a href="https://aws.amazon.com/sns">SNS</a> and <a href="https://aws.amazon.com/sqs/">SQS</a>. Consolidating these into one service has helped to simplify our lifecycle management stack.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-15-kafka-on-paasta-part-one/cluster-architecture.png" alt="Cluster Architecture" /><p class="subtle-text"><small>Cluster Architecture</small></p></div><p>Putting these components together, we arrive at a cluster architecture in which we define a Custom Resource Definition (CRD) through our internal config management system and couple it with a custom Kafka Docker image. The Kafka Kubernetes operator uses the config, CRD, and the Docker image in its interaction with the Kubernetes API to generate a KafkaCluster Custom Resource on a Kubernetes master. This allows us to schedule Kafka pods on Kubernetes nodes, and the operator oversees and maintains the health of the cluster through both the Kubernetes API and the APIs exposed by the Cruise Control service. Humans can observe the cluster and interact with it through the Cruise Control UI or PaaSTA CLI tools.</p><p>Finally, we’d like to illustrate the overall flow of operations with an example scenario. Consider the case of scaling down the size of the cluster by removing a broker. A developer updates the cluster’s config and decrements the broker count, which in turn updates the Kafka cluster’s CRD. As part of the reconcile loop the operator recognizes that the desired cluster state differs from the actual state represented in the StatefulSet, so it asks Cruise Control to remove a broker. Information about the removal task is returned by the Cruise Control API, and the operator annotates the decommissioning pod with metadata about this task. While Cruise Control performs the process of moving partitions away from the broker to be decommissioned, the operator routinely checks the status of the decommission by issuing requests to Cruise Control. Once the task is marked as completed, the operator removes the pod and the internal state of the cluster spec has been reconciled.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-15-kafka-on-paasta-part-one/scaling-example.png" alt="Scale Down Scenario" /><p class="subtle-text"><small>Scale Down Scenario</small></p></div><h2 id="what-comes-next">What comes next?</h2><p>After designing this architecture we built tooling and constructed a process for seamlessly migrating Kafka clusters from EC2 to PaaSTA. As of this post we have migrated many of our clusters to PaaSTA, and we’ve deployed new clusters using the architecture detailed here. We’re also continuing to tune our hardware selection to accommodate different attributes of our clusters. Stay tuned for another installation in this series where we will share our migration process!</p><h2 id="acknowledgements">Acknowledgements</h2><p>Many thanks to Mohammad Haseeb for contributing to the architecture and implementation of this work, as well as for providing the architecture figures. I would also like to thank Brian Sang and Flavien Raynaud for their many contributions to this project. Finally, I’d like to thank Blake Larkin, Catlyn Kong, Eric Hernandez, Landon Sterk, Mohammad Haseeb, Riya Charaya, and Ryan Irwin for their insightful review comments and guidance in writing this post.</p><div class="island job-posting"><h3>Principal Platform Software Engineer (Data Streams) at Yelp</h3><p>Want to build next-generation streaming data infrastructure?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a04be5e0-7421-48c7-8a4a-9c02b9c758cd/Principal-Platform-Software-Engineer-Data-Streams-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html</link>
      <guid>https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html</guid>
      <pubDate>Wed, 15 Dec 2021 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Building a unified setup flow to better onboard business users]]></title>
      <description><![CDATA[<p>At Yelp we are always striving to optimize our user experience so we can help guide our customers to success. We aim to streamline the onboarding process for business owners by centralizing customer products into a single page.</p><h2 id="the-challenge">The Challenge</h2><p>Yelp offers an array of <a href="https://business.yelp.com/products/business-page/">free</a> and <a href="https://business.yelp.com/products/business-page/upgrades/">paid</a> products that help local businesses connect with consumers. To set up these products on their Yelp page, business owners previously had to navigate through multiple tabs, which negatively impacted product setup rates (roughly 55%) and lowered overall user engagement.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-08-building-a-unified-setup-flow-to-better-onboard-business-users/boa-home-page.png" alt="Previously, the only way businesses could set up Yelp advertising products was through navigating multiple tabs." /><p class="subtle-text"><small>Previously, the only way businesses could set up Yelp advertising products was through navigating multiple tabs.</small></p></div><h2 id="the-setup-flow">The Setup Flow</h2><p>To make it easier for business owners to set up their business page and run their advertising campaigns on Yelp, <strong><a href="https://blog.yelp.com/news/yelp-releases-new-yelp-for-business-features-enabling-more-effective-advertising-and-adding-control-and-value-for-business-owners/">we built a new unified setup flow</a> dedicated to ushering them through the setup process.</strong></p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-08-building-a-unified-setup-flow-to-better-onboard-business-users/setup-flow.png" alt="What a business owner sees under the new centralized flow." /><p class="subtle-text"><small>What a business owner sees under the new centralized flow.</small></p></div><h2 id="so-how-was-it-built">So, how was it built?</h2><p>Ideally, we could’ve just imported all of the setup components into a new single page application. However, all of these components were built differently and lacked consensus among their architecture. So, rather than rebuild all the components the same way, we designed a new system that could accept these different components despite their varying structures.</p><h3 id="mvp-component-architecture">MVP Component Architecture</h3><p>When designing the setup flow we wanted to focus on scalability while also maintaining a reasonable project scope. To balance these two priorities we created a plug-and-play schema that each setup component was required to follow in order to be imported into our page. The component for each step must:</p><ol><li>Read and write all its own data.</li>
<li>Require only basic properties such as business ID, CSRF tokens, or the locale of the request.</li>
<li>Accept a couple of callback functions that would communicate with our single page application to denote when to save or skip over the current step in the flow.</li>
</ol><p>Once the setup step abides by our requirements we can plug it into our page skeleton. By conforming to this layout, we can easily add new steps to our setup flow as other product teams build new features or update existing ones.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-08-building-a-unified-setup-flow-to-better-onboard-business-users/setup-flow-skeleton.png" alt="Example of the Page Skeleton with a Setup Step Imported" /><p class="subtle-text"><small>Example of the Page Skeleton with a Setup Step Imported</small></p></div><h3 id="data-fetching">Data Fetching</h3><p>At Yelp, we have historically used <a href="https://developer.mozilla.org/en-US/docs/Web/Guide/AJAX/Getting_Started">AJAX</a> to fetch data in our frontend components. However, for this project we relied heavily on <a href="https://graphql.org/">GraphQL</a> to fetch all the data we needed. GraphQL is a query language that gives clients the power to ask for the exact data they need and nothing more. It also provides a high level of data stewardship that helps developers build robust data models and avoid having to write manual parsing code on the frontend. The smooth developer experience of building with GraphQL and the scope creep that comes from having many AJAX endpoints, made this an easy decision when designing the data fetching for this new system.</p><p>Not only did this save us from hooking up our lightweight single page application to clunky frontend services, it also resulted in substantial performance gains. Upon rendering, GraphQL is able to batch together multiple data fetches to make only one request to the server.</p><p>Additionally, we cache all the GraphQL calls and the data they return. This increases performance because any re-requested data can be found in the cache and doesn’t have to hit the backend server.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-12-08-building-a-unified-setup-flow-to-better-onboard-business-users/data-fetching.png" alt="The flow of data in our MVP component architecture." /><p class="subtle-text"><small>The flow of data in our MVP component architecture.</small></p></div><h3 id="v2-component-architecture">V2 Component Architecture</h3><p>In order for a setup step component to communicate with the page skeleton, our MVP component architecture relied on callback functions.</p><p>For example, when a user saved their newly updated business hours, the setup step component used a callback function called <strong>onSuccessfulSave()</strong> to inform the page skeleton. When called, the page skeleton marked the current step as complete and moved on to the next step. However, using callbacks was limiting because we had to add a new function for every additional piece of information the page wanted to know about the plugged in component. We quickly realized that this system was not scalable.</p><p>To solve this problem, we have begun working on a V2 of the setup flow that shares a <a href="https://reactjs.org/docs/context.html#contextprovider">context provider</a> between the plugged-in component and the page skeleton. This provides efficient &amp; clean communication between the setup flow and the state of each step e.g. if it’s saving, loading or ran into any network errors. This new version allows the flow to communicate more information to the user about each plugged-in component which will greatly improve the user experience.</p><h2 id="results">Results</h2><p>After launching our MVP and getting early feedback from A/B testing, we rolled out this new flow to 100% of the businesses that go through our claim process. The setup flow has increased product setup rates by an average of 8% across all the steps with some products seeing a significant boost. For example, our <a href="https://blog.yelp.com/news/yelp-connect-a-new-voice-for-restaurants-to-reach-locals/">Yelp Connect</a> product saw a 35% increase in its set up rate!</p><p>As we continue to improve this system, our focus is on making the setup process efficient in order to help businesses grow and thrive on our platform.</p><h2 id="acknowledgements">Acknowledgements</h2><p>This project was a group effort so shoutout to everyone on the Biz Guidance Team: Zoher Zoomkawalla, Arun Bharadwaj, Taras Anatsko, Brenda Kaing, Abdul Lateef Haamid, Heidi Makein, Sophia Chen, Dorothy Cruz Perdomo, and Leon Rudyak. We also had a lot of cross team support so big thank you to everyone else who helped build this new flow!</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="http://www.yelp.com/careers?job_id=3358a10e-b1af-4a5a-bd0e-4aa6bab35c93?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/12/building-a-unified-setup-flow-to-better-onboard-business-users.html</link>
      <guid>https://engineeringblog.yelp.com/2021/12/building-a-unified-setup-flow-to-better-onboard-business-users.html</guid>
      <pubDate>Wed, 08 Dec 2021 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Building a server-driven foundation for mobile app development]]></title>
      <description><![CDATA[<p>Yelp has many teams of mobile developers who collectively maintain two different mobile apps on iOS and Android: Yelp (hereinafter “Consumer App”) and Yelp for Business (hereinafter “Biz App”). We’re always looking for ways to ship features more quickly and consistently on all these platforms! We adopt vendor and open-source libraries when possible, and we develop our own shared libraries when necessary. While many teams were already independently adopting server-driven UI (SDUI) to build their features faster and cheaper, we felt something was missing – a foundation that tied all our libraries together into a shared, server-driven, end-to-end solution for mobile features.</p><p>In this blog post, we’ll cover the Biz Native Foundation (BNF), which provides a foundation for building, testing, deploying, and monitoring server-driven features in our Biz App. At the end, we’ll share future plans for extending this foundation to our Consumer App.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/biz_native_foundation.png" alt="Biz Native Foundation Diagram" class="c1" /></div><h2 id="what-is-the-yelp-for-business-mobile-app">What is the Yelp for Business mobile app?</h2><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/yelp_for_business_1024x1229.png" alt="Yelp for Business mobile app screenshots" class="c1" /></div><p>Launched in <a href="https://blog.yelp.com/2014/12/our-gift-to-business-owners-a-yelp-app-just-for-you">December 2014</a> for both iOS and Android, our <a href="https://business.yelp.com/tools/business-mobile-app/">Yelp for Business</a> mobile app enables businesses to manage their presence on Yelp and connect with customers from their phone.</p><p>The app has several core screens, or tabs, with highly personalized content for each business and app user. For example, a restaurant that opened during COVID-19 will have different needs than a firmly established plumber, and a business owner will have different needs than a manager or employee. The core screens link to secondary screens for updating business information, adding photos, responding to reviews, and finding new customers.</p><p>Developing both iOS and Android versions of a complex, personalized app has been a major challenge. The level of effort to ship a new feature can be high, and the time-to-market can range from a minimum of one week (we release our apps weekly) to several months or quarters. Once released, a feature must be supported in a range of app versions while undergoing continuous maintenance and improvements with each new release.</p><p>Server-driven UI (SDUI) was an obvious way to address these challenges, and many product teams had already adopted SDUI or were planning to adopt SDUI in 2019 when we began developing the BNF to standardize and simplify mobile development. The COVID-19 pandemic accelerated our efforts as we <a href="https://engineeringblog.yelp.com/2020/06/how-businesses-have-reacted-to-covid-19-using-yelp-features.html">added features</a> to help businesses navigate an extremely dynamic, challenging time. We realized we needed to make a significant investment in order to adopt SDUI across our entire app.</p><h2 id="business-and-technical-requirements">Business and Technical Requirements</h2><p>We defined a handful of important business and technical goals for our foundation:</p><ol><li>Ship Biz App features more quickly and consistently on iOS and Android</li>
<li>Reduce the level-of-effort required to build, test, deploy, and monitor new Biz App features</li>
<li>Support dynamic, highly-personalized content</li>
<li>Give our marketing and product teams more direct control over the content in the Biz App</li>
</ol><h2 id="alternatives-considered">Alternatives Considered</h2><p>Before we began building our own foundation, we reviewed a couple alternatives:</p><h3 id="webviews">Webviews</h3><p>The Biz App was already using webviews to share content with our Yelp for Business web app (<a href="https://biz.yelp.com">biz.yelp.com</a>). That said, we’d been slowly migrating away from webviews for the past two years for several reasons:</p><ul><li>Webviews require careful handshaking between native and web apps</li>
<li>Most mobile app engineers don’t have experience debugging web apps, and most front-end engineers don’t have experience debugging mobile apps</li>
<li>Native screens offer superior user experience (UX) over webviews, e.g. they are faster and more tightly integrated with the platform</li>
</ul><h3 id="react-native">React Native</h3><p><a href="https://reactnative.dev/">React Native</a> would allow us to ship mobile app features more quickly and consistently, and our front-end developers could contribute to our mobile app more easily. React Native would be faster than webviews and more tightly integrated with the platform. However, React Native had some significant downsides for our existing Biz App and developer community:</p><ul><li>We didn’t already use React Native at Yelp, and most of our mobile developers didn’t have professional React Native experience</li>
<li>We couldn’t reuse our existing code or native libraries without extensive bridging, which feels counter to building a foundation</li>
</ul><p>Once we decided to build our own foundation, we established some design principles to guide our efforts.</p><h3 id="adopt-best-practices">Adopt Best Practices</h3><p>We would adopt Yelp-specific or industry-standard best practices when possible. Yelp already has consistent vendor, open-source, and internal libraries for mobile development. We use the latest features in <a href="https://developer.apple.com/documentation/uikit">UIKit</a> (iOS) and <a href="https://material.io/develop/android">Material Design</a> / <a href="https://developer.android.com/jetpack/androidx">Jetpack</a> (Android). On Android, we use our open-sourced <a href="https://engineeringblog.yelp.com/2019/05/introducing-bento.html">Bento</a> framework to build modularized UIs. On iOS, we have a similar internal framework. We wanted our foundation to build on these existing solutions rather than replace them.</p><h3 id="support-server-driven-ui-sdui">Support Server-Driven UI (SDUI)</h3><p>We would give our backend more control over screen content through server-driven UI. This would enable us to make changes more quickly and consistently on all clients. It would also enable dynamic, personalized content and give our marketing and product teams more direct control. Fortunately, SDUI wasn’t new to Yelp or the Biz App, where several product teams had already adopted SDUI for their features. We would learn from these efforts to create a shared SDUI framework for our foundation.</p><h3 id="enable-customization">Enable Customization</h3><p>We would enable customization. Though we wanted to encourage reuse and consistency, we didn’t want to restrict product teams from writing custom code where it makes sense. Otherwise, they would simply build their features without the foundation.</p><h3 id="create-a-supporting-toolchain">Create a Supporting Toolchain</h3><p>We would create tools that simplify or automate common tasks, such as debugging, testing, logging, monitoring, and documentation.</p><h2 id="core-concepts">Core Concepts</h2><p>The BNF has only four core concepts to keep the system simple and intuitive. Since mobile screens are the heart of every application, the BNF provides a <strong>generic screen</strong> that hosts <strong>generic components</strong>. When the user interacts with generic components, we trigger <strong>generic actions</strong> to update the UI or applications state, using <strong>generic properties</strong> to provide a way for generic components to observe application state without strong coupling.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/generic_architecture.png" alt="Generic Architecture Diagram" class="c1" /></div><p>We’ll go through these concepts and show how they all work together to support mobile feature development.</p><h3 id="generic-screen">Generic Screen</h3><p>A generic screen is a flexible template that can support any screen in the Biz App. Before the BNF, adding a new screen required boilerplate code, such as a custom view controller (iOS) or activity/fragment (Android). Fortunately, mobile screens are constrained by the geometry of mobile devices, so we created one highly configurable screen.</p><p>A generic screen consists of a number of sections containing one or more generic components. Each section represents a part of the screen, such as the top/bottom navigation bar or scroll view.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/generic_screen.png" alt="Generic Architecture Diagram" class="c3" /></div><p>A generic screen must be configured to display content. The BNF supports both remotely and locally configured screens.</p><h4 id="remotely-configuring-a-generic-screen-sdui">Remotely configuring a generic screen (SDUI)</h4><p>A remotely configured generic screen uses an endpoint on our REST API to load a JSON screen configuration resource:</p><div class="language-plaintext highlighter-rouge highlight"><pre>/ui/{business_id}/screens/{name}/configuration/v1
</pre></div><p>The endpoint has path arguments that specify the target business ID and the logical name of the screen, e.g. home.</p><p>We use <a href="https://swagger.io/specification/v2/">Swagger 2.0</a> to document our REST APIs and auto-generate client networking libraries. Let’s look at some definitions for our screen configuration.</p><p>The screen configuration object (<code class="language-plaintext highlighter-rouge">ScreenConfigurationV1</code>) has properties for each section on the screen, e.g. <code class="language-plaintext highlighter-rouge">components</code> is the main scroll view. We version the screen configuration object and the screen configuration endpoint whenever we add new properties to this object, such as a new section.</p><div class="language-yaml highlighter-rouge highlight"><pre>ScreenConfigurationV1:
  properties:
    header:
      $ref: #/definitions/GenericComponent
    components:
      type: array
      items:
        $ref: #/definitions/GenericComponent
    sticky_bottom_components:
      type: array
      items:
        $ref: #/definitions/GenericComponent
    data:
      $ref: #definitions/GenericScreenData
  required:
  - components
  - data
  type: object
</pre></div><p>Each section can be configured with one or more generic components. The <code class="language-plaintext highlighter-rouge">GenericComponent</code> object only contains an ID (<code class="language-plaintext highlighter-rouge">learn-more-button</code>) and a type (<code class="language-plaintext highlighter-rouge">generic_button_v1</code>).</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericComponent:
  properties:
    id:
      type: string
    type:
      type: string
  required:
  - id
  - type
  type: object
</pre></div><p>The data for each component is stored in a separate <code class="language-plaintext highlighter-rouge">GenericScreenData</code> object that maps a component ID to a <code class="language-plaintext highlighter-rouge">GenericComponentData</code> object, which was our best approximation for a Swagger union type that worked with our Swagger codegen pipeline, which automatically generates networking code for iOS and Android clients. <code class="language-plaintext highlighter-rouge">GenericActionData</code> plays a similar role for generic actions.</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericScreenData:
  properties:
    id_to_component_data:
      description: Map component ID to its GenericComponentData
      $ref: #/definitions/IdToGenericComponentData
    id_to_action_data:
      description: Map action ID to its GenericActionData
      $ref: #definitions/IdToGenericActionData
  required:
  - id_to_component_map
  - id_to_action_map
  type: object
GenericComponentData:
  properties:
    generic_button_v1:
      $ref: #definitions/GenericButtonDataV1
    generic_text_v1:
      $ref: #definitions/GenericTextDataV1
    ...
  type: object
GenericActionData:
  properties:
    generic_open_url_v1:
      $ref: #definitions/GenericShowOpenUrlV1
    generic_close_screen_v1:
      $ref: #definitions/GenericCloseScreenV1
    ...
  type: object
</pre></div><p>There are some benefits to storing configuration data separately from references:</p><ul><li>We can reuse the same data across multiple references, reducing the size of the screen configuration</li>
<li>We can debug the screen configuration more easily with all configuration data in a flat map</li>
</ul><h4 id="locally-configuring-a-generic-screen">Locally configuring a generic screen</h4><p>A locally configured generic screen uses either a Kotlin or Swift domain-specific language (DSL).</p><div class="language-kotlin highlighter-rouge highlight"><pre>screenConfiguration {
    header {
        navBar(title = "Welcome!")
    }
    components {
        text("Yelp is working on some cool things!", style = HEADER1)
        button(
            "Learn more on our blog",
            tappedActions = actions {
               openUrl("https://engineeringblog.yelp.com")
            }
        )
    }
}
</pre></div><p>Though locally configured screens can’t be updated without a client release, they still satisfy many of our requirements, such as shipping features more quickly and reducing the level-of-effort. Not every screen has dynamic, personalized content that benefits from being server-driven, and some screens are simply hard to make server-driven.</p><h3 id="generic-components">Generic Components</h3><p>A generic component is a basic building block for a generic screen. The BNF supports a rich, extensible library of components.</p><p>Every component has an unique ID and type. We use a naming convention to distinguish generic, reusable component types (<code class="language-plaintext highlighter-rouge">generic_button_v1</code>) from components that are customized for one feature (<code class="language-plaintext highlighter-rouge">feature_ad_preview_v1</code>). However, the BNF doesn’t handle generic or feature-specific component types differently, so we refer to all components as generic components.</p><h4 id="configuring-components">Configuring components</h4><p>Generic components must be configured with data. In remote screen configurations, each component type has an associated data object. When adding new features to the component, we always version the component type and data object.</p><div class="language-yaml highlighter-rouge highlight"><pre>definitions:
  GenericButtonDataV1:
    properties:
      text:
        type: string
      style:
        type: string
        enum:
        - primary
        - secondary
        - tertiary
      size:
        type: string
        enum:
        - standard
        - large
        - small
      viewed_actions:
        description: Actions to fire when the app user views the button
        type: array
        items:
          $ref: #/definitions/GenericAction
      tapped_actions:
        description: Actions to fire when the app user taps the button
        type: array
        items:
          $ref:#/definitions/GenericAction
    required:
    - text
    - style
    - size
    - tapped_actions
    type: object
</pre></div><p>In a local screen configuration, the generic component can be configured with our DSL:</p><div class="language-kotlin highlighter-rouge highlight"><pre>button(
   text = "Learn more on our blog",
   style = PRIMARY,
   tappedActions = actions {
      openUrl("https://engineeringblog.yelp.com")
  }
)
</pre></div><p>Both produce the same result on the client:</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/generic_button.png" alt="Generic button screenshot" class="c1" /></div><h4 id="composing-components">Composing components</h4><p>The BNF has several ways to build larger components from smaller pieces. First, many mobile features can be broken into a vertical stack of simpler components, such as buttons, text, icons, and images. Second, many features can be built by composing components with a container component.</p><p>For example, the Biz App has cards to promote the products and services Yelp offers to businesses. The promotional cards are built from a stack of simpler components and a bordered container (<code class="language-plaintext highlighter-rouge">generic_bordered_container_v1</code>), which contains a feature-specific component for each product (<code class="language-plaintext highlighter-rouge">feature_call_to_action_v1</code>).</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/composing_components.png" alt="Composing generic components into a promotional card" class="c4" /></div><p>iOS and Android provide mechanisms to recycle views when scrolling, so using vertical stacks of simpler, reusable components improves scroll performance.</p><p>Initially, we were worried that containers would impact scroll performance and memory consumption, especially with high levels of nesting. But we kept finding designs that benefited from containers. In practice, we don’t nest more than one or two levels. On Android, containers are nested <a href="https://developer.android.com/reference/androidx/recyclerview/widget/RecyclerView">RecyclerViews</a> that share a common <a href="https://developer.android.com/reference/androidx/recyclerview/widget/RecyclerView.RecycledViewPool">RecycledViewPool</a>, allowing re-use of simpler components such as text, buttons, and images.</p><h4 id="rendering-components-on-clients">Rendering components on clients</h4><p>On the client, components are rendered with a factory associated with the component type. The same factory handles multiple versions of the same component. We typically have one implementation of each component (<code class="language-plaintext highlighter-rouge">GenericButtonComponent</code>) on the client, and the factory maps the server-driven component data to an internal configuration.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class GenericButtonComponentFactory: GenericComponentFactory {
    // Used by the BNF infrastructure to build a catalog of
    // available &amp; deprecated types
    override val availableTypes = listOf(V1, V2)
    override val deprecatedTypes = listOf(V1)
    override fun create(
        component: GenericComponent,
        data: GenericComponentData
    ) = when (component.type) {
        V1 -&gt; createV1(component.id, data.generic_button_v1)
        V2 -&gt; createV2(component.id, data.generic_button_v2)
        else -&gt; throw IllegalStateException("Unexpected component ${component.type}")
    }
    fun createV1(id: String, data: GenericButtonDataV1): GenericButtonComponent {
        // Convert GenericButtonDataV1 to an internal state
        // Construct &amp; return the GenericButtonComponent
    }
    fun createV2(id: String, data: GenericButtonDataV2): GenericButtonComponent {
        // Convert GenericButtonDataV2 to an internal state
        // Construct &amp; return the GenericButtonComponent
    }
    companion object {
        const val V1 = "generic_button_v1"
        const val V2 = "generic_button_v2"
    }
}
</pre></div><h3 id="generic-actions">Generic Actions</h3><p>A generic action is a side effect that occurs when the user interacts with a screen or component. A generic screen or component can trigger actions under any number of conditions, such as when the user views the screen or taps the component.</p><p>Like generic components, every generic action has a unique ID (<code class="language-plaintext highlighter-rouge">open-blog-url</code>) and type (<code class="language-plaintext highlighter-rouge">generic_open_url_v1</code>), and we use naming conventions to distinguish between generic and feature-specific actions (<code class="language-plaintext highlighter-rouge">feature_close_business_v1</code>).</p><p>As with generic components, the BNF was designed to support a rich, extensible library of actions. Here’s a sampling of actions:</p><table><thead><tr><th>Generic Action</th>
<th>Description</th>
</tr></thead><tbody><tr><td>generic_open_url_v1</td>
<td>Opens a deep link, which supports “https”, “tel”, “yelp”, and “yelp-biz” schemes</td>
</tr><tr><td>generic_close_screen_v1</td>
<td>Closes the current screen and opens an optional URL to navigate to the next screen</td>
</tr><tr><td>generic_show_screen_v1</td>
<td>Opens another screen using a nested screen configuration</td>
</tr><tr><td>generic_reconfigure_screen_v1</td>
<td>Reconfigures the current screen with a new screen configuration</td>
</tr><tr><td>generic_update_property_v1</td>
<td>Updates the value of a generic property, which represents a piece of application state</td>
</tr><tr><td>generic_scroll_to_component_v1</td>
<td>Scrolls the screen to a specified component</td>
</tr><tr><td>feature_close_business_v1</td>
<td>Marks the current business as closed, which has a lot of feature-specific side-effects</td>
</tr></tbody></table><h4 id="configuring-actions">Configuring actions</h4><p>In a remote screen configuration, each action type has a corresponding data model:</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericOpenUrlDataV1:
  properties:
    url:
      description: Link to be opened when the action is triggered
      type: string
  required:
  - url
  type: object
</pre></div><p>In a local screen configuration, actions can be configured with the DSL in an actions block:</p><div class="language-kotlin highlighter-rouge highlight"><pre>button(
   text = "Learn more on our blog",
   style = PRIMARY,
   tappedActions = actions {
      openUrl("https://engineeringblog.yelp.com")
  }
)
</pre></div><h4 id="handling-actions">Handling actions</h4><p>We use an event-based architecture on both iOS and Android to handle user interactions. Generic actions are events, which are either Swift structs or Kotlin data classes. For example, on Android, we have an <code class="language-plaintext highlighter-rouge">OpenUrlEvent</code> to model <code class="language-plaintext highlighter-rouge">generic_open_url_v1</code> in remote screen configurations.</p><div class="language-kotlin highlighter-rouge highlight"><pre>data class OpenUrlEvent(val url: String): GenericScreenEvent()
</pre></div><p>Android uses a Model-View-Intent (MVI) architecture where components publish events (intents) to a shared event bus. When the user taps a component, the component will publish its <code class="language-plaintext highlighter-rouge">tappedActions</code>.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class GenericButtonComponentViewHolder :
    GenericComponentViewHolder&lt;GenericButtonComponentState&gt;(
        R.layout.view_generic_button_component
    )
{
    lateinit var tappedActions: List&lt;GenericScreenEvent&gt;
    private val button by clickView&lt;GenericButton&gt;(R.id.button) {
        eventBus.sendEvents(tappedActions)
    }
    override fun bind(state: GenericButtonComponentState) {
        button.configure(state)
        tappedActions = state.tappedActions
    }
}
</pre></div><p>The event will be delivered to a matching intent handler that knows how to process the user’s intent and update the UI state.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class NavigationIntentHandler: GenericScreenIntentHandler() {
    @Event(OpenUrlEvent::class)
    fun handleOpenUrl(event: OpenUrlEvent){
        with(event.url) {
            when {
                startsWith("tel:") -&gt; openTelLink(this)
                startsWith("https:") -&gt; openSecureHttpLink(this)
                startsWith("http:") -&gt; openUnsecureHttpLink(this)
                startsWith("yelp-biz:") -&gt; openCustomLink(this)
                else -&gt; reportUnsupportedLinkError(this)
            }
        }
    }
}
</pre></div><h3 id="generic-properties">Generic Properties</h3><p>Most UIs are dynamic; they need to respond to user interactions and changes in application state. For example, businesses can exchange messages with their customers, and we want to show the number of unread messages as a badge component.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/generic_property.png" alt="A navigation component with a badge showing the number of unread messages" class="c1" /></div><h4 id="modeling-generic-properties">Modeling generic properties</h4><p>We represent a generic property using a dot-separated hierarchical path and an associated data type:</p><div class="language-plaintext highlighter-rouge highlight"><pre>businesses.{business_id}.inbox.messages.unread.count&lt;integer&gt;
</pre></div><p>A generic property can have path parameters that provide additional context. For example, each business has a separate inbox, so the <code class="language-plaintext highlighter-rouge">{business_id}</code> parameter corresponds to the unique business ID.</p><p>To a generic component, a generic property is just a strongly-typed variable that it can read or write. The generic component doesn’t know the meaning of the data (the number of unread messages) or how the data is stored or updated.</p><p>In a remote screen configuration, we use the <code class="language-plaintext highlighter-rouge">GenericProperty</code> object to model properties.</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericProperty:
    properties:
      path:
        description: A dot-separated hierarchical path for the property
        type: string
      type:
        description: Represents the property type
        type: string
    required:
    - name
    - type
    type: object
</pre></div><h4 id="supporting-generic-properties">Supporting generic properties</h4><p>Each generic property has a generic property manager that handles reads and writes.</p><p>On Android, we resolve a generic property into an <a href="http://reactivex.io/RxJava/3.x/javadoc/io/reactivex/rxjava3/core/Observable.html">RxJava Observable</a> backed by a <a href="http://reactivex.io/RxJava/3.x/javadoc/io/reactivex/rxjava3/subjects/BehaviorSubject.html">BehaviorSubject</a>, which remembers the latest value. A generic component subscribes to the <code class="language-plaintext highlighter-rouge">Observable</code> to receive new values and update its view.</p><div class="language-kotlin highlighter-rouge highlight"><pre>class BusinessInboxPropertyManager: GenericPropertyManager&lt;Int&gt; {
    val inboxPropertyDefinition =
        GenericPropertyDefinition(
            "businesses.{business_id}.inbox.messages.unread.count"
        )
    override val properties = listOf(inboxPropertyDefinition)
    private val subjectMap = mutableMapOf&lt;String, BehaviorSubject&lt;Int&gt;&gt;()
    override fun get(path: String): Observable&lt;Int&gt; {
        return getOrCreateSubject(path).hide()
    }
    override fun set(path: String, value: Int) {
        getOrCreateSubject(path).onNext(value)
    }
    private fun getOrCreateSubject(path: String): BehaviorSubject&lt;Int&gt; {
        return subjectMap[path] ?: BehaviorSubject.create&lt;Int&gt;().also {
            subjectMap[path] = it
        }
    }
}
</pre></div><h4 id="building-dynamic-components-with-generic-properties">Building dynamic components with generic properties</h4><p>The BNF supports a <code class="language-plaintext highlighter-rouge">generic_badge_v1</code> component that represents a basic badge with a dynamic count using a generic property.</p><div class="language-yaml highlighter-rouge highlight"><pre>GenericBadgeDataV1:
    properties:
      dynamic_count:
        $ref: #/definitions/GenericProperty
    required:
    - dynamic_count
    type: object
</pre></div><p>On Android, we map the generic property to an <code class="language-plaintext highlighter-rouge">Observable&lt;Int&gt;</code> in the component’s MVI state.</p><div class="language-kotlin highlighter-rouge highlight"><pre>// The BadgeComponent’s MVI state stores an Observable
data class BadgeComponentState(
   val dynamicCount: Observable&lt;Int&gt;,
   @ColorRes val color: Int = R.color.red
)
// The BadgeComponenFactory resolves a generic property into
// the Observable required by the MVI state using the
// GenericProperties registry.
fun createBadgeComponentState(data: GenericBadgeDataV1)
   = BadgeComponentState(
         dynamicCount = GenericProperties.get(data.dynamicCount.path)
     )
</pre></div><p>The component subscribes to the <code class="language-plaintext highlighter-rouge">Observable&lt;Int&gt;</code> and updates the badge to reflect the current code.</p><div class="language-kotlin highlighter-rouge highlight"><pre>// The BadgeComponent subscribes to the Observable
state.dynamicCount
  .doOnSubscribe {
      // Keep the badge invisible until we have the first count
      badgeView.isVisible = false
  }
  .doOnSuccess {
      // Update the value of the badge!
      badgeView.value = it
      // Don’t show the badge unless there’s a non-zero count
      badgeView.isVisible = (it &gt; 0)
  }
  .doOnError {
      // If there’s an error, hide the badge
      badgeView.isVisible = false
  }
  .subscribe()
  .autodispose()
</pre></div><p>We’re still experimenting with generic properties and refining the use cases. We believe they are a necessary concept to unlock dynamic, server-driven UIs.</p><h2 id="current-use-cases">Current Use Cases</h2><p>We’re using the BNF to power the Home, Yelp Ads, and Business Info, and More tabs. These tabs are remotely configured screens because they host dynamic, personalized content. The Yelp Ads tab hosts the <a href="https://blog.yelp.com/2021/05/yelp-releases-new-yelp-for-business-features-enabling-more-effective-advertising-and-adding-control-and-value-for-business-owners">Ads Dashboard</a> screen, which was the first screen built entirely from scratch using the BNF. We’ll share more about this in a future blog post; stay tuned!</p><div class="c2"><img src="https://blog.yelp.com/wp-content/uploads/2021/05/mobile-ads-dash-PR.gif" alt="Ads Dashboard screenshot" class="c3" /></div><p>We’re also using the BNF to power several in-product marketing screens. These screens are usually remotely configured to give our marketing and product teams more direct control, but sometimes we build them locally first using our screen configuration DSL.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/in_product_marketing.png" alt="Dynamic in-product marketing screens" class="c4" /></div><p>Finally, we’re using the BNF to build debug screens to prototype new designs or test individual generic components, actions, or properties. These screens are locally configured with our screen configuration DSL.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/building-a-server-driven-foundation-for-mobile-app-development/debug.png" alt="Debug screens built with our DSL" class="c3" /></div><h2 id="future-directions">Future Directions</h2><h3 id="building-a-better-backend">Building a Better Backend</h3><p>SDUI pushes more of the business and presentation logic into the backend. Most backend engineers aren’t familiar with mobile app development or building mobile UIs. Consequently, they need better infrastructure for making, testing, and deploying their changes. We also need better tooling for our marketing and product teams to make changes, too.</p><h3 id="adopting-swiftui-and-jetpack-compose">Adopting SwiftUI and Jetpack Compose</h3><p>One of our design principles is “Adopting Best Practices.” We’ve therefore watched the evolution of <a href="https://developer.apple.com/xcode/swiftui/">SwiftUI</a> and <a href="https://developer.android.com/jetpack/compose">Jetpack Compose</a> with great interest. Both frameworks support building composable, dynamic UIs with a simple declarative syntax. We hope to adopt these new frameworks in the near future.</p><h3 id="adopting-graphql">Adopting GraphQL</h3><p>Yelp is currently migrating our web and mobile apps from individual REST APIs to a unified <a href="https://graphql.org/">GraphQL</a> schema. We’re planning to migrate the BNF to GraphQL, which offers better support than REST for making changes without breaking backwards compatibility. Mobile clients must write explicit GraphQL queries that describe the types and fields they support. With our REST API, we are frequently creating new versions of entire objects (<code class="language-plaintext highlighter-rouge">GenericButtonDataV7</code>) or APIs just to add one field safely. With GraphQL, we can evolve our schema incrementally.</p><h3 id="building-a-yelp-native-foundation">Building a Yelp Native Foundation</h3><p>Our Consumer App and Biz App handle separate sides of the same transaction – connecting consumers to great local businesses. In many cases, building a new feature requires changes in both apps. For example, when the Biz App added features for businesses to provide <a href="https://blog.yelp.com/2020/12/covid-related-updates-for-your-yelp-page#Edit-your-COVID-19-Advisory-Alert">COVID-related updates</a> to consumers, the Consumer App added corresponding features for consumers to see those updates.</p><p>When we started the BNF in 2019, product teams working on the Consumer App were also starting a shared server-driven foundation for similar reasons. Unfortunately, the Biz App and Consumer App had different REST APIs and separate Git repositories. We made the practical decision to share ideas and techniques but not code. Now we’re slowly moving towards a common Yelp Native Foundation by migrating to a unified GraphQL schema and adopting monorepos.</p><p>We’re very excited about the future of SDUI at Yelp and in the industry as a whole. Many companies, such as <a href="https://medium.com/airbnb-engineering/a-deep-dive-into-airbnbs-server-driven-ui-system-842244c5f5">Airbnb</a> and <a href="https://doordash.engineering/2021/08/24/improving-development-velocity-with-generic-server-driven-ui-components/">Doordash</a>, have recently published the details of their own shared, server-driven foundations, and there are open-source efforts, such as <a href="https://github.com/ZupIT/beagle">Beagle</a>. We’ve noticed many similarities between our work and these projects, which suggests there are some natural design patterns for implementing SDUI. We hope this blog post contributes to the growing SDUI community. Keep an eye on this blog for updates on our progress!</p><div class="island job-posting"><h3>Become a Mobile Software Engineer at Yelp</h3><p>Want to help us grow our mobile foundation on iOS?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/50448189-a770-4214-8f7c-407798d7707f?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/11/building-a-server-driven-foundation-for-mobile-app-development.html</link>
      <guid>https://engineeringblog.yelp.com/2021/11/building-a-server-driven-foundation-for-mobile-app-development.html</guid>
      <pubDate>Tue, 30 Nov 2021 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Awesome Women in Engineering Hosts its First Virtual Summit]]></title>
      <description><![CDATA[<div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-10-25-awesome-women-in-engineering-hosts-its-first-virtual-summit/summit_poster.png" alt="" /></div><p>Yelp’s employee resource group for women in engineering, <a href="https://www.yelp.com/engineering/awe">Awesome Women in Engineering (AWE)</a>, recently held its first virtual summit! The summit was designed for women and allies at Yelp to learn, network, and have fun. AWE started in 2013 with a mission to build a strong community for women and allies at Yelp by facilitating professional career-building activities, networking, leadership, and mentorship opportunities. As a resource group, we provide support and organize activities targeted towards professional growth for women engineers, helping them to maximize their potential at Yelp and beyond. We are excited to share the different activities that helped make this a successful event.</p><h2 id="everything-was-perfect-working-at-a-company-which-supports-events-hosted-by-women-and-with-many-women-as-speakers-is-amazing---thais-a-software-engineer">“Everything was perfect. Working at a company which supports events hosted by women, and with many women as speakers is amazing!” - Thais A., Software Engineer</h2><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-10-25-awesome-women-in-engineering-hosts-its-first-virtual-summit/MeetOurSpeakers.png" alt="Our speakers for the summit" /><p class="subtle-text"><small>Our speakers for the summit</small></p></div><p>We’d previously hosted a <a href="https://engineeringblog.yelp.com/2019/10/first-awe-summit-sf.html">similar summit</a> in our San Francisco office, but this summit was 100% virtual as we’ve since moved to a more distributed work environment. This enabled us to have events at times accessible to our distributed teams either in Europe or North America. We hosted several events ranging from technical talks to networking sessions to workshops, giving women and allies the opportunity to share their experiences and learn from the experiences of others.</p><h2 id="i-got-to-meet-awesome-women-that-i-dont-interact-with-often---maoreen-m-technical-sourcer">“I got to meet awesome women that I don’t interact with often” - Maoreen M., Technical Sourcer</h2><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-10-25-awesome-women-in-engineering-hosts-its-first-virtual-summit/keynote_screenshot.png" alt="Miriam leading the keynote speech" /><p class="subtle-text"><small>Miriam leading the keynote speech</small></p></div><p>A highlight of this summit was the keynote speech given by Miriam Warren, Yelp’sChief Diversity Officer. Miriam spoke about her journey at Yelp, building and empowering communities, demystifying networking, and knowing your story. It was also fascinating to hear about her journey joining nonprofit boards and the ways these experiences helped her grow her career and learn from people in other industries.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-10-25-awesome-women-in-engineering-hosts-its-first-virtual-summit/career_panel.png" alt="A panel discussion about career growth" /><p class="subtle-text"><small>A panel discussion about career growth</small></p></div><p>Many other members of AWE also gave talks. Some of those talks were focused on technical learning. For example, we heard about statistical thinking, the math used in our ads algorithms, and measuring product success. Some talks were centered more on the role of diversity in our work, such as creating an accessible product, reducing biases in algorithms, and diversity in recruiting and hiring. Other talks were geared towards career growth where we heard from women in various roles about their journeys.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-10-25-awesome-women-in-engineering-hosts-its-first-virtual-summit/gather_town.png" alt="Our virtual networking event was hosted on gather.town" /><p class="subtle-text"><small>Our virtual networking event was hosted on gather.town</small></p></div><p>The summit also incorporated interactive events. We hosted two ally skills workshops, which took participants through real-world scenarios and consisted of group discussions about how to act as an ally in each situation. There was also a technical workshop that covered the basics of machine learning followed by an interactive session where everyone built a basic model. Lastly, we had a virtual networking session where participants were able to meet new people and get to know each other through icebreaker questions.</p><p>The summit was an amazing opportunity for women and allies to build deeper connections, learn from each others’ experiences, and feel empowered to always be our most authentic selves. We’re proud to have done this event in a distributed environment and plan to look back at what worked and what didn’t for participants so we can do it again in the future, while continuing to inspire women through AWE’s many other initiatives.</p><p>Acknowledgements: Dorothy Jung, Chie Shu, Trisha Walsh, and Grace Yuan</p><div class="island job-posting"><h3>Interested in joining the awesome women in engineering and product at Yelp?</h3><p>We're hiring! Check out our Careers page for more open positions.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/10/awesome-women-in-engineering-hosts-its-first-virtual-summit.html</link>
      <guid>https://engineeringblog.yelp.com/2021/10/awesome-women-in-engineering-hosts-its-first-virtual-summit.html</guid>
      <pubDate>Mon, 25 Oct 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Nrtsearch: Yelp’s Fast, Scalable and Cost Effective Search Engine]]></title>
      <description><![CDATA[Yelp
<noscript>
</noscript>
<p><a href="https://engineeringblog.yelp.com/">Yelp</a></p>
<div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container"><div><label class="flex-box">Search for </label> <label class="flex-box">Near </label>
</div></form></div>
<div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2021 Yelp Inc. Yelp, <img src="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/17089be275f0/assets/img/logos/logo_desktop_xsmall_outline.png" alt="Yelp logo" class="main-footer_logo-copyright" srcset="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/0aade8725c91/assets/img/logos/logo_desktop_xsmall_outline@2x.png 2x" />, <img src="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/58cfc999e1f5/assets/img/logos/burst_desktop_xsmall_outline.png" alt="Yelp burst" class="main-footer_logo-burst" srcset="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/dcb526e86d86/assets/img/logos/burst_desktop_xsmall_outline@2x.png 2x" /> and related marks are registered trademarks of Yelp.</small></div>]]></description>
      <link>https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html</link>
      <guid>https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html</guid>
      <pubDate>Tue, 21 Sep 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Building a thriving engineering team]]></title>
      <description><![CDATA[<p>This post brings our Engineering Career Series to an end. I hope you’ve enjoyed reading it as much as we’ve enjoyed sharing Yelp’s philosophy on building engineering careers in a thoughtful, equitable, and enjoyable way.</p><p>As the series has shown, building a thriving engineering team requires ongoing investment in people and in processes. It requires you to recognize and acknowledge your successes and failures, and continue to iterate and improve. There are no quick fixes and the job is never truly done, but the rewards of improving are huge, for the individuals and for the success of your company as a whole.</p><p>What we’ve tried to share with you during this series is not that we’re perfect and that we have all the answers. Instead we wanted to give you some idea of what the journey has been like to get where we are now, and to be open about some of the challenges along the way that you may also encounter in your engineering career – whether you’re an engineer, a technical leader, or a manager.</p><p>If there’s one thing I’d like you to take away from the series, it’s that this is <em>worth the effort</em>. There are concrete steps you can take as leaders that will change your engineering culture for the better, and there are contributions that anyone involved in engineering can make that will make people’s careers (and lives) happier, fairer, and more successful.</p><p>At Yelp, we’re committed to giving the resources to everyone involved to keep making these efforts, to continuously improve our engineering culture and the experience of everyone who works here. We’d love to welcome anyone else who is as passionate about creating a diverse and inclusive engineering team to <a href="http://www.yelp.com/careers">join us</a>, or simply to get in touch and share your experiences.</p><p>If you’ve not read the rest of the series, here’s a quick recap:</p><h3 id="hiring-a-diverse-team-reducing-bias-in-engineering-interviews"><a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">Hiring a diverse team: reducing bias in engineering interviews</a></h3><p>How Yelp has approached hiring over the years, and the major lessons we learned in how to reduce bias.</p><h3 id="using-structured-interviews-to-improve-equity"><a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">Using structured interviews to improve equity</a></h3><p>A key change to our interview process improved equity of outcomes considerably.</p><h3 id="how-we-onboard-engineers-across-the-world-at-yelp"><a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html">How we onboard engineers across the world at Yelp</a></h3><p>Once you’ve hired someone amazing, you need to set them up for success on day one.</p><h3 id="career-paths-for-engineers-at-yelp"><a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">Career paths for engineers at Yelp</a></h3><p>How we designed and redesigned our framework for career growth and levelling, and how that shift increased fairness and equity.</p><h3 id="technical-leadership-at-yelp"><a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">Technical leadership at Yelp</a></h3><p>Why we approach technical leadership as a role you can choose to take on at Yelp, rather than just a level within our career path framework.</p><h3 id="how-yelp-approaches-engineering-management"><a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html">How Yelp approaches engineering management</a></h3><p>What “success” looks like for managers at Yelp, how we hire them, what we ask them to do and to value, and how we’ve built this into the career path for managers.</p><h3 id="ensuring-pay-equity--career-progression-in-yelp-engineering"><a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html">Ensuring pay equity &amp; career progression in Yelp Engineering</a></h3><p>What we learnt from committing to publishing our analysis of pay equity and career progression to all of engineering annually, no matter what the results.</p><h3 id="fostering-inclusion--belonging-within-yelp-engineering"><a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-fostering-inclusion-and-belonging-within-yelp-engineering.html">Fostering inclusion &amp; belonging within Yelp Engineering</a></h3><p>Improving inclusion and belonging requires you to provide for teams and groups in many different ways. We designed systems and processes that give people the support they need in the time, place and manner they need it.</p><div class="post-gray-box">This post is part of a series covering how we're building a happy, diverse, and inclusive engineering team at Yelp, including details on how we approached the various challenges along the way, what we've tried, and what worked and didn't.<p>Read the other posts in the series:</p><ul><li><a title="Engineering Career Series: Building a happy, diverse, and inclusive engineering team" href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">Building a happy, diverse, and inclusive engineering team</a></li>
<li><a title="Engineering Career Series: Hiring a diverse team by reducing bias" href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">Hiring a diverse team by reducing bias</a></li>
<li><a title="Engineering Career Series: Using structured interviews to improve equity" href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">Using structured interviews to improve equity</a></li>
<li><a title="Engineering Career Series: How we onboard engineers across the world at Yelp" href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html">How we onboard engineers across the world at Yelp</a></li>
<li><a title="Engineering Career Series: Career paths for engineers at Yelp" href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">Career paths for engineers at Yelp</a></li>
<li><a title="Engineering Career Series: Technical Leadership at Yelp" href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">Technical Leadership at Yelp</a></li>
<li><a title="Engineering Career Series: How we think about engineering management" href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html">How we think about engineering management</a></li>
<li><a title="Engineering Career Series: Ensuring Pay Equity &amp; Career Progression in Yelp Engineering" href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html">Ensuring Pay Equity &amp; Career Progression in Yelp Engineering</a></li>
<li><a title="Engineering Career Series: Fostering inclusion &amp; belonging within Yelp Engineering" href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-fostering-inclusion-and-belonging-within-yelp-engineering.html">Fostering inclusion &amp; belonging within Yelp Engineering</a></li>
<li><a title="Engineering Career Series: Building a thriving engineering team" href="https://engineeringblog.yelp.com/2021/08/engineering-career-series-building-a-thriving-engineering-team.html">Building a thriving engineering team</a></li>
</ul></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/08/engineering-career-series-building-a-thriving-engineering-team.html</link>
      <guid>https://engineeringblog.yelp.com/2021/08/engineering-career-series-building-a-thriving-engineering-team.html</guid>
      <pubDate>Thu, 12 Aug 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Fostering inclusion & belonging within Yelp Engineering]]></title>
      <description><![CDATA[<p><a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">Recruiting</a>, <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">hiring</a>, and <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html">onboarding</a> new employees in Engineering at Yelp is a multi-team, cross-functional effort as we have laid out in our Career Series blog posts. But once people are here, how do we retain them? While <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">career advancement</a>, <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">technical leadership</a>, and <a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html">pay equity</a> are all important components to building a happy engineering team, we believe fostering inclusion and belonging is also a fundamental component in supporting, and thus retaining, people. While this is an area that’s received a lot of recent attention in the tech industry, we’ve prioritized inclusion and belonging for many years because we want all of our colleagues to feel like an integral part of our team and share their unique perspectives.</p><p>In this post, we’ll discuss some of the building blocks that make up our inclusion and belonging programs, many of which were developed in partnership with Yelp’s Culture team.</p><p><strong>Employee Resource Groups</strong></p><p>One of the ways we support belonging is through Yelp Employee Resource Groups (YERGs), which are groups of employees that come together to support each other and other employees by way of community, programming, and events. The groups can be formed around shared social identities, characteristics, or life experiences. Yelp has <a href="https://www.yelp.careers/us/en/culture-at-yelp">many YERGs</a> including YelpCares (community, non-profit volunteering), YelpParents, Women at Yelp (WAY), VetConnect, and Yelp Asian Pacific Islanders (YAPI). Three of our YERGs were started by members of our Engineering team: Awesome Women in Engineering (AWE), ColorCoded, and Neurodiversity &amp; Mental Health.</p><p>Each YERG is led by several employees who facilitate programming and support the group. We also use an executive sponsorship model for all of our YERGs, where a senior leader provides mentorship and guidance, connections across departments, removes any blockers the group may face as they run their programming, and works with the leads to champion and promote the group company-wide.</p><p><strong>Awesome Women in Engineering (AWE)</strong></p><p>AWE started as a social group in April 2013 before employee resource groups came into existence at Yelp. The founding leaders of AWE organized several activities like networking lunches, book clubs, and public speaking workshops, and coordinated with Yelp’s Recruiting team to send AWE members to represent Yelp at external events (e.g., the <a href="https://ghc.anitab.org/about/">Grace Hopper Conference</a>). The next phase was to build a stronger community of women engineers at Yelp.</p><p>As a resource group, AWE provides support for and organizes activities targeted towards professional growth for women engineers and allies, helping maximize their potential at Yelp and beyond. AWE has grown considerably these last eight years and offers programs focused on being champions for women in Engineering, public speaking, internal and external networking, allyship, mentorship, and hosting internal events.</p><p>AWE and our other YERGs provide avenues for engineers to take on leadership opportunities by coordinating an event, facilitating a discussion about a book, or becoming a program lead. YERGs allow engineers to work on these skills in a safe and supportive environment with a focus on growth instead of perfection.</p><p>As a result of our remote work environment over the last year, AWE has transitioned to hosting its events virtually. This has allowed employees across time zones and countries to join the group and participate in events they could not have attended previously. As we continue supporting engineers working in multiple time zones, we intend to continue making programming available virtually.</p><p>\ <strong>ColorCoded</strong></p><p>Back in 2016, a few Yelp engineers in San Francisco started ColorCoded as a social group with the goal of supporting engineers of color at Yelp. Over the last five years, ColorCoded has grown to become one of Yelp’s employee resource groups, cultivating a community of engineers of color and their allies at Yelp. The group’s executive sponsor, employee leadership team, and members work in partnership to provide professional development and leadership activities, networking events, and community engagement opportunities.</p><p>Before the COVID-19 pandemic, ColorCoded organized various in-person activities in San Francisco, such as résumé workshops with Bay Area nonprofits, employee panel discussions, lunch book discussions, and more. With the onset of the pandemic, transition to remote work, and the Black Lives Matter movement in 2020, ColorCoded shifted programming to better meet the needs of our community members and expanded our reach to include more members from other Yelp offices. Five programs were established: Community Check-Ins, Race Matters, Virtual Happy Hours, Ally Skills Workshops, and Community Voices. Race Matters is a monthly discussion series where Yelp employees learn and discuss the historical context of racism and how racism affects Black, Indigenous, and People of Color (BIPOC) communities in the United States, and we’re hoping to expand this programming to cover the historical context of other countries where we have employees in the future. Community Check-Ins are another monthly discussion series where members gather together and discuss current events.</p><p>At times, ColorCoded also partners with other employee resource groups, such as Awesome Women in Engineering (AWE) and Yelp Asian Pacific Islanders (YAPI), to put on events together.</p><p><strong>Neurodiversity and Mental Health</strong></p><p>Neurodiversity is a movement championing the premise that autism and other conditions like attention-deficit/hyperactivity disorder, dyslexia, anxiety, post-traumatic stress syndrome, dyscalculia, and apraxia are normal variations of the human brain and thought process. As natural variations, these differences should be celebrated and supported.</p><p>This recently-created YERG is made up of employees who are neurodiverse, have diagnosed or undiagnosed mental health conditions, care about their mental health, and/or are allies to these individuals. The group works to create a more inclusive environment for neurodiverse individuals and individuals with mental health conditions. Though starting within Engineering, the group now has representation from departments across Yelp.</p><p>Our most successful event to date was an open roundtable discussion towards the beginning of the pandemic. The adjustment to regional lockdowns brought an additional focus on mental health and how best to support each other. In the roundtable event, we welcomed employees to discuss how they were dealing with the transition. We are currently planning a panel with a few speakers to share their experiences at Yelp, incorporating neurodiversity in our existing diversity training, working on new training for managers, and raising awareness about existing tools Yelp provides to employees to foster wellness.</p><p><strong>Work-life balance</strong></p><p>Historically, Yelp Engineering leaders have championed work-life balance and have long valued the well-being of their teams. This is reflected in our <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">career leveling rubric</a>, with a dimension dedicated to sustaining and improving the well-being of our colleagues, as well as an expectation of our <a href="https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html">engineering managers</a>. In response to the pandemic, we implemented new policies to best support work-life balance for our employees in Engineering.</p><p>The first one focuses on offering flexibility around when you work. Employees living in different time zones with different schedules shouldn’t need to fully align all of their working hours. Within Engineering, we’ve implemented a flexible working policy that introduces the concept of “core hours,” observed from 11am to 3pm in one’s local time, where the balance of the day’s hours can be before or later. However, even these core hours are flexible and can be adjusted to accommodate unique needs of individuals and teams, such as a parent needing to pick up their child from daycare over lunch. This practice offers some form of predictability for collaborating teammates and other teams to know when they can expect colleagues to be available while still giving employees the autonomy to set a schedule that works best for them.</p><p>Another new policy we implemented is the option for most full-time employees in Engineering to work 80% of a full-time workload for 80% of their full-time pay, providing engineers another opportunity to adapt their work schedule to suit their current life priorities and preferences.</p><p><strong>Distributed workforce</strong></p><p>The COVID-19 pandemic showed us that we can function as a company with nearly all of our employees working remotely. In some cases, people have reported being more productive without the usual in-office distractions and noise. We also know that for some, especially parents or other caregivers, being home and removing commutes has allowed them to continue to provide care and work full-time.</p><p>Even when offices reopen, <a href="https://blog.yelp.com/2021/05/returning-to-yelp-offices-in-2021">Yelp is giving employees a choice to continue working as a distributed remote workforce</a>, unless their role specifically requires otherwise. A new relocation policy offers clear guidance around relocating to new locations within one’s country or between the countries in which Yelp operates (Canada, Germany, UK, and USA).</p><p><strong>Putting it all together</strong></p><p>Through YERGs and the policies mentioned above, we are making the space and providing the opportunities for folks to bring their full authentic selves to Yelp and have the flexibility to work in a way that works best for them. We are proud of our investments in hiring great people in Engineering and supporting their sense of inclusion and belonging once they have joined us. That said, our work isn’t done. We will continue to evolve and incorporate a multi-faceted approach to inclusion and belonging. We will continue to offer training in diversity, equity &amp; inclusion, promote and support YERGs, and find ways, like flexible working arrangements, to support engineers in doing their best work.</p><p><strong>Next up</strong></p><p>Yelp CTO, Sam Eaton, will wrap up our Engineering Career Series. If you’d like to join an organization passionate about inclusion and belonging (or any of the other topics we’ve covered), <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/07/engineering-career-series-fostering-inclusion-and-belonging-within-yelp-engineering.html</link>
      <guid>https://engineeringblog.yelp.com/2021/07/engineering-career-series-fostering-inclusion-and-belonging-within-yelp-engineering.html</guid>
      <pubDate>Thu, 29 Jul 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Ensuring Pay Equity & Career Progression in Yelp Engineering]]></title>
      <description><![CDATA[<p>At Yelp, we care deeply about ensuring all employees are compensated fairly for their contributions, regardless of their gender, race, and ethnicity. Within Yelp Engineering, we work hard to achieve equal pay for equal work through a combination of tactics:</p><ul><li>Well-defined career levels and corresponding pay bands</li>
<li>A systematic levels calibration process across teams</li>
<li>Transparency of our outcomes with the entire Engineering team</li>
</ul><p>In a <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">previous blog post</a>, we described how we think about career progression and leveling. Each level within Engineering and Engineering Management has an associated merit band, equity band, and cash bonus target based on location. We use our leveling framework to help guide managers to place their employees at a position within those bands. For example, an engineer recently promoted to IC3 would likely fall towards the lower end of the IC3 level framework and pay band.</p><p>In order to ensure managers interpret and apply our career leveling framework consistently, we run calibration conversations on a quarterly basis. Calibration conversations are discussions among management peer groups about performance expectations of individuals on their team. These calibration conversations contribute towards more equitable pay by making sure expectations are consistent across teams.</p><p>Our frameworks and processes would be meaningless if we didn’t closely analyze our compensation data to ensure they are actually working. Within our Engineering org, we have committed to conducting a pay equity analysis annually and sharing the results internally with the entire Engineering team. We’re pleased to share some highlights from our latest analysis below.</p><p>As we look into our data, a few things immediately come to mind. First, the data is a snapshot of a point in time and is not entirely complete as we don’t have demographic data for all employees. Our analysis includes our full-time, individual contributor Engineering employees who have voluntarily provided their race, ethnicity and/or gender information, which is about three-quarters of this population. Second, we don’t expect pay to be identical for all people within a level, as we mentioned above when we talked about how we think about pay. Small pay gaps are to be expected for a number of factors that are unrelated to race, ethnicity, or gender – for example, performance, impact, and growth within level.</p><p>Third, we show gender in terms of women and men because we do not yet have enough data to represent a more nuanced view of gender identity, but we are working on improving our data to represent this more completely in the future. With respect to race and ethnicity, we have combined Black, Hispanic, Native Hawaiian/Other Pacific Islander, and American Indian or Alaska Native employees into an under-represented minority (URM) group in the data due to small sample size.</p><p>So how do we do the analysis? It’s important to understand <em>compa-ratio</em>. Our salary bands were developed by our compensation team and leaders, guided by our pay philosophy and the competitive market landscape. Compa-ratio is computed for each salary band, by taking each individual’s salary divided by the middle of the salary band for that role and level (50th percentile). We use the median compa-ratio - the red line represents the median compa-ratio of the population at that level and the gray bars represent quartiles.</p><p>Without further ado, let’s show you some data! First, we’ll start with gender. As you can see from the chart below, the median compa-ratio for men is 100% while women’s is 101%, which means, on average, men and women are paid nearly identically. Men do outnumber our women in Engineering and their distribution is slightly higher.</p><div class="c1"><img src="https://engineeringblog.yelp.com/images/posts/2021-07-15-engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering/pay-equity-1.jpg" alt="image" /></div><p>Next, let’s talk about race and ethnicity. As you’ll see in the chart below, all groups have compa-ratio medians within 2% of each other and the distribution of employees around the comp ratio medians appear relatively equal.</p><div class="c1"><img src="https://engineeringblog.yelp.com/images/posts/2021-07-15-engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering/pay-equity-2.jpg" alt="image" /></div><p>This is just a high level snapshot of the analysis we do at Yelp. We also dig deep into level progression to ensure employees progress across our levels at similar rates regardless of gender, race, or ethnicity. As an example, the chart below represents gender by level and tenure. It shows that progression through levels occurs at similar rates for men and women. At 2 years tenure, the majority of our employees sit at IC1, IC2, or IC3. By 5 years, the majority of employees are at IC3, IC4, IC5, or IC6.</p><div class="c1"><img src="https://engineeringblog.yelp.com/images/posts/2021-07-15-engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering/pay-equity-3.jpg" alt="image" /></div><p>As we cut the data, we run into smaller sample sizes that can result in disproportionate differentials. Whenever we find outliers, our leadership team looks at pay and level information on a case-by-case basis to ensure the outlier is due to legitimate, nondiscriminatory reasons like scope of impact, and takes action where needed through out-of-cycle level or pay adjustments to achieve equity and fairness. We’ve learned through this analysis that our framework and methodology have resulted in equitable pay.</p><p>We are always trying to improve our pay equity analysis. We continue to iterate how we look at total compensation to ensure equitable pay that attracts, motivates, and retains Engineering talent. We also have an opportunity to better report on gender identity in a non-binary way. We share our pay equity data with our employees annually and typically review the data twice a year (although with the challenges we faced during the pandemic, we prioritized investments in resources to assist our employees in 2020 and skipped the analysis that year). This ongoing analysis coupled with transparency and communication not only builds trust in our leaders and processes, but also keeps us accountable for our pay practices.</p><p><strong>Up next: Fostering Belonging and Inclusion at Yelp</strong></p><p>Continuing the conversation of gender identity, race, and ethnicity at Yelp, Trisha Walsh, Tenzin Kunsal, and Ian Fijolek will talk about our Employee Resource Groups and efforts made to promote a healthy work/life balance as well as mental health. If you’d like to join a team passionate about pay equity and inclusion, <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html</link>
      <guid>https://engineeringblog.yelp.com/2021/07/engineering-career-series-ensuring-pay-equity-and-career-progression-in-yelp-engineering.html</guid>
      <pubDate>Thu, 15 Jul 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Why Yelp's hiring strategy in Canada no longer includes offices]]></title>
      <description><![CDATA[<p>When Yelp first started building engineering and product teams in Canada in 2019, our plan was to create a workforce based out of our Toronto office. Over the past year as we adapted to being an entirely remote workforce we realized, like many companies, that people don’t need to work in offices to be collaborative and successful. In fact, through remote work surveys sent to our employees, we found that most people are happier and more productive when they have the option to work remotely.</p><p>We’re now hiring engineering and product roles as fully remote in Canada, as well as in all of our locations across North America and Europe. We plan to <a href="https://blog.yelp.com/2021/05/returning-to-yelp-offices-in-2021">open our offices</a> worldwide this year, allowing employees to decide how many days per week, if any, they’d like to work from an office. As we continue to grow while working remotely, we’ve remained focused on how to best support employees. In addition to our <a href="https://www.yelp.careers/us/en/benefits-at-yelp-in-canada">standard benefits</a>, we’re offering a $100 monthly reimbursement and a one-time payment of $450 to support the costs of working from home.</p><p><strong>Growing as a distributed workforce</strong></p><p>The freedom to work from anywhere within the <a href="https://www.yelp.careers/us/en">locations we hire in</a> — including Ontario, British Columbia, Quebec, and Alberta — has allowed us to reach a wider pool of individuals from a broader variety of backgrounds. Since about half of our global technical hires will be based in Canada this year, we’re excited to bring our engineering and product opportunities to local communities and welcome more employees with diverse experiences.</p><p>As Yelp’s technical teams become increasingly distributed, we’re being intentional about creating a culture where everyone can maintain a healthy work-life balance and have equal opportunities for impact, growth, and success. We’re taking a close look at our communication styles and creating best practices for collaborating across time zones. We’re also enabling people to make connections both inside and outside of their own organizations, as well as continuing to provide valuable mentorship opportunities. For example, we host social events for our new hires, provide a dedicated mentor matching program, and encourage participation in <a href="https://engineeringblog.yelp.com/2017/02/open-sourcing-yelp-beans.html">Yelp Beans</a> — an internal tool we use to help employees meet colleagues within the company.</p><p>Employees in all Yelp locations have the support of our many Employee Resource Groups (ERGs) to help make meaningful connections. These include ERGs focused on our employees in the engineering and product space, such as <a href="https://www.yelp.com/engineering/awe">Awesome Women in Engineering</a>, Women in Product, and ColorCoded, just to name a few. Our goal is to enable all employees to bring their authentic selves to work and to be successful, regardless of their location or background.</p><p><strong>Bringing together diverse cultures to build something greater</strong></p><p>Since Yelp began seeking technical talent in Canada, our goal has been to create a workforce that reflects the demographics of the Canadian population. By increasing the locations people can choose to work from, we’re able to create an even more diverse organization that brings new expertise to help us solve increasingly complex challenges. We’re focused on proactively growing and cultivating an employee community based on a variety of backgrounds, talents, and perspectives. To achieve our goals, our technical talent team partners closely with our engineering and product teams to ensure we’re <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">building happy, diverse, and inclusive teams</a>, <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">hiring a diverse team by reducing bias</a>, and using structured <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">interviews</a> and <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">promotion practices</a> to improve equity.</p><p>In our technical hiring, we’ve set a goal to exceed Canada’s national average with regards to the representation of women and underrepresented minorities in the tech community. Between Q4 2020 and Q2 2021, we’ve consistently met these goals in our technical recruiting efforts, including hiring a higher percentage of underrepresented minorities than is represented across the entire country of Canada, <a href="https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&amp;Geo1=PR&amp;Code1=01&amp;Geo2=PR&amp;Code2=01&amp;Data=Count&amp;SearchText=canada&amp;SearchType=Begins&amp;SearchPR=01&amp;B1=All&amp;TABID=1">according to their 2016 census</a>. Our technical recruiting team, itself a group of individuals representing a variety of backgrounds, is passionate about increasing the representation of underrepresented groups in tech — not only because it’s a proven smart business move, but also because it’s morally the right thing to do.</p><p><strong>Sound like a fit? We’d like to get to know you.</strong></p><p>Yelp is looking for Product Managers, Software Engineers, Engineering Managers, Data Scientists, Business System Analysts, Product Designers, and more to join our growing team in Canada. If you’re looking to work at a company that values <a href="https://blog.yelp.com/category/diversity-and-inclusion">diversity, inclusion, belonging</a> and work-life balance, we’d love to hear from you!</p><p>Check out our <a href="https://www.yelp.careers/us/en/search-results">careers site</a> to see our current opportunities.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/07/why-yelps-hiring-strategy-in-canada-no-longer-includes-offices.html</link>
      <guid>https://engineeringblog.yelp.com/2021/07/why-yelps-hiring-strategy-in-canada-no-longer-includes-offices.html</guid>
      <pubDate>Mon, 12 Jul 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Analyzing Experiments with Changing Cohort Allocations]]></title>
      <description><![CDATA[<p>Have you ever run an A/B test and needed to change cohort allocations in the middle of the experiment? If so, you might have observed some surprising results when analyzing your metrics. Changing cohort allocation can make experiment analysis tricky and even lead to false conclusions if one is not careful. In this blog post, we show what can go wrong and offer solutions.</p><p>At Yelp, we are constantly iterating on our products to make them more useful and engaging for our customers. In order to ensure that the Yelp experience is constantly improving, we run A/B tests prior to launching a new version of a product. We analyze metrics for the new version versus the previous version, and ship the new version if we see a substantial improvement.</p><p>In our A/B tests, we randomly choose what product version a given user will see — the new one (for users in the test cohort) or the current one (for those in the status quo cohort). In order to make the new version’s release as safe as possible, we often gradually ramp up the amount of traffic allocated to the test cohort. For example, we might start with the test cohort at 10%. During this period, we would look for bugs and monitor metrics to make sure there are no precipitous drops. If things look good, we would ramp our test cohort allocation up, perhaps going to 20% first before ultimately increasing to 50%.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-07-06-analyzing-experiments-with-changing-cohort-allocations/cohort_ramp_up.png" alt="Example cohort allocation changes throughout an experiment" /><p class="subtle-text"><small>Example cohort allocation changes throughout an experiment</small></p></div><p>In this situation, we have multiple runs of the experiment with different cohort allocations in each run. This blog post will show how to properly analyze data from all runs of the experiment. We will discuss a common pitfall and show a way to avoid it. We will then frame this problem in the language of causal inference. This opens up numerous causal inference-based approaches (we survey a couple) that can yield further insight into our experiments.</p><p>Comparing metrics between cohorts can get tricky if cohort allocations change over time. In this section, we show an example where failing to account for the changing cohort allocation can cause one to get misleading results.</p><p>For concreteness, suppose that we are trying to improve our home and local services experience, with a view towards getting more users to <a href="https://blog.yelp.com/2016/04/yelp-request-a-quote">request a quote</a> for their home projects on Yelp. The metric we are trying to optimize in this example is the conversion rate — what fraction of users visiting home and local services pages decides to actually request a quote.</p><p>We run an A/B test to ensure that the new experience improves conversion versus the status quo. We have two runs, one each in the winter and the spring; in the second run, we increase the fraction of traffic allocated to the test cohort from 10% to 50%. The cohort allocations and true per-cohort conversion rates in each experiment run are as in the table below.</p><table><thead><tr><th>Time period</th>
<th>Experiment Run</th>
<th>Cohort</th>
<th>% of traffic assigned to cohort</th>
<th>Conversion Rate</th>
</tr></thead><tbody><tr><td>Winter</td>
<td>1</td>
<td>Status Quo</td>
<td>90%</td>
<td>0.15</td>
</tr><tr><td>Winter</td>
<td>1</td>
<td>Test</td>
<td>10%</td>
<td>0.15</td>
</tr><tr><td>Spring</td>
<td>2</td>
<td>Status Quo</td>
<td>50%</td>
<td>0.30</td>
</tr><tr><td>Spring</td>
<td>2</td>
<td>Test</td>
<td>50%</td>
<td>0.30</td>
</tr></tbody></table><p>Notice that in this example, the conversion rate is higher in the spring. This can happen, for example, if home improvement projects are more popular in the spring than the winter, causing a higher fraction of visitors to use the Request a Quote feature. Importantly, there is no conversion rate difference between the two cohorts.</p><p>We will now simulate a dataset that one might obtain when running this experiment and show that if we fail to account for the changing cohort allocation, we will be misled to believe that the test cohort has a higher conversion rate.</p><p>In our simulated dataset, we will have ten thousand samples for each experiment run. A given sample will include information about the experiment run, cohort, and whether a conversion occurred. The cohort is randomly assigned according to the experiment run’s cohort allocation. The conversion event is sampled according to the true conversion rate in the given experiment run and cohort.</p><div class="highlighter-rouge highlight"><pre>import numpy as np
import pandas as pd
def simulate_data_for_experiment_run(
    total_num_samples: int,
    experiment_run: int,
    p_test: float,
    status_quo_conversion_rate: float,
    test_conversion_rate: float
):
    experiment_data = []
    for _ in range(total_num_samples):
        cohort = np.random.choice(
            ["status_quo", "test"],
            p=[1 - p_test, p_test]
        )
        if cohort == "status_quo":
            conversion_rate = status_quo_conversion_rate
        else:
            conversion_rate = test_conversion_rate
        # 1 if there is a conversion; 0 if there isn't
        conversion = np.random.binomial(n=1, p=conversion_rate)
        experiment_data.append(
            {
                'experiment_run': experiment_run,
                'cohort': cohort,
                'conversion': conversion
            }
        )
    return pd.DataFrame.from_records(experiment_data)
</pre></div><div class="highlighter-rouge highlight"><pre>experiment_data = pd.concat(
    [
        simulate_data_for_experiment_run(
            total_num_samples=10000,
            experiment_run=1,
            p_test=0.1,
            status_quo_conversion_rate=0.15,
            test_conversion_rate=0.15,
        ),
        simulate_data_for_experiment_run(
            total_num_samples=10000,
            experiment_run=2,
            p_test=0.5,
            status_quo_conversion_rate=0.30,
            test_conversion_rate=0.30,
        ),
    ],
    axis=0,
)
</pre></div><p>The most straightforward way one might try to estimate the per-cohort conversion rate is to take the mean of the conversion column for all samples in each cohort. Effectively, this gives the number of conversions per cohort divided by the total number of samples in the cohort.</p><div class="highlighter-rouge highlight"><pre>def get_conversion_date_for_cohort(
    experiment_data: pd.DataFrame,
    cohort: str
):
    experiment_data_for_cohort = experiment_data[experiment_data.cohort == cohort]
    return experiment_data_for_cohort.conversion.mean()
</pre></div><div class="highlighter-rouge highlight"><pre>get_conversion_date_for_cohort(experiment_data, "test")
0.2768492470627172
get_conversion_date_for_cohort(experiment_data, "status_quo")
0.20140431324783262
</pre></div><p>We observe a substantial difference between the conversion rate estimates for the status quo and test cohorts.</p><p>To get some intuition about whether it is statistically significant, let us create five thousand simulated datasets with the same parameters (cohort allocations and conversion rates). For each dataset, we will estimate conversion for the two cohorts, and look at the estimates’ distribution. The table below reports the mean and quantiles of the five thousand estimates of the status quo and test conversion rates. The table shows that the distributions in the estimated conversion rate for the two cohorts are very different, suggesting that the difference we observed is indeed statistically significant.</p><table><thead><tr><th> </th>
<th>Mean</th>
<th>2.5th percentile</th>
<th>50th percentile</th>
<th>97.5th percentile</th>
</tr></thead><tbody><tr><td>Status Quo</td>
<td>0.204</td>
<td>0.197</td>
<td>0.204</td>
<td>0.210</td>
</tr><tr><td>Test</td>
<td>0.275</td>
<td>0.246</td>
<td>0.275</td>
<td>0.287</td>
</tr></tbody></table><p>Recall, however, that there is no conversion rate difference between the cohorts: for both experiment runs, the test and status quo cohorts have equal conversion rates. Thus, if we use the simple approach described above to analyze the experiment, we would be misled and think that the test cohort outperforms the status quo.</p><p>What is going on? What we are seeing is a consequence of the fact that the average conversion rates and cohort allocations both change between experiment runs. For the test cohort, the majority of samples come from the higher-conversion period of the second experiment run. The opposite is true for the status quo cohort. So, the calculated conversion rate is higher for the test cohort than for the status quo cohort.</p><p>(In fact, it is not hard to adapt this example so that the test cohort has a lower conversion rate than the status quo cohort in each experiment run, but a higher calculated conversion rate overall. In this case, we might be misled and release the test experience, despite the fact that it harms conversion. This phenomenon is an example of <a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox">Simpson’s Paradox</a>.)</p><p>How do we correctly compare the conversion for the experiment cohorts? We first observe that the average conversion rate over the entire dataset is 0.225, the average of the conversion rates in each experimental run. (There are ten thousand samples in each run, so we can take a simple average. If the number of samples per run were different, we would instead calculate the overall conversion rate using a weighted average; the weights would be the number of samples in each experiment run.) Since the two cohorts have the same conversion rate, the method used for estimating it should arrive at this number for both (up to some statistical noise).</p><p>The previous method reported a higher conversion rate for the test cohort because it had disproportionately many samples from the second experiment run. To correct for this imbalance, let us instead try to calculate per-cohort conversion rates separately for each experiment run, and then combine them with a weighted average. This approach is implemented below:</p><div class="highlighter-rouge highlight"><pre>def per_experiment_run_conversion_rate_estimator_for_cohort(
    data: pd.DataFrame,
    cohort: str,
    experiment_runs: List[int],
):
    data_for_cohort = data[data.cohort == cohort]
    conversion_rates = []
    total_num_samples = []
    for experiment_run in experiment_runs:
        conversion_rates.append(
            data_for_cohort[
                data_for_cohort.experiment_run == experiment_run
            ].conversion.mean()
        )
        total_num_samples.append(
            (data.experiment_run == experiment_run).sum()
        )
    return np.average(conversion_rates, weights=total_num_samples)
</pre></div><p>In the table below, we report statistics about conversion rate estimates on five thousand simulated datasets:</p><table><thead><tr><th> </th>
<th>Mean</th>
<th>2.5th percentile</th>
<th>50th percentile</th>
<th>97.5th percentile</th>
</tr></thead><tbody><tr><td>Status Quo</td>
<td>0.225</td>
<td>0.218</td>
<td>0.225</td>
<td>0.232</td>
</tr><tr><td>Test</td>
<td>0.225</td>
<td>0.213</td>
<td>0.225</td>
<td>0.238</td>
</tr></tbody></table><p>We see that the estimates for the test and status quo conversion rates are close to the true value on average, and are close to each other.</p><p>In the rest of this blog post, we will provide a more theoretical justification for why this method, and another one based on regression, are appropriate for analyzing experiments where cohort allocations change over time. This will involve interpreting our problem in the language of causal inference.</p><p>The issues we faced when analyzing experiments with changing cohort sizes have a connection with causal inference. In this section, we will explore this connection, which will help us gain a better understanding of methods used to correctly calculate conversion rate (including the per-experiment run computation in the previous section).</p><h2 id="what-are-we-trying-to-measure">What are we trying to measure?</h2><p>We are trying to measure the causal effect on conversion from being in the test (versus the status quo) cohort (also known as the treatment effect). To do this, we imagine taking all the samples in our dataset. What fraction would convert if all of them were in the test cohort (call this Y<sub>T</sub>)? What fraction would convert if all were in the status quo cohort (call this Y<sub>SQ</sub>)? The difference between the two is the average treatment effect for the dataset.</p><p>Unfortunately, it is impossible to directly measure the average treatment effect as described above. Any given sample is in one cohort but not both, so it is impossible to know that sample’s outcome if it were in the other cohort. The calculation relies on some counterfactual data, e.g. for a sample in the status quo cohort, would it have converted had it been in the test cohort? This is known as the <a href="https://en.wikipedia.org/wiki/Rubin_causal_model#The_fundamental_problem_of_causal_inference">fundamental problem of causal inference</a>.</p><p>However, we can use our samples to estimate the average treatment effect.</p><h2 id="estimating-average-treatment-effect">Estimating average treatment effect</h2><p>The first attempt to estimate the average treatment effect was computing the average conversion rate per cohort. We computed the probability of conversion given that the cohort was test or status quo, and subtracted the two. We found that being in the test cohort was correlated with higher conversion. This correlation does not imply causation, however. The reason that being in the test cohort is correlated with conversion is that, given our cohort allocations, a user being in the test cohort means that they are more likely to be in the higher-conversion second experiment run.</p><p>Said another way, the experiment run is a confounding variable that produces a non-causal association between cohort and conversion. This is known as <a href="https://catalogofbias.org/biases/confounding/">confounder bias</a>. To properly estimate the causal effect of being in the test cohort, we have to control for the confounder. There are a number of standard ways of doing this in the causal inference literature (e.g. Section 3.2 of [1]).</p><h3 id="separate-conversion-rate-calculations-per-confounder-value">Separate conversion rate calculations per confounder value</h3><p>This approach tries to correct for confounder bias by computing per-cohort conversion rates separately for each value of the confounder (experiment run). To get the overall conversion rate for each cohort, we take a weighted average of the conversion rates per experiment run, with weights being the relative prevalence of each confounder value in the dataset. (See, for example, <a href="http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf">Equation 3.21</a> in [2].) This is equivalent to weighing by the total number of samples (test and status quo) in each experiment run. This gives estimates for Y<sub>T</sub> and Y<sub>SQ</sub>, and we can subtract them to get an estimate for the average treatment effect.</p><p>We did precisely this when we tried to properly calculate conversion rate per cohort (using the <code class="highlighter-rouge">per_experiment_run_conversion_rate_estimator_for_cohort</code> function). This approach makes sense because, for a given experiment run, the per-cohort calculation gives us an estimate of the conversion rate for that experiment run if all samples were in the given cohort (this relies on the fact that users are assigned at random to the status quo or test cohort). Then, the weighted average step gives us an estimate of what the conversion rate would be over the entire data set (all experiment runs).</p><h3 id="including-confounding-variable-in-a-regression">Including confounding variable in a regression</h3><p>Another approach for controlling for the confounder is to build a regression model for the outcome variable (conversion) as a function of the treatment variable (cohort, specifically a dummy variable encoding whether the sample is in the test cohort). If we simply regress conversion on the test cohort dummy variable, we will see a positive regression coefficient, which may lead us to conclude there is a positive treatment effect. However, in our running example — where the treatment effect is zero — there will be a positive coefficient just because being in the test cohort is correlated with conversion, which happens due to the presence of the confounder.</p><p>To fix this, we include the confounder as a predictor variable alongside the cohort. This will split the conversion effects that are due to the confounder and those due to being in the test cohort. The coefficient of the cohort variable will give us the average treatment effect.</p><p>Both the cohort and the experiment run are categorical variables, and we will encode them using dummy variables. For each categorical variable, we need one fewer dummy variable than the number of different values the variable can take. For our data with two cohorts and two experiment runs, the code below will create dummies for whether the user is in the test cohort and also for whether it is in the second experiment run.</p><div class="highlighter-rouge highlight"><pre>import statsmodels.formula.api as smf
smf.ols(
    formula="conversion ~ C(cohort, Treatment('status_quo')) +  C(experiment_run)",
    data=experiment_data
).fit().summary()
</pre></div><p>This code uses the <a href="https://www.statsmodels.org/v0.12.0/example_formulas.html">formula</a> API in <a href="https://www.statsmodels.org/v0.12.0/index.html">statsmodels</a>. It stipulates that conversion is a linear function of cohort and experiment run. The <a href="https://patsy.readthedocs.io/en/v0.5.1/categorical-coding.html">C(·) notation</a> encodes these variables as dummy variables.</p><p>The results are:</p><table><thead><tr><th>Variable</th>
<th>Coefficient</th>
<th>Standard Error</th>
</tr></thead><tbody><tr><td>Intercept</td>
<td>0.1507</td>
<td>0.004</td>
</tr><tr><td>User is in test cohort</td>
<td>0.0073</td>
<td>0.007</td>
</tr><tr><td>Second experiment run</td>
<td>0.1427</td>
<td>0.006</td>
</tr></tbody></table><p>The intercept term is approximately equal to the baseline conversion rate (in the first experiment run and status quo cohort), namely 0.15.</p><p>We see a close to zero effect from being in the test cohort; the coefficient is almost equal to its standard error. On the other hand, we see an approximately 0.15 effect from being in the second experiment run. Indeed, samples in that experiment run have a conversion of 0.3, which is 0.15 higher than the conversion rate in the first experiment run.</p><h2 id="example-with-non-zero-treatment-effect">Example with non-zero treatment effect</h2><p>We modified our running example such that the test cohort conversion rate was 0.05 higher in each experiment run than the status quo conversion rate, and tested out our two methods for computing average treatment effect.</p><table><thead><tr><th>Time period</th>
<th>Experiment Run</th>
<th>Cohort</th>
<th>% of traffic assigned to cohort</th>
<th>Conversion Rate</th>
</tr></thead><tbody><tr><td>Winter</td>
<td>1</td>
<td>Status Quo</td>
<td>90%</td>
<td>0.15</td>
</tr><tr><td>Winter</td>
<td>1</td>
<td>Test</td>
<td>10%</td>
<td>0.20</td>
</tr><tr><td>Spring</td>
<td>2</td>
<td>Status Quo</td>
<td>50%</td>
<td>0.30</td>
</tr><tr><td>Spring</td>
<td>2</td>
<td>Test</td>
<td>50%</td>
<td>0.35</td>
</tr></tbody></table><p>The overall conversion rates are 0.225 for the status quo cohort and 0.275 for the test cohort.</p><h3 id="separate-conversion-rate-per-confounder-value">Separate conversion rate per confounder value</h3><p>Running <code class="highlighter-rouge">per_experiment_run_conversion_rate_estimator_for_cohort</code> gives conversion rate estimates that are close to the actual values (0.225 and 0.275 for the status quo and test cohorts respectively).</p><h3 id="regression">Regression</h3><p>The regression gives the following coefficients:</p><table><thead><tr><th>Variable</th>
<th>Coefficient</th>
<th>Standard Error</th>
</tr></thead><tbody><tr><td>Intercept</td>
<td>0.1506</td>
<td>0.004</td>
</tr><tr><td>User is in test cohort</td>
<td>0.0567</td>
<td>0.007</td>
</tr><tr><td>Second experiment run</td>
<td>0.1431</td>
<td>0.007</td>
</tr></tbody></table><p>As before, the coefficient for the variable encoding whether the user is in the test cohort approximates the true average treatment effect (0.05). The coefficient for the variable encoding the second experiment run is approximately 0.15, once again as expected — in the second experiment run, conversions are that amount higher.</p><h2 id="simulation-study">Simulation study</h2><p>To better understand the two methods for estimating average treatment effects and the advantages of each, we ran a simulation study. In this study, we produced a large number of datasets with the same parameters and looked at the distribution of average treatment effects estimated by the two methods.</p><p>We will take a look at a particular example:</p><ul><li>Two experiment runs with 1000 samples per run (ten times lower than in the previous datasets; this helps better illustrate the statistical noise in our estimates)</li>
<li>Test cohort allocation is 10% and 50% in the two runs respectively</li>
<li>The status quo conversion rates are 0.15 and 0.30 in the two experiment runs respectively</li>
<li>The test cohort conversion rates are 0.20 and 0.35 (0.05 higher than the status quo conversion rates)</li>
</ul><p>We produced a total of 5000 datasets, and hence estimated the treatment effect 5000 times for each method.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2021-07-06-analyzing-experiments-with-changing-cohort-allocations/power_analysis_histograms.png" alt="Histograms of Estimators" /></p><p>The orange graphs in the figure above are histograms of the estimated average treatment effect for the separate conversion rate estimation method (top) and the regression method (bottom). Both distributions have means close to 0.05, the true average treatment effect, and have very similar shapes. The graphs in blue are the estimated average treatment effects for datasets that are the same as above, but where the status quo and test cohorts have the same conversion rate in each experiment run. These distributions have means of close to 0 as expected, since the true treatment effect is 0.</p><p>We have run a number of simulation studies, and have found that the two methods for estimating average treatment effect perform similarly. Overall, we believe that the most important thing is not the precise method one uses, but that one is aware of confounder bias, and takes steps to correct for it.</p><p>Nevertheless, it is good to keep the regression method in one’s tool chest because it can be easier to use in many instances. For one, software packages such as <code class="highlighter-rouge">statsmodels</code> automatically compute standard errors for regression estimates. Additionally, with regression, it is fairly straightforward to analyze more complicated experiments, such as when there are multiple confounders. (One example is if cohort allocations within experiment runs were different for different geographical regions; in this case, geographical region would be an additional confounding variable.)</p><p>Analyzing experiments where cohort allocations change over time can get a little complicated. Simply looking at the outcome variable for samples in the status quo and test cohorts can cause misleading results, and different techniques, which control for the confounding variable, are needed. We hope that this blog post has raises awareness of this issue and provides some solutions.</p><ul><li>Billy Barbaro for originally making me aware of the issue discussed in this post.</li>
<li>Alex Hsu and Shichao Ma for useful discussions and suggestions, which ultimately helped frame this causal inference interpretation of the problem.</li>
<li>Blake Larkin and Eric Liu for carefully reading over this post and giving editorial suggestions.</li>
</ul><ol><li>Joshua D. Angrist and Jörn-Steffen Pischke. <em>Mostly Harmless Econometrics</em>. Princeton University Press, 2008</li>
<li>Judea Pearl. “Controlling Confounding Bias.” In <em>Causality</em>. Cambridge University Press, 2009 <a href="http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf">http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf</a></li>
<li>Adam Kelleher. “A Technical Primer on Causality.” <a href="https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41#.o1ztizosj">https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41#.o1ztizosj</a></li>
</ol><div class="island job-posting"><h3>Become an Applied Scientist at Yelp</h3><p>Want to impact our product with statistical modeling and experimentation improvements?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/cc5ce7e2-26e9-4290-8847-c082632df9e8/Applied-Scientist-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/07/analyzing-experiments-with-changing-cohort-allocations.html</link>
      <guid>https://engineeringblog.yelp.com/2021/07/analyzing-experiments-with-changing-cohort-allocations.html</guid>
      <pubDate>Tue, 06 Jul 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: How we think about engineering management]]></title>
      <description><![CDATA[<p>In our last post we talked about technical leadership, one of the growth paths available to software engineers at Yelp. In this post we’d like to share more about engineering management, which is another path that some software engineers choose after some time in the industry. We’ll start with an explanation of what engineering management is (and isn’t), discuss our approach to management, and talk about what makes it different from engineering. We’ll also discuss how people get started on a management path at Yelp, and what we do to help our management team grow in their roles.</p><h2 id="whats-an-engineering-manager">What’s an engineering manager?</h2><p>At Yelp, every engineering manager is accountable for the overall health, execution, and vision of their team. Managers safeguard Yelp’s culture and <a href="https://www.yelp.careers/us/en">values</a>, ensuring that it’s a great place to work. We expect managers to make good decisions that are best for the company and to put the interests of the team ahead of themselves. Sometimes this means handing off an interesting project to another team that is better equipped, or finding a role on another team at Yelp for a senior engineer who’s ready for a new challenge.</p><p>Each of us on the management team is a technologist with a background in doing the work, whether that be software engineering, information technology, machine learning, or something else. This background knowledge gives us the ability to understand what our teams do on a day-to-day basis, to empathize with the challenges they face, and to entrust our team with most of the decision-making necessary to build and operate our product and its infrastructure.</p><h3 id="what-does-it-mean-to-be-accountable-for-the-health-of-a-team">What does it mean to be accountable for the health of a team?</h3><p>Managers are responsible for the motivation, well-being, and career growth of their teammates. A manager’s first job is to build a trusting relationship with each person on the team, usually through weekly one-on-one meetings and quarterly career planning discussions. If someone is feeling excited or proud, the manager is there for a high five. If someone is stressed or upset, or not taking care of themselves, the manager is there to listen and support, first helping them to pinpoint what they are feeling.</p><p>We rely on managers to connect the right opportunities with the right people, and you can’t do that unless you know someone’s interests, aspirations, and concerns. Someone might want to get better at public speaking, so you find a project for them in a role that involves lots of presentations to other teams. Or they might have deep social anxiety, so you make sure they <em>don’t</em> have to present, or you can find alternate opportunities for them to build confidence and communication skills. As a teammate advances in their career, it’s important to work with them to find opportunities that will engage their desire for personal growth and learning. Managers need to establish a foundation of trust and open communication if they’re going to understand what each team member loves about their work.</p><p>Managers spend a lot of time listening and paying attention, especially in our one-on-ones with our team. We want to be there for people when they are worried, frustrated, or stuck. We don’t aim to solve all of their problems, but we will offer our perspective and feedback, ask questions, and connect with others who can help them. We also want to be there to celebrate wins alongside them, and to make sure their growth and achievements are recognized.</p><h3 id="what-is-a-managers-role-in-a-teams-execution">What is a manager’s role in a team’s execution?</h3><p>Managers are responsible for ensuring the team has processes, norms, and guardrails that allow the team to operate effectively and for everyone to do their best work. Each team will have its own personalities and preferences, but teams always need to be inclusive to be effective. For example, the manager of a Scrum team might run sprint planning themselves, or they might delegate that responsibility to engineers, but in either case they need to ensure that every team member feels involved and is an active contributor. Managers are constantly on the lookout for ways to help their team improve execution, and the best managers enable their teams to do this effectively.</p><p>People do their best work when they feel a sense of agency and autonomy over <em>what</em> the work is and <em>how</em> it gets done. This often requires managers to delegate much of the day-to-day technical work, like actually building and shipping software. Although all managers need to be prepared to roll up our sleeves and help the team during an emergency, it’s important that we avoid trying to do the same day-to-day work as the team. Writing code is fun – some of us really miss it – but if a manager is regularly doing pull requests, it raises uncomfortable questions: Is it safe to leave critical feedback on their work? Does the manager not trust the team? Is it a sign the team is understaffed? Is there nothing else the manager could do that would help the team perform better?</p><p>It’s no different with higher-level technical decision-making: it’s almost always better if the engineering team can handle challenging decisions on its own, instead of relying on their manager to make the call. Should we move from one database platform to another? Is it time for us to rewrite that ancient module that nobody understands anymore? Is our time best spent trying to make the app faster, or in rewriting our app framework so more teams can work in parallel? When these questions come up, the manager’s job is not to <em>decide</em> but to put structure around the team’s decision-making, using our own experience to guide the discussion and keep things moving forward.</p><p>As former engineers, we might love wrestling with technical problems, but as managers we often need to set aside our own interests in writing and pushing code to better support everyone else on the team. We do, however, rely heavily on our technical backgrounds to guide conversations, validate the team’s direction and investments, support technical growth of our teammates, and ask the right questions along the way.</p><h3 id="is-vision-just-a-fancy-word-for-fancy-presentations">Is “vision” just a fancy word for fancy presentations?</h3><p>Managers connect the dots between business value and engineering projects. Many teams have product managers to identify business opportunities and study what delivers the most value to our users. Meanwhile, engineers are most familiar with the product’s current technical capabilities and weaknesses, as well as which systems are incurring technical debt and cannot be easily extended or reused. Engineering managers facilitate an ongoing conversation that aligns the next set of technical investments with business value, whether it’s iterating on a feature or system in its current state, or taking on a bigger effort to refactor or rearchitect.</p><p>One of the key ways a manager supports their team is by planning ahead. Engineering managers need to see at least a few months into the future, beyond the current backlog, and ensure the team is prepared for what’s coming up. In addition to working closely with other Engineering teams, we collaborate with teams in Product Operations, Sales, and Customer Success to understand the business’s priorities and to help make sensible trade-offs. We try to strike a healthy balance between incremental improvements (where returns on investment might be clear, but limited, and big bets (where the uncertainty is higher, but for a bigger potential payoff). If we are going to make a big bet, we help the team to break it down into milestones to reduce risk.</p><p>In parallel, each Engineering team keeps track of its own prioritized backlog of technical investments and engineering opportunities; the team’s manager needs to make sure there is enough time and budget for the team to make meaningful progress on that backlog. Many teams will allocate time in each planning cycle to address maintenance issues and small refactors. Larger technical investments are motivated by patterns in bugs and failures, as well as developer velocity. Often iterating on a system over years will lead to difficulty in supporting it due to the accumulation of complexity, drift of business goals, and increases in both the volume of traffic to the system and number of engineers who interact with the system. In recent years, we’ve tracked our largest technical investments at an engineering-wide level to ensure that teams know they are priorities.</p><p>It’s important for every engineering manager to understand the bigger picture so that they can share context with the team on where things are going and how they fit together. Every Engineering team has more possible work that it can ever complete, so it’s critical for the manager to facilitate the conversation of investment levels in various areas.</p><h3 id="bringing-it-all-together">Bringing it all together</h3><p>Health, execution, and vision are interconnected. A healthy team with a good work-life balance requires clear, consistent processes for triaging issues and making commitments. A team that consistently makes high quality decisions in a collaborative fashion is only possible if the team believes their manager when they say, “This decision is yours to make.” A team where every engineer feels motivated and challenged requires a manager who is thinking ahead and anticipating what comes next.</p><p>One example that touches on all three of these areas is incident response. Over the years, we’ve become much better at dealing with emergency incidents, putting in place protocols that ensure that engineers communicate and support each other through the incident, then discuss and write a (blameless) postmortem with follow-up action items (which can include longer-term engineering investments). Following an incident, managers check in with engineers taking personal time to deal with incidents to offer some paid time off to recover. As a management team we’ve prioritized introspection by our teams and time set aside for continuous improvement.</p><h2 id="how-does-someone-become-a-manager">How does someone become a manager?</h2><p>Sometimes an engineer will express interest in management and explore it with their leadership team; other times, we’ll see potential in someone and encourage them to consider it. We know that there is a lot of variation in effective leadership styles, and in some cases it has taken years of coaching and encouraging an engineer for them to give management a try. To ensure that we’re being open-minded about who can be a manager, we continue to develop leadership training for all engineers, not just ones who have self-selected into the management path, and we are asking all managers to have deeper conversations with their level IC3 reports on career development options, since this is where we typically see branching between the management and technical leadership career paths.</p><p>In any case, we only consider engineers who’ve already demonstrated some of the skills required to be a good manager. That could be mentoring other engineers on the team and helping train new hires. It could be leading projects and keeping track of team initiatives. We want to be sure that anyone under consideration for management understands what the role means at Yelp and has shown themselves to be a role model for Yelp’s values.</p><p>Management roles are not unlimited; they become available through team growth, reorganizations, and (sometimes) departures. This means that stepping into a manager role often involves changing teams. We think this is healthy; it helps new managers avoid the trap of trying to manage the team <em>and</em> remain one of the key technical contributors. If you’re learning to manage people for the first time, you have to be able to focus on that new set of skills and challenges. That is easier if it’s a new set of systems than what you worked on as an engineer.</p><p>Through some trial and error, we developed a training program we call “proto-manager” to help individuals try out the management role without making it seem irreversible. We wanted to give them exposure to the role and a vote of confidence from leadership, but still allow for the option to say, “Not for me. I’d rather keep writing code.”</p><p>As a proto-manager, the engineer takes over one-on-ones with the engineers on the team and accountability for the team’s execution; compensation planning and performance review are still handled by the team’s manager. Over the next few months, the proto-manager will get regular feedback from their team and their manager, and will track their progress against the expectations laid out in the first level of our Engineering Manager leveling system.</p><p>Proto-managers enroll in a training program we’ve developed that details our philosophy, approach, and toolset for managers. They also gain access to all of the resources, documents, and meetings that their EMs have, so they can learn as quickly and effectively as possible. After a few months, the proto-manager and their manager decide whether to move forward and make it official.</p><h2 id="how-does-someone-grow-as-a-manager">How does someone grow as a manager?</h2><p>Mountains of books about management get published every year, but most of the growth of a manager comes from doing the job. It’s unlikely you’ll suddenly become adept at having crucial conversations with someone who is on the path to burning out just by reading a book or attending a training. It takes experience and practice to unwind the reasons that a complex project with a strong team still isn’t running according to plan. However, we have several habits and programs to support new managers as they learn on the job.</p><p>New managers gain a lot by getting advice from their peers. Directors hold weekly meetings with small groups of peer managers, with a focus on knowledge sharing and support within the group. Every attendee contributes to the agenda with books, articles, or videos that they found useful or insightful, but the majority of discussion often centers around what we call “people stories.” Managers can bring interesting or challenging situations to the group and the other attendees will listen, ask questions, and offer suggestions. Many of these stories are about coaching, giving feedback to, or finding opportunities for a particular individual. There is absolute confidentiality within these meetings; the goal is for a manager to get advice on how to help their team and grow professionally. We’ve found that these meetings are most effective when they are kept small because we want them to create a sense of safety and trust, and for each manager to feel like they can seek advice from a trusted circle of peers.</p><p>We also run a monthly manager meeting with all engineering managers at Yelp, which is an information sharing meeting that covers topics like updates to our leveling system and compensation planning, updates from around the Engineering team, and talks given by managers (for example, about allyship). This also provides a regular forum to ask questions of senior managers.</p><p>Over the years we’ve done several versions of management training, creating cohorts of new managers (those new to management or new to Yelp) and scheduling a series of sessions focused on topics like decision-making, coaching, and career development for their reports. As we’ve progressed, we’ve worked to adapt this to align more with Yelp’s overall management and leadership training, refresh the content, and ensure that it scales to work across worldwide engineering offices and in a distributed team environment.</p><p>Finally, we implemented a manager mentorship program early on, finding that managers derived a lot of benefits from meeting with other managers. Many new managers find themselves with mentors both inside and outside of their current group.</p><h2 id="how-do-we-track-manager-career-development">How do we track manager career development?</h2><p>Our manager leveling system mirrors the <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">engineering leveling system</a>, sharing a common set of five dimensions: Technical Skill, Ownership, Business Insight, Continuous Improvement, and Leadership. We expect most managers to be well-rounded across these different areas. Advancement in these areas is generally tied to having accountability for increased scope. Senior managers tend to deal with more ambiguity, think more about how technology can deliver business value, mentor and manage more senior reports, and manage budgets and hiring plans. They solve problems by putting processes and structures in place, pursuing opportunities to improve in a changing business landscape, and steering the growth and restructuring of our teams.</p><p>While some of these areas are self-explanatory, we have several subdimensions in Continuous Improvement and Leadership that we’d like to highlight: Mentorship, Well-Being, and Community. These reflect our values as a team, ensuring that managers are looking for opportunities to grow others, support their teams, and build strong relationships. At higher levels, we expect managers to build and scale programs that sustain Yelp Engineering.</p><h2 id="do-i-need-to-become-a-manager-to-keep-growing">Do I need to become a manager to keep growing?</h2><p>Absolutely not! We understand that a management career is a separate path from becoming an excellent software engineer, and we strive for both of these career paths to be full of growth opportunities. Nobody should ever feel forced or compelled to step into management to advance their career, because they’ll wind up in a role they don’t enjoy while their team won’t get the support they need.</p><p>While we’ve supported engineers who have been interested in transitioning into other functions like Product Management and Data Science, many engineers choose to stay on the technical career path at Yelp. For more information, check out two other blog posts in this series: <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">Career Paths for Engineers</a> and <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html">Technical Leadership</a>.</p><h2 id="building-yelps-management-team">Building Yelp’s Management Team</h2><p>Between the two of us, we have 19 years of management experience at Yelp. We’ve hired, mentored, and managed a lot of managers at Yelp, many of whom are in their first management roles. Over the years we’ve helped to define and articulate our management culture. It’s been incredibly rewarding to build the team and support it. While we know the management path is not for everyone, it brings together a lot of challenges in helping a team work effectively to define and achieve success together. Many things can go wrong when people come together to build software, and managers can help a team to overcome all sorts of challenges and celebrate both individual and team success along the way.</p><p>We’ve scaled our management culture to a team of more than 150 managers. In an earlier blog post we talked about our career growth framework, which has been a useful tool to standardize career conversations and compensation scales. In our next post we’ll discuss how we measure and ensure pay equity and fairness in career progression across Yelp Engineering.</p><p>If this all sounds good to you, and you’re excited to continue developing your management career, <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html</link>
      <guid>https://engineeringblog.yelp.com/2021/07/engineering-career-series-how-we-think-about-engineering-management.html</guid>
      <pubDate>Thu, 01 Jul 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Technical Leadership at Yelp]]></title>
      <description><![CDATA[<p>Hi there!</p><p>In this post we’re discussing technical leadership, a topic that is paramount to any engineering organization, but is also hard to define. Even observing whether your team, organization, or company has good technical leadership can be a challenge. You might be thinking right now, “Am I a good technical leader?”</p><p>To help describe how Yelp thinks about technical leadership, we have two of our Group Tech Leads (a.k.a. GTL, more on what this is later) writing this post. They are both seasoned Yelpers who have held a number of technical leadership roles — they were even willing test subjects for Yelp’s early experiments in defining such roles.</p><p>Jason Sleight has been with Yelp for six years in various machine learning (ML) oriented roles, and is currently the GTL for Yelp’s <a href="https://engineeringblog.yelp.com/2020/07/ML-platform-overview.html">ML platform</a>, which consists of a collection of centralized systems for ad hoc computing, data ETL, and training/serving ML models.</p><p>Josh Walstrom has been with Yelp for seven years in various backend, iOS, and Android roles and is currently the GTL for our <a href="https://business.yelp.com/tools/business-mobile-app/">“Yelp for Business”</a> mobile apps, which enable business owners to manage their presence on Yelp.</p><h2 id="what-is-technical-leadership">What is technical leadership?</h2><p>First things first, technical leadership is not a single concept, but rather an encapsulation of several distinct functions. For the sake of simplicity, we’re going to bucket them into a few categories. We call someone a Tech Lead (TL) when they focus on these functions.</p><h3 id="own-technical-direction-within-an-area-of-responsibility-aor">Own technical direction within an Area of Responsibility (AoR)</h3><p>Yelp is a highly collaborative environment where product, design, management, and technical leaders share goals and direction, but each focus on different aspects based on their expertise. For example, a product manager’s (PM) expertise is market fit, and a TL’s expertise is technical execution. A PM might focus on whether a team should increase weekly active users via increasing app downloads versus improving retention of existing users, while a TL might focus on whether we should prioritize componentizing UI elements to enable app-pitch interstitials versus improving caching to reduce page load times.</p><p>We like our TLs to focus their attention within an AoR. By having a defined AoR, we can ensure TLs are involved in all the necessary planning, decisions, etc. for that system. In some cases, an AoR is a direct mirror of the team’s mandate (e.g., managing MySQL deployments); in others, an AoR is a sub- or cross-section of an important initiative (e.g., migrating to a new service discovery tech stack). In any event, AoRs are long-lived concepts with multi-quarter or even multi-year roadmaps, and it is the TL’s responsibility to champion that process.</p><h3 id="ensure-the-technical-success-of-their-aor">Ensure the technical success of their AoR</h3><p>Once you have a technical direction established, you need to execute a sequence of projects towards those goals. Ensuring success can take many forms and often varies depending on the lifecycle phase of an AoR. In a new AoR, TLs are often very hands-on by creating proof-of-concepts and prototypes. As systems grow, the TL might lead a project with a few other engineers to bring the system to production quality and release it for early adopters to try out. And finally, as the system matures with broader adoption, the TL needs to step back and facilitate team processes for triaging issues, implementing new features, and other maintenance tasks.</p><h3 id="provide-technical-mentorship-to-engineers-working-in-their-aor">Provide technical mentorship to engineers working in their AoR</h3><p>Engineers love to make progress, and that includes in their personal skills. While engineering managers (EMs) are ultimately accountable for engineers’ continued growth, TLs are closer to the day-to-day technical contributions of engineers in their AoRs and best positioned to give them feedback on how to improve and grow. On a micro level, this includes giving feedback on code robustness, efficiency, maintainability, etc. On a macro level, this includes exposing engineers to new technologies, creating training materials for the AoR’s systems, and helping engineers connect with relevant stakeholders.</p><p>At Yelp, we associate career progression with increased impact (see our recent blog on <a href="https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html">career leveling</a>). TLs act as force multipliers and explicitly budget time to spend on coordination of efforts, championing their AoR, and mentorship. Clearly these are characteristics that lend themselves towards high impact, and consequently our TLs tend to be relatively advanced in Yelp’s career levels.</p><p>However, there are additional ways to be impactful. Tasks like debugging a logging system, refactoring a complex data model, and optimizing page load times require in-depth technical expertise. A TL might not be the best positioned for these tasks; instead an engineer that spends more time deep in the code has the right expertise. You can view this as a depth versus breadth distinction, and a healthy organization needs both types of skill sets. Having TL as a career progression step would funnel engineers into a breadth-first mindset at the detriment of deep technical experts.</p><p>In the past, this distinction was somewhat ambiguous at Yelp, and being a TL was occasionally viewed as a career progression step by engineers. To combat this incorrect perception, we’ve recently refreshed our TL program to make it explicit that TL is a role, as well as to re-establish explicit support networks for our TLs like training programs and senior level mentorship.</p><p>The TL role involves more than just guiding technical work and mentoring engineers in their AoR.</p><h3 id="tls-are-engineer-advocates">TLs are engineer advocates</h3><p>TLs identify and remove roadblocks for engineers in their AoR. By nature, engineers excel at finding workarounds or tolerating solvable problems because they want to ship features. Some roadblocks are fairly obvious, such as an unstable deployment pipeline. Other roadblocks can be more subtle, such as not having access to the best tools or training.</p><p>When the roadblocks are within their AoR, TLs work with their EMs and PMs to schedule time for proper solutions even if that means deferring some work on the product roadmap. When the roadblocks are outside their AoR, TLs ask their fellow TLs for help.</p><h3 id="tls-are-allies-for-their-stakeholders">TLs are allies for their stakeholders</h3><p>TLs work closely with PMs and EMs in their AoR. TLs participate actively in the early planning process for new features, advising on technical feasibility, level of effort, and potential risks. TLs ensure PMs have the data they need to optimize their product strategy, and TLs ensure EMs can complete new features on schedule with high quality.</p><p>Many AoRs have stakeholders outside Yelp Engineering, such as Sales &amp; Marketing, Business Operations, and Finance. TLs work hard to develop empathy for the needs of these stakeholders.</p><h3 id="tls-are-communication-hubs">TLs are communication hubs</h3><p>TLs simplify communication by coordinating the flow of information around their AoR. This aspect of the role means TLs spend less time building and more time reading, writing, listening, or talking.</p><p>TLs cultivate a sufficiently broad context about their AoR through a generous reading list of emails, Slack messages, product &amp; technical specifications, JIRA tickets, and GitHub PRs. TLs also use external sources to develop context. They read blog posts, attend technical conferences, and participate in open-source communities.</p><p>TLs meet regularly with other TLs to exchange context, creating a collaborative community of technical leaders focused on solving “big picture” problems. They also meet with EMs, PMs, and external vendors. In short, being a TL means you’ll have more meetings. Not as many as an EM or PM but more than a typical engineer.</p><p>GTLs own the technical work of a group of overlapping or related AoRs, each with its own TL. Effectively, GTLs create second level AoRs that cut across clean organizational and technical boundaries to address critical business needs and drive company-wide technical initiatives.</p><p>We introduced the GTL role because some hard, cross-cutting problems seemed impossible to solve without a dedicated owner. In the beginning, the role GTL didn’t have a formal application process, and the expectations weren’t clear, except that a GTL was a TL with a much larger and more ambiguous scope. Though we’re still figuring some things out, the role of the GTL has matured considerably over the past few years. There’s a formal application process and clearer set of expectations for GTLs beyond those we’ve covered for TLs in general.</p><h3 id="gtls-are-experts-in-the-their-fields">GTLs are experts in the their fields</h3><p>GTLs stay informed on industry best practices, future trends, and potential risks. They understand and influence Yelp’s business strategy. Using their expertise and knowledge, GTLs guide their groups in making the good technical decisions that best support Yelp’s business strategy.</p><h3 id="gtls-keep-a-holistic-view-on-engineering-health-and-success">GTLs keep a holistic view on engineering health and success</h3><p>As a community, GTLs have awareness across most of Yelp Engineering, and they work together to support the overall health and success of the engineering organization.</p><p>GTLs monitor recent incidents/retrospectives for trends and consistent issues that need more focused attention (particularly help from other GTLs). GTLs look for valuable cross-group projects that may otherwise be missed, evaluate potential solutions, and make recommendations for next steps. In many cases, these projects require careful planning and long-term investments spanning years, not just months or quarters.</p><h3 id="gtls-foster-technical-leadership">GTLs foster technical leadership</h3><p>Being the most widely-scoped technical role at Yelp, GTLs are positioned to develop technical leadership across engineering, not just within their AoR. This involves both training new TLs and ensuring that existing TLs are set up for success.</p><p>Each week, GTLs hold office hours, alternating between North American- and European-friendly time slots. Anyone, not just TLs, can attend these office hours to ask questions, solicit feedback on technical proposals, or just listen to what’s being discussed. GTLs also participate in asynchronous discussions on Slack.</p><h3 id="gtls-are-exemplars-of-yelps-culture">GTLs are exemplars of Yelp’s culture</h3><p>Finally, as highly visible leaders, GTLs set positive examples of our <a href="https://www.yelp.careers/us/en">Yelp values</a>. They “play well with others” by handling disagreements constructively and demonstrating how to solve problems through consensus rather than authority. They are “tenacious” and “unboring” by finding creative solutions to Yelp’s most difficult, far-reaching problems. They are “authentic” by communicating openly and honestly, and they “protect the source” by making reliable products and services that help connect Yelp’s consumers with great local businesses.</p><h2 id="up-next-how-yelp-approaches-engineering-management">Up next: How Yelp approaches engineering management</h2><p>We hope you’re enjoying this blog series and find the peek into Yelp’s engineering culture meaningful! Next up we’ll discuss Yelp’s approach to engineering management, how we measure managers’ success, and provide a glimpse at their responsibilities and values.</p><p>Finally, if you’ve been reading these posts and think Yelp sounds like a great place to work (it is!), then head over to our Careers site – <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2021/06/engineering-career-series-technical-leadership-at-yelp.html</guid>
      <pubDate>Thu, 17 Jun 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Modernizing Business Data Indexing]]></title>
      <description><![CDATA[<p>On the Yelp app and website, there are many occasions where we need to show detailed business information. We refer to this process as Data Hydration, filling out a “dry” business with compelling, rich data. Whether on the home screen, search results page or business details page, there is a large set of properties we may show about any given business, everything from name and address to photos, <a href="https://blog.yelp.com/2019/03/yelp-announces-verified-licenses-bringing-peace-of-mind-to-booking-a-professional">Verified Licenses</a>, insights, and more. These properties are stored in a variety of different databases, and their display is subject to a significant amount of filtering and transformation logic. All of this creates challenges for scaling and performance.</p><p>One technique we rely on heavily is the use of materialized views. Using this technique, we gather the data from the various sources and apply the transformation logic offline, storing it in a single key-value store for rapid fetching. The indexing process for this system for many years was our home-grown ElasticIndexer, which has become outdated and doesn’t take advantage of recent advances in Yelp’s backend data processing infrastructure. This post tells the story of our migration from the legacy system to an improved ElasticIndexer 2, meeting several challenges in the process and ultimately delivering a host of advantages.</p><p>Let’s take a closer look at our materialized view and the role it plays in our Data Hydration system. As a motivating example, let’s look at the delivery property. This shows up in the UI when a restaurant offers delivery through the Yelp platform.</p><div class="c1"><img src="https://engineeringblog.yelp.com/images/posts/2021-06-07-modernizing-business-data-indexing/oil-vinegar.png" alt="A restaurant with delivery" /></div><p>For various reasons, the form of a business’s delivery-related data stored in our database is not the same as that served to clients such as the website or app. For one, the database schema is relatively static to accommodate data from years ago, while the client applications are constantly changing. Also, the database form is optimized for data modeling, while the form sent to clients is optimized for speedy processing. Thus, transformation logic needs to be applied to the data fetched from the database before being sent to the clients.</p><p>The central challenge of maintaining a materialized view of this property is to react to changes in the underlying data store to update the view with the transformed property. This all must happen in real-time to avoid serving stale data. This becomes especially complicated when a property depends on multiple database tables, which is true for many properties including delivery availability.</p><p>For many years, we used ElasticIndexer to index the materialized view for our Data Hydration platform. ElasticIndexer listens to table change logs (implemented as a separate MySQL table) in the underlying databases, and, in response to changes, will issue database queries and run the transformation logic, ultimately writing the result to the Cassandra materialized view. As a performance and scaling measure, the change logs only contain the primary key of the row being changed, so re-fetching the row from the database is required for any non-trivial transformation. In cases where the business ID is not the primary key of the database table, a domain-specific language (DSL) was used to establish a mapping between a given row and the relevant business ID. This process is illustrated below.</p><div class="c1"><img src="https://engineeringblog.yelp.com/images/posts/2021-06-07-modernizing-business-data-indexing/ei1.jpg" alt="Elastic Indexer 1" /></div><p>While this system has generally served us well, there are several downsides to this approach. First, the need to re-issue queries to the database unnecessarily increases the load on the database and introduces race conditions. Database deletes are not supported, as the row would be gone when the indexer would query it. Rewinding the materialized view to an arbitrary point is not possible. Specifying relationships between the different tables was awkward in the special-purpose DSL. Having properties based on the current time was hacky to implement. And parallelizing the process was difficult given the implementation of the change log.</p><p>There must be a better way…</p><p>As stream processing tools and systems such as <a href="https://flink.apache.org/">Flink</a> become more mature and popular, we have decided to create our next generation Data Hydration indexing system based on these new technologies. MySQL is the authoritative source of truth for most applications at Yelp. We stream real-time changes in MySQL to Kafka topics using <a href="https://github.com/Yelp/mysql_streamer">MySQLStreamer</a>, which is a database change data capture (CDC) and publishing system. Once this data is available in Kafka data pipelines, we have a variety of handy customized stream processing tools based on <a href="https://flink.apache.org/">Flink</a> to do most of the necessary data transformation on business properties before storing them in materialized views in Cassandra:</p><ul><li>StreamSQL: A Flink Application for performing queries on one or more Kafka data streams using syntax supported by Apache Calcite.</li>
<li><a href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery</a>: A Flink service built at Yelp, performing un-windowed joins across keyed data streams. Each join output is in the form of a data stream.</li>
<li>Aggregator: A Flink-based service that aggregates Data Pipeline messages. Think of it as the GROUP BY SQL statement over streams.</li>
<li>Apache Beam: An open-source unified programming model that allows users to write pipelines in a set of different languages (Java, Python, Go, etc.) and to execute those pipelines on a set of different backends (Flink, Spark, etc.).</li>
<li><a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Cassandra Sink</a>: A Flink-based data pipeline data connector for Apache Cassandra. It is responsible for reliably loading data pipeline messages into Cassandra in real time.</li>
<li>Timespan Updater: A Flink-based tool to schedule data transformation tasks based on date and time conditions.</li>
</ul><p>There are several cases that data transformation requires complex logics that the above tools alone cannot implement. To support such cases we define the logic in a stand-alone service that Beam jobs can communicate to retrieve transformed data. The following figure illustrates a high level overview of our new indexing topology:</p><div class="c1"><img src="https://engineeringblog.yelp.com/images/posts/2021-06-07-modernizing-business-data-indexing/ei2.jpg" alt="Elastic Indexer 2" /></div><p>The new mechanism reduces database load dramatically as most of the transformation is done inside Flink applications. With this system, the source of data changes can be any data stream, and we are no longer limited to getting changes only from MySQL. Backfilling data in case of adding new properties or failures is relatively easy by changing data pipeline schemas and resetting/rewinding input streams’ offsets. All of the data pipeline tools at Yelp support delete operations, which makes it very easy to delete business properties from materialized views. This ensures that we don’t store stale data in Cassandra. Since both Kafka and Flink are built for distributed environments, they provide first class parallelization capability, which can be used to increase indexing throughput, especially during backfilling data for all businesses.</p><p>One of the main challenges that we faced during the migration was porting complex business logic from stand-alone services (Python or Java) into Flink applications, despite having various Flink applications to cover various specific use cases. Some of these logic migrations required complex streaming topology that were hard to maintain and monitor.</p><p>The legacy indexer’s logic was in multiple microservices. Not only was this logic used by the legacy indexer, but also other applications and clients. That’s why we couldn’t simply move the logic to the data pipeline. We would have had to create duplicate logic in our Flink applications to keep other parts of Yelp’s microservice ecosystem working smoothly. This could easily lead to discrepancies in application logic in microservices and Flink applications, especially when new complex logic that is hard to create in our generic Flink applications is added to a microservice. This was the reason that we had to keep some of the logic in microservices and call them from Beam jobs, whenever they were needed.</p><p>One of the biggest requirements for this project was to switch to the new system without causing any down time for the downstream services. We achieved this goal by a multi-step launch process for each property at a time. We ran the legacy and the new indexers in parallel so that both Cassandra clusters had the same data. The next step was to verify if the data in the new cluster matched that of the old one. Because of the large amount of data and the real-time indexing aspect of both of the indexers, we couldn’t simply do a direct one-to-one comparison between records in each cluster by querying them directly from Cassandra. That’s why we modified the consumer service of this data to pull data for a small percentage of requests from the new Cassandra cluster in the background, while it was serving users with the data from the old cluster. Then we logged both old and new data. After collecting enough data samples, we ran a sanity check script to verify that the new data was correct. It was only after this step that we had enough confidence to switch the consumer service to read data from the new cluster.</p><p>Fantastic! ? We now have a proper monitoring system for our data ingestion system, which gives us granular information and control on each component. Maintenance has become a lot easier. We can now scale up/down the indexer for each property according to its load without affecting indexing jobs in other properties.</p><p>We now have a proper dead-letter queue that can be utilized to backfill properties for businesses that fail for various reasons. With this tool we would know the exact count of failing records, if they ever happen.</p><p>Many people were involved in this project, but special thanks to Yujin Zhang, Weiheng Qiu, Catlyn Kong, Julian Kudszus, Charles Tan, Toby Cole, and Fatima Zahra Elfilali who helped with the design and implementation of this project.</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Are you interested in using streaming infrastructure to help solve tough engineering problems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5d226cd-6ea1-4d12-b875-725b331202b7?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/06/modernizing-business-data-indexing.html</link>
      <guid>https://engineeringblog.yelp.com/2021/06/modernizing-business-data-indexing.html</guid>
      <pubDate>Mon, 07 Jun 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Career paths for engineers at Yelp]]></title>
      <description><![CDATA[<p>About 5 years after joining Yelp, I was managing several teams in our <a href="https://www.yelp.careers/us/en/yelp-jobs-in-germany">Hamburg, Germany office</a> and asked my manager, a director at the time, what were the expectations for an engineering manager versus a director. While the conversation was helpful to me at that moment, the gist was basically “we haven’t written that down.” As you can imagine, it’s hard to know both where you stand and how to grow if that’s <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">not captured anywhere for you to read</a>.</p><h2 id="where-we-started">Where we started</h2><p>For many years, the career track for engineers at Yelp was not documented. People still advanced in their careers; we just didn’t have written, consistent guidelines on how. For example, some engineers took on Tech Lead responsibilities, but it wasn’t always clear whether that was a temporary role or a level. In early 2016, we introduced our first directors within engineering management, but there was no engineering-wide documentation on what led to one title vs. another.</p><h2 id="how-did-we-get-here">How did we get here?</h2><p>With over 500 engineers by this time in early 2016, you’re probably wondering what led us to this point. There are even 10-person startups with leveling systems in place. A couple of concerns made us hesitant to roll out anything:</p><ul><li>Many of our engineering leaders came from organizations with toxic leveling systems, characterized by contentious career conversations between managers and engineers that, instead of focusing on growth, involved stressful political games and quid pro quo schmoozing around annual performance reviews and promotion periods.</li>
<li>We associated leveling with titles in our minds, and we wanted to avoid the latter. We saw at other organizations how titles led to folks earlier in their careers having their input disregarded purely due to their title. Many of Yelp’s greatest accomplishments come from interns having an equal seat at the table, and that’s an aspect of our culture we were keen to retain.</li>
</ul><p>That said, we all agreed the status quo was no longer working. Folks who joined, and even long-timers, weren’t sure what was expected of them since expectations were not explicitly captured anywhere. Similarly, it wasn’t clear why their compensation changed or didn’t over time. And, as we continued to grow, interviewers weren’t consistent in what they expected of candidates, so we needed a small group of calibrators to vet candidates. That process also wasn’t scaling and was <a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html">susceptible to implicit bias</a>.</p><h2 id="first-attempt">First attempt</h2><p>With that in mind, we set out on our first attempt to capture expectations of software engineers and tech leads. We produced two documents along with self-assessment sheets that engineers could use to assess themselves on a scale of 1 to 5 from “no experience” to “very confident.”</p><h3 id="what-worked">What worked</h3><p>This was a good exercise in turning the implicit into explicit and prompted all of engineering leadership to think about and write down what we’re looking for in engineers and what we value as an organization. It also was a good starting point for interviewers to calibrate on what to look for in candidates.</p><h3 id="what-didnt-work">What didn’t work</h3><p>For starters, it was a single set of expectations for all engineers. For example, one of our expectations, “be a stabilizing force within your team, technically, emotionally, and culturally,” means something very different for someone who just joined Yelp for their first job or for someone with 15+ years of experience. Since these expectations were one-size-fits-all, they weren’t linked to compensation, which left that part still ambiguous for folks. Finally, the self-assessment was entirely voluntary, and not everyone completed it.</p><h2 id="second-attempt">Second attempt</h2><p>We felt it was important to address the one-size-fits-all aspect of the expectations. We reviewed a number of blog posts and articles from our peers that helped us break out our expectations into a two-dimensional grid (shout-outs to <a href="https://labs.spotify.com/2016/02/08/technical-career-path/">Spotify</a>, <a href="http://dresscode.renttherunway.com/blog/ladder">Rent the Runway</a>, <a href="http://engineering.chartbeat.com/2015/06/05/engineering-ladders/">Chartbeat</a>, and <a href="http://joelonsoftware.com/articles/ladder.html">Fog Creek</a> for sharing their journeys and results).</p><p>We established 5 milestones to indicate the scale of impact an engineer was having:</p><ul><li><strong>Self:</strong> You’re focused on what you can personally deliver.</li>
<li><strong>Team:</strong> You have a significant impact on your whole team.</li>
<li><strong>Group:</strong> Your contributions are recognized and sought out by engineers across several teams or your tech community.</li>
<li><strong>Company:</strong> Your work is impactful across the entire company.</li>
<li><strong>Industry:</strong> You drive changes that advance Yelp’s interests across the industry.</li>
</ul><p>And we assessed those against 5 dimensions that captured what we valued as an engineering organization:</p><ul><li><strong>Technical Skill:</strong> Your depth of knowledge and expertise in your specific domain or current position.</li>
<li><strong>Ownership:</strong> You take responsibility for your actions as well as those of your team, and you hold others to the same standards. You deliver projects with tangible results consistently and in a timely fashion.</li>
<li><strong>Business Insight:</strong> You understand how projects and decisions benefit Yelp as a company. You design &amp; build solutions to deliver long-term value while also being flexible to accommodate rapid change.</li>
<li><strong>Continuous Improvement:</strong> You continuously learn and grow, and you invest time in mentoring others. You never accept the status quo for yourself, your peers, or the org as a whole.</li>
<li><strong>Leadership:</strong> You communicate clearly and effectively. You optimize for the group’s success and advocate Yelp’s values of inclusivity and support. You strengthen your team by championing Yelp internally &amp; externally.</li>
</ul><p>This time, we also asked that all engineers have an assessment conversation with their manager to see where they stood in the milestones instead of making it optional.</p><h3 id="what-worked-1">What worked</h3><p>This addressed the main pain point of our first attempt by providing a graduated scale that outlined different responsibilities for someone just starting out in their career, versus for someone who had been at Yelp for several years already. We also established a clear expectation around progression by requiring all engineers move from Self to Team in every dimension within two years. We also made this exercise mandatory at rollout, which helped us collect data across the organization.</p><h3 id="what-didnt-work-1">What didn’t work</h3><p>Post rollout, many engineers weren’t motivated to keep having the conversation with their manager. Since the framework still wasn’t tied to their compensation, it wasn’t clear how this benefited them. It also wasn’t granular enough for the makeup of our teams at the time. Most of engineering sat at either Self or Team with more senior engineers approaching Group. The Company and Industry milestones, while aspirational, felt out of reach. Lastly, our framework didn’t summarize up to a single level, which made it hard for recruiters to explain to candidates that were accustomed to level terminology from other companies.</p><h2 id="third-attempt">Third attempt</h2><p>So with all that, in 2018, we embarked on a third attempt. We still didn’t want to introduce titles for the same reasons we avoided them initially, so our levels would be kept private and not be associated with titles.</p><p>Our third framework addressed some of the key lessons from our first two attempts:</p><ul><li>We dropped the top Industry milestone as being too aspirational, and we added more steps between the remaining milestones. We now had six levels, IC1 through IC6, for the four steps in our second attempt. The IC1-IC6 terminology also matches what other companies use.</li>
<li>Most importantly, we did what the previous frameworks didn’t with this third attempt: we tied it to compensation. Each level has an associated merit band, equity band, and cash bonus target based on role and location.</li>
<li>Finally, as <a href="https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html">Kent and Grace wrote about previously</a>, we revamped our interview process to be based on our dimensions and began using levels and associated compensation bands when hiring.</li>
</ul><p>This framework also came with a web app, based on <a href="https://github.com/Medium/snowflake">Medium’s Snowflake</a>, to navigate the dimensions and levels.</p><figure><p style="text-align: middle;"><img src="https://engineeringblog.yelp.com/images/posts/2021-06-03-engineering-career-series-career-paths-for-engineers-at-yelp-snowflake.jpg" alt="Yelp Product &amp; Engineering Levels Web App" /></p>
<figcaption><small>Yelp Product &amp; Engineering Levels Web App</small>
</figcaption></figure><h3 id="what-worked-2">What worked</h3><p>Everyone in engineering uses the leveling system; it’s not optional. Job offers include a level, we record them as part of their employee profile, and compensation adjustments are based on their progress in or across levels. Seeing our success, other departments throughout Yelp adopted our model, and we now have levels using the same format for our Product, Business Systems Analyst, IT, and Engineering Management roles, with other teams adopting it every quarter. All of these various leveling frameworks are visible to everyone at Yelp, so people know what to expect if they’re considering a role change.</p><h3 id="what-we-missed">What we missed</h3><p>I didn’t call this section “third time’s the charm” because, let’s be honest, we didn’t get everything right even with this third attempt.</p><p>After the initial rollout, managers were mostly left on their own to calibrate. Senior levels, IC5+, required calibration discussions with the Chief Technology Officer, but the process was adhoc for levels prior to that. Transparency around leveling was also inconsistent, with some managers having a shared document with their engineers, and others keeping the whole process opaque.</p><p>In recent quarters, we’ve been addressing these pain points by rolling out an organization-wide calibration process among manager groups that happens every quarter before promotions are submitted. In addition, managers now maintain a historical record of an engineer’s progression that is shared with them and allows them to contribute to the leveling process with their own data points.</p><h2 id="there-is-no-done">There is no “done”</h2><p>As an organization grows, its leveling framework and processes will start to break down. What was once widely understood will become foreign to newer members of the team. For that reason, we’re constantly reviewing and iterating on what we have.</p><p>While employees now have a tool at their disposal to understand the expectations once they join, hiring managers still have to verbally walk through all this with potential new hires, making it challenging for them to clearly understand what’s expected of a certain level. To address this, we’ll be publishing our engineer and engineering manager leveling frameworks in the coming months. We believe everyone should know what they’re signing up for and what career growth looks like at the company they’re joining.</p><h2 id="up-next-technical-leadership-at-yelp">Up next: Technical leadership at Yelp</h2><p>One area that overlaps our career framework is technical leadership. Unlike some of our peers, we view technical leadership as a role, not a level. In our next post, Jason Sleight and Josh Walstrom, two of our Group Technical Leads, will walk us through why we approach it that way and how we’ve worked to build a collaborative, cross-pollinating community of technical leaders who work together regularly to solve some of Yelp’s biggest problems.</p><p>If you also dream of leveling frameworks and are passionate about fostering an environment for career growth, <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2021/06/engineering-career-series-career-paths-for-engineers-at-yelp.html</guid>
      <pubDate>Thu, 03 Jun 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: How we onboard engineers across the world at Yelp]]></title>
      <description><![CDATA[<p>Like most companies, Yelp has had to make substantial changes to the way we onboard new team members over the past year. Yelpers have always been naturally good at fostering a welcoming and supportive atmosphere for new employees. Translating this into a welcoming and supportive <em>virtual</em> atmosphere hasn’t happened organically. As we grow distributed teams across the United States, Canada, and Europe, the new ways in which we prepare, welcome, train, and support our employees have become, and will continue to be, important for the advancement of Yelp’s Engineering &amp; Product organizations.</p><p>Going into 2020, we knew we were already outgrowing a number of our programs, and we were presented with even more challenges as we made the shift to remote work, and then permanently distributed teams. This post aims to share some of the ways we’ve adapted to these changes, and the lessons we’ve learned along the way.</p><h2 id="welcoming-new-hires">Welcoming New Hires</h2><p>Showing up to an office on your first day at a new job can be nerve-racking. Pre-COVID, we did our best to put new hires at ease. They were connected with their mentor as soon as they arrived, their desk was set up with the right tech and a welcome kit, and they were invited to a team lunch. When we made the shift to remote work and distributed hiring, we knew we had to figure out how to replicate this welcoming environment virtually.</p><p>Through partnerships with our IT and Workplaces departments, we now ensure that the all-important first day welcome kit is shipped with new hires’ equipment. Once folks have had the chance to login, they begin their day with a check-in with their new manager and mentor. These chats are casual, and help give every new hire the lay of the land for their first week. Managers also arrange for new hires to connect with other members of their team and anyone they might work with on a regular basis. Previously, these introductions would’ve happened informally around the office and in meetings, but we now intentionally schedule these as one-on-one conversations so that new hires don’t miss out on making important connections.</p><p>We also make sure a new hire’s entire team can gather for a virtual lunch, coffee, or watercooler chat on their first day. This gives everyone the chance to welcome their new teammate in a fun atmosphere, and helps new hires start putting faces to names.</p><h2 id="tackling-remote-onboarding-challenges">Tackling Remote Onboarding Challenges</h2><p>While we were able to replicate some of the social aspects of being in an office, we were also hit with a whole new set of logistical challenges. “What time do I show up?” turned into “When can I expect a call from my manager?” Managers had to figure out how to connect with new hires in different time zones, or even in different countries. And simply having someone log in to a new computer at the office became an adventure in international shipping logistics.</p><p>After hitting a few speed bumps, we worked with teams across Yelp to set up new systems for onboarding new hires remotely. We send a regular cadence of reminders to managers, mentors, and new hires beginning two weeks before someone is expected to start. For managers and mentors, our onboarding team shares checklists to help them prepare, and most have developed complementary team-specific lists in an effort to maintain consistency across onboarding experiences. For the new hire, we share materials ahead of time, like an up-to-date technical primer that outlines the tools they’ll be using when they start. This provides folks with the foundational information they need to hit the ground running on their first day.</p><h2 id="revamping-new-hire-orientation">Revamping New Hire Orientation</h2><p>Historically, Yelp held a one-hour orientation session for new hires on their first day. This session, called “Space Camp,” was in-person and typically hosted by a leader from within Engineering. Space Camp was something we knew we were outgrowing. We tried to pack a lot of information into just 60 minutes, which overwhelmed a lot of new hires. And because we invited all Engineering and Product roles to this session, we had to keep the content broad. But this actually resulted in it not being helpful for a number of folks who attended.</p><p>We started looking at an overhaul of Space Camp in January 2020. But when COVID hit, we had to pivot. We knew we needed programming that would scale across time zones, be relevant for folks in a wide variety of roles, and provide everyone with a sense of belonging in their first few weeks. In close partnership with our People Operations team, we eventually landed on a blended approach that includes:</p><ul><li><strong>Virtual instructor-led orientation</strong>. Led by our People Operations team, this happens on a new hire’s first day, and covers everything from our company values to benefits.</li>
<li><strong>On-demand e-courses.</strong> People Operations maintains a collection of resources for all Yelp employees, while our Technical Talent team maintains resources more specific to Engineering &amp; Product. This includes a collection that dives into our Engineering team structure, culture, and goals for the year, as well as a collection of ramp-up materials for software engineers.</li>
<li><strong>Virtual meet &amp; greets.</strong> All new hires are invited to a virtual meet and greet with our Internal Community Manager within their first 30 days. These chats are fun and informal, and give new hires the opportunity to learn about unique aspects of Yelp, such as Yelp Employee Resource Groups.</li>
</ul><p>In 2021, we’re looking to expand our virtual instructor-led and culture offerings. Currently in the works are virtual whiteboarding sessions called Build Me a Yelp, which are interactive introductions to Yelp’s infrastructure that give new hires the chance to ask questions. We’re also aiming to ensure all Engineering &amp; Product new hires have the chance to connect with the Vice President of their organization in their first 90 days.</p><h2 id="strengthening-mentorship--local-buddy-programs">Strengthening Mentorship &amp; Local Buddy Programs</h2><p>Strong mentorship is a crucial part of setting new hires up for success. We assign all new hires a mentor weeks before they start. This way mentors have plenty of time to prepare for their new teammate. Our mentors review and update team documentation, arrange for regular one-on-ones with their mentee, facilitate connections with other teams, set expectations, and provide feedback.</p><p>We combined the lessons we’ve learned over time with a few new guardrails once we started making the shift towards distributed teams. We now:</p><ul><li>Make sure all mentors complete a “Mentorship 101” course. More on this below!</li>
<li>Match mentors and mentees that share the same core working hours.</li>
<li>Recommend that all mentors have at least one year of experience on the team so they’ll have a good understanding of both Yelp and team-specific norms.</li>
<li>No longer require that mentors be more technically experienced than their mentee. We view the job of a mentor to be showing the new hire what life at Yelp is like and how to be successful on the team, and we provide other technical resources to help fill in any potential gaps in knowledge.</li>
</ul><p>We also recognize that remote mentorship is just…different. And we may even have cases where there may not be a mentor on the team that has any overlapping working hours with the mentee! In these instances, we encourage managers to find a “local buddy.” Buddies are coworkers who are in the same time zone, and can be a resource for questions and support if the new hire’s mentor and other teammates are offline.</p><p>We’ve also worked to find the positives in instances in which the mentor might typically start their day later than their mentee due to a time zone difference. Providing the new hire with items to complete on their own such as onboarding courses or small code fixes gives them a chance to feel productive and collect questions before they connect with their mentor each day. But once their mentor is online, we encourage everyone to default to over communicating. This includes sending messages to indicate availability to field questions, utilizing icons and status updates, and making sure working hours are publicly displayed on calendars. Some mentors have even experimented with keeping a call open on Slack or Google Meet throughout the day to simulate the ability to simply turn <a href="https://www.youtube.com/watch?v=2EwViQxSJJQ">to the left</a> and ask a quick question.</p><h2 id="training-for-mentors">Training for Mentors</h2><p>We know that “<a href="https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html">no process” is another name for “bias</a>,” so as we ramped up hiring going into 2021, we established a formal training program for all mentors. Training helps to ensure all mentors are providing a consistent experience for their mentees.</p><p>The first step for any new mentor is to complete an on-demand e-course focused on the fundamentals, like “why be a mentor?” and expectations for mentorship at Yelp. We regularly refresh this content to ensure the information is up-to-date and remains useful as the company (and the world around us!) changes.</p><p>We also create spaces for mentors to connect and discuss any questions, issues, or lessons learned. All mentors are invited to participate in live quarterly discussion sessions facilitated by experienced mentors and managers where they can bring questions to discuss with the group and compare approaches to common scenarios. If mentors have questions they’d like to discuss on the spot, they can turn to an internal channel to pose them to everyone who’s ever been a mentor.</p><p>Beyond providing fundamental training and spaces to connect, we’re exploring additional workshops to level up specific skills and help to build out a strong and lasting mentorship culture. In the previous section, we mentioned that providing feedback to their mentee is a part of a mentor’s responsibilities. We know this is a significant determining factor in how quickly the new hire can ramp up. So, we recently started offering a live training session on giving and receiving feedback to all mentors. We have plans to continue supplementing this workshop over time.</p><h2 id="ongoing-learning--mentorship-programs">Ongoing Learning &amp; Mentorship Programs</h2><p>In addition to our new-hire and mentorship programs, we’re working to establish ongoing learning opportunities for everyone on our team, regardless of their tenure or role.</p><p>Like onboarding, this is an area where we’ve worked closely with our People Operations team. In 2020, we launched the Leadership Essentials and Development (LEAD) program for new managers. This program covers management basics like effective one-on-ones, coaching, and feedback, providing managers with the foundation they need to help their teams grow. In 2021, our People Operations team is expanding on this program by offering learning opportunities to support senior leaders.</p><p>We’re also working with teams across Engineering &amp; Product to develop learning content that’s tailored to a wide range of audiences. Below is a little taste of what we have going on in this space:</p><ul><li>We’ve launched an internal podcast series. Hosted by leaders on our Product team, the series covers best practices and ways to upskill.</li>
<li>We’re working with a group of Engineering Managers to develop resources on understanding, supporting, and working effectively with neurodiverse individuals.</li>
<li>We’re hosting virtual workshops focused on agile skills development.</li>
</ul><p>Finally, now that we’ve made a permanent shift towards distributed teams, we’ve established working groups to help ease this transition. We know this can be a tricky thing to nail, and we’re doing our best to get it right. These groups are focused on improving distributed mentorship and belonging programs, as well as establishing organization-wide policies for things like asynchronous standups, sprint planning, and roadmapping. It’s our hope that we’ll be able to provide everyone with the skills and resources they need to do their best work, no matter where they’re based.</p><h2 id="up-next-career-paths-for-engineers-at-yelp">Up next: Career Paths for Engineers at Yelp</h2><p>While it’s crucial to provide an informative and inclusive onboarding experience, and follow that up with continued opportunities for learning, we know creating the space for learning isn’t enough on its own. We also need to provide everyone with the same structured framework for growing a career at Yelp. In our next post, we’ll dive into our career paths framework, including the history behind our current leveling system, and how we view career growth as an ongoing conversation.</p><p>Lastly, if you’re finding these posts interesting and Yelp sounds like the kind of company culture that you’d like to be a part of… <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring!</a></p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2021/05/engineering-career-series-how-we-onboard-engineers-across-the-world-at-yelp.html</guid>
      <pubDate>Thu, 20 May 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Moderating Promotional Spam and Inappropriate Content in Photos at Scale at Yelp]]></title>
      <description><![CDATA[<p>The <a href="https://trust.yelp.com/">trust</a> of our community of consumers and business owners is Yelp’s top priority. We take significant measures to maintain this trust through our state of the art <a href="https://trust.yelp.com/recommendation-software/">review recommendation algorithms</a> in order to maintain the integrity and quality of the content on our site. Albeit popular, review text is only one of the many types of user-generated content at Yelp. Photos are also a key piece of content and they are increasingly becoming an attack vector for spammers and inappropriate or other unwanted behavior. In this blog post we show how we built a scalable photo moderation workflow leveraging <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Yelp’s in-house real-time data streaming and processing pipeline</a>, simple heuristics, and deep learning models in order to deal with hundreds of thousands of photo uploads per day.</p><p>Yelp’s mission is to connect people with great local businesses. Local businesses are often small in size and might not have the resources to quickly identify and flag the content generated on their pages, especially if it is disruptive or deceiving, which could result in an impact in trust for both the business and its customers. Trust is deeply embedded in two Yelp values:</p><ul><li>Protect the source: community and consumers come first.</li>
<li>Authenticity: tell the truth. Content found on Yelp should be reliable and accurate.</li>
</ul><p>Yelp takes pride in its mission and values and it constantly strives to develop and improve the systems to protect business owners and users.</p><p>So we addressed two types of photo content spam: promotional and inappropriate.</p><p><strong>Promotional spam</strong> is an inappropriate commercial message of extremely low value which tries to disguise itself as business owner content and often leads the user to being scammed (e.g. by showing a fake customer support number). We consider this a type of <em>deceptive spam</em> because it erodes the trust the users have on our platform.</p><figure><p style="text-align: middle;"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/promotional-spam-1.png" alt="Promotional Spam Example 1" class="c1" /><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/promotional-spam-2.jpg" alt="Promotional Spam Example 2" class="c1" /></p>
<figcaption><small>Examples of promotional spam.</small>
</figcaption></figure><p><strong>Inappropriate spam</strong> is content that can be interpreted as offensive or unsuitable in the specific context where it appears. Context is especially relevant for this type of spam as inappropriate content covers a broad range of situations where the classification can be fairly ambiguous depending on where the content appears or which content policy applies (<a href="https://www.yelp.com/guidelines">Yelp Content Guidelines</a>). We consider this a type of <em>disruptive spam</em> because it can be abusive and offensive if not outright disturbing. Examples of this type of spam are suggestive or explicit nudity (e.g., revealing clothes, sexual activity), violence (e.g., weapons, offensive gestures, hate symbols), drugs/tobacco/alcohol, etc.</p><p>Users and business owners upload hundreds of thousands of photos every single day.</p><p>At this scale, the infrastructure and the resources required for real-time classification are a considerable challenge due to tight response time constraints required to maintain a good user experience. Additionally, processing photos using neural networks requires expensive GPU instances. Real-time classification is also not an ideal choice in an adversarial space because it provides immediate feedback to an attacker trying to circumvent or reverse engineer our systems. Having an indeterminate delay between content upload and moderation significantly increases the time cost for an attacker to reverse engineer the system. Conversely, unwanted content should be moderated as quickly as possible to protect our users and since spam tends to be generated in waves, if we fail to swiftly remove it we will likely end up with large swathes of unsafe content on the platform.</p><p>There are also challenges specifically related to the machine learning (ML) algorithms used to process image data. Promotional and inappropriate spam is fairly rare on Yelp which creates the problem of extremely unbalanced data, making training and evaluating ML algorithms a lot more challenging. While we can use smart sampling techniques to produce balanced datasets for training purposes, evaluation in production is highly skewed on trying to minimize false positives, which in turn affects the recall of spammy content. Another concern that we need to address is the context of a photo, especially for inappropriate content (e.g. a photo of a lingerie model in a lingerie shop is perfectly fine but it is not if it is on a restaurant business page), and an adversarial space which requires the ability to react quickly to evolving threats and constantly keep our models up to date.</p><figure><p style="text-align: middle;"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/content-label-distribution.png" alt="Content Label Distribution" class="c2" /></p>
<figcaption><small>Distribution of content types, the far right bar is "good" content.</small>
</figcaption></figure><p>As we mentioned above, any moderation of user-generated content has to work in an adversarial space. Hence, we decided to not use any out-of-the-box or third party solutions which we considered vulnerable to an attacker reverse engineering because they are publicly available and therefore attackers can experiment with them and learn to bypass them before attacking Yelp. In this case, rolling out our own custom system plays security through obscurity to our advantage by buying us time against attackers which in turn allows us to remain ahead of the game.</p><p>Moreover, we discussed the issue in ML when dealing with class imbalance. In our solution we focused on precision while maintaining good recall. Precision and recall are inversely proportional but we prefer a “do no harm” approach where we minimize the false positive instances which would lead to removing valid content. This is incredibly important for businesses that have little content on their pages and for which removing a valid photo would have a non-negligible effect. A high precision solution also minimizes manual work for our content moderation team. This helps to deal with the continuous growth of Yelp since manual moderation does not really scale well and reduces the exposure to inappropriate content which can be psychologically taxing and potentially a liability.</p><p>Finally, while designing the system, we tried to leverage existing Yelp technologies and systems as much as possible to minimize engineering development cost and maintenance burden.</p><p>After considering the challenges in infrastructure, ML, and the adversarial space we settled for a <strong>multi-stage multi-model approach</strong> where there are two stages and different models for each stage and type of spam. The first stage is used to identify the subset of photos that are most likely to contain spam; the models in this stage are tuned to maximize spam <a href="https://en.wikipedia.org/wiki/Precision_and_recall">recall</a> while filtering out most of the safe photos. Essentially, this step changes the label distribution of the data fed into the second stage and in doing so it significantly reduces <a href="https://en.wiktionary.org/wiki/ham_e-mail">ham</a>/spam class imbalance and removes many potential false positives (consider the following: we do not perform inference on a large subset of photos in the second stage, and the final set of false positive is only limited to the false positives generated by the second stage, which may or may not intersect with the false positives generated in the first stage). The second stage is where the actual classification of the content happens; the models in this stage are tuned for <a href="https://en.wikipedia.org/wiki/Precision_and_recall">precision</a> because we aimed to send only a small amount of content to the manual moderation queue and we wanted to keep false positives to a minimum. Moreover, we have a set of heuristics playing alongside ML models which speed up the whole pipeline and are quickly tunable so that we can react in a small amount of time to a new threat our models are not capable of handling which give us the time to update our models while keeping users protected. Finally, we created a Review Then Publish (RTP) moderation workflow UI where images that are identified as spam are hidden from the users and sent to our <a href="https://trust.yelp.com/content-moderation/">content moderation team</a> for manual review. Yelp’s content moderation team then can decide to either restore a photo if it is a false positive or allow the photo to remain hidden if it’s malicious.</p><p>In the next sections we will dive into the details of what this solution looks like for each type of spam.</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/multi-stage-multi-model.png" alt="Multi-stage multi-model ML pipeline" class="c3" /></div><p>Most of the promotional spam is characterized by fairly simple graphics containing a bunch of text that is used to deliver the spam message. Therefore, the image-spam identification models used in the first stage try to identify photos containing text or logos; these models are mostly heuristic based and are very resource efficient. In the second stage, we extract the text from the photos using a deep learning neural network. The spam classification is then performed on the text content leveraging a <a href="https://en.wikipedia.org/wiki/Regular_expression">regular expression</a> and <a href="https://en.wikipedia.org/wiki/Natural_language_processing">NLP</a> service. The fast path provided by the regular expressions allows for an efficient recall of most egregious cases and provides the capability to quickly react to content that is not being captured by the NLP models.</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/promotional-spam-pipeline.png" alt="Promotional Spam ML Pipeline" class="c3" /></div><h2 id="inappropriate-spam">Inappropriate Spam</h2><p>Inappropriate spam is much more complex than promotional spam because it covers a broad range of content. The classification is also heavily dependent on the context where it appears. In order to maximize recall, the first stage comprises two models: a thin <a href="https://en.wikipedia.org/wiki/Residual_neural_network">ResNet</a> trained on a binary classification task to identify inappropriate content in photos based on Yelp’s policies, and a deep <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">CNN</a> model trained on a binary classification task to identify photos containing people. This second model has been added specifically to maximize recall since many instances of inappropriate content involve people. The second stage combines a deep learning model trained on a multi-label classification task, where the output is a set of labels and associated confidence scores. The model is then calibrated for precision based on confidence scores and a set of context heuristics (e.g. the business category) that take into account where the content is being displayed.</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/inappropriate-spam-pipeline.png" alt="Inappropriate Spam ML Pipeline" class="c5" /></div><h2 id="dealing-with-spam-waves-and-adversarial-actors">Dealing with Spam Waves and Adversarial Actors</h2><p>So far we covered mostly the ML aspects of the system and just briefly mentioned how heuristics can be used to quickly enhance the system to adapt to the changing threats coming from adversarial actors. Spam often hits websites in waves of very similar content that is generated from fake accounts piloted by bots. Hence, we have a workflow and a couple of infrastructure improvements specifically to address that. Photos flagged as spam are tracked by a fuzzy matching service. If a user tries to upload an image and the image matches a previous spam sample it is automatically discarded. On the other hand, if there is no similar spam match it goes through the pipelines mentioned above and it could end up in the content moderation team queue. While awaiting moderation the images are hidden from the users so that they are not exposed to potentially unsafe content. The content moderation team can also act on entire user profiles instead of just a single piece of user content. For example, if a user is found to be generating spam, its user profile is closed and all associated content is removed. This sensibly improves spam recall because we need to catch only one image from a user to be able to remove all unwanted content generated by the spam bot profile. Finally, the traditional user reporting channel exists which provides us with feedback to monitor the effectiveness of our systems.</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/fuzzy-matching.png" alt="Fuzzy Matching Service" class="c6" /></div><p>In this blog post we covered some of the solutions Yelp developed to process hundreds of thousands of photos per day using a two stage processing pipeline powered by state of the art ML models. We also implemented a RTP moderation workflow so that problematic content is hidden from users until moderation happens. Finally the system provides us with the flexibility to quickly respond to adversarial actors, fake accounts, and spam waves.</p><p><a href="http://trust.yelp.com/">Trust &amp; Safety</a> is taken very seriously at Yelp and we are proud of the work we do to protect our users and business owners. As a result, <a href="https://blog.yelp.com/2019/10/study-shows-97-of-people-buy-from-local-businesses-they-discover-on-yelp">Yelp is one of the most trusted review platforms on the web</a>.</p><div class="c4"><img src="https://engineeringblog.yelp.com/images/posts/2021-05-12-moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp/whole-workflow.png" alt="Fuzzy Matching Service" class="c5" /></div><ul><li>Thanks to Jeraz Cooper for mentoring, countless code reviews, and enabling the photo support in the moderation UI.</li>
<li>Thanks to Jonathan Wang for the insights on the inappropriate spam model.</li>
<li>Thanks to Pravinth Vethanayagam and Nadia Birouty for consulting on system design and people and logo classifiers.</li>
</ul><div class="island job-posting"><h3>Join Yelp</h3><p>Want to help us make Yelp a safer place?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/e9a3e447-7271-431d-b8d3-29168c9c01ef?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/05/moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2021/05/moderating-promotional-spam-and-inappropriate-content-in-photos-at-scale-at-yelp.html</guid>
      <pubDate>Wed, 12 May 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Using structured interviews to improve equity]]></title>
      <description><![CDATA[<p>For years, Yelp continued to use an interview process that was created when we were a 50-200 person Engineering organization, with only a handful of interviewers:</p><ul><li>Each interviewer wrote their own interview questions</li>
<li>A few senior leaders gave overall hire/no hire decisions for every panel</li>
<li>Interviewers received ad hoc feedback from senior leaders when it seemed like they were too tough or too easy in their interviews</li>
</ul><p>A few things went well:</p><ul><li>there was a strong sense of personal responsibility for both leaders and interviewers</li>
<li>turnaround time for offer approvals was quick</li>
<li>and <a href="https://blog.yelp.com/2021/04/yelp-values-employee-panel-theres-always-room-for-a-smile">Yelp values</a> could be preserved by senior leaders.</li>
</ul><p>As the Engineering organization grew to more than 500 employees, the interviewer pool also grew, from tens to hundreds. This created a few challenges. It became harder to enforce similar standards across interviewers. It became increasingly difficult to tell whether a candidate was strong or if an interview question was correctly calibrated. This lack of structure made it difficult to confidently and consistently identify strong candidates. It also made it difficult to identify whether there were patterns of bias in our interview process. Faced with these challenges, we asked, <strong>“How do we continue to hire diverse, amazing talent as we scale our Engineering organization?”</strong></p><h2 id="creating-structured-interviews">Creating Structured Interviews</h2><p>A group of folks across Technical Talent and Engineering banded together to answer this question. Since others had gone down this path before us, we began with a review of prior work by <a href="https://medium.engineering/mediums-engineering-interview-process-b8d6b67927c4">Medium</a>, <a href="https://www.quora.com/What-is-the-engineering-interview-process-like-at-Stripe">Quora</a>, and <a href="https://sensu.io/blog/interviewing-engineers-at-sensu-e4fc35cd601f">Sensu</a>. Those references, along with our own internal review led to the creation of a structured interview process that reflected what we felt it took to succeed at Yelp. As a first step, we focused on standardizing questions across all of our open roles to four key question types:</p><ul><li>Problem Solving</li>
<li>System Design</li>
<li>Ownership, Tenacity, and Curiosity</li>
<li>Playing Well with Others</li>
</ul><p>The first two interview types focus on the candidate’s technical skills, and the latter two focus on non-technical skills and how aligned the candidate is with Yelp’s values. For the technical portions, we wanted to evaluate the candidate’s skill with technical tasks that would be common in the role they’re applying for, rather than their ability to memorize algorithms or easy-to-search-for trivia. To create these questions, we asked engineers across the organization to take a problem their team recently solved and create an example on a smaller scale. We strongly believe that using real life problems to evaluate skills captures what is needed to actually succeed at Yelp and helps us give more opportunities to people with different backgrounds to be successful in our hiring pipeline.</p><p>To evaluate questions, we standardized criteria that related to dimensions (Technical Skill, Ownership, Business Insight, Continuous Improvement, and Leadership) that we use internally for leveling engineers. This further aligned internal and external expectations of candidates and employees.</p><p>Moving to structured interviews allowed us to take the first step to both collect and analyze interview data in a meaningful way. We went from having no comparable feedback to thousands of technical and behavioral data points in a consistent format. This not only gives us the opportunity to monitor the health and size of our pipelines, but it also enables us to identify potential problems or biases at every stage of the interview process. When observing a gap or difference in dropoff rates, we are better able to drill down our focus to specific question sets or interviewers and determine what solutions to implement to directly mitigate bias.</p><h3 id="first-try-what-we-learned">First try: what we learned</h3><p>After introducing structured interviews, we soon identified a difference in pass rates across genders in the initial round of technical interviews. Upon closer inspection, we found instances where candidate performance was identical when measuring how many components of a coding question were completed. However, men were progressing to the next stage of the interview process at a higher rate than women. We were able to quickly reduce this gap by replacing individual interviewers’ judgment on a candidate’s performance with standardized pass/fail criteria, which ensured that all qualified candidates moved forward. This was the first of several successful modifications, which have collectively reduced the pass rate gap between genders. Making corrections to the early steps of the interview process has made a huge impact on gender diversity at every subsequent stage. This ultimately increased the likelihood of having more women make it to the final offer stage. With better pipeline observability, we’ve been able to more effectively hire diverse talent by mitigating these biases and reducing false negatives.</p><h3 id="second-try-defining-evaluation-criteria">Second try: defining evaluation criteria</h3><p>While we were now able to both pinpoint and remedy where drop offs were occurring in our interview process, our approach to reducing bias was still reactive. Interpretation of candidate performance varied amongst interviewers, even with the measures we had in place. We recognized that having structured interview questions wasn’t enough, and we needed explicit evaluation criteria for all of our interviews.</p><p>To address this, we introduced a points-based evaluation criteria to our structured interviews. In this initiative, we further clarified what signals we wanted interviewers to look for and capture. Points are awarded for expected candidate behaviors based on a rubric. Interviewers are required to provide an explanation for when and why points are deducted. This scoring framework can then be aggregated and converted to hiring and initial leveling decisions to maintain consistency across the larger organization. A key benefit of this framework is that interviewers can systematically measure candidate performance during the interview, but the onus of deciding final interview outcome, and, therefore, the possibility of unconscious (or even conscious) bias by the interviewer, is reduced.</p><h2 id="how-were-evolving">How we’re evolving</h2><p>If there’s anything we’ve learned from this journey, it’s that improving interviews is an ongoing process of review and adaptation. At Yelp, we’ve made this a shared priority between Technical Talent and Engineering. Our teams work closely with one another and have a dedicated task force with several subgroups composed of folks from both groups that meet on weekly cadences to put this commitment into action. While we still have a lot on our roadmap, here are some key lessons that we have learned so far:</p><ul><li>Making interview improvements requires a real partnership. It may seem obvious to say this, but, if you’re going to improve engineering interviews, you’re going to need subject matter experts from both engineering and recruiting to capture all the nuances that are often overlooked.</li>
<li>Interviewer bias still exists in your hiring process even with a standardized process and structure. A good best practice to combat this is to make sure that the group working on interview processes is reflective of the demographics of your organization, or what you’d like your organization to be. Make sure women and underrepresented minorities are involved.</li>
<li>A distributed workforce means different geographies with different cultural considerations and different employment norms, so include engineers representative of all your geographies when standardizing. Our initial task force failed to include folks from our European teams and, thus, some of our interview questions were geared towards Bay Area tech culture.</li>
<li>Collecting feedback is imperative towards making progress, so make sure you create feedback loops from all stakeholders: recruiters, recruiting coordinators, interviewers, and hiring managers. Candidates are stakeholders, too, so make sure to have a process to get feedback on their interview experience.</li>
<li>Standardization allows for easier review and change, whether that is the pipeline, the interview questions, interview evaluation, training - the list goes on. We’re still in the midst of rolling out our points-based evaluation criteria for structured interviews, and we’re able to move a lot faster instead of needing to reinvent the wheel!</li>
</ul><h2 id="up-next-how-we-onboard-engineers-across-the-world-at-yelp">Up next: How we onboard engineers across the world at Yelp</h2><p>Equally important to bringing in diverse talent is everything that happens onwards from the moment a candidate becomes an official Yelper. In the next post in this series, we’ll take a closer look at the thought and logistics that we go through to set folks up for success, how we’ve streamlined our onboarding process for distributed teams in the virtual world, and the opportunities for continuous learning we provide to our employees through training and mentorship programs.</p><p>If you’re finding these posts interesting and Yelp sounds like the kind of company culture that you’d like to be a part of… <a href="https://www.yelp.careers/us/en/c/engineering-jobs">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html</link>
      <guid>https://engineeringblog.yelp.com/2021/05/engineering-career-series-using-structured-interviews-to-improve-equity.html</guid>
      <pubDate>Thu, 06 May 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[One year later: building Trust Levels during COVID]]></title>
      <description><![CDATA[<p>From its devastating toll on local economies to its impact on the little things like handshakes and hugs, the COVID-19 pandemic seemed to leave nothing unchanged. Local businesses were especially impacted and forced to make big changes, many overhauling their operations overnight in order to adapt to the new normal.</p><p>Businesses turned to Yelp to communicate operational changes brought on by the pandemic. They kept their communities in the know by updating the <a href="https://www.protocol.com/manuals/small-business-recovery/yelp-pivot-small-businesses-coronavirus">COVID-19 section</a> on their business pages, which was launched at the beginning of the pandemic. They indicated new health and safety precautions, such as wearing masks and enforcing social distancing. They updated their hours, scaled back sit-down dining, pivoted to support takeout and delivery, and even introduced <a href="https://blog.yelp.com/2020/06/helping-local-businesses-reopen-during-covid-19-with-new-products-and-features">virtual service offerings</a> to remain accessible to their communities.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/covid-19-section.png" alt="COVID-19 section on Yelp" class="c1" /></div><p>Given the surge in businesses updating their Yelp pages with COVID-19 related changes, we knew it would be important to measure how confident we are that a given piece of business information is still accurate. To address this, Yelp’s Semantic Business Information team built a new internal system called Trust Levels. In this blog post, we will define Trust Levels, take a look at each part of the new system, and end with an example that ties all the pieces together.</p><h2 id="defining-trust-levels">Defining Trust Levels</h2><p>At Yelp, we call our business information “business properties.” Business properties include anything that we can describe about a business, such as business address, if the business is women-owned, if the business can repair appliances, etc. The numerous business properties found on each Yelp page can share special insights about businesses from retail and restaurants to home and local services.</p><p>Usually business owners can indicate the business properties on their business page. Consumers are also able to contribute information about a business, which can also be collected through our survey questions answered by people who have checked in or visited that business. Our User Operations team reviews changes to ensure quality and accuracy, and can modify the information as well. However, determining how confident we can be about any given piece of information became especially important as businesses repeatedly had to update how they operate due to changing local government policies.</p><p>In order to define our confidence levels, we first created a unified vocabulary that could be used across engineering and Product teams, to avoid each team creating its own definition of trust. We created Trust Level labels from Level 1 (L1), which means we are highly confident that the data is both accurate and current, to Level 4 (L4), which means we do not have strong or recent signals to determine accuracy. These levels, which we designed to be simple and easy to refer to, can then be used by various teams without needing to do their own calculations. For example, if a front-end team wants to only display information of the highest confidence level, they can do so by fetching the information and Trust Level from the backend and only displaying it if the Trust Level is L1.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/trust-levels-table.png" alt="The different Trust Levels" class="c1" /></div><h2 id="calculation">Calculation</h2><p>Once we defined this shared vocabulary, we set out to calculate a Trust Level for each of the tens of millions of business property values on our platform. To start, we utilized one of our existing systems that tracks historical business data. The system logs all business changes to a dedicated Kafka stream for offline use cases. Each record contains a source type (business owner, external partner, etc.), source ID (which particular source provided the data), source flow (which feature or callsite the update came through), and timestamp. All of these fields are essential indicators when it comes to determining how confident we are that a given business property is correct.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/offers-delivery-bar-chart.png" alt="Changes to the 'Offers Delivery' property during 2020" class="c3" /></div><p>We also realized our property ingestion APIs could be improved to capture another important signal around data freshness. A lot of incoming updates we receive are “non-updating updates” — those with values that match what we already have on file. Previously, most of our ingestion flows discarded these as redundant, so we modified them to instead emit logs to a new, dedicated stream for verification events. Not all verifications are equivalent, so we made sure to include the same source-related fields described above in the Kafka stream with each event, preserving context about the verification that might be useful to us later.</p><p>Equipped with historical updates and verifications, we wrote a Spark ETL job to periodically pull these logs from S3, join them on business_id and business property, and then execute a series of rules to decide which Trust Level to assign to that pair. While we won’t detail the actual algorithm here, signals of recency and source type ended up being the biggest determiners of a given business property’s Trust Level.</p><h2 id="storage">Storage</h2><p>After calculating each Trust Level value, we needed a place to store them. Trust Levels are data describing properties, so it made sense to store these values alongside other business property metadata. A metadata table was considered multiple times in the past as we constantly fielded questions about when a property value was created, what time it was last updated, what source type updated the value, or from what flow the value was updated. Instead of running ad hoc queries and pulling together information from multiple datastores, we centralized the metadata in a new table to make it easier to access and eventually expose Trust Levels.</p><p>We called this table business_property_metadata and gave it the following schema:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/metadata-schema.png" alt="" /></div><p>Here’s an example row of the business_property_metadata table:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/metadata-row-example.png" alt="" /></div><p>We chose Cassandra over MySQL as the underlying datastore for this new table, and while our rationale for this could be a blog post of its own, here are the main reasons.</p><p>We knew the table would hold tens of millions of rows, and we could safely assume clients of the data would be accessing it using the primary key (business_id, business_property_name). Cassandra provides good read and excellent write performance for data at this scale when rows are always queried on this key, which Cassandra uses in part as a partition key to distribute a table’s data to different nodes.</p><p>MySQL, which is used extensively throughout Yelp, offers different benefits that were less important for this particular use case. We don’t anticipate needing efficient joins of this metadata with other data entities, nor do we foresee the need for strict transaction mechanisms or strong consistency guarantees around these fields. Cassandra’s eventual consistency semantics are enough for this type of business information.</p><p>As a final note on storage, our metadata table is easily extendable. We have already included a column, provenance, that captures different fields around the data’s source in case our downstream consumers need access to that information. In the future, we will be able to add more types of metadata to the table as use cases arise.</p><h2 id="serving-trust-levels-online-and-offline">Serving Trust Levels (Online and Offline)</h2><p>The final step in enabling our Trust Levels was ensuring that the data was accessible for various teams to use. To do this, we created new dedicated REST API endpoints for querying and writing to our metadata table. We also backfilled our metadata table with historical data that we already had and calculated Trust Levels for those properties. We then migrated existing calls around business properties to our new API endpoints in order to write live updates to our metadata table. Now, with our metadata table filled with values, internal clients can access Trust Levels and other metadata through our online APIs.</p><p>For offline access, we already had existing data streams of our business property data published to <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Yelp’s Data Pipeline</a> and used by teams such as Search and Ads. We needed to make sure that the new metadata information was included alongside the property data in our data pipeline, while also ensuring that the data was easy to consume for our downstream clients.</p><p>In order to accomplish this, we first aggregated our data from the new metadata table along with other currently consumed tables, using a Yelp stream processing service called Flink Aggregator. The aggregator transforms the data stream to be similarly keyed by business_id, since metadata uses a different primary key (business_id, business_property_name). We then combine these streams using <a href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery</a> to produce one data stream that shows the entire current value of that business including metadata. This allows our downstream clients to utilize the same data stream with only slight modifications on their side to read the metadata — including Trust Levels — as well.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/online-offline-diagram.png" alt="The pipeline for serving Trust Levels" class="c1" /></div><h2 id="example-business-hours">Example: Business Hours</h2><p>To conclude, let’s connect all the steps described above by walking through an example for a business property that was updated a lot during COVID: Business Hours. Assume there is a business where the business owner last updated their hours two weeks ago, followed by a verification event from a data partner submitting the same hours one week ago. The following diagram illustrates the entire Trust Levels flow for this particular business property.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-29-one-year-later-building-trust-levels-during-covid/example-business-hours-diagram.png" alt="" /></div><p>Anyone at Yelp can now use this authoritative confidence label however they need it. A front-end team could use it to power new UI components indicating recently updated hours. A Search engineer could experiment with incorporating it as a feature in a ranking model. A data scientist could analyze if accurate business hours data is correlated with higher user engagement. Whatever it is, the Trust Levels data is ready for them, and becomes another tool we use to build helpful features for consumers and business owners during these unprecedented times.</p><h2 id="acknowledgements">Acknowledgements</h2><ul><li>We would like to thank Devaj Mitra, Surashree Kulkarni, Abhishek Agarwal, Pravinth Vethanayagam, Jeffrey Butterfield (author), Maria Christoforaki, Parthasarathy Gopavarapu, our Semantic Business Information team, our Database Reliability Engineering Team, and our Data Streaming teams who all helped make Trust Levels a reality.</li>
<li>Thanks to Venkatesan Padmanabhan and Joshua Flank for technical reviewing and editing of this post.</li>
</ul><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/04/one-year-later-building-trust-levels-during-covid.html</link>
      <guid>https://engineeringblog.yelp.com/2021/04/one-year-later-building-trust-levels-during-covid.html</guid>
      <pubDate>Thu, 29 Apr 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Hiring a diverse team by reducing bias]]></title>
      <description><![CDATA[<p>Compared to where we started, Yelp’s technical organization has made a lot of headway over the years when it comes to diverse hiring. While our approach to this work continues to evolve, we’ve made significant progress in improving the diversity of our organizations by, among other things, reducing gender and ethnicity bias in our interview process. We’re here to share some of what we’ve learned to help others in their own efforts.</p><p>If you’ve come looking for the secret formula to emulate our success, I can’t help you there, unfortunately. Anyone offering otherwise is probably selling you something. And, to be sure, you’re going to need to buy some things along the way. But, if that newest iteration of Bias Blaster 9000 sounds too good to be true, that’s because it is. There are no easy fixes here.</p><p>In the 9 years I’ve been with Yelp, we have taken several major strides to evolve our hiring processes and strategy that got us to where we are today. What I’m covering in this blog is arguably the most critical of the changes we’ve made: tracking every bit of data possible and running regular analyses to better diagnose the bias in our engineering interviews.</p><h2 id="a-data-oriented-approach">A Data Oriented Approach</h2><p>Today, we’re monitoring every stage of every candidate’s interview process, as well as several data points about the candidates themselves. We track how many candidates apply organically versus how many are sourced and how each group performs on the first round interview. We monitor the offer rates by gender identity and how each group is converting to offer acceptance. We’re able to determine down to the level of individual questions in our interview process whether they are being passed at equal rates by all people being interviewed.</p><p>To know these things, we’ve become deliberate about tracking data and analyzing it. We automatically publish daily updates to a host of dashboards that monitor the health of our pipelines. We report weekly on the state of our hiring pipelines so that we can make adjustments as needed. We don’t make changes to our interview process without first knowing that we can measure the effects. With these procedures in place, we are truly able to systematically identify and address problems.</p><p>We’ve come a long way from where we started. It sounds somewhat absurd in hindsight, but early on during my tenure at Yelp, we didn’t even know how many people we needed to hire. We just knew that we needed more engineers, and that we needed them last month. We monitored how many people we were hiring per month alongside our offer-to-accept conversion rate. Sort of. As long as we remembered to track them, but it wasn’t a big deal if we forgot either. There was a lot of room for improvement.</p><h2 id="an-opportunity-to-start-fresh">An Opportunity To Start Fresh</h2><p>Just prior to the onset of the pandemic, our recruiting team was presented with a new opportunity as Yelp Engineering decided to expand its footprint to Toronto. As the pandemic unfolded, our plans pivoted from focusing on Toronto to remote hiring in Canada at large. This was our first opportunity to enter a new talent market properly with the knowledge we’d gained over the previous years.</p><p>And it seems to be working: In Q1 2021, 19% of our engineering hires in Canada identified as Black or Latinx (together, underrepresented minorities or URM), and we saw even more impressive gains in leadership positions, too.</p><h2 id="start-now">Start Now</h2><p>I’ve often regretted our inability to make quicker decisions for lack of data. It takes time to build up a sizable enough data set to understand your processes and detect the bias in them. In Yelp’s case, depending on the current rate of hiring, we’re typically able to understand the state of affairs with statistical significance after a month or two of data collection. There are of course variables that impact this. For example, top of funnel stages, such as first round interviews, produce more data.</p><p>Nine years ago, it would have taken us significantly longer to produce useful data. This is especially true at the later stages of the pipeline, when the number of candidates are reduced, and for demographics that are typically underrepresented in tech, because there weren’t enough in the pipeline to make statistically significant conclusions about. If you’re just getting started, the sooner you’re tracking recruiting data, the sooner you’ll be making meaningful changes to your processes.</p><h2 id="essential-data-points">Essential Data Points</h2><p>If we were starting fresh today, there are three data points I’d want to make sure we started collecting immediately.</p><ol><li><strong>Proceed/Did not proceed rates at every stage of the interview process</strong> - This one might go without saying, but it’s the foundation everything else is built on. Start from the point of contact all the way through to offer acceptance. Everything else is useless without an understanding of the proceed/did not proceed rates at each interview stage.</li>
<li><strong>Candidate source</strong> - Knowing where your candidates are coming from generates a number of insights. Do applicants from career fairs or job boards get more offers? Most people jump to wanting to find the most successful sources, but it’s equally valuable to know your least successful sources. Candidates from certain sources falling out of your pipeline at a disproportionate rate can be very telling. We’ve seen this manifest as non-traditional CS educations, such as bootcamps, being rejected at disproportionate rates. This indicated that we needed to be more explicit in our evaluation criteria for interviews that we’re unconcerned with an applicant’s educational background, and changing these criteria has been highly effective at making sure candidates with a wide range of education backgrounds proceed equally through the pipeline.</li>
<li><strong>Candidate demographics</strong> - Being able to analyze your pipeline by the demographics of the candidates is extremely helpful. For instance, it’s well known that there is a gender disparity in the tech space. Tying gender or ethnicity back to the previous two data points allows for powerful insights into which interview stages are problematic. As an example, we were able to identify early on in our data that women were less likely than men to attempt the code test, which is the first step to our interview process. A surprisingly effective intervention here was to ask all candidates a second time to participate, which is a good reminder that you don’t always need to reinvent the wheel to make change.</li>
</ol><p>Point 3 comes with two <strong>very</strong> important caveats.</p><ol><li>Collecting this data is subject to different legal requirements depending on your location. Consult legal experts before moving forward.</li>
<li><strong>No one responsible for making hiring decisions can have access to this data.</strong> The trackers that hold this data are managed by our operations people and access is granted only to the sourcers and recruiters tracking the data.</li>
</ol><h2 id="assess-your-systems">Assess Your Systems</h2><p>Don’t let perfect be the enemy of good when tracking your data. Teams can be overwhelmed with the possibilities of what to track and how to go about it. A good place to start is by getting a grasp on what your existing systems can provide. You likely have some sort of applicant tracking system (ATS) that can provide some types of pipeline metrics. Learn what your system can do for you and how it does it.</p><p>It’s likely you’ll have to supplement your ATS with custom-made solutions, as there’s going to be data that your ATS is unable to provide. Don’t be too good for spreadsheets. I know, I know, there has to be a better way. There’s always a better way. Getting the data matters more than how you’re getting it. If spreadsheets allow you to track your data while you find a more permanent off-the-shelf solution or your teammates in engineering build you something, do it. We’ve relied on spreadsheets for years. Even though we’ve incorporated tools such as Tableau, spreadsheets have remained an important part of our system.</p><h2 id="proceduralize">Proceduralize</h2><p>Good, reliable data depends on maintaining consistent data collection practices. Depending on your systems, some of this will be automatic. At Yelp, we track a sizable amount of data manually and use our ATS for everything it can accurately, automatically track. For everything else, we rely on our recruiters and sourcers to manually track data in spreadsheets.</p><p>Each recruiter and sourcer has a centrally managed tracker that they use to track their candidates from start to finish. There is no room for interpretation about what data to collect and how to collect it. Every tracker is exactly the same and every team member tracks the same data.</p><p>Maintaining and analysing your data should also be the explicit responsibility of someone on your team. For us, things really took off when we created a full-time operations role within the recruiting organization. Taking this work seriously requires constant maintenance that can’t be done in “spare” time.</p><h2 id="pitfalls">Pitfalls</h2><p>This approach is not foolproof. There are mistakes to be made, and we’ve made our fair share of them. Chief among them is drawing conclusions from data that is not statistically significant. We’re often dealing with fairly small data sets, and it can be very tempting to make changes based on perceived patterns. In these cases, patience is key.</p><p>If something looks off, definitely pay attention, but try not to jump to conclusions. Rolling a change back hurts and also messes up your painstakingly collected data. It ends up being more damaging in the long run to make changes based on data that hasn’t reached significance.</p><h2 id="structured-interviews">Structured Interviews</h2><p>Part 2 of this post will go into detail on how we’ve built a structured interview process and acted on the data that we’ve collected. Layering structured interviews on top of our data collection and analysis practices has allowed us to make fine-grained tweaks to the interview process that would be otherwise impossible. Our insights have led us to a points-based system in our latest iteration of structured interviews that will further our goal of more equitably scoring interview performance.</p><p>Lastly, if you’re finding these posts interesting and Yelp sounds like the kind of company culture that you’d like to be a part of… <a href="http://www.yelp.com/careers">we’re hiring!</a></p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html</link>
      <guid>https://engineeringblog.yelp.com/2021/04/engineering-career-series-hiring-a-diverse-team-by-reducing-bias.html</guid>
      <pubDate>Thu, 22 Apr 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Engineering Career Series: Building a happy, diverse, and inclusive engineering team]]></title>
      <description><![CDATA[<p>I considered writing this as a clickbaity listicle: “7 secrets of engineering team management - you won’t believe number three!” Unfortunately that’s impossible, because it’s a much harder topic, and anyway, number three is: “many years of ongoing investment in building the right team culture, making a lot of mistakes, and learning from them.” Less catchy, but much more what this series is going to try and cover…</p><p>I’ve been at Yelp for eight years now, and I’ve been leading engineering teams for almost 25 years in both the UK and the US, at a wide variety of companies, at different scales and stages of their development, and in very different parts of the technology industry.</p><p>Over that time, with the assistance of many of my colleagues and mentors, I’ve developed a set of principles that guide my approach to management and building engineering team cultures. When I joined Yelp, I found a company with <a href="https://www.yelp.careers/us/en/about-us">values that aligned very well</a> with these principles, at all levels of the company, so I’ve been lucky to be able to try and apply them thoroughly in practice here.</p><h2 id="teamwork-matters-more-than-individual-brilliance">Teamwork matters more than individual brilliance</h2><p>People are indeed individually brilliant, and everyone has unique life experience and talents to contribute to their work. However, it takes teams to build something at the scale of Yelp. Building a culture that values empathy and teamwork pays dividends. A corollary of this is that a strong <a href="https://www.gsb.stanford.edu/faculty-research/books/no-asshole-rule-building-civilized-workplace-surviving-one-isnt">“no assholes” rule</a> is vital.</p><h2 id="diversity-leads-to-success-but-only-if-theres-equity-inclusion-and-belonging">Diversity leads to success, but only if there’s equity, inclusion, and belonging</h2><p>There’s plenty of evidence that <a href="https://hbr.org/2016/11/why-diverse-teams-are-smarter">diversity makes teams more effective</a>, but that doesn’t mean that just hiring a diverse team automatically leads to success. To really succeed, you have to build a company culture where you genuinely deliver an inclusive and equitable experience for everyone. Building and cultivating a culture where everyone can thrive and feel like they belong requires you to constantly examine what you’re doing as a company and what the real impact of it is on your teams. That includes listening to people’s lived experiences and constantly trying to improve.</p><h2 id="distributed-teams-help-diversity">Distributed teams help diversity</h2><p>It’s a lot easier to build truly diverse teams if you’re not limited to having to hire people near the places you have offices. We’d already been hiring in multiple countries for some years at Yelp. Re-examining remote work and distributed teams during the pandemic has highlighted both the scale of the opportunity to really build teams that “meet people where they live,” but also the challenges in building successful distributed teams, abandoning the idea of a “head office” and creating a culture where everyone has an equal opportunity to succeed.</p><h2 id="no-process-is-another-name-for-bias">“No process” is another name for “bias”</h2><p>It’s really easy to have no process in small organizations - which is where you start by default, and the flexibility of not having a process offers lots of advantages at first. The thing is, you never really have no process, you just have a process that you’ve never written down and examined critically. And processes that you’ve never examined critically generally hide a world of unexamined unfairness, even with the best of intentions. You need to articulate and examine what these implicit processes are and make them more explicit, to eliminate that unfairness and the biased outcomes it produces. That means looking in depth at how you hire, how career advancement works, how you compensate people, how you think about technical leadership, and many other more innocuous seeming things where you encode unintentional biases into the system and culture of your company, influencing the likelihood that different people thrive or fail.</p><h2 id="you-have-to-walk-the-walk">You have to walk the walk</h2><p>It’s no use just <em>saying</em> you want to be better at building diverse, inclusive, happy teams. You need to actually change things, measure the results of your changes, look at that data, and then try and improve things again. This continuous iteration driven by data is vital, you must be really transparent and accountable about what you’re doing, and its successes and failures. This directly relates to the previous principle about process, but fundamentally underpins every effort to improve here. And yes, it’s hard. And you will fall on your face sometimes, publicly. And it will hurt. And you need to get up again and keep trying, because that’s the only way things will improve.</p><p>Rather than just hearing from me on how we’ve approached trying to live up to some of these principles at Yelp, we have a series of blog posts over the coming months to further explain. These blog posts will go into detail on the how as well as the why, and share some of what we’ve tried, what worked and what didn’t, in an attempt to give back to the many people whose ideas and learning we’ve built on over the years. Over the next few months we’ll cover:</p><h2 id="hiring-a-diverse-team-reducing-bias-in-engineering-interviews">Hiring a diverse team: reducing bias in engineering interviews</h2><p>How Yelp has approached hiring over the years, and the major lessons we learned. Once we started to standardise our approach to interviewing, we were able to analyse the data to find out if we were actually living up to our good intentions.</p><h2 id="how-we-onboard-engineers-across-the-world-at-yelp">How we onboard engineers across the world at Yelp</h2><p>Once you’ve hired someone amazing, you need to set them up for success on day one. The initial onboarding is vital, but is only part of the process. We’ve found that it’s critical to have a strong mentorship program for new hires, and that means choosing the right people to mentor and train them well. Mentorship doesn’t just stop at onboarding either, so we run an ongoing training and career development program to make sure people from diverse backgrounds can all succeed at Yelp.</p><h2 id="career-paths-for-engineers-at-yelp">Career paths for engineers at Yelp</h2><p>Yelp previously had a completely flat “no levels” individual contributor career framework for Engineering. We’ll cover how we designed and redesigned our framework for career growth and levelling to move away from that, and discuss how that shift increased fairness and equity.</p><h2 id="technical-leadership-at-yelp">Technical leadership at Yelp</h2><p>Why we approach technical leadership as a role you can choose to take on at Yelp, rather than just a level within our career levelling framework, and how we’ve tried to build a collaborative, cross-pollinating community of technical leaders who work together regularly to solve “big picture” problems, rather than just being experts in their own fields.</p><h2 id="how-yelp-approaches-engineering-management">How Yelp approaches engineering management</h2><p>What “success” looks like for managers at Yelp, what we ask managers to do and to value, how we’ve built this into the career path for managers, and how we hire and onboard them.</p><h2 id="ensuring-pay-equity--career-progression-in-yelp-engineering">Ensuring pay equity &amp; career progression in Yelp Engineering</h2><p>“Walking the walk” meant actually examining in detail how we compensated people and how they progressed in their career, and whether that was actually fair and equitable across all demographics at Yelp. And then publishing the outcomes to the whole Engineering team and committing to do so annually, whatever the results were.</p><h2 id="fostering-inclusion--belonging-within-yelp-engineering">Fostering inclusion &amp; belonging within Yelp Engineering</h2><p>Improving inclusion and belonging requires you to provide for teams and groups in many different ways, like supporting Employee Resource Groups to encourage communities to socialise, collaborate, and empower themselves, providing flexible working practices to suit people with different needs, abilities, and lifestyles, as well as designing systems and processes that give people the support they need in the time and place and manner they need it.</p><p>I hope you’ll find this series informative and helpful. I welcome the opportunity to share our triumphs and setbacks with you, and look forward to the feedback on what we’re doing well, and what we still need to learn to do better.</p><p>And last but not least, if this sounds like the kind of company culture that you’d like to be a part of, and you’d like to help make it better… <a href="http://www.yelp.com/careers">we’re hiring</a>!</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html</link>
      <guid>https://engineeringblog.yelp.com/2021/04/engineering-career-series-building-a-happy-diverse-and-inclusive-engineering-team.html</guid>
      <pubDate>Thu, 08 Apr 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Powering Messaging Enabledness with Yelp's Data Infrastructure]]></title>
      <description><![CDATA[<p>In addition to helping people find great places to eat, Yelp connects people with great local professionals to help them accomplish tasks like making their next big move, fixing that leaky faucet, or repairing a broken phone screen. Instead of spending time calling several businesses, users can utilize Yelp’s <a href="https://blog.yelp.com/2020/08/yelp-reinvents-the-hiring-experience-for-home-and-local-services">Request a Quote</a> feature to reach out to several businesses at once, receive cost estimates from those businesses, and ultimately hire the right local professional for the job. This post focuses on how <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Yelp’s Data Pipeline</a> is used to efficiently compute which businesses are eligible for the feature, and also introduces Yelp’s Data Lake, which we use to track historical values of the feature for offline metrics and analytics.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-04-05-powering-messaging-enabledness-with-yelps-data-infrastructure/raq-image.png" alt="Request a Quote Widget" class="c1" /></div><p>While most businesses can be reached via the phone number listed on their Yelp business page, only a subset of businesses are eligible for Yelp’s messaging feature (at least for now!). We refer to the ability for a business to receive messages from users as “messaging enabledness” (or sometimes, just enabledness). It is determined by checking several different conditions about the business. For instance, the business owner must opt-in to the feature and they must have a valid email address so they can be notified about new messages, among other things.</p><p>Computing messaging enabledness is tricky since checking all the criteria requires joining and fetching values from several different SQL tables. For some features, like deciding whether or not to display the “Request a Quote” button on a business’s page, it’s essential to correctly identify a business’s enabledness, even if it takes extra time to perform all those joins. For other applications of the data, such as analysis or indexing, we can tolerate the risk of a stale value in order to speed things up, so a cached mapping of an identifier for the business (business_id) to its messaging enabledness is stored in its own SQL table. This is kept up to date by a batch which runs periodically to recompute the value for all businesses.</p><p>In addition to storing the current state of enabledness, Yelp is also interested in persisting a historical record of messaging enabledness for businesses. This allows the company to measure the health of the Request a Quote feature in addition to being an invaluable source of information when investigating any pesky bugs that might pop up.</p><p>There are millions of businesses listed on Yelp, so storing this history in a SQL table is not efficient in terms of storage cost or query time. Another option was to store a nightly snapshot of the table, but that would have resulted in duplicated information day to day, would have been more difficult to query, and wouldn’t have captured multiple changes to the same business in a single day. What we really want to store is a change log of the table.</p><p>Remember that this data is stored in a SQL table. If you’ve been following along with the Data Pipeline posts on the Yelp engineering blog, you’ll know that Yelp has developed a tool called the <a href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Replication Handler</a> which publishes a message to our Kafka infrastructure for every update to a SQL database. By connecting this tool to the table caching businesses’ messaging enabledness, a full history of changes can be written to a Kafka stream. Now if only we had a way to store this stream…</p><h2 id="yelps-data-lake">Yelp’s Data Lake</h2><p>Yelp’s Data Lake is our solution for storing schematized data at scale. Our Data Lake is built on top of the Apache Parquet format, Amazon S3, and Amazon Glue. This S3 based architecture allows us to cheaply store data, making it possible for us to keep records over a long period of time. Our Data Lake implementation also integrates with our in-house schema management system, <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">Schematizer</a>.</p><p>Data from Kafka can easily flow into our Data Lake through our Data Lake Sink Connector. The connector provides a fully-managed way for moving data to the Data Lake, without engineers having to worry about any underlying infrastructure. All engineers need to do is specify which data they want in the Data Lake, either though our datapipe CLI tool or through our Pipeline Studio web UI.</p><div class="highlighter-rouge highlight"><pre>$ datapipe datalake add-connection --namespace main --source message_enabledness
Data connection created successfully
Connection #9876
  Namespace:         main
  Source:            message_enabledness
  Destination:       datalake
</pre></div><p>Once in the Data Lake, data can power a wide variety of analytic systems. Data can be read with Amazon Athena or from Spark jobs. Using Redshift Spectrum, we also allow analytics from Redshift, where Data Lake data can be joined with data we put in Redshift using our <a href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Redshift Sink Connector</a>. Redshift Spectrum can also be used to power Tableau dashboards based on Data Lake data.</p><p>We previously mentioned that the messaging enabledness table was updated periodically. Even though changes to the table are being persisted to the Data Lake, with this approach we’re not able to identify the time the change happened and have no way to tell why the value changed (i.e. which of the criteria triggered this change).</p><p>In order to catch these changes in real time, each time an update happens that might affect a business’s enabledness, an asynchronous task can be submitted to recompute the value and store it in the table along with the reason for the change. The code looks something like this:</p><div class="highlighter-rouge highlight"><pre>def update_value(business_id, new_value):
    update_value_for_business(business_id, new_value)
    update_enabledness_async(business_id, reason)
def update_enabledness_async(business_id, reason):
    current_enabledness = get_enabledness_from_cache(business_id)
    updated_enabledness = compute_enabledness(business_id)
    if not current_enabledness == updated_enabledness:
        set_enabledness(business_id, updated_enabledness, reason)
</pre></div><p>While this works, you might be able to spot a shortcoming: anytime an engineer adds a new way to update a value which might change a business’s enabledness, they also need to remember to call the update_enabledness_async method. Even though this approach might capture all the changes when it is first written, as the code evolves over time a single mistake can cause the data stored in the table to be inaccurate.</p><div class="highlighter-rouge highlight"><pre>def update_value_v2(business_id, new_value)
    update_value_for_business(business_id, new_value)
    # Something is missing here...
</pre></div><h2 id="reacting-to-the-changes">Reacting to the Changes</h2><p>Looking more closely at the system above, there is something peculiar about the update call. Recomputing enabledness isn’t really an operation that should happen after one of the criteria was updated. Instead it happens as a result of the value being changed. Rather than depending on engineers and code reviewers to remember that the <code class="highlighter-rouge">update_enabledness_async</code> function must be triggered manually, what if we could build a system that triggered this update as a result of the change?</p><p>Each of the criteria for enabledness is stored in a SQL table, and as we discussed earlier in the post, the Replication Handler can be used to publish changes to those tables to Yelp’s Data Pipeline! Consumers of those topics (specifically, <a href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">Paastorm spolts</a>) can be set up to call the <code class="highlighter-rouge">update_enabledness_async</code> task on any relevant change!</p><p><img src="https://engineeringblog.yelp.com/images/posts/2021-04-05-powering-messaging-enabledness-with-yelps-data-infrastructure/system-diagram.png" alt="System Diagram" /></p><p>This post introduces the Yelp Data Lake and demonstrates how the Data Lake Sink Connector makes it easy to track the historical value of a business’s messaging enabledness. It also shows how the Streaming Infrastructure you’ve read about in previous blog posts (or at least the ones you’re about to read right after you finish this one!) is used to solve real engineering problems at Yelp, allowing systems to react to data changes without the need to write custom components or complex logic.</p><ul><li>Thanks to Mohammad Mohtasham, Vipul Singh, Francesco Di Chiara, and Stuart Elston who assisted at various stages of design and implementation of this project.</li>
<li>Thanks to Blake Larkin, William O’Connor, and Ryan Irwin for technical review and editing.</li>
</ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Are you interested in using streaming infrastructure to help solve tough engineering problems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5d226cd-6ea1-4d12-b875-725b331202b7?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/04/powering-messaging-enabledness-with-yelps-data-infrastructure.html</link>
      <guid>https://engineeringblog.yelp.com/2021/04/powering-messaging-enabledness-with-yelps-data-infrastructure.html</guid>
      <pubDate>Mon, 05 Apr 2021 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Passwordless Login: Reengaging Business Owners with Less Friction]]></title>
      <description><![CDATA[<p>As various teams at Yelp were focused on developing features to help businesses adapt to COVID-19, some teams were looking ahead and developing features that would help businesses in the later stages or after the pandemic.<br /></p><p>Early on in the pandemic, we saw some businesses pause advertising on Yelp as government regulations required many businesses temporarily close or limit their operations. However, businesses quickly adjusted to the local regulations, while implementing health and safety precautions to keep their staff and customers safe. Through this adjustment we wanted to ensure it was easy to restart advertising right where they left off.</p><p>Our data revealed that as of April 2017, most business owners used password-based logins to sign into their business owner account. However, if they forgot their password, it could be a frustrating experience for them to continue into the app.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-02-23-passwordless-login/passwordless-login-image-1.png" class="c1" alt="image" /></div><p>A typical Reset Password flow looked like:</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-02-23-passwordless-login/passwordless-login-image-2.png" class="c3" alt="image" /></div><p>After receiving their Reset Password link, they were presented with</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-02-23-passwordless-login/passwordless-login-image-3.png" class="c4" alt="image" /></div><p>Depending on what the users entered into the two input boxes, they could receive the following error messages:</p><ul><li>Please enter a password</li>
<li>Oops, the passwords you entered don’t match!</li>
<li>Please choose a password of at least 6 characters</li>
<li>This password is insecure, please try a different one.</li>
<li>This password has been used in the past year. Please enter a different password.</li>
<li>… and more…</li>
</ul><p>Our data showed that we sent one of these errors about 7,500 times a day.<br /></p><p>To resolve this, our solution was to remove the need to authenticate a business owner with a password by creating a passwordless login. Yelp will send a unique link (called Magic Links) which are short-lived (ie. 1-hour to up to 3 days) links that provide automatic login functionality. A Magic Link will automatically open the Yelp for Business app, verify a business owner’s credentials, log them in, and then optionally redirect them to anywhere in the app of our choosing. Magic Links are one-time use and time sensitive, so the links will eventually expire if they aren’t used.</p><p>To unlock the full capabilities of this feature, we also appended each Magic Link with a redirect link that takes them to a specific page after automatic successful login. The redirect link can consist of any deeplink that we already currently support. Particularly for this initiative, we redirected our users to our One Click Restart screen, which allowed business owners to restart their ads with Yelp.</p><p>We have implemented this logic on Android, iOS, and the web. Even if business owners did not have the Yelp for Businesses app installed on their device, they can still take advantage of this feature.<br /></p><p>With this solution, we are able to provide a seamless user journey for business owners to the end goal, securely and with only one click. Technically, it was a creative and innovative solution.</p><p>Figure 1 shows the original status quo before we implemented Magic Links and Figure 2 shows the sequence of steps after implementing Magic Links.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-02-23-passwordless-login/passwordless-login-image-4.png" class="c1" alt="image" /></div><p>At Yelp, UrlCatcherActivities are Activities that are responsible for handling deeplinks. We have deeplinks that are preceded with http://biz.yelp.com or yelp-biz://. Given two deeplinks that are exactly the same with the exception of the host, the app could reroute them differently. Since Magic Links could be sent via device notifications or emails, we needed to support both URI hosts.</p><p>The MagicLinkUrlCatcherActivity was responsible for intercepting all Magic Links via the Android Manifest and acting on it. It validated the Magic Link and provided feedback for both successful and unsuccessful validations.</p><p>Our Magic Link schemas looked like this: https://biz.yelp.com/login/passwordless/?return_url=https://biz.yelp.com/ads/i2kK8NtpmtuKf84NYm0d3A/</p><p>An invalid Magic Link could consist of an expired, malformed, or missing MAGICLINKTOKEN.</p><p>On successful validation, we logged the user into the app. The return_url is an optional parameter. If it is present and also a valid deeplink that we supported, we then forwarded the redirect url embedded in the Magic Link to downstream UrlCatchersActivities. From that point on, the app behaved as status quo. If the return_url was not specified, we redirected to the home screen.</p><p>On unsuccessful validation, the activity is responsible for redirecting the user to the Log In screen so that users may try to enter in their credentials manually. If successful, we redirected the user to the embedded link within the Magic Link.</p><p>When the project first began, we had (naively) thought that this project would be simple (refer to Figure 1). Our initial strategy was to write the Magic Link logic in both the UrlCatcherActivities, but like most projects, the more we worked on it the more we realized that there were a lot of edge cases we had to handle. Accounting for each edge case for each UrlCatcherActivity would duplicate code and double our blast radius. On top of that, each requirement change, no matter the size, would have to be duplicated. We quickly realized that we should refactor all the code into one place sooner rather than later.</p><p>The Magic Link high level logic (illustrated by diagram below) was refactored into the MagicLinkUrlCatcherActivity.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2021-02-23-passwordless-login/passwordless-login-image-5.png" class="c4" alt="image" /></div><p>Each UrlCatcherActivity already contained business logic that required some time to understand. By moving all Magic Link related logic into the MagicLinkUrlCatcherActivity, we only passed the optional redirect url into the downstream logic.</p><p>We didn’t need to test this end-to-end. We only had to concentrate on 4 main areas:</p><h3 id="input-validation">Input validation</h3><p>We validated that all deeplinks into the app were successfully triaged by the AndroidManifest to either go to one of the downstream UrlCatcherActivities or the MagicLinkUrlCatcherActivity</p><h3 id="magic-link-validation">Magic Link validation</h3><p>We tested that the MagicLinkUrlCatcherActivity was able to handle successful and unsuccessful validation of the Magic Link.</p><h3 id="redirect-links-are-passed-to-the-correct-catcher">Redirect links are passed to the correct Catcher</h3><p>Lastly, we wrote tests to ensure that the embedded redirect links within the Magic Link would get passed to the correct downstream UrlCatcherActivities. Also verify when there were no redirect links passed.</p><h3 id="analytics">Analytics</h3><p>We verified that the correct analytics were fired at specific points in the code so that Yelp can track usage and other metrics of interest.</p><p>Using Magic Links together with deeplinks reduced friction for our business owners to log into their accounts and resume advertising.. By making passwords obsolete on login we reduced user churn resulting from abandoned password resets. We hope this feature will help our business owners get the word out about their business and better communicate and engage with their customers.</p><p>Shoutout goes to Karlo Pagtakhan and Khushboo Puneet for working on this with me! Also, thank you to Blake Larkin, Eric Hernandez, Rajan Roy, Joshua Walstrom, Patrick Fitzgerald, and Mark Brady for technical review and editing.</p><div class="island job-posting"><h3>Become an Android Engineer at Yelp!</h3><p>We're working on cool interesting problems everyday! Come join our Android team!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/d13b9fe9-c523-4407-9432-7783d2848fca/Software-Engineer-Android-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/03/passwordless-login.html</link>
      <guid>https://engineeringblog.yelp.com/2021/03/passwordless-login.html</guid>
      <pubDate>Mon, 01 Mar 2021 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Boosting user conversion with UX performance wins]]></title>
      <description><![CDATA[<p>Everyone loves graphs going up and to the right, unless they reflect your page load timings. This blog post is about curtailing higher page load times. <a href="https://biz.yelp.com">Yelp for Business Owners</a> allows business owners to manage their listing, respond to reviews, edit their business information, upload business photos, and more. Business owners can also purchase Yelp Ads and profile products to target local audiences and enhance their business’s presence on Yelp. In this blog post, you’ll learn about the ways we improved the UX performance of our ads purchase flow by dramatically reducing the load times. You’ll be able to apply the same tactics to your own flow and hopefully achieve results similar to ours:</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/fcp-reduction.png" class="c1" alt="image" /></div><p>Our core ads purchase flow is a single-page React application powered by Python-based backend services and GraphQL. Over the past couple of years, it has grown from a four step process to a <a href="https://blog.yelp.com/2020/09/getting-started-with-yelp-ads">seven step process</a> with new features to provide better ad campaign controls. However, as we added more features, performance suffered. Our page-load P75 timings increased from 3 seconds to 6 seconds for desktop users. This slowdown was even more pronounced for our mobile users due to increased constraints in network speeds and reliability.</p><p>It's a <a href="https://web.dev/why-speed-matters/">known fact</a> that faster-loading pages directly benefit user conversion. We wanted to measure how much faster performance affected the bottom line and ran a lightweight experiment to measure the relationship between performance and user conversion. We made some backend optimizations to reduce page load timings by one second, and immediately observed a 12% relative increase in conversion rate. This early win gave us confidence in our future investments along with full buy-in and support from our product team.</p><p>The first step in our performance effort was to set up a framework that would standardize the metrics and logging across all our flows. We decided to target two specific metrics:</p><h3 id="first-contentful-paint-fcp">First Contentful Paint (FCP)</h3><p>FCP is the browser’s time spent rendering any image or text after sending the page load request. It is widely accepted as a <a href="https://web.dev/first-contentful-paint/">key metric</a> in the industry to measure your web page’s performance. Targeting FCPs was critical because it hints to the user that their page is starting to load. During our experimentation, we found that a user was much more likely to leave our site during a page load than after they saw any content, even if they only saw a loading spinner. Since a page load event depends on multiple systems (such as web browser, routing layers, authentication proxies, etc.), we further broke down our FCP into the following units to help categorize our efforts:</p><ol><li>
<p>Redirect time: How long the browser spent following redirects (HTTP 303s).</p>
</li>
<li>
<p>Request time: How long the request-response cycle took for the main request inside Yelp servers.</p>
</li>
<li>
<p>Rendering time: How long it took for the browser to render the first contentful paint after receiving the initial response.</p>
</li>
</ol><p>The image below shows the breakdown of our timings in the units discussed above.</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/fcp-breakdown.png" class="c1" alt="image" /></div><p>This metric measures the time spent by the browser to fully load a page. It captures any client-side rendering logic and async data fetching required to render the complete user experience. At Yelp, we call this metric <em>Yelp Page Complete</em> (YPC). It is critical to capture TTI since many of our applications render a shimmer or a page shell after the initial page load, and then the respective components fetch their data. TTI helps capture the entire user experience timings.</p><p>We have several other similar flows on the biz site with their own data fetching strategy. To make the integration convenient across all of them, we created a shared JavaScript package to consolidate all of the logic related to logging, polyfills, batching/throttling of logging related AJAX calls, etc. In the end, the integration only required adding a couple of lines to start logging all the performance metrics.</p><p>We relied on many tools that were critical to our effort that are worth mentioning here:</p><h3 id="zipkin">Zipkin</h3><p><a href="https://zipkin.io/">OpenZipkin</a> is an open-source distributed tracing system set up at Yelp. It helped identify bottlenecks during the request lifecycle inside Yelp servers. Our request travels through multiple services, and this tool was indispensable in identifying potential optimizations on the backend. Here is a sample Zipkin trace:</p><div class="image-caption"><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/zipkin-trace.png" class="c1" alt="image" /></div><p class="subtle-text"><small>Source: https://zipkin.io/</small></p></div><h3 id="webpack-bundle-analyzer">Webpack Bundle Analyzer</h3><p><a href="https://github.com/webpack-contrib/webpack-bundle-analyzer">Webpack Bundle Analyzer</a> helped us visualize our JavaScript bundles’ content with an interactive zoomable treemap. This tool was crucial to identify optimizations in our frontend assets that we discuss later. Below is a sample treemap interaction from the plugin’s Github repository:</p><div class="image-caption"><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/webpack-treemap.gif" class="c1" alt="image" /></div><p class="subtle-text"><small>Source: https://github.com/webpack-contrib/webpack-bundle-analyzer</small></p></div><h3 id="splunk">Splunk</h3><p>We ingested all of our performance metrics in Redshift database tables and visualized them as <a href="https://www.splunk.com/en_us">Splunk</a> dashboards. These helped us track our progress in real-time while deploying changes. Below is an example dashboard:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/splunk-dash.png" alt="" /></div><h3 id="chrome-devtools">Chrome DevTools</h3><p>Chrome’s tooling provided terrific insights into our frontend performance issues. We specifically relied on the <a href="https://developers.google.com/web/tools/chrome-devtools/evaluate-performance/timeline-tool">flame charts</a> under the Performance tab to identify where the browser’s main thread was blocking and how much time was being spent in our assets loading, parsing, and evaluation. Google <a href="https://developers.google.com/web/tools/lighthouse">Lighthouse</a> also provided actionable opportunities and diagnostic information.</p><p>After learning from the gathered metrics, we planned on tackling the performance issue on all fronts: backend, frontend, and infrastructure. Below are a few things that are worth sharing:</p><h2 id="frontend-optimizations">Frontend Optimizations</h2><p>Our JavaScript bundles had been growing slowly due to continuous feature additions over the past couple of years. Yelp’s in-house tooling already enforced general best practices such as gzip compression, bundle minification, dead code elimination, etc. So most of the issues were part of our application setup. After analyzing our bundle using the tooling above, we employed various techniques listed below to reduce our gzipped bundle size from 576 KB to 312 KB, almost 50% reduction!</p><ol><li>
<p><strong>Code Splitting</strong>: Serving code for all the seven pages of our purchase flow during the initial page load was undoubtedly wasteful. We opted to use <a href="https://loadable-components.com/">loadable components</a> to create separate chunks for different steps that would load on demand. This chunking helped reduce our bundle size by 15%. However, loading these assets on demand added a small delay on every page load, so we wrote a helper function to preload all the chunks using the useful <a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/requestIdleCallback">requestIdleCallback</a> function to avoid any UX behavior changes.</p>
</li>
<li>
<p><strong>Tree Shaking</strong>: Yelp’s recent default Webpack settings enable dead code elimination. Looking into our bundle treemaps, we realized that tree shaking wasn’t working for some of our older packages because they were still using older build settings. So, a hunt began to figure out all such packages, and we ended up further reducing our bundle size by 30% by just upgrading their build.</p>
</li>
<li>
<p><strong>Replacing Packages with Heavy Footprint</strong>: We identified a few packages being used infrequently in our code that occupied an unreasonable portion of our bundle. The primary example was <a href="https://momentjs.com/">moment.js</a> that was used only twice but occupied 5% of the bundle. We were able to replace it with <a href="https://date-fns.org/">date-fns</a>, which is tree-shakeable. Fun fact: the project status of momentjs now itself recommends using alternatives.</p>
</li>
<li>
<p><strong>Deduplicating Packages</strong>: We use Yarn for our dependency management, and (before Yarn V2) it didn’t deduplicate the packages with overlapping ranges. For our large apps, deduplication had a noticeable impact on our bundle sizes. Yarn V2 helped solve this problem for us.</p>
</li>
<li>
<p><strong>Reducing Component Re-rendering:</strong> <a href="https://reactjs.org/blog/2018/09/10/introducing-the-react-profiler.html">React profiler</a> identified that specific core page components such as the navigation bar were re-rendering wastefully during the page load. This re-rendering blocked the main thread and delayed FCP. We resolved this by adding <a href="https://reactjs.org/docs/react-api.html#reactmemo">memoization</a> on top of these components.</p>
</li>
</ol><h2 id="server-side-optimizations">Server-side Optimizations</h2><p>Yelp’s growing service architecture presented some interesting roadblocks. As the request traveled through multiple services (including a monolith), its lifecycle was complicated. For example, the page-load request went through 3 services and depended upon up to 5 downstream services for fetching data. The efforts listed below helped us bring down our request timings:</p><ol><li>
<p><strong>Removing Proxy Layers</strong>: All biz site requests were proxied through Yelp’s monolith because it handled authentication and authorization for logged-in business owners. This proxy was expensive. Earlier this year, we packaged up the authentication and authorization business logic into a reusable Python package. This optimization entailed integrating with that package, setting our service up to accept traffic directly from our routing layer, and rolling it out carefully via <a href="https://martinfowler.com/bliki/DarkLaunching.html">dark-launching</a>. It helped us save 250ms from our request time along by getting rid of legacy code.</p>
</li>
<li>
<p><strong>Parallelizing Network Calls:</strong> We rely on several downstream services for fetching data during page load. Zipkin helped us uncover that we had laid out some of our network calls in a blocking manner that slowed down the entire request. At Yelp, we use Futures built with <a href="https://github.com/Yelp/bravado">Bravado</a>, which allows us to send network requests concurrently. We rewrote the request code to fire off all the network requests at the top of business logic and avoided starting any new network request later in the code. It helped us shave 300ms from our request timings. While this issue can regress, we documented best practices for this behavior to help prevent them in the future.</p>
</li>
<li>
<p><strong>Eliminating Redirects</strong>: Legacy pages, old flows, third-party blog posts, etc., contributed to redirects before the user landed on the final URL/page. These redirects were a few seconds in some cases for our mobile traffic. We documented all of the redirects using the <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer">HTTP Referer</a> header and tackled them accordingly.</p>
<div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/redirects-table.png" class="c1" alt="image" /></div>
</li>
<li>
<p><strong>Server-side Rendering</strong>: Before this effort, our flow was rendered entirely client-side, i.e., we didn’t send any HTML in the request’s response. We only sent the JavaScript bundle and relied entirely on the browser and React app to generate HTML for serving the page’s content. We identified that this adversely affected our FCP, especially on mobile clients with limited CPU and memory. We already had a (React) component rendering service based on <a href="https://github.com/airbnb/hypernova">Hypernova</a> set up at Yelp. We integrated with that service and started rendering the first page’s markup from the server. We immediately saw significant benefits for all the clients. We transferred the rendering load to the server, as evident in the graphs below, but the rendering time took a steep drop and the net impact was lower FCP time. Also, long gone was our loading shimmer!</p>
<div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/server-side-rendering.png" class="c1" alt="image" /></div>
</li>
<li>
<p><strong>Pre-warming Cache:</strong> We have a few computationally expensive tasks in our requests, such as building a category tree object created by reading configurations from disk. We cached these objects in memory, but we identified that our higher latency P90 requests still suffered because they would always get a cache miss. We created an internal endpoint whose sole responsibility was to warm up all the caches and create expensive cacheable objects. We used a <a href="https://uwsgi-docs.readthedocs.io/en/latest/PythonDecorators.html#uwsgidecorators.postfork">uWSGI hook</a> that would be called every time a worker was created to make a call to this internal endpoint. It helped bring our P95s down by almost 2 seconds across all clients.</p>
</li>
<li>
<p><strong>Vertical Scaling:</strong> Last but not least, we also tried deploying our service and its dependent services on highly performant <a href="https://aws.amazon.com/ec2/instance-types/z1d/">z1d.6xlarge EC2 instances</a>. We saw marginal improvements (up to 100msec) on page load timings, but some of the other computationally expensive AJAX APIs saw more significant gains. For example, our POST endpoint responsible for purchasing the products got 20% faster, leading to lower timeouts.</p>
</li>
</ol><p>After four months of focused effort with a dedicated engineering team, we achieved results that made this investment worthwhile. It was not just a win for our conversion metrics, but also for our customers, who now experienced substantially faster loading pages.</p><p>The keys results that we achieved for our ads purchase flow:</p><ul><li>We reduced our P75 FCPs from 3.25s to 1.80s, a 45% improvement.</li>
<li>We reduced our p75 YPCs from 4.31s to 3.21s, a 25% improvement.</li>
<li>We saw up to 15% lift in our conversion rate.</li>
</ul><p>Below are a couple of graphs that show our progress over time:</p><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/p75-fcp.png" class="c1" alt="image" /></div><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/boosting-user-conversion-with-ux-performance-wins/ypc-improvements.png" class="c1" alt="image" /></div><h2 id="acknowledgements">Acknowledgements</h2><ul><li>Shoutout to my teammates on this project: Thibault Ravera, Bobby Roeder, Frank She, Austin Tai, Yang Wang and Matt Wen.</li>
<li>Shoutout to Dennis Coldwell, Aaron Gurin, Blake Larkin and Alex Levy for technical review and editing.</li>
</ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/a6cfee89-2dd0-4451-bf52-746b9547dfb7/Software-Engineer-Full-Stack-Engineer-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2021/01/boosting-user-conversion-with-ux-performance-wins.html</link>
      <guid>https://engineeringblog.yelp.com/2021/01/boosting-user-conversion-with-ux-performance-wins.html</guid>
      <pubDate>Wed, 27 Jan 2021 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Whose Code is it Anyway?]]></title>
      <description><![CDATA[Yelp
<noscript>
</noscript>
<p><a href="https://engineeringblog.yelp.com/">Yelp</a></p>
<div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container">
<p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p>
</form></div>
<div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/17089be275f0/assets/img/logos/logo_desktop_xsmall_outline.png" alt="Yelp logo" class="main-footer_logo-copyright" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/0aade8725c91/assets/img/logos/logo_desktop_xsmall_outline@2x.png 2x" />, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/58cfc999e1f5/assets/img/logos/burst_desktop_xsmall_outline.png" alt="Yelp burst" class="main-footer_logo-burst" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/dcb526e86d86/assets/img/logos/burst_desktop_xsmall_outline@2x.png 2x" /> and related marks are registered trademarks of Yelp.</small></div>]]></description>
      <link>https://engineeringblog.yelp.com/2021/01/whose-code-is-it-anyway.html</link>
      <guid>https://engineeringblog.yelp.com/2021/01/whose-code-is-it-anyway.html</guid>
      <pubDate>Wed, 13 Jan 2021 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Now You See Me: How NICE and PDQ plots Uncover Model Behaviors Hidden by Partial Dependence Plots]]></title>
      <description><![CDATA[<p>Many machine learning (ML) practitioners use <a href="https://scikit-learn.org/stable/modules/partial_dependence.html">partial dependence plots</a> (PDP) to gain insights into model behaviors. But have you run into situations where PDPs average two groups with different behaviors and produce curves applicable to none? Are you longing for tools that help you understand detailed model behavior in a visually manageable way? Look no further! We are thrilled to share with you our newest model interpretation tools: the Nearby Individual Conditional Expectation plot and its companion, the Partial Dependence at Quantiles plot. They highlight local behaviors and hint at how much we may trust such readings.</p><h2 id="a-not-nice-world">A not NICE world</h2><p>At Yelp, we have ML models for personalized user and business owner <a href="https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html">recommendations</a>, the <a href="https://engineeringblog.yelp.com/2020/02/accelerating-retention-experiments-with-partially-observed-data.html">retention</a> of advertisers, <a href="https://engineeringblog.yelp.com/2019/12/architecting-wait-time-estimations.html">wait time prediction</a> in <a href="https://restaurants.yelp.com/products/waitlist-table-management-software/">Waitlist</a>, <a href="https://engineeringblog.yelp.com/2020/01/modernizing-ads-targeting-machine-learning-pipeline.html">ads targeting</a>, <a href="https://engineeringblog.yelp.com/2014/12/learning-to-rank-for-business-matching.html">business matching</a>, etc. Although the prediction quality is always one of the key priorities for any ML model, we also care deeply about the interpretability of the model. As ML practitioners, we often use model interpretation tools to do sanity checks on how a model is generalizing from the features. More importantly, exposing the “why” behind a model’s behavior to its consumers, who often are not ML practitioners, can give them confidence in its accuracy and generalizability, or lead them to deeper applications and better business decisions.</p><p>Since most of our models are complex in order to achieve better prediction quality, they can also be harder to decipher. One common question in model understanding is, “How do changes in a feature’s values relate to changes in the prediction?” Previously, we used the popular <a href="https://scikit-learn.org/stable/modules/partial_dependence.html">PDP</a> and the <a href="https://ieeexplore.ieee.org/abstract/document/5949423">sensitivity plot</a> to take snapshots from the model that are easy for a human to understand. A PDP shows how predictions change, on average, when varying a single feature<sup><a href="https://engineeringblog.yelp.com#footnote1">1</a></sup> over its values (e.g., min to max) and holding the other features constant. PDPs can answer questions like, “What would users’ wait time be respectively if the local temperature were 30°F, 50°F, and 70°F?” In contrast, a sensitivity plot varies a single feature relatively (e.g., -15% to +15%) while holding the other features constant. Sensitivity plots can answer questions like, “If the weather had been 10% warmer for these days (temperatures on these days in general are different), how would wait time estimates have changed?”</p><p>However, these tools are not without some limitations. First of all, both PDP and sensitivity plots operate at an aggregated level, meaning that we average all the data points to achieve one single curve. This aggregation may hide differences in various subpopulations. For example, when creating either a sensitivity plot or a PDP, we could imagine the prediction goes up in half of the population and goes down in the other half when we increase a feature. When we average the two halves together, we may falsely conclude that the feature has no marginal contribution to the prediction. Secondly, when drawing PDPs or sensitivity plots over sparse data regions, both plots become untrustworthy. For example, if we only have a few restaurants open at 0°F, then it’s usually unwise to generalize from a PDP for wait times around such low temperatures.</p><p>To address these concerns, we came up with two new tools: the Nearby Individual Conditional Expectation (NICE) plot and its companion the Partial Dependence at Quantiles (PDQ) plot. Instead of the aggregate effect, the NICE plot individually draws changes in predictions due to local perturbations on top of the scatter plot between feature values and corresponding predictions. The PDQ plot helps to summarize the heterogeneity in the NICE plot by aggregating partial dependence at different quantiles of predictions. In practice, we often need to review the PDQ plot when we have difficulties in figuring out the general patterns in a NICE plot.</p><h2 id="what-is-the-nice-plot">What is the NICE plot?</h2><p>NICE plots examine the <a href="https://www.tandfonline.com/doi/abs/10.1080/10618600.2014.907095">Individual Conditional Expectation</a> in the neighborhood of the original feature values. Below is an example plot from one feature in one of our retention models.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-12-17-now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots/fig1-nice.png" alt="Note: This graph contains 1000 data points and each blue line consists of 7 points." /><p class="subtle-text"><small>Note: This graph contains 1000 data points and each blue line consists of 7 points.</small></p></div><p>We made this NICE plot using the following algorithm:</p><ol><li>Select a random sample of data points (if your dataset is large).</li>
<li>Make a scatter plot of feature values and model predictions (the black dots).</li>
<li>Make nearby perturbations about each feature value (e.g., <code class="highlighter-rouge">lower_bound</code> = 0.9 * <code class="highlighter-rouge">feature_value</code> and <code class="highlighter-rouge">upper_bound</code> = 1.1 * <code class="highlighter-rouge">feature_value</code>) and evenly sample N points within the bounds (we recommend N to be odd so the original feature value is included).</li>
<li>Record their corresponding perturbed predictions.</li>
<li>Draw lines between the N points and corresponding predictions on the scatter plot (the blue lines).</li>
</ol><p>A NICE plot foremost shows the bivariate distribution between feature values and their corresponding predictions. Therefore, it is straightforward to observe the sparsity between the two. In the above example, the model rarely gives a low prediction when the feature is smaller than 4 (exhibited by the white space on the bottom left corner) and it rarely gives a high prediction when the feature value is roughly smaller than 1 (illustrated by the white triangle-like shape on the top left corner).</p><p>More importantly, this plot only examines marginal effects at the neighborhood of each observed data point, which helps to show heterogeneous effects and may hint at any interaction effects. In the above graph, the marginal effect goes up and then goes down when the feature value is in the range of 0 to 1. Starting from 1, the effect is positive and large in magnitude until the feature value reaches to 2. In the range 2 to 4, we observe some heterogeneous effects: some lines are downward sloping while others are flat, and the flat ones are observed more often when the prediction gets larger. When the feature value is greater than 6, all the NICE lines are flat throughout the region.</p><p>On the other hand, the information in the PDP and the sensitivity plot of the same feature lacks many details.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-12-17-now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots/fig2-pdp-sens.png" alt="Note: the y-axes in these figures have narrower ranges than the previous NICE plot because of aggregation." /><p class="subtle-text"><small>Note: the y-axes in these figures have narrower ranges than the previous NICE plot because of aggregation.</small></p></div><p>The PDP (left) correctly captures the most significant inverted V-shape structure when the feature is smaller than 4 and the flat shape afterwards, but it loses some subtleties contained in the V-shape. The sensitivity plot (right) is misleading. From it, you may conclude that tweaking the feature would yield a single-peaked relationship with the peak at -20% of the feature value, which is true only in aggregate. From the NICE plot, we can see this relationship fails to hold for most, if not all, individual data points: the marginal effects are flat when the feature values are greater than 4, and have the “wrong” shape for samples in the valley near one.</p><p>When trying to apply the above algorithm to binary or categorical features, we cannot make nearby perturbations and have to examine the change from one value to another. Below is one example from our system.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-12-17-now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots/fig3-nice.png" alt="Note: we apply jitter to the feature values to make the density easier to see." /><p class="subtle-text"><small>Note: we apply jitter to the feature values to make the density easier to see.</small></p></div><p>As one can see, the NICE plot is still useful to demonstrate heterogeneous effects. In the above figure, the lines at the top of the figure (corresponding to high prediction values) are flatter. But when the predictions get smaller, the lines get steeper. For comparison, below is the PDP of the same feature:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-12-17-now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots/fig4-pdp.png" alt="Note: the y-axis in this figure has a narrower range than the previous NICE plot because of aggregation." /><p class="subtle-text"><small>Note: the y-axis in this figure has a narrower range than the previous NICE plot because of aggregation.</small></p></div><p>Clearly, the PDP manages to capture the aggregated trend, but misses the differences in marginal effects when the predicted values are different. Using this plot one cannot see the heterogeneity in these marginal effects.</p><p>The structures contained in a NICE plot, however, may be both a blessing and a curse. When the model has complex interaction effects, it is hard for a human to decipher all the subtleties from the numerous dots and lines in a NICE plot. To mitigate this issue, we developed a companion tool: the PDQ plot.</p><h2 id="what-is-the-pdq-plot">What is the PDQ plot?</h2><p>A PDQ plot is a variation of the conventional PDP. It stands on the middle ground between the fully local NICE plot and fully global PDP. It plots the partial dependence conditional on some pre-specified quantiles of the predicted values, which helps to simplify the heterogeneity and emphasizes the major structures in a NICE plot. Here is the PDQ of the first NICE plot in this article:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-12-17-now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots/fig5-pdq.png" alt="" /></div><p>We made this PDQ plot using the following algorithm:</p><ol><li>Select the quantile values to be drawn. Our default values are 0.05, 0.25, 0.5, 0.75, and 0.95.</li>
<li>For each quantile, find data points that can produce predictions that are close to the desired quantile (e.g., the desired quantile +/− 0.001).<sup><a href="https://engineeringblog.yelp.com#footnote2">2</a></sup></li>
<li>Again for each quantile, generate and plot partial dependencies using only those samples.<sup><a href="https://engineeringblog.yelp.com#footnote3">3</a></sup></li>
</ol><p>From the plot, we can easily identify a sharp inverted V-shape structure when the predicted value is small, but this non-monotonic effect gradually flattens out as the prediction increases. When the prediction is sufficiently large (starting from the 0.75 quantile), we do not see a significant drop after the initial rise.</p><p>In practice, one can use the corresponding PDQ plot to help make sense of the NICE plot. For example, it may be unclear to some readers that the non-monotonic effect gradually flattens as the prediction increases by just inspecting the NICE plot. Indeed, a lot is going on in a small region. But after observing the PDQ plot, one can go back and re-examine the NICE plot.</p><p>If PDQ plots can represent information confined in NICE plots in a concise fashion, why don’t we solely rely on them? Firstly, PDQ plots still need to aggregate some data. Therefore, it is difficult, if not impossible, to differentiate a mix shift from an inherent behavior change of the model by just examining a PDQ plot. For example, you may possibly think that the gradually flattened V-shape structure is because the negative marginal effects are less steep when the predictions are higher, which can be ruled out with the help of the corresponding NICE plot.</p><p>Moreover, PDQ plots have a data sparsity issue. We cannot observe the bivariate distribution in PDQ plots. Therefore, in some regions we do not have many, if any, data points. The following two figures constitute a good example.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-12-17-now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots/fig6-nice-pdq.png" alt="" /></div><p>It is very tempting to conclude that the effect gradually flattens out after the feature value is greater than 50 when q=0.95 from the PDQ plot. However, we can see there are almost no data points with feature value greater than 50 and the prediction very high. Therefore, it is probably unjustified to assume such a relationship exists in that region.</p><p>Finally, why do PDQ plots work in practice? We have repeatedly observed that the patterns in NICE plots can be roughly grouped by the predicted values. This probably is because samples that produce similar predictions are similar for the purpose of a specific prediction task. Therefore, these samples are more likely to share a common marginal effect.</p><h2 id="conclusion">Conclusion</h2><p>A NICE plot is an individual conditional expectation plot restricted to feature values near the observed ones. It shows how the model would behave if we perturb a feature near its observed values while keeping all other features fixed. Reading a NICE plot can also tell us how much we can trust such behaviors because the plot contains information about data sparsity.</p><p>The PDQ plot helps to summarize the heterogeneity in the NICE plot by grouping partial dependence at different quantiles. We typically consult the corresponding PDQ plot when we have difficulties in figuring out the general patterns in a NICE plot. PDQ works because data points with similar predictions behave more similarly than the ones without in a given prediction task.</p><h2 id="acknowledgements">Acknowledgements</h2><p>The original idea of the NICE plot belongs to Jeffrey Seifried. Blake Larkin, Nelson Lee, Eric Liu, Jeffrey Seifried, Vishnu Purushothaman Sreenivasan, and Ning Xu (ordered alphabetically) help read through the earlier versions and make helpful comments.</p><h3 id="notes">Notes</h3><p><a name="footnote1" id="footnote1">1</a>: We can vary multiple features and draw a multivariate PDP, but the interpretation gets very difficult past two features!</p><p><a name="footnote2" id="footnote2">2</a>: To reduce noise, we do not just select a handful of data points exactly at the pre-defined quantile. In general, samples with different predictions may behave differently in terms of their marginal effects, and you don’t want to be fooled by a tiny sample.</p><p><a name="footnote3" id="footnote3">3</a>: You can check <a href="https://github.com/scikit-learn/scikit-learn/blob/master/examples/inspection/plot_partial_dependence.py">scikit-learn’s implementation</a> if you need help computing partial dependencies.</p><div class="island job-posting"><h3>Become an Applied Scientist at Yelp!</h3><p>Are you intrigued by data? Uncover insights and carry out ideas through statistical and predictive models.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/d0c0d643-2e39-4eb5-81a6-7e56b517f777/Applied-Scientist?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/12/now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots.html</link>
      <guid>https://engineeringblog.yelp.com/2020/12/now-you-see-me-how-nice-and-pdq-plots-uncover-model-behaviors-hidden-by-partial-dependence-plots.html</guid>
      <pubDate>Thu, 17 Dec 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Orchestrating Cassandra on Kubernetes with Operators]]></title>
      <description><![CDATA[Yelp
<noscript>
</noscript>
<p><a href="https://engineeringblog.yelp.com/">Yelp</a></p>
<div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container">
<p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p>
</form></div>
<div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/17089be275f0/assets/img/logos/logo_desktop_xsmall_outline.png" alt="Yelp logo" class="main-footer_logo-copyright" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/0aade8725c91/assets/img/logos/logo_desktop_xsmall_outline@2x.png 2x" />, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/58cfc999e1f5/assets/img/logos/burst_desktop_xsmall_outline.png" alt="Yelp burst" class="main-footer_logo-burst" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/dcb526e86d86/assets/img/logos/burst_desktop_xsmall_outline@2x.png 2x" /> and related marks are registered trademarks of Yelp.</small></div>]]></description>
      <link>https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html</link>
      <guid>https://engineeringblog.yelp.com/2020/11/orchestrating-cassandra-on-kubernetes-with-operators.html</guid>
      <pubDate>Mon, 16 Nov 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Tales of a Mobile Developer on Consumer Growth]]></title>
      <description><![CDATA[Yelp
<noscript>
</noscript>
<p><a href="https://engineeringblog.yelp.com/">Yelp</a></p>
<div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container">
<p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p>
</form></div>
<div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/17089be275f0/assets/img/logos/logo_desktop_xsmall_outline.png" alt="Yelp logo" class="main-footer_logo-copyright" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/0aade8725c91/assets/img/logos/logo_desktop_xsmall_outline@2x.png 2x" />, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/58cfc999e1f5/assets/img/logos/burst_desktop_xsmall_outline.png" alt="Yelp burst" class="main-footer_logo-burst" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/dcb526e86d86/assets/img/logos/burst_desktop_xsmall_outline@2x.png 2x" /> and related marks are registered trademarks of Yelp.</small></div>]]></description>
      <link>https://engineeringblog.yelp.com/2020/11/tales-of-a-mobile-developer-on-consumer-growth.html</link>
      <guid>https://engineeringblog.yelp.com/2020/11/tales-of-a-mobile-developer-on-consumer-growth.html</guid>
      <pubDate>Fri, 13 Nov 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Minimizing read-write MySQL downtime]]></title>
      <description><![CDATA[<p>The relational database of choice at Yelp is MySQL and it powers much of the Yelp app and yelp.com. MySQL does not include a native high-availability solution for the replacement of a primary server, which is a single point of failure. This is a tradeoff of its dedication to ensuring consistency. Replacing a primary server is sometimes necessary due to planned or unplanned events, like an operating system upgrade, a database crash or hardware failure. This requires pausing data modifications to the database while the server is restarted or replaced and can mean minutes of downtime. Pausing data modifications means that our users can’t perform actions like writing reviews or messaging a home service professional, and this amount of downtime must be minimized to the shortest amount possible. This post details how Yelp has integrated open-source tools to provide advanced MySQL failure detection and execute automated recoveries to minimize the downtime of our read-write MySQL traffic.</p><h2 id="characteristics-of-mysql-infrastructure-at-yelp">Characteristics of MySQL infrastructure at Yelp</h2><p>Our MySQL infrastructure is made up of:</p><ul><li>Hundreds of thousands of queries per second from HTTP services and batch workloads (lots of low latency, user facing web traffic!)</li>
<li>Applications connect to MySQL servers through a layer 7 proxy, open-source ProxySQL</li>
<li>MySQL clusters have a single primary and use asynchronous replication. Most deployments span geographically sparse data centers (we love scaling with MySQL replicas!)</li>
<li>ZooKeeper based service discovery system, used for applications to discover proxies and proxies to discover MySQL databases</li>
<li>Open-source Orchestrator deployed to multiple datacenters in raft consensus mode for high availability and failure detection of MySQL servers</li>
</ul><p>MySQL primary replacements are performed due to MySQL crashes, hardware failure and maintenance (hardware, operating system, MySQL upgrades). For unplanned failures, Orchestrator detects the failure and initiates the recovery procedure. For planned server upgrades, an on-call engineer can invoke Orchestrator’s primary replacement procedure.</p><p>We are able to minimize MySQL downtime when replacing a MySQL primary because:</p><ul><li>MySQL clients (applications) remain connected to a proxy tier</li>
<li>Orchestrator detects failure within seconds, then initiates MySQL specific recoveries and elects a new primary server</li>
<li>the new primary server indicates to the service discovery system that it is the primary for a set of databases</li>
<li>the proxy tier watches for the update to the service discovery system and adds the identity of the new primary server to its configuration</li>
</ul><p>When the proxy tier has discovered the new primary server, the replacement is complete and applications are again able to write data to the database.</p><p>This procedure is completed in seconds!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-11-09-minimizing-read-write-mysql-downtime/2020-11-09-minimizing-mysql-downtime-diagram.png" alt="" /></div><p>A closer look at how everything fits together:</p><ul><li>Individual components store and consume data in ZooKeeper, storing their own identities (IP addresses) and reading the identities of other components</li>
<li>Applications establish connections to ProxySQL and issue queries</li>
<li>ProxySQL maintains a connection pool to each MySQL server, and proxies client connections to connections in its pool</li>
<li>Orchestrator maintains a connection pool to each MySQL server, constantly performing health checks and is ready to initiate a failure recovery when necessary</li>
</ul><h2 id="proxysql-as-a-highly-available-proxy-layer">ProxySQL as a highly available proxy layer</h2><p>ProxySQL is a high performance, high availability, protocol aware proxy for MySQL. We love ProxySQL because it limits the number of MySQL connections to our MySQL servers and it permits us to replace MySQL servers without requiring applications to re-establish their database connections.</p><h3 id="deployment">Deployment</h3><p>We deploy ProxySQL using AWS Auto-scaling groups and AWS EC2. We configure these servers to run ProxySQL after powering on, using Puppet, and since they are relatively stateless we are able to add or replace ProxySQL capacity very quickly, in less than 10 minutes.</p><h3 id="configuring-proxysql-to-route-to-mysql-backends">Configuring ProxySQL to route to MySQL backends</h3><p>We use ProxySQL’s hostgroup functionality to group MySQL servers into tuples of (MySQL schema, MySQL role), where MySQL schema is one of our vertical shards to isolate workloads and MySQL role is one of {primary, replica, reporting replica} to isolate read/write, read only, and non-user facing read traffic respectively. A single MySQL user maps uniquely to a hostgroup, which means that an application only needs to present a username and password to ProxySQL to be routed and load balanced to the proper database and database role.</p><p>Each ProxySQL server must be configured with the set of available MySQL servers and continue to stay up to date as MySQL capacity is added, replaced, or when hosts transition between MySQL roles and therefore hostgroups. On a several minute interval, a script runs on each ProxySQL server to read the available MySQL servers and their roles from our ZooKeeper based service discovery system and load them into ProxySQL’s configuration as hostgroups. This script is idempotent and also contains important verification functionality, such as preventing a mass-removal of MySQL servers if an outage of the service discovery system is detected or ensuring that only one server exists in the “primary” hostgroup for each cluster. The latter verification method is a key component of ensuring that our primary failover system is safe in the face of network partitions.</p><h3 id="applications-connecting-to-proxysql">Applications connecting to ProxySQL</h3><p>Just as MySQL servers register into service discovery so that they can be discovered by ProxySQL servers, ProxySQL servers register into the same system so that applications are able to discover and connect to them. Applications read the identity of ProxySQL servers from service discovery and supply a username and password deployed with the application to initiate their MySQL connections.</p><h2 id="service-discovery">Service Discovery</h2><p>At Yelp, the data plane of our service discovery system consists of a daemon on each server that performs HTTP or TCP healthchecks on a service, and if the service is healthy, stores information including the IP address and port of the service in ZooKeeper. If a service fails to respond successfully to its healthcheck, this daemon will remove the state of the failing service instance. A separate daemon is responsible for reading the state in ZooKeeper and proxying requests through the service mesh.</p><h3 id="mysql-registration-and-healthcheck">MySQL registration and healthcheck</h3><p>MySQL servers are grouped by (MySQL schema, MySQL role) where MySQL role is a value in {primary, replica, reporting replica}. Both the MySQL schema and MySQL role values are represented as files on disk of each MySQL server. These files are understood by the process that performs health checks and are used to represent the (MySQL schema, MySQL role) groupings in ZooKeeper.</p><p>Our health check for the MySQL replica services is more thorough than only verifying that the MySQL port is open since these servers are running stateful workloads that require significant configuration. Before a MySQL replica is deemed to be healthy, it must pass all of the monitoring checks defined using our monitoring framework. To accommodate this, an HTTP service is deployed on each MySQL server to provide an HTTP health check endpoint which verifies that the server has passed all of its monitoring checks before the MySQL process is considered healthy. Some examples of these monitoring checks are:</p><ul><li>The server restored from backup successfully</li>
<li>The server is replicating and is caught up to real time</li>
<li>The server is “warmed” by streaming a MySQL buffer pool from another server in the cluster and loading it into its own buffer pool</li>
</ul><h3 id="proxysql-healthcheck">ProxySQL healthcheck</h3><p>Because ProxySQL servers are lightweight and almost completely stateless, a ProxySQL server is considered healthy as long as it is listening for TCP connections on the defined ProxySQL port. After the ProxySQL process is launched and begins listening for TCP connections, it passes its health check and is discoverable by applications.</p><h2 id="orchestrator-driven-failure-recovery">Orchestrator driven Failure Recovery</h2><p>Orchestrator is an open source MySQL high availability and replication management tool that provides failure detection and automated recovery of MySQL servers. We deploy Orchestrator using its distributed raft mode in order to have the service be highly available and to provide improved failure detection of MySQL servers. Orchestrator’s failure recovery features solve the single point of failure presented with a single primary MySQL configuration mentioned earlier in this post.</p><p>Upon detecting a failure of a MySQL server, the multiple orchestrator instances running in raft mode will seek consensus of the identified failure, and if a quorum of instances agree, a recovery will proceed.</p><p>If the failed server is a replica and is a replication source for other replicas, Orchestrator will ensure that these replicas are re-configured to replicate from a healthy replication source. If the failed server is a primary, Orchestrator will proceed to set the failed primary to read-only mode (MySQL variable @@read_only=1), identify a candidate to be promoted to primary, re-configure replicas of the former primary to replicate from the candidate primary, and set the candidate primary to read-write mode (@@read_only=0). Orchestrator handles the MySQL specific changes for replacing a primary server and allows definitions of “failover hooks” to run custom defined commands during different phases of the recovery process.</p><h3 id="primary-failover-hooks">Primary Failover Hooks</h3><p>Orchestrator performs the MySQL specific part of the failover but there are still other changes required, such as modifying the file on disk representing a server’s MySQL role to the service discovery system. An HTTP service exists on each MySQL server in order to support this, and failover hooks are configured to send an HTTP request to both the former and newly promoted primaries to update their MySQL role. After this hook executes, the service discovery daemon will notice that the MySQL role of the promoted primary has changed and will update the identity of the primary server in ZooKeeper.</p><p>As mentioned earlier, each ProxySQL server runs a script on a several minute interval which reads the MySQL service discovery state in ZooKeeper and ingests this data to ProxySQL’s configuration. In order to reduce the recovery time after a primary failover, a separate process runs on ProxySQL servers to watch the identities of MySQL primaries in ZooKeeper and to initiate the previous process immediately when a change is noticed.</p><h2 id="perspective-of-a-mysql-client-during-a-primary-failover">Perspective of a MySQL client during a primary failover</h2><p>After Orchestrator issues <code class="highlighter-rouge">set @@read_only=1</code> on the former primary, clients will see INSERT/UPDATE/DELETE queries fail. These failures will remain until ProxySQL has updated its hostgroup configuration to replace the failed primary with the promoted one. Neither applications or ProxySQL need to create new TCP connections – clients remain connected to the same ProxySQL server and each ProxySQL server already has an existing pool of connections to the promoted primary because it was previously existing as a replica. After modifying its hostgroup configuration, a ProxySQL server is able to route MySQL traffic to the new primary.</p><h2 id="special-cases--network-partitioning-and-avoiding-split-brain">Special cases: network partitioning and avoiding split-brain</h2><p>This failure recovery system is carefully designed to make the right decision in failure scenarios caused by a network partition. A partial or incorrect failure recovery due to a network partition has the potential to leave the system with multiple primary hosts, each believing they are the primary, resulting in a divergence of the dataset known as “split-brain”. It is very difficult to repair a split-brain scenario, so we have several components in this system to help prevent this.</p><p>One mechanism to prevent the possibility of split-brain is validation in the logic which transforms the service discovery data in ZooKeeper into ProxySQL’s hostgroup configurations. If there is more than 1 primary registered in ZooKeeper, the script will refuse to make changes to the hostgroup configurations and emit an alert to page an on-call responder who can inspect and appropriately remediate this situation.</p><p>We also set Orchestrator’s PreventCrossDataCenterMasterFailover value to true so that Orchestrator would not ever elect a new MySQL primary in a separate datacenter. We use this setting because we would not want to change the datacenter of a MySQL cluster’s primary without considerable planning and because it reduces the surface area of potential network partition scenarios that could result in split-brain.</p><h2 id="conclusions">Conclusions</h2><p>Thanks to these systems, we are able to quickly recover from MySQL failures and maximize the availability of Yelp for our users, ensuring a smooth user experience.</p><div class="island job-posting"><h3>Become a Database Reliability Engineer at Yelp</h3><p>Want to help make our databases even more reliable?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/d88f19a8-f38a-4ceb-917d-d9d5a8ba0cc6/Senior-Software-Engineer-Database-Reliability-Engineering-NoSQL?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/11/minimizing-read-write-mysql-downtime.html</link>
      <guid>https://engineeringblog.yelp.com/2020/11/minimizing-read-write-mysql-downtime.html</guid>
      <pubDate>Mon, 09 Nov 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Folium: Enabling Reproducible Notebooks at Yelp]]></title>
      <description><![CDATA[<p>Jupyter notebooks are a key tool that powers Yelp data. It allows us to do ad hoc development interactively and analyze data with visualization support. As a result, we rely on Jupyter to build models, create features, run Spark jobs for big data analysis, etc. Since notebooks play a crucial role in our business processes, it is really important for us to ensure the notebook output is reproducible. In this blog post, we’ll introduce our notebook archive and sharing service called Folium and its key integrations with our Jupyterhub that enable notebook reproducibility and improve ML engineering developer velocity.</p><h2 id="folium-for-notebook-archiving--sharing">Folium for Notebook Archiving &amp; Sharing</h2><p>There are a few ways to archive and share notebooks (i.e., exporting to html, saving .ipnb files in Github, shared network drives). There are also some other higher-level frameworks for notebook archiving, but these frameworks lacked integration with Jupyterhub, searchability, and additional customizations presented in this post.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-21-introducing-folium-enabling-reproducible-notebooks-at-yelp/fig-1-folium-and-jupyterhub.png" alt="Figure 1. Folium and Jupyterhub" /><p class="subtle-text"><small>Figure 1. Folium and Jupyterhub</small></p></div><p>Folium is a basic front-end service that also has APIs that interact with our Jupyterhub. These APIs enable uploading after developing a notebook. While uploading a notebook, the user is prompted for tags (i.e., project name, ticket) and a potential description fetched from the notebook automatically. The front-end service part provides the ability to search for notebooks by user, tag, or documentation of the notebooks. It also renders the notebooks in the webpage including the different notebook versions (more on this later!) and extracts a table of contents by extracting markdown in the notebook.</p><p>The functionality described above laid the basic foundation of notebook archiving and sharing, but we built several additional features that we want to share on helping with reproducibility of notebooks:</p><ul><li>The notebook running environment is logged so that we can easily reproduce the output.</li>
<li>Versions of the same notebooks are grouped together to easily compare their differences.</li>
<li>The shared notebooks can be directly imported into Jupyter server so that people can easily reproduce or improve on the existing notebooks.</li>
<li>Adjust variables and rerun existing notebooks directly from Folium without going to Jupyterhub.</li>
<li>Tags system allows searching and grouping related notebooks.</li>
</ul><p>We will talk about each function in more detail in the following sections.</p><h2 id="logged-notebook-running-environment">Logged Notebook Running Environment</h2><p>We have a Jupyterlab extension installed on our Jupyterhub that takes care of import/export functionality to Folium. When exporting to Folium, the extension gathers the running environment from the current notebook servers, so that the key information is also logged into the notebook’s metadata. Currently we log which docker image and kernel are being used so that when re-running this notebook, we will be able to choose the correct working environment. We also log the memory and CPU/GPUs used so that users can pick the correct amount of resources in order to re-run the notebook. For different tasks, some might need more computation powers, versus some of the tasks may need to have higher memory. Without knowing how the resources are being used by current notebooks, we would likely get out-of-memory issues when rerunning the notebooks.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-21-introducing-folium-enabling-reproducible-notebooks-at-yelp/fig-2-basic-notebook-information.png" alt="Figure 2. Basic Notebook Information" /><p class="subtle-text"><small>Figure 2. Basic Notebook Information</small></p></div><h2 id="import-notebook-from-folium-to-jupyterhub">Import Notebook from Folium to Jupyterhub</h2><p>The same Jupyterlab extension mentioned above also allows us to import notebooks directly from Folium via its APIs. People can search and preview all the available Folium notebooks, and directly import them into Jupyterhub. We regularly use this function for collaboration and for improving on old models.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-21-introducing-folium-enabling-reproducible-notebooks-at-yelp/fig-3-search-folium-notebook-archive-and-import-within-jupyterhub.png" alt="Figure 3. Search Folium’s notebook archive and import within Jupyterhub" /><p class="subtle-text"><small>Figure 3. Search Folium’s notebook archive and import within Jupyterhub</small></p></div><h2 id="grouping-of-different-versions-of-notebooks">Grouping of Different Versions of Notebooks</h2><p>Often an analysis is valuable enough that it needs to be repeated. This means a user will upload multiple similar notebooks. When a user does this, we group these similar notebooks together on the same page. Therefore we can directly compare the result of different versions of notebooks. We also use this feature to provide tutorials as well, where you can put the question and answer on the same page for people to learn by themselves. In addition to that, we also link the related code review on this notebook in the Folium, so that people can easily refer to the feedback for the notebooks.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-21-introducing-folium-enabling-reproducible-notebooks-at-yelp/fig-4-different-versions-of-notebooks-linked-together-and-related-links-are-also-highlighted.png" alt="Figure 4. Different versions of notebooks linked together and related links are also highlighted." /><p class="subtle-text"><small>Figure 4. Different versions of notebooks linked together and related links are also highlighted.</small></p></div><h2 id="parametrized-notebooks">Parametrized Notebooks</h2><p>Besides importing a notebook to Jupyterhub, we also have the feature that allows users to directly rerun the notebook with different parameters in Folium. This helps us reuse the notebooks and quickly get us the result for similar analyses.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-21-introducing-folium-enabling-reproducible-notebooks-at-yelp/fig-5-substitute-variables-and-rerun-notebooks-from-folium.png" alt="Figure 5. Substitute variables and rerun notebooks from Folium" /><p class="subtle-text"><small>Figure 5. Substitute variables and rerun notebooks from Folium</small></p></div><p>Searching is also a key thing in reusing the notebook. Without search integration, we will end up having lots of similar notebooks being recreated. This is exactly the issue we’re seeing before improving the tagging and search system on Folium. People have to constantly recreate the same notebooks, because they are not aware there are similar notebooks that can be easily imported and reused. As a result, we fixed the issue by automatically fetching the markdown from notebooks to generate required descriptions that helps users to search for specific notebooks. Free form tagging is also supported and being used widely for teams to tag the notebooks they owned or grouping the notebooks related to specific projects.</p><p>The Folium web service has a simple search results page (SERP) with filtering by tag and user. Also, the search API supporting the SERP is also leveraged for searching in the sidebar from Jupyter as shown in Figure 2.</p><h2 id="future-work">Future Work</h2><p>Folium is a tool that not only helps us share the code, but also helps us reuse the built notebooks to accelerate our daily work! On the roadmap, we are looking to continuously improve it by providing the ability to review notebooks, including a view of diffs and commenting. We are also adding more ways to get re-run notebooks delivered, including the option of emailed reports.</p><h2 id="acknowledgements">Acknowledgements</h2><p>Thanks to the Core ML team for building and continuously improving our Jupyter and Folium infrastructure, and thanks to Blake Larkin, Ayush Sharma, Shuting Xi, Jason Sleight for editing the blog post.</p><div class="island job-posting"><h3>Become an ML Platform Engineer at Yelp</h3><p>Interested in designing, building, and deploying ML infrastructure systems? Apply to become an ML Platform Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/53b90eff-b187-483b-969c-847cb332fb6d/ML-Platform-Engineer-Remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/10/introducing-folium-enabling-reproducible-notebooks-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2020/10/introducing-folium-enabling-reproducible-notebooks-at-yelp.html</guid>
      <pubDate>Wed, 21 Oct 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Flink on PaaSTA: Yelp’s new stream processing platform runs on Kubernetes]]></title>
      <description><![CDATA[<p>At Yelp we process terabytes of streaming data a day using <a href="https://flink.apache.org/">Apache Flink</a> to power a wide range of applications: ETL pipelines, push notifications, bot filtering, sessionization and more. We run hundreds and hundreds of Flink jobs, so routine operations like deployments, restarts, and <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/state/savepoints.html">savepoints</a> don’t take thousands of hours of developers’ time, which would be the case without the right degree of automation. The latest addition to our toolshed is a new stream processing platform built on top of <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>, Yelp’s Platform As A Service. Sitting at its core, a <a href="https://kubernetes.io/">Kubernetes</a> <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator</a> automatically watches over the deployment and the lifecycle of our fleet of Flink clusters.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-14-flink-on-paasta/logos.png" alt="Flink on PaaSTA on Kubernetes" /><p class="subtle-text"><small>Flink on PaaSTA on Kubernetes</small></p></div><h2 id="life-before-kubernetes">Life before Kubernetes</h2><p>Before the introduction of Kubernetes at Yelp, Flink workloads at Yelp were running on dedicated AWS <a href="https://aws.amazon.com/emr/">ElasticMapReduce</a> clusters which come with both Flink and <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">YARN</a> pre-installed. In order to make EMR instances work well with the rest of the Yelp ecosystem, our previous stream processing platform Cascade used to run a chunk of Yelp’s <a href="https://puppet.com/docs/pe/2019.8/pe_user_guide.html">Puppet</a> monolith in a <a href="https://www.docker.com/">Docker</a> container to apply configurations and to start the common set of daemons running on almost all Yelp’s hosts.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-14-flink-on-paasta/cascade.png" alt="Architecture of Cascade" /><p class="subtle-text"><small>Architecture of Cascade</small></p></div><p>Cascade also introduced a per-cluster controller component in charge of the Flink jobs life cycle (starting, stopping, savepointing) and monitoring which we call Flink Supervisor.</p><p>While this system served us well for years, our developers were experiencing a handful of limitations:</p><ul><li>It previously took around 30 minutes to spin up a new Flink cluster</li>
<li>We needed trained human operators to manually deploy new versions or scale up resources for each cluster</li>
<li>We could not upgrade to newer versions of Flink until AWS supported them</li>
<li>The complexity of running Puppet in Docker and of maintaining a very different infrastructure from the rest of Yelp was time consuming</li>
</ul><p>When Kubernetes started to gain more and more momentum both outside and inside the company, we decided that it was time for a change.</p><p><a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a> is Yelp’s Platform As A Service and runs all Yelp’s web services and a few other stateless workloads like batch jobs. Originally developed on top of <a href="http://mesos.apache.org/">Apache Mesos</a>, we are now migrating it to Kubernetes. This opened up the opportunity to support more complex workloads thanks to Kubernetes’ powerful primitives. Flink was the first in line and <a href="http://cassandra.apache.org/">Cassandra</a> is coming up in the very near future (be on the lookout for a new blog post!), both of them developed in tight collaboration with our Compute Infrastructure team.</p><p>Instead of “just” running Flink on top of Kubernetes using something off-the-shelf, we went down the road of developing a full-fledged platform that would make the experience of running Flink workloads as similar as possible to running any other service at Yelp. We did so to greatly reduce the knowledge necessary for a user to operate Flink clusters and to make our infrastructure very homogeneous with the rest of Yelp’s ecosystem.</p><p>With Flink on PaaSTA, provisioning a cluster is as easy as writing a YAML configuration file. New code deployments all happen automatically as soon as they are committed to git via <a href="https://www.jenkins.io/">Jenkins</a>. The commands provided by PaaSTA for starting, stopping, reading logs or monitoring a web service work exactly the same for any Flink cluster.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-14-flink-on-paasta/command.png" alt="paasta status command output" /><p class="subtle-text"><small>paasta status command output</small></p></div><p>In addition to UX improvements, we managed to reduce the average time to spin up a Flink cluster from 30 minutes to under 2 minutes and we are now free to hop on the latest version of Flink on our own schedule.</p><h2 id="peeking-inside-the-hood">Peeking inside the hood</h2><p>At the core of Flink on PaaSTA sits our custom Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator</a>, watching over the state of Flink clusters running on Kubernetes and making sure that they always match what is described in the configuration defined by the users.</p><p>Our PaaSTA glue translates this configuration into Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">Custom Resources</a>, which the operator reads and updates with information taken from the Flink clusters, like the jobs list and status. These resources are also used by the PaaSTA commands to fetch what to show to the users and to interact with the operator for operations like start and stop.</p><p>The operator knows how to map the high-level definition of a Flink cluster resource into the right Kubernetes primitives like <a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/">Deployment</a> for scheduling the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html#taskmanagers">TaskManagers</a>, <a href="https://kubernetes.io/docs/concepts/services-networking/service/">Service</a> to make the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html#jobmanager">JobManager</a> discoverable by the other components in the cluster or <a href="https://kubernetes.io/docs/concepts/services-networking/ingress/">Ingress</a> to make the Flink web dashboard accessible by our users. The operator together with Jenkins schedules these components in Docker containers which allow us to customize the Flink installation and to select our Flink version of choice for each application.</p><p>You may find it surprising to see in the diagram below that our legacy Supervisor component still has a place in our new platform. At Yelp we like to approach all our projects with a practical spirit, infrastructure migrations included. While everything the Supervisor is doing could be worked into the operator, we decided to keep it around to reduce the development time by re-using existing features. Even more importantly, minimizing the scope of changes also helped to make the migration from Cascade to PaaSTA as easy as possible for our existing users.</p><p>For example, we deploy the Supervisor as a Kubernetes <a href="https://kubernetes.io/docs/concepts/workloads/controllers/job/">Job</a> to leverage its logic for triggering <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/state/savepoints.html">savepoints</a> of all the Flink jobs running on a cluster just before the operator shuts it down.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-14-flink-on-paasta/cluster.png" alt="Components of a Flink PaaSTA cluster" /><p class="subtle-text"><small>Components of a Flink PaaSTA cluster</small></p></div><p>If you’d love to hear more about the details, we encourage you to check out our <a href="https://youtu.be/hL5nNAMx8Bk">talk at Flink Forward</a>.</p><h2 id="what-now">What now?</h2><p>Freeing us from the need to manage hundreds of Flink clusters, Flink on PaaSTA unlocked a new world of possibilities for our users and our Stream Processing team.</p><p>On the infrastructure side, we are now close to adding <a href="https://beam.apache.org/">Apache Beam</a> support to Flink on PaaSTA in order to make Python stream processing a first-class citizen at Yelp. We are also working on implementing auto scaling and per-job cost reporting for Flink clusters.</p><p>On the UX side, we are developing tools to allow our users to define complex pipelines of streaming components with a single configuration file. We are also busy building features to shape our on-line machine learning platform.</p><p>Stay tuned if you want to hear about all the above and more!</p><div class="island job-posting"><h3>Data Streams Platform Engineer at Yelp</h3><p>Want to build next-generation streaming data infrastructure?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5ccf6d4-d3c2-49cf-9692-9a9497ed4467/Senior-Platform-Engineer-Data-Streams?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/10/flink-on-paasta.html</link>
      <guid>https://engineeringblog.yelp.com/2020/10/flink-on-paasta.html</guid>
      <pubDate>Wed, 14 Oct 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[The Dream Query: How we scope projects with GraphQL]]></title>
      <description><![CDATA[<p>At Yelp, new web pages and app screens are powered by GraphQL for fetching data.</p><p>This blog post describes the <strong>Dream Query</strong> – a pattern our feature teams use when refactoring or creating new pages.</p><p><em>(<a href="https://engineeringblog.yelp.com/2020/04/open-sourcing-dataloader-codegen.html">Check out our previous blog post</a> to see how we dynamically codegen DataLoaders to implement the server layer!)</em></p><h2 id="scoping-a-new-feature-with-graphql">Scoping a new feature with GraphQL</h2><p>Let’s jump in with an example!</p><p>Imagine your team is tasked with creating the new version of the “Header component” for the website (we’ll use the Yelp.com website in our example). You may receive a design mock that looks like this:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-07-dream-query/mock.png" alt="Mock Header Component" /><p class="subtle-text"><small>Mock Header Component</small></p></div><p>Your mission (should you choose to accept, of course): turn this into code.</p><p>Alongside the usual planning activities such as OKR docs and estimated timelines, we’ve found it particularly helpful to write out a theoretical GraphQL query that could power the page or component - aka the “Dream Query”.</p><h2 id="writing-a-dream-query">Writing a Dream Query</h2><p>The first step is to identify what dynamic data we need to display. In our case of the Header component above, we can see the UI showing:</p><ul><li>Number of unread inbox messages</li>
<li>User’s profile photo</li>
</ul><p>Therefore we might write something like this: ?</p><div class="language-graphql highlighter-rouge highlight"><pre>query{loggedInUser{profilePhoto(size:"small"){src}inbox{unreadMessageCount}}}</pre></div><p>The idea of the dream query is to let developers “just write” the query they wished they could write to power the page - with as low barriers to entry as possible. In other words, imagine you’re writing the UI code, and you magically have everything already available to you. What query would you write?</p><p>Here’s a few points to keep in mind:</p><ol><li><strong>Try to use real types</strong> that already exist in the schema. (Use GraphiQL’s docs tab to search.)</li>
<li><strong>It’s ok if you don’t get this perfect!</strong> Large schemas can contain hundreds of types, and it’s easy to miss stuff. (This will ideally be caught in review.)</li>
<li><strong>It’s ok to query for types that don’t exist yet.</strong> (That’s kind of the point here!)</li>
<li><strong>It’s ok if your team is totally new to GraphQL.</strong> Don’t worry if you aren’t super confident in the syntax yet - this will at least provide a great starting point for reviewers.</li>
<li><strong>Write the Dream Query before the real application code</strong> is written - ideally as part of the scoping or planning phase. This cuts down on the overall iteration cycle, since we aren’t writing any real code to implement resolver methods yet.</li>
</ol><h2 id="review">Review</h2><p>Once written, share the Dream Query widely. We do this in a Google Doc, so folks can comment line by line.</p><p>The goals here are to:</p><ul><li><strong>Refine the query</strong> such that it meets our schema design guidelines. (At Yelp, we have a community-driven schema review group specifically set up for this.)</li>
<li><strong>Find other teams who may be stakeholders</strong> in the types being created.</li>
<li><strong>Understand the time investment</strong> needed for the backend portion of the project (i.e. for creating new GraphQL resolvers).</li>
</ul><h2 id="graphql-faker">graphql-faker</h2><p>During review, it’s important not to block feature development. We want to be able to parallelize the backend and frontend work.</p><p>We’ve found <a href="https://github.com/APIs-guru/graphql-faker">graphql-faker</a> to be really helpful. It’s a super nifty tool for mocking up a schema, such that you can make “real” queries and iterate on the Dream Query in a live GraphQL playground.</p><p>This also lets client developers hook up the graphql-faker endpoint inside their application - meaning we can use a Dream Query in development to build the view layer while the schema is still in review.</p><h2 id="incrementally-using-the-dream-query">Incrementally using the Dream Query</h2><p>When writing new large pages, or incrementally refactoring a non-GraphQL page to use GraphQL, we may want to roll things out incrementally. Perhaps not all the resolvers can be implemented straight away.</p><p>We’ve found it helpful to paste in the whole Dream Query into the app:</p><div class="language-jsx highlighter-rouge highlight"><pre>const GET_HEADER_DATA = gql`
  query GetHeaderData {
    loggedInUser {
      city
      displayName
      # TODO: Uncomment and use when each field is supported
      # profilePhoto(size: "small") {
      #   src
      # }
      # inbox {
      #   unreadMessageCount
      # }
      # yearsElite
    }
  }
`;
function MyPage() {
  const { data, loading, error } = useQuery(GET_HEADER_DATA);
  if (error) throw error;
  if (loading) return null;
  const { displayName, city } = data;
  return &lt;Header displayName={displayName} city={city} /&gt;;
}
</pre></div><p>Tickets can be created to uncomment specific fields. This provides a way to break up, parallelize and track how much work there is left to complete the migration. When a type becomes available in the schema, we can uncomment the lines and use it in production.</p><h2 id="why-a-dream-query-and-not-a-dream-schema">Why a dream query and not a dream schema?</h2><p>Schema proposals are great to see too! We recommend starting with the query first, since this maps directly the interface the product will be using. It also allows those unfamiliar with the schema to quickly get an understanding of the shape of data the client is requesting, without having to go through all the options that the existing/proposed schema allows for.</p><h2 id="takeaways">Takeaways</h2><p>The Dream Query</p><ul><li>is used as a way to communicate what data the component or page needs, and what new backend work needs to be done</li>
<li>forces us as developers to think critically about what types we’re adding and encourages reuse of existing schema</li>
<li>provides an opportunity for schema reviewers to catch issues early, before iteration cycles are spent committing to suboptimal schema design</li>
<li>provides a way to chunk up migrations to GraphQL</li>
</ul><p>Mark Larah, Software Engineer (<a href="https://twitter.com/mark_larah">@mark_larah</a>)</p><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/b5d226cd-6ea1-4d12-b875-725b331202b7/Software-Engineer-Application-Backend-remote?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/10/dream-query.html</link>
      <guid>https://engineeringblog.yelp.com/2020/10/dream-query.html</guid>
      <pubDate>Wed, 07 Oct 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Improving the performance of the Prometheus JMX Exporter]]></title>
      <description><![CDATA[<p>At Yelp, usage of <a href="https://prometheus.io/">Prometheus</a>, the open-source monitoring system and time series database, is blossoming. Yelp is initially focusing on onboarding infrastructure services to be monitored via Prometheus, one such service being <a href="https://kafka.apache.org/">Apache Kafka</a>. This blogpost discusses some of the performance issues we initially encountered while monitoring Kafka with Prometheus, and how we solved them by contributing back to the Prometheus community.</p><h3 id="kafka-at-yelp-primer">Kafka at Yelp primer</h3><p>Kafka is an integral part of Yelp’s infrastructure, clusters are varied in size and often contain several thousand topics. By default, Kafka exposes a lot of metrics that can be collected, most of which are crucial to understand the state of a cluster/broker during incidents, or gauge the overall health of a cluster/broker. By default, Kafka reports metrics as JMX (<a href="https://en.wikipedia.org/wiki/Java_Management_Extensions">Java Management Extensions</a>) MBeans.</p><h3 id="prometheus-metrics-primer">Prometheus metrics primer</h3><p>One of the ways to export metrics in Prometheus is via <a href="https://prometheus.io/docs/instrumenting/exporters/">exporters</a>. Exporters expose metrics from services in a <a href="https://prometheus.io/docs/instrumenting/exposition_formats/">format</a> that Prometheus understands. Prometheus shards are then able to collect metrics exposed by these exporters.</p><p>The Prometheus community officially maintains the <a href="https://github.com/prometheus/jmx_exporter">JMX Exporter</a>, an exporter that can be configured to expose JMX MBeans from virtually any JVM-based process as Prometheus metrics. As mentioned above, Kafka is one such process.</p><hr /><p>In order to make Kafka metrics available in Prometheus, we decided to deploy the JMX Exporter alongside Kafka.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-02-improving-the-performance-of-the-prometheus-jmx-exporter/architecture.png" alt="Figure: Architecture of Prometheus metric collection for a 3-broker Kafka cluster" /><p class="subtle-text"><small>Figure: Architecture of Prometheus metric collection for a 3-broker Kafka cluster</small></p></div><p>When we initially deployed the JMX Exporter to some of the clusters, we noticed collection time could be as high as 70 seconds (from a broker’s perspective). We tried running the exporter as a Java agent and tweaking the configuration to collect only metrics that were interesting to us, but this did not improve the speed.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-02-improving-the-performance-of-the-prometheus-jmx-exporter/collection-time.png" alt="Figure: Collection time (in seconds) of a single Kafka broker with no prior code change." /><p class="subtle-text"><small>Figure: Collection time (in seconds) of a single Kafka broker with no prior code change.</small></p></div><p>This meant that metrics usable by automated alerting or engineers would have, at best, one datapoint per time series every 70 seconds. This would have made monitoring an infrastructure supporting real-time use cases difficult, e.g: noticing spikes in incoming traffic, garbage collection pauses, etc. would be more difficult to spot.</p><p>We dug into the JMX Exporter codebase and realised some operations were repeated at every collection. Sometimes hundreds of thousands of times per collection. For Kafka, some metrics are available with a topic-partition granularity; if a Kafka cluster contains thousands of topic-partitions, thousands of metrics are exposed. One of the operations that seemed the most costly was <a href="https://github.com/prometheus/jmx_exporter/blob/ce04b7dca8615d724d8f447fa25c44ae1c29238b/collector/src/main/java/io/prometheus/jmx/JmxCollector.java#L375">matching MBean names against a configured set of regular expressions</a>, which then computes Prometheus sample <a href="https://github.com/prometheus/jmx_exporter/blob/ce04b7dca8615d724d8f447fa25c44ae1c29238b/collector/src/main/java/io/prometheus/jmx/JmxCollector.java#L408">name</a> and <a href="https://github.com/prometheus/jmx_exporter/blob/ce04b7dca8615d724d8f447fa25c44ae1c29238b/collector/src/main/java/io/prometheus/jmx/JmxCollector.java#L421">labels</a>.</p><p>The set of regular expressions is immutable over the lifespan of the exporter and between configuration reloads. This means that if an MBean name matches one of the regular expressions (or does not match any) during the first metric collection, it will match it for all collections until the configuration is changed or reloaded. The result of matching MBean names against the set of regular expressions can hence be cached and the time-consuming task of matching regular expressions (and computing sample name and labels) skipped during further collections.</p><p>After introducing this cache, heavy computations are made only once throughout the lifespan of the exporter. The initial collection does the heavy work of caching and takes a significant amount of time to complete, however subsequent collections take very little time. Collections that used to take 70 seconds, now take around 3 seconds. This allows us to have more fine-grained dashboards and alerting.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-10-02-improving-the-performance-of-the-prometheus-jmx-exporter/collection-time-updated.png" alt="Figure: Collection time (in seconds) before and after enable rules caching. Red line shows the number of MBeans in the cache." /><p class="subtle-text"><small>Figure: Collection time (in seconds) before and after enable rules caching. Red line shows the number of MBeans in the cache.</small></p></div><p>This change is now available in the upstream <a href="https://github.com/prometheus/jmx_exporter/pull/518">jmx_exporter</a>, and can be toggled on/off depending on the use case.</p><h3 id="looking-further">Looking Further</h3><p>As mentioned in the introduction, the usage of Prometheus at Yelp is growing and many systems and teams rely on it for monitoring, dashboards and automated alerting. The changes to the JMX exporter are only a small part of a large initiative driven by our <a href="https://www.yelp.careers/us/en/job/5fb956a7-4777-48d2-bc5e-ef49b5a2e300/Site-Reliability-Engineer">Production Engineering team</a>, watch this space for more insights into this journey!</p><h3 id="acknowledgements">Acknowledgements</h3><p>Brian Brazil for code reviews and best practices</p><div class="island job-posting"><h3>Site Reliability Engineering at Yelp</h3><p>Want to build and manage scaleable, self-healing, globally-distributed systems?</p><a class="ybtn ybtn-primary" href="https://www.yelp.careers/us/en/job/5fb956a7-4777-48d2-bc5e-ef49b5a2e300/Site-Reliability-Engineer?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/10/improving-the-performance-of-the-prometheus-jmx-exporter.html</link>
      <guid>https://engineeringblog.yelp.com/2020/10/improving-the-performance-of-the-prometheus-jmx-exporter.html</guid>
      <pubDate>Fri, 02 Oct 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Yelp's Machine Learning Platform]]></title>
      <description><![CDATA[<p>Understanding data is a vital part of Yelp’s success. To connect our consumers with great local businesses, we make millions of recommendations every day for a variety of tasks like:</p><ul><li>Finding you immediate quotes for a plumber to fix your leaky sink</li>
<li>Helping you discover which restaurants are open for delivery right now</li>
<li>Identifying the most popular dishes for you to try at those restaurants</li>
<li>Inferring possible service offerings so business owners can confidently and accurately represent their business on Yelp</li>
</ul><p>In the early days of Yelp circa 2004, engineers painstakingly designed heuristic rules to power recommendations like these, but turned to machine learning (ML) techniques as the product matured and our consumer base grew. Today there are hundreds of ML models powering Yelp in various forms, and ML adoption continues to accelerate.</p><p>As our ML adoption has grown, our ML infrastructure has grown with it. Today, we’re announcing our ML Platform, a robust, full feature collection of systems for training and serving ML models built upon open source software. In this initial blog post, we will be focusing on the motivations and high level design. We have a series of blog posts lined up to discuss the technical details of each component in greater depth, so check back regularly!</p><h2 id="yelps-ml-journey">Yelp’s ML Journey</h2><p>Yelp’s first ML models were concentrated within a few teams, each of whom created custom training and serving infrastructure. These systems were tailored towards the challenges of their own domains, and cross pollination of ideas was infrequent. Owning an ML model was a heavy investment both in terms of modeling, as well as infrastructure maintenance.</p><p>Over several years, each system was gradually extended by its team’s engineers to address increasingly complex scope and tighter service level objectives (SLOs). The operational burden of maintaining these systems took a heavy toll, and drew ML engineers’ focus away from modeling iterations or product applications.</p><p>A few years ago, Yelp created a Core ML team to consolidate our ML infrastructure under centrally supported tooling and best practices. The benefits being:</p><ol><li>Centrally managed systems for ML workflows would enable ML developers to focus on the product and ML aspects of their project without getting bogged down by infrastructure.</li>
<li>By staffing our Core ML team with infrastructure engineers, we could provide new cutting edge capabilities that ML engineers might lack expertise to create or maintain.</li>
<li>By consolidating systems we could increase system efficiency to provide a more robust platform, with tighter SLOs and lower costs.</li>
</ol><p>Consolidating systems for a topic as broad as ML is daunting, so we began by deconstructing ML systems into three main themes and developed solutions within each: interactive computing, data ETL, and model training/serving. The approach has worked well, and allowed teams to migrate portions of their workflows on to Core ML tooling while leaving other specialized aspects of their domain on legacy systems as needed.</p><p>In this blogpost, I’ll discuss how we architected our model training and serving systems into a single, unified model platform.</p><h2 id="yelps-ml-platform-goals">Yelp’s ML Platform Goals</h2><p>At a high level, we have a few primary goals for our ML Platform:</p><ul><li>Opinionated APIs with pre-built implementations for the common cases.</li>
<li>Correctness and robustness by default.</li>
<li>Leverage open source software.</li>
</ul><h3 id="opinionated-apis">Opinionated APIs</h3><p>Many of Yelp’s ML challenges fall into a limited set of common cases, and for these we want our ML Platform to enforce Yelp’s collective best practices. Considerations like meta data logging, model versioning, reproducibility, etc. are easy to overlook but invaluable for long term model maintenance. Instead of requiring developers to slog through all of these details, we want our ML Platform to abstract and apply best practices by default.</p><p>Beyond canonizing our ML workflows, opinionated APIs also enable us to streamline model deployment systems. By focusing developers into narrower approaches, we can support automated model serving systems that allow developers to productionize their model via a couple clicks on a web UI.</p><h3 id="correctness-and-robustness-by-default">Correctness and robustness by default</h3><p>One of the most common pain points of Yelp’s historical ML workflows was system verification. Ideally, the same exact code used to train a model should be used to make predictions with the model. Unfortunately, this is often easier said than done – especially in a diverse, large-scale, distributed production environment like Yelp’s. We usually train our models in Python but might deploy the models via Java, Scala, Python, inside databases, etc.</p><p>Even the tiniest inconsistencies can make huge differences for production models. E.g., we encountered an issue where 64-bit floats were unintentionally used by a XGBoost booster for predictions (XGBoost only uses 32-bit floats). The slight floating point differences when numerically encoding an important categorical variable resulted in the model giving approximately random predictions for 35% of instances!</p><p>Tolerating sparse vector representations, missing values, nulls, and NaNs also requires special consideration. Especially when different libraries and languages have differing expectations for client side pre-processing on these issues. E.g., some libraries treat zero as missing whereas others have a special designation. It is extremely complicated for developers to think through these implementation details let alone even recognize if a mistake has occurred.</p><p>When designing our ML Platform, we’ve adopted a test-driven development mindset. All of our code has a full suite of end-to-end integration tests, and we run actual Yelp production models and datasets through our tests to ensure the models give exactly the same results across our entire ecosystem. Beyond ensuring correctness, this also ensures our ML Platform is robust enough to handle messy production data.</p><h3 id="leverage-open-source-solutions">Leverage Open Source Solutions</h3><p>ML is currently experiencing a renaissance of open source technology. Libraries like Scikit-learn, XGBboost, Tensorflow, and Spark have existed for years and continue to provide the foundational ML capabilities. But newer additions like Kubeflow, MLeap, MLflow, TensorFlow Extended, etc. have reinvented what an ML system should entail and provide ML systems with much needed software engineering best practices.</p><p>For Yelp’s ML Platform, we recognized that any in-house solution we might construct would be quickly surpassed by the ever-increasing capabilities of these open source projects. Instead we selected the open source libraries best aligned with our needs and constructed thin wrappers around them to allow easier integrations with our legacy code. In cases where open source tools lack capabilities we need, we’re contributing solutions back upstream.</p><h2 id="ml-platform-technological-overview">ML Platform Technological Overview</h2><p>In future blog posts, we’ll be discussing these systems in greater detail, so check back soon. For now, I’ll just give a brief overview of the key tech choices and a model’s life cycle within these systems.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-07-01-ML-Platform-Overview/model_platform_overview.png" alt="" /></div><h3 id="mlflow-and-mleap">MLflow and MLeap</h3><p>After evaluating a variety of options, we decided on <a href="https://mlflow.org/">MLflow</a> and <a href="https://mleap-docs.combust.ml/">MLeap</a> as the skeleton of our platform.</p><p>MLflow’s goal is to make managing ML lifecycles simpler, and contains various subcomponents each aimed at different aspects of ML workflows. For our ML Platform, we especially focused on the MLflow Tracking capabilities. We automatically log parameters and metrics to our tracking server, and then developers use MLflow’s web UI to inspect their models’ performance, compare different model versions, etc.</p><p>MLeap is a serialization format and execution engine, and provides two advantages for our ML Platform. Firstly, MLeap comes out of the box with support for Yelp’s most commonly used ML libraries: Spark, XGBoost, Scikit-learn, and Tensorflow – and additionally can be extended for custom transformers to support edge cases. Secondly, MLeap is fully portable, and can run inside any JVM-based system including Spark, Flink, ElasticSearch, or microservices. Taken together, MLeap provides a single solution for our model serving needs like robustness/correctness guarantees and push-button deployment.</p><h3 id="typical-code-flow-in-our-ml-platform">Typical Code Flow in our ML Platform</h3><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-07-01-ML-Platform-Overview/model_training_flow.png" alt="Offline Code Flow for Training a Model in our ML Platform" /><p class="subtle-text"><small>Offline Code Flow for Training a Model in our ML Platform</small></p></div><p>Developers begin by constructing a training dataset, and then define a pipeline for encoding and modeling their data. Since Yelp models typically utilize large datasets, Spark is our preferred computational engine. Developers specify a Spark ML Pipeline for preprocessing, encoding, modeling, and postprocessing their data. Developers then use our provided APIs to fit and serialize their pipeline. Behind the scenes, these functions automatically interact with the appropriate MLflow and MLeap APIs to log and bundle the pipeline and its metadata.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-07-01-ML-Platform-Overview/model_serving_flow.png" alt="Online Code Flow for Serving a Model in our ML Platform" /><p class="subtle-text"><small>Online Code Flow for Serving a Model in our ML Platform</small></p></div><p>To serve models, we constructed a thin wrapper around MLeap that is responsible for fetching bundles from MLflow, loading the bundle into MLeap, and mapping requests into MLeap’s APIs. We created several deployment options for this wrapper, which allows developers to execute their model as a REST microservice, Flink stream processing application, or hosted directly inside Elasticsearch for ranking applications. In each deployment option, developers simply configure the MLflow id for the models they want to host, and then can start sending requests!</p><h2 id="whats-next">What’s Next?</h2><p>We’ve been rolling out our ML Platform incrementally, and observing enthusiastic adoption by our ML practitioners. The ML Platform is full featured, but there are some improvements we have on our roadmap.</p><p>First up is expanding the set of pre-built models and transformers. Both MLflow and MLeap are general purpose and allow full customization, but doing so is sometimes an involved process. Rather than requiring developers to learn the internals of MLflow and MLeap, we’re planning to extend our pre-built implementations to cover more of Yelp’s specialized use cases.</p><p>We’d also like to integrate our model serving systems with Yelp’s A/B experimentation tools. Hosting multiple model versions on a single server is available now, but currently relies on clients to specify which version they want to use in each request. However, we could further abstract this detail and have the serving infrastructure connect directly to the experimentation cohorting logic.</p><p>Building on the above, we would like to have the actual observed events feed back into the system via Yelp’s real-time streaming infrastructure. By joining the observed events with the predicted events, we can monitor ML performance (for different experiment cohorts) in real-time. This enables several exciting properties like automated alerts for model degradation, real-time model selection via reinforcement learning techniques, etc.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/07/ML-platform-overview.html</link>
      <guid>https://engineeringblog.yelp.com/2020/07/ML-platform-overview.html</guid>
      <pubDate>Wed, 01 Jul 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How businesses have reacted to COVID-19 using Yelp features]]></title>
      <description><![CDATA[<p>Yelp periodically releases an open, all-purpose dataset for learning. The dataset is a subset of our businesses, reviews, and user data to inform government policy, academic research, and business strategy, among other uses. It has provided opportunities including teaching students about databases, helping others study natural language processing, sampling production data while learning to create mobile apps, and discovering compelling <a href="https://www.yelp.com/dataset/challenge/winners">research findings</a>. <a href="https://www.yelp.com/dataset">Our most recent dataset</a> was published in March 2020.</p><p>Businesses everywhere are adapting to the effects of the Coronavirus and have been using Yelp <a href="https://blog.yelp.com/2020/05/supporting-local-businesses-and-the-yelp-community-with-new-products-and-features">features</a> to stay connected with their customers. To this end, we’re releasing an addendum dataset including the following components, as of June 10, 2020:</p><ul><li><a href="https://blog.yelp.com/2020/03/new-page-features-to-communicate-covid-19-response">COVID-19-related business highlights</a></li>
<li>Restaurants with delivery/takeout enabled</li>
<li>Restaurants partnered with Grubhub</li>
<li>Businesses with <a href="https://biz.yelp.com/support/call_to_action">Call to Action</a> buttons enabled</li>
<li>Does the business still have <a href="https://blog.yelp.com/2016/04/yelp-request-a-quote">Request A Quote</a> enabled?</li>
<li>Has the business created a <a href="https://blog.yelp.com/2020/04/coronavirus-alert-banner-examples-for-business-pages">custom page banner</a> during COVID-19?</li>
<li>Temporary closures</li>
<li><a href="https://blog.yelp.com/2020/05/supporting-local-businesses-and-the-yelp-community-with-new-products-and-features">Virtual Services offered</a></li>
</ul><p>We hope researchers, academics, and any interested parties will utilize this new data, along with our <a href="https://www.yelpeconomicaverage.com/yelp-coronavirus-economic-impact-report.html">most recent economic impact report</a>, to investigate and further understand the broad-ranging effects of the coronavirus pandemic. Download the new Yelp dataset <a href="https://www.yelp.com/dataset/download">here</a>.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/06/how-businesses-have-reacted-to-covid-19-using-yelp-features.html</link>
      <guid>https://engineeringblog.yelp.com/2020/06/how-businesses-have-reacted-to-covid-19-using-yelp-features.html</guid>
      <pubDate>Mon, 15 Jun 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[dataloader-codegen: Autogenerate DataLoaders for your GraphQL Server!]]></title>
      <description><![CDATA[Yelp
<noscript>
</noscript>
<p><a href="https://engineeringblog.yelp.com/">Yelp</a></p>
<div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container">
<p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p>
</form></div>
<div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2020 Yelp Inc. Yelp, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/17089be275f0/assets/img/logos/logo_desktop_xsmall_outline.png" alt="Yelp logo" class="main-footer_logo-copyright" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/0aade8725c91/assets/img/logos/logo_desktop_xsmall_outline@2x.png 2x" />, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/58cfc999e1f5/assets/img/logos/burst_desktop_xsmall_outline.png" alt="Yelp burst" class="main-footer_logo-burst" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/dcb526e86d86/assets/img/logos/burst_desktop_xsmall_outline@2x.png 2x" /> and related marks are registered trademarks of Yelp.</small></div>]]></description>
      <link>https://engineeringblog.yelp.com/2020/04/open-sourcing-dataloader-codegen.html</link>
      <guid>https://engineeringblog.yelp.com/2020/04/open-sourcing-dataloader-codegen.html</guid>
      <pubDate>Wed, 08 Apr 2020 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[An Ever Evolving Company Requires an Ever Evolving Communication Plan]]></title>
      <description><![CDATA[<p>It’s 2014 and your teams are divided by platform, something like: Web, Mobile Web, Android, and iOS.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/an-ever-evolving-company-requires-an-ever-evolving-communication-plan/Original_Org.png" alt="" /></div><p>In order to launch features, product managers jump from platform to platform and teams move fast. Really fast. Lines of code in each repository increase to the point where you now name them “monoliths.” A few engineers maintain these monoliths when they need to, but no one is solely dedicated to the task. Engineers are distributed by platform; so communication on when to maintain the monoliths is easy, but presents another problem.</p><p><strong>Can you continue to ship code efficiently if you depend entirely on these monoliths?</strong> It turns out that as you increase developers and the size of the code base, the number of rollbacks and unscheduled mobile point releases also increases. At first you notice only a few rollbacks, but as your team grows, you start to estimate when all pushes result in a rollback. This is not the typical “up and to the right graph” that companies look for.</p><p><strong>Since rollbacks sound like a blocker, we’ve come up with an alternative: microservices. Then another: product teams.</strong></p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/an-ever-evolving-company-requires-an-ever-evolving-communication-plan/Updated_Org.png" alt="" /></div><p>Now the company can scale both infrastructure and team organization. Product teams have a common set of infrastructure; the button used on the Growth team is the same button used on the Contributions team. Function-based (core) teams spin up. They work on the parts that individual maintainers worked on in the days of the monolith. They’re dedicated to making sure that, in the long term, we’re coding sustainably. Communication becomes harder. In fact, communication complexity continues to increase. Core teams used to do all the changes needed for maintenance/infrastructure upgrades, but the organization has gotten so large they need to rely on product teams to do the bulk of the work. Core teams generate a list of maintenance items that product teams need to work on, but product teams have to concentrate on adding new products.</p><p><strong>How do we prioritize work?</strong> Before we prioritize work, we need to identify who’s responsible for what. To tackle this problem, Core teams create tooling. Ownership becomes more defined with added metadata to “entities,” an abstract term used to describe things like code and alerting. All this ownership becomes shareable via the ownership service, and, we can now track migrations across the engineering organization with a tool called “migration-status.” We start by defining migrations from a “core team” perspective, but also have migrations from other infrastructure teams. Now that product teams are multi-disciplinary, we start to bombard them with an increasing number of messages to upgrade/migrate their infrastructure. Communication complexity increases and efficiency decreases.</p><p>We start thinking of a way to tie together priorities from multiple teams. We need a solution that has a global view and seeks to control communication complexity. Just like how a notification platform for your users needs to figure out the right messages to send, we need a tool to surface the right reminders to the right teams. So, which messages are sent to which users?</p><p><strong>Over the next few blog posts, we’ll walk you through what the Engineering Effectiveness Metrics (EE Metrics) Platform is and how we use it to reduce communication complexity.</strong> The first blog post will dive into our “Ownership” service. We’ll be talking about what it is, how we use it, and the value that it brings to our engineering organization. The second post will cover how we use the EE Metrics tool to increase awareness of developer velocity and code quality and to improve prioritization of critical migrations for product teams. We do all of these things while maintaining a safe space for teams and individuals.</p><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/03/an-ever-evolving-company-requires-an-ever-evolving-communication-plan.html</link>
      <guid>https://engineeringblog.yelp.com/2020/03/an-ever-evolving-company-requires-an-ever-evolving-communication-plan.html</guid>
      <pubDate>Fri, 06 Mar 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Supporting Spark as a First-Class Citizen in Yelp’s Computing Platform]]></title>
      <description><![CDATA[<p>Yelp extensively utilizes distributed batch processing for a diverse set of problems and workflows. Some examples include:</p><ul><li>Computation over Yelp’s review corpus to identify restaurants that have great views</li>
<li>Training ML models to predict personalized business collections for individual users</li>
<li>Analytics to extract the most in-demand service offerings for Request a Quote projects</li>
<li>On-demand workloads to investigate surges in bot traffic so we can quickly react to keep Yelp safe</li>
</ul><p>Over the past two years, Yelp engineering has undertaken a series of projects to consolidate our batch processing technologies and standardize on Apache Spark. These projects aimed to simultaneously accelerate individual developer workflows by providing easier access to powerful APIs for distributed computing, while also making our systems more robust, performant, and cost efficient on a macro scale.</p><h2 id="background">Background</h2><p>Throughout Yelp’s history, our batch processing framework of choice was MapReduce, executed via Amazon Elastic MapReduce (AWS EMR). We even constructed our own open source framework, <a href="https://github.com/Yelp/mrjob">mrjob</a>, which abstracts the details of the underlying MapReduce execution infrastructure away from developers. This way they could focus on the application-specific portions of their workflow instead, like defining their Map and Reduce steps. This framework has served us well over the years, and every day our production environment executes hundreds of mrjobs.</p><p>Over time though, Yelp developers were increasingly drawn towards Apache Spark. The foremost advantage of Spark is in-memory computing, but additional advantages include a more expressive API and large library of open source extensions for specialized workloads. This API flexibility makes it easier for developers to write their distributed processing workloads and results in higher-quality code that is both more performant and easier to maintain. However, without a well-supported backend, provisioning Spark resources was an intensive process that made deploying Spark jobs to production a challenge and all but eliminated Spark from contention for ad hoc workflows en masse.</p><p>Better support for Spark seemed like a promising direction, so our first step was to add Spark support into Yelp’s mrjob package (seen in mrjob v0.5.7). This enabled developers to write Spark code using the familiar mrjob framework and execute their Spark jobs on AWS EMR. Results from early adopters were encouraging, with one Yelp engineering team going so far as to convert over 30 of their legacy MapReduce mrjob batches into Spark mrjob batches, resulting in an aggregated 80% runtime speedup and 50% cost savings! Clearly, Spark was a direction that could add substantial value to Yelp’s distributed computing platform.</p><p>Running Spark mrjob batches on AWS EMR was viable for many production batch use cases, but also demonstrated a few problems. Firstly, it was painful to connect to the rest of Yelp’s infrastructure, and consequently workloads had to operate in isolation (e.g., they couldn’t make requests to other Yelp services). Secondly, it was painful to use for ad hoc workloads since it required launching an AWS EMR cluster on demand, which could take up to 30 minutes between provisioning and bootstrapping.</p><p>Integrating Spark as a first-class citizen in Yelp’s computing platform as a service, <a href="https://github.com/Yelp/paasta">PaaSTA</a>, enabled us to ease these pain points while also inheriting all of PaaSTA’s capabilities.</p><h2 id="spark-on-paasta">Spark on PaaSTA</h2><p>PaaSTA is Yelp’s defacto platform for running services and containerized batches. At its core it’s (currently) built on <a href="http://mesos.apache.org/">Apache Mesos</a>. Spark has native support for Mesos, but was a new framework for PaaSTA, which had previously only executed <a href="https://mesosphere.github.io/marathon/">Marathon</a> for long-running services, and Yelp’s in-house batch scheduling system, <a href="https://github.com/Yelp/Tron">Tron</a>, for containerized batches. To set up Spark as a framework in PaaSTA, we needed to select several configuration settings and design the interfaces Spark on PaaSTA would expose to Yelp developers.</p><p>We elected to run the Spark driver as a Mesos framework in a Docker container using the Spark client deploy mode. Since losing the driver is catastrophic for Spark clusters, we constrained Spark drivers to only run on a dedicated Auto Scaling Group of on-demand EC2 instances. On the other hand, Spark’s resilient data model provides automatic recovery from executor loss, allowing us to run Spark executors in Docker containers on a cluster of EC2 spot instances. For simplicity, we configured Spark on PaaSTA such that the Spark driver and executors used the same Docker image pulled from our internal Docker registries.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-03-02-spark-on-paasta/spark_diagram.png" alt="" /></div><p>Yelp has two primary use cases for Spark: offline batches and ad hoc interactive computing. To serve these needs, we created two APIs for Spark on PaaSTA:</p><ul><li>A command line interface that developers use to schedule Spark batches. Behind the scenes, this interface injects the necessary Spark configuration constraints to connect to PaaSTA’s Mesos masters, provision executors on PaaSTA’s Mesos Agents, pull images from the appropriate Docker registry, and create a SparkSession object.</li>
<li>A Python package that developers invoke from arbitrary Python code (e.g., Jupyter notebooks). Much like our command line interface, this package injects the necessary Spark configuration constraints and then returns the resulting SparkSession object.</li>
</ul><p>Both of these APIs allow developers to have full control over how to configure their Spark cluster and can provide overrides for any of Spark’s configuration settings (e.g., executor memory, max cores, driver results size, etc.). PaaSTA will then use those values to provision and configure a Spark cluster as requested.</p><p>Beyond Spark configuration settings, Yelp developers also have the ability to specify the Docker image that Spark uses. This enables developers to easily include custom code (e.g., from their production service) in their Spark workflows. To reduce developer overhead, we’ve constructed an internal debian package that developers can install into their Docker image to automatically include many Spark extensions that are valuable for Yelp workflows, like hadoop-aws, spark-avro, etc.</p><h2 id="isolating-spark-jobs-with-a-dedicated-mesos-pool">Isolating Spark Jobs with a Dedicated Mesos Pool</h2><p>Initially, we provisioned Spark frameworks on the same Mesos pools as Yelp’s other Mesos frameworks. However, we quickly recognized that Spark workloads have drastically different characteristics than the long-running Marathon services our Mesos pools were configured to support. Two differences in particular convinced us to create a dedicated Mesos pool for Spark jobs.</p><p>Firstly, Spark workflows are stateful. Yelp heavily uses AWS EC2 spot instances to drive cost savings for our computing platforms, which means an instance can be reclaimed by AWS at any time. Moreover, the PaaSTA cluster autoscaler dynamically scales the Mesos cluster to maintain a desired utilization, and can kill service instances on underutilized Mesos agents without warning. Since Yelp’s Marathon services are stateless and frequently have multiple concurrent instances, these abrupt disruptions are mostly inconsequential. While Spark can recover from losing an executor, losses can result in cached RDDs to be recomputed, thereby increasing load on upstream datastores and degrading developer experience. With a dedicated pool, we can use AWS instance types with lower reclamation rates, and a specially tailored Spark autoscaler (discussed in the next section) can minimize the probability of executor loss caused by PaaSTA.</p><p>Secondly, Spark workflows are more memory-intensive than service workloads. While one of Spark’s primary advantages over MapReduce is in-memory computing, that advantage is only realized if the Spark cluster has sufficient memory to hold the necessary data. At Yelp, it’s common for our developers to request terabytes of memory in aggregate for a single Spark cluster alongside several hundred CPUs. That memory-to-CPU ratio is substantially different from stateless Marathon services, which are typically CPU-bound with low memory footprints. With a dedicated pool, we’re able to populate Spark frameworks’ Mesos agents with AWS instances that have higher memory capacity and SSD drives in order to deliver a more cost effective system with higher resource utilization.</p><h2 id="autoscaling-spark">Autoscaling Spark</h2><p>Yelp has been autoscaling our PaaSTA clusters for several years, reducing infrastructure costs by only running as many servers as necessary. We generally use a fairly standard reactive autoscaling algorithm which attempts to keep the most utilized resource (e.g., CPUs or memory) at a desired level (around 80%). For example, if 90 out of 100 CPUs in the cluster were in use, it would add another ~12 CPUs to the cluster to bring CPU utilization back to 80%. If, later on, only 80 of the now 112 CPUs are utilized, it will downscale the cluster back to 100 CPUs. This approach works well for most workloads we run at Yelp, the majority of which are long-running services whose load varies gradually throughout the day in proportion to web traffic.</p><p>Spark workloads, however, do not have gradually varying needs. Instead, the typical workload makes a large, sudden request for hundreds or thousands of CPUs, abruptly returning these resources a few hours (or even minutes) later when the workload completes. This causes sudden load spikes on the cluster, which is problematic for a reactive autoscaling approach for two reasons.</p><p>Firstly, a reactive autoscaling approach can only trigger scaling actions when the cluster is already over or under utilized. This is not problematic for gradually shifting workloads since the extra load usually fits into the cluster headroom (i.e., the 20% of CPUs the autoscaler keeps unallocated) while additional capacity is added. However, large Spark jobs can easily exceed the cluster headroom, preventing the workload and anything else on the cluster from obtaining additional resources until more machines are provisioned.</p><p>Secondly, relying on resource utilization obscures the true quantity of resources needed. If a cluster with 100 CPUs is at 100% utilization and our desired utilization is 80%, then the aforementioned reactive autoscaling strategy will provision 25 additional CPUs regardless of how many CPUs are needed. While it’s possible that 25 CPUs will be sufficient, if the Spark workload requested a thousand CPUs then it will take 11 autoscaling cycles (!) to reach the desired capacity, impeding workloads and causing developer frustration. By relying only on current utilization, we have no way to distinguish between these two cases. See below for an example in which it took our reactive algorithm four cycles and almost one and a half hours to scale the cluster from 100 to 500 CPUs for a Spark job.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-03-02-spark-on-paasta/reactive_scaleup.png" alt="" /></div><p>To solve these problems, we turned to Clusterman, our modular cluster autoscaler, which makes it simple to write custom autoscaling code for specific pools of machines. (You can check out our <a href="https://engineeringblog.yelp.com/2019/02/autoscaling-mesos-clusters-with-clusterman.html">blogpost</a> to learn more about it!) First, we extended the APIs that developers use to start Spark on PaaSTA jobs to send the Spark workflow’s resource needs to Clusterman. We then created a custom Clusterman signal for Spark that looks at these reported resource needs and compares them to the list of Spark frameworks currently registered with our Mesos clusters. If the framework associated with a given resource request is still running or we’re within a several minute grace period, that resource request is included in Clusterman’s allocation target. Because Clusterman knows the full resource requirements of each job as soon as it starts, we can make sure that Spark on PaaSTA jobs wait as little as possible for resources, regardless of quantity. The graph below shows our new approach performing the same task as the previous one in 15 minutes instead of one and a half hours!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-03-02-spark-on-paasta/clusterman_scaleup.png" alt="" /></div><h2 id="spark-on-paasta-results">Spark on PaaSTA Results</h2><p>Over the past two years, we’ve seen accelerating adoption of Spark on PaaSTA among Yelp developers. Roughly 80% (and climbing) of all scheduled batches are now running Spark on PaaSTA instead of legacy mrjob on AWS EMR! In addition, Yelp developers create hundreds of Spark clusters every day for their ad hoc workloads.</p><p>Aside from improving job performance and developer experience, moving to Spark on PaaSTA has also resulted in meaningful cost savings. As mentioned earlier, our legacy mrjob package runs on AWS EMR. Since EMR is a managed platform on top of EC2, AWS bills for EMR by taking the underlying EC2 cost and adding a premium. In essence, you can think of EMR as having a usage tax, with the EMR tax rate equal to the EMR premium divided by the EC2 cost. Figure 2 shows the EMR tax rate for different configurations of M5 and R5 instances. In many cases, the EMR tax is a substantial portion of overall EMR costs, and since Yelp uses spot instances heavily, our aggregate savings by moving Spark jobs from EMR to PaaSTA is over 30%.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-03-02-spark-on-paasta/emr_tax_rate.png" alt="" /></div><p>Given the success of early Spark adoption, an organizational goal for Yelp in 2019 was to migrate all batch processing workloads to Spark on PaaSTA. As you might expect, migrating hundreds of legacy batches (many of which have been running without intervention for years) was a daunting endeavor. Rather than going through batches one by one, we instead migrated legacy MapReduce mrjobs en masse via a mrjob extension that wraps MapReduce code and executes it via Spark on PaaSTA. While rewriting MapReduce jobs to fully utilize Spark capabilities results in peak performance, we’ve observed significant wins just from running the existing Map and Reduce steps in Spark instead of MapReduce.</p><h2 id="conclusions">Conclusions</h2><p>Looking back on our journey, adding Spark support to our computing platform has gone fairly smoothly. Nevertheless, there are still a few things we want to improve.</p><p>The first is system efficiency. By default, Mesos uses round robin task placement, which spreads the Spark executors to many Mesos agents and results in most agents containing executors for many Spark frameworks. This causes problems for cluster downsizing and can yield low cluster utilization. Instead, we would prefer to pack executors onto fewer Mesos Agents. We are currently experimenting with a patch to Spark’s Mesos scheduler that instead greedily packs executors onto hosts. We also plan to investigate Spark’s Dynamic Resource Allocation mode as a further improvement to cluster efficiency.</p><p>The second is stability. We’ve made a deliberate decision to run Spark on spot EC2 instances to benefit from their lower cost, but this choice also means that executor loss is possible. While Spark can recover from this, these events can lead to substantial recomputations that disrupt developer workflows—in worst cases getting stuck in a perpetual crash-recomputation loop that has to be manually terminated. These issues are magnified as the size of our Spark clusters continue to grow; and clusters with thousands of CPUs and TBs of memory are likely to experience at least one executor loss. Some solutions we plan to explore include aggressive checkpointing and/or selectively utilizing on-demand EC2 instances.</p><p>Finally, we’re in the process of converting our Spark deployment from Mesos to Kubernetes, with the primary advantage being that Kubernetes provides additional control layers for us to tune cluster stability, responsiveness, and efficiency. These changes are being made as part of PaaSTA itself, meaning that we can change the backend infrastructure without developers needing to alter their Spark usage!</p><p>We’re continuing to invest in Spark as a premier computing engine at Yelp, so stay tuned for further updates!</p><h2 id="acknowledgments">Acknowledgments</h2><p>Special thanks to everyone on the Core ML and Compute Infrastructure teams for their tireless contributions to bring Spark to all of Yelp!</p><div class="island job-posting"><h3>Become a Distributed Systems Engineer at Yelp</h3><p>Interested in designing, building, and deploying core infrastructure systems? Apply to become a Distributed Systems Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a368cc58-18d4-4d0a-a58e-44b9da767322?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/03/spark-on-paasta.html</link>
      <guid>https://engineeringblog.yelp.com/2020/03/spark-on-paasta.html</guid>
      <pubDate>Mon, 02 Mar 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Accelerating Retention Experiments with Partially Observed Data]]></title>
      <description><![CDATA[<p>Here at Yelp, we generate business wins and a better platform by running A/B tests to measure the revenue impact of different user and business experience interventions. Accurately estimating key revenue indicators, such as the probability a customer retains at least \(n\)-days (\(n\)-day retention) or the expected dollar amount a customer spends over their first \(n\) days (\(n\)-day spend) is core to this experimentation process.</p><p>Historically at Yelp, \(n\)-day customer or user retention was typically estimated as the proportion of customers/users we observed for more than \(n\) days who retained more than \(n\) days. Similarly, \(n\)-day spend was estimated as the average amount spent over the first \(n\) days since experiment cohorting by businesses we have observed for at least \(n\) days.</p><p>Recently, we transitioned to using two alternative statistical estimators for these metrics: the Kaplan-Meier estimator and the mean cumulative function estimator. These new approaches consider censored data, i.e. partially observed data, like how long a currently subscribed advertiser will retain as a customer. Accordingly, they offer several benefits over the previous approaches, including higher statistical power, lower estimate error, and more robustness against within-experiment seasonality.</p><p>By performing Monte-Carlo simulations [<a href="https://engineeringblog.yelp.com#1">1</a>, Chapter 24], we determined that using these estimators allowed us to read A/B experiment metrics a fixed number of days earlier after cohorting ends without any drop in statistical power. This amounted to a 12% to 16% reduction in overall required cohorting and observation time, via a 25% to 50% reduction in the time used to observe how people respond to the A/B experiences. Altogether, this improved our ability to iterate on our product.</p><p>The value of a Yelp customer can be quantified in two primary and informative directions: how long a user / business remains active / subscribed in our system (known as retention), as well as the total dollar amount they generate over their lifetime (known as cumulative spend). When we experiment on different user / business experiences, we make a point of estimating the effect of these changes on retention and spend metrics before we make a final ship decision.</p><p>As a proxy, sometimes dollars might be replaced with less noisy units like ad clicks, page views, etc., but for the purposes of this blog post we will focus on altering a business experience and analyzing \(n\)-day retention and spend. The conclusions carry over equally well to experimentation settings that either deal with users or with similarly defined proxy metrics.</p><p>The diagram below illustrates the typical lifecycle of an A/B test focused on retention and spend.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2020-02-19-accelerating-retention-experiments-with-partially-observed-data/life_of_ab_test_diagram.png" alt="The life of a Yelp A/B test." /></p><p>Depending on the type of experience change and the time window used for measurement, the cohorting and observing phases can in many instances take the most time of the whole experimentation pipeline. As such, the acceleration of the observing phase detailed here can provide improvements in our ability to iterate on our product.</p><p>One possible retention measure is “what percentage of those in cohort \(C\) who subscribed to product \(P\) at any point during the experiment went on to retain for more than \(n\) days,” where \(C\), \(P\), and \(n\) are parameters the experimenter can adjust. Yelp previously computed this \(n\)-day retention measure within each experiment cohort as follows: of the customers who started purchasing product \(P\) during the experiment who we have observed for at least \(n\) days since their initial purchase, report the proportion who are still subscribed at day \(n\).</p><p>Our measure for dollars spent is analogous: we measure \(n\)-day spend by considering “on average, how many dollars do those in cohort \(C\) spend on products \(P_1, P_2, \ldots, P_k\) over their first \(n\) days after being cohorted.” Here, \(C\), the product basket \(P_1, P_2, \ldots, P_k\), and \(n\) are freely adjustable as above.</p><p>Standard error estimates for both of these metric estimators which rely on the Central Limit Theorem are always reported as well [<a href="https://engineeringblog.yelp.com#1">1</a>, Theorem 6.16]. These estimators are unbiased and consistent (as the number of people observed for at least \(n\) days grows), assuming that the retention of customers is independent of the relative time they enter the experiment.</p><p>There are a number of potential concerns that can arise from using the estimators mentioned above.</p><p><strong>Variance:</strong> One of the greater concerns is that estimators could have large variance if only a small number of customers had been observed in a cohort during the experiment period. Let’s suppose we want to cohort individuals into an experiment for 70 days and are interested in estimating the 60-day retention of each cohort in the experiment. At day 75, the uncensored estimators above only access the users who arrived during the first 15 days of experiment cohorting, even though we have been cohorting users for 5 times as long. Therefore, unless the sample size is very large or the underlying retention/spend distribution is very concentrated, our estimator will have large variance at this point. This forces us to wait more days after the experiment ends to get a genuine metric read or detect a bona-fide difference in retention or spend of the cohorts.</p><p><strong>Seasonality:</strong> In addition to the variance issue, the retention or spend characteristics of individuals cohorted into an experiment may vary with the experiment runtime. In this situation, the uncensored estimators will in general be biased away from the true population mean retention and spend over the experiment window until after the observation phase is fully completed. For example, if we run a revenue experiment that starts right before Christmas, the 60-day retention estimate 75 days after the hypothetical experiment above was started would be heavily biased towards HVAC contractors and retail stores instead of the large fraction of restaurants that were closed over the holidays. It is feasible that we could declare a difference between the two populations at day 75 and make a conclusion about the experiment results, even though the underlying estimates are biased and reflect only a subset of the population.</p><p>Our solution to a more effective retention estimate, which can mitigate the problems mentioned above, is based on the so-called Kaplan-Meier estimator of the “survival curve.” The survival curve \(S(t)\) is a function of time \(t\) which returns the probability that someone would retain for at least \(t\) days after subscribing. Accordingly, if we had access to the population survival curve, \(S(n)\) would return the proportion of businesses in our population who would retain at least \(n\) days, precisely our retention metric. The Kaplan-Meier estimator is a nonparametric estimator of the whole survival curve. Evaluating the estimated curve at time \(t = n\) days gives an estimate of the desired retention metric:</p><p><img src="https://engineeringblog.yelp.com/images/posts/2020-02-19-accelerating-retention-experiments-with-partially-observed-data/km_curve_example.png" alt="An example Kaplan-Meier estimate of a survival curve." /></p><p>For a coarser discretization of time than typically used, this Kaplan-Meier estimate of \(n\)-day retention first writes the \(n\)-day churn as \[S(n) = \prod_{t=1}^n \mathrm{Pr}(\text{remains subscribed through day }t |\text{ subscribed for first }t - 1\text{ days}).\] At this point, each multiplicand is estimated as \(h_t\) , the fraction of people we have observed for at least \(t\) days and were subscribed at the end of day \(t - 1\), who then stayed subscribed through the end of day \(t\). The full estimate is then \[S(n) = \prod_{t=1}^n h_t.\] This equals the status quo version that does not incorporate censored data if we instead used the individuals observed for at least \(n\) days to compute each \(h_t\) instead of the larger sample size afforded by using those observed for at least \(t &lt; n\) days. Better utilization of available information can increase the precision of our estimates, increase statistical power to detect differences in cohort retentions, and mitigate sensitivity to time-dependent retention characteristics over the course of the experiment.</p><p>This estimator is a consistent estimator of the whole survival curve (computed by varying \(n\)) as both the number of individuals and the length of time we observe each of them increase [<a href="https://engineeringblog.yelp.com#2">2</a>]. It is not, in general, unbiased [<a href="https://engineeringblog.yelp.com#3">3</a>], and is also affected by seasonality in the same way that the status quo estimator is affected, although in simulations seasonality had less of an effect on the estimates than with the status quo approach.</p><p>In the cumulative spend setting, we employed the mean cumulative function estimator detailed in [<a href="https://engineeringblog.yelp.com#4">4</a>]. This mean cumulative function estimator writes the total spend of a business through day \(n\) after cohorting as the sum of the spend on the first day after cohorting, the spend on the second day after cohorting, etc., all the way through the spend on the \(n\)-th day after cohorting. Each day-\(t\) spend is then estimated as the average day-\(t\) spend of people we have seen for at least \(t\) days. Since the sample size used to estimate day-1 spend is usually much greater than that used to estimate day-60 spend, this estimator can achieve greater power than a status quo estimator that restricts the sample in each day-\(t\) spend estimate to only the people observed for all \(n\) days.</p><p>This estimator is unbiased, consistent as the number of businesses seen through day \(t\) increases, and has mean squared error no greater than that of the status quo estimator. In the presence of seasonality, this estimator will be biased in the same way the status quo estimator will be, but we observe in practice that it typically has lower mean squared error despite this fact.</p><p>The variance of our estimate of expected spend over the \(n\) days following cohorting can be written as follows. Mathematically, if \(s_t\) is the random variable giving the distribution over dollars spent by a business throughout their \(t\)-th day after being cohorted, then the variance of this \(n\)-day spend estimate is \[\sum_{t=1}^{n}\frac{\mathrm{Var}(st)}{m_t} + \sum_{t\neq t’} \frac{\mathrm{Cov}(s_t, s_{t’})}{max(m_t, m_{t’})},\] where \(m_t=|\{i:\text{individual }i\text{ observed through day }t\}|\). This can be estimated in practice by plugging in empirical unbiased estimates of \(\mathrm{Var}(s_t)\) and \(\mathrm{Cov}(s_t,s_{t’})\). The covariance terms are summed for all \(t\neq t’\) which are both no more than \(n\). This result is similar to the one presented in [<a href="https://engineeringblog.yelp.com#4">4</a>] but differs in our level of discretization.</p><p>In order to realize any acceleration in experimentation under the new estimators, we had to create a policy where experimenters would compute their A/B test metrics using these new estimators earlier than they would under the old, uncensored approaches, all while maintaining a comparable statistical power. Because of a relatively limited number of historical A/B tests with which to evaluate this speed-up empirically, we decided to rely on Monte-Carlo simulation to determine the speed-up to prescribe in practice. Although we ended up going with a simpler policy of reading metrics a fixed number of days earlier, such Monte-Carlo simulation of the speedup could be computed in a bespoke way for each proposed A/B test. This would, in some situations, achieve a much greater speed-up than available under the uniform policy we ended up using, at the expense of complexity.</p><h2 id="the-simulation-framework">The Simulation Framework</h2><p>All of our simulation data are generated according to the following probabilistic model:</p><p>An experiment is defined as a collection of initial subscription times \(t \sim \mathrm{Uniform}(0,T \text{ days})\) which arrive uniformly between 0 days and \(T\) days. \(T\) was a pre-set constant that was set to be \(K\) days for spend simulations and \(K+10\) days for retention simulations; these are similar enough (and well within the range of typical experiment fluctuation) that the results should be interpreted identically. Also note that this is the continuous uniform distribution: people can arrive half-way or three-quarters of the way through any given day. Every individual has some underlying mean retention time \(\mu(t)\) which is typically a constant in every scenario except Simulation 3 where \(\mu(t) = T’ + b (2t/T - 1)\) to simulate within-experiment seasonality of revenue characteristics. Given the mean retention time \(\mu(t)\), the retention time \(R\) of a subscriber is exponentially distributed with mean \(μ(t)\), which results in a subscription from time \(t\) to time \(t + R \sim t + \mathrm{Exp}(\text{mean}=\mu(t))\). Moreover, in all but the last spend simulation, the amount that someone spends in a day is precisely a constant times the fraction of a day they were an active subscriber. We don’t include non-subscribers in the spend simulation here; non-subscribers are emulated in the stress test later. For a target sample size \(m\) to collect during the simulated experiment, we independently sample \(m\) such subscriptions to create the experimental data.</p><p>For every simulated experiment, we then wish to estimate the \(n\)-day retention/spend at time \(K + T_r\) where the read time \(T_r = 0 \text{ days}, 1 \text{ day}, \ldots,\) etc. since the experiment finished. When measuring retention and spend at time \(T_r\) since the experiment finished, we do not have access to any events (e.g. a subscriber churning) at time later than \(T_r\).</p><p>All experiment scenarios and results are averaged over 1000 independent trials in the retention simulations and 1500 independent trials in the spend simulations.</p><p>In this simulation, we generated experimental cohorts according to the above data model under various amounts of cohorted subscribing customers and mean retention times. These retention and sample size characteristics were chosen to run the gamut of experimental data we would expect to see in practice. Then, we matched the cohorts with the same sample size pairwise in order to compute the probability that we could detect (with a \(z\)-test) the bona-fide difference in retention / spend between the two hypothetical A/B experiences the different cohorts would receive. We estimated the \(n\)-day retention probability and \(n\)-day spend at day \(0, 1, \ldots , n\) after the experiment cohorting ended using both the status quo estimator and the Kaplan-Meier / mean cumulative function approaches. We stopped estimating spend and retention at day \(n\) after the end of the experiment because all the data are guaranteed to be uncensored at this point, and accordingly the status quo and proposed estimators coincide exactly.</p><p>In all scenarios of interest, the test based on the uncensored approaches have lower statistical power (lower probability of detecting the bona-fide retention difference) than the Kaplan-Meier / mean cumulative function based one where statistically comparable. This is particularly noticeable for moderate sample sizes and moderate differences: in one simulated scenario representative of reality, the status quo based test detects the difference less than half of the time on the day the experiment ends, while the Kaplan-Meier approach succeeds over 80% of the time. In two-thirds of the scenarios tested, the Kaplan-Meier approach succeeds at least 5 percentage points of the time more than the status quo approach the day the experiment ends, and in the majority of those cases the difference is over 10 percentage points.</p><p>Looking at the simulations results differently, this can be quantified in terms of accelerating the number of days we need to achieve the same statistical power (within a 1% or similarly small relative tolerance) we would achieve if we computed the status quo estimators at day \(n\) after cohorting ends (the typical time we historically have read retention / spend experiment metrics.) The speed-up we observed for the mean cumulative function (relative to the total time used for cohorting and waiting to read retention and spend) for the various scenarios considered are presented below. The results for retention with the Kaplan-Meier estimator are similar and are not shown here. Note that the intervals of relative speed-ups are not confidence intervals — they are point estimates — but reflect the fact that the total time used to cohort and wait for retention historically has not been fixed and instead varies within a range of \(L\) to \(U\) days. If \(k\) is the number of days earlier we read our metrics, the reported interval is simply \(k / U\) to \(k / L\).</p><table><thead><tr><th class="c1"><strong>Relative Speed-up</strong></th>
<th class="c1"><strong>0.1% Power Tolerance</strong></th>
<th class="c1"><strong>1% Power Tolerance</strong></th>
<th class="c1"><strong>2% Power Tolerance</strong></th>
</tr></thead><tbody><tr><td class="c2"><strong>Mean</strong></td>
<td class="c2">20-27%</td>
<td class="c2">25-33%</td>
<td class="c2">28-38%</td>
</tr><tr><td class="c2"><strong>25th Percentile</strong></td>
<td class="c2">8-11%</td>
<td class="c2">13-18%</td>
<td class="c2">17-22%</td>
</tr><tr><td class="c2"><strong>50th Percentile</strong></td>
<td class="c2">14-19%</td>
<td class="c2">18-24%</td>
<td class="c2">23-30%</td>
</tr><tr><td class="c2"><strong>75th Percentile</strong></td>
<td class="c2">32-42%</td>
<td class="c2">43-58%</td>
<td class="c2">46-61%</td>
</tr></tbody></table><p>To incorporate these estimators across all of Yelp’s experiment analysis, we dictated that individuals should read their experiment metrics with speed-up corresponding to the 50th percentile speed-up we observed in these simulations, under a 0.1% power tolerance as compared to the previous status quo approach. In doing so, under the assumption that our simulations were as representative as we believe, about half of experiment settings would see power no less than 0.1% lower than the status quo approach, but almost all would see power no less than 2% lower than the status quo approach. In light of the marked increase in our ability to iterate on Yelp’s products, this felt like a more-than-fair trade to make. Since in many circumstances the speed-up can be much greater than 12-16% over status quo, bespoke recommendations can and will be made in situations when rapid experimentation is extremely important to Yelp’s bottom line.</p><h2 id="stress-test-1-robustness-against-seasonality">Stress Test 1: Robustness against Seasonality</h2><p>In order to check that our simulations don’t break down in real world scenarios, we ran a number of stress tests that injected more extreme versions of reality into our data generating model, checking that the results largely mirrored what we see with the original data model. We only considered retention in this simulation, and not cumulative spend.</p><p>In the first of these stress tests, we consider the case where the average subscriber retention time in a cohort is fixed at some number of days, but where the average retention of an individual varies with respect to when they initially make a purchase during the experiment. Fixing some day-zero bias \(b\), the average retention of an individual is linearly interpolated between \(C + b\) and \(C - b\) over the duration of the experiment in a way such that the population average stays the same. For biases chosen from a predefined set of candidates, we track the bias of the \(n\)-day retention probability estimates made by the status quo and Kaplan-Meier approaches as we re-calculate the metrics after the experiment ends.</p><p>The Kaplan-Meier retention estimator has uniformly lower bias than the status quo approach where they are statistically comparable. Indeed, in situations with positive day-zero bias, the bias of the Kaplan-Meier estimator is on the order of 50% of the bias of the status quo approach. Moreover, the bias of the Kaplan-Meier estimator decreases super-linearly with respect to how long we wait to make the measurement, while the bias of the status quo estimator decreases linearly. Here, linearity means that the error is a line with a negative slope. This is different from the typical use of “linear decrease,” which is commonly used to denote a geometric decay in error. This result increases our confidence that the new estimators won’t return worse results in situations that have within-experiment seasonality.</p><p>In the second stress test, we modified the data generating model so that the amount a person spends each day is not a uniform constant multiple of whether or not they are subscribed, but instead a constant multiple of whether or not they are subscribed that varies across individuals according to some heavy-tailed and bi-modal distribution reflective of actual spend distributions in Yelp products. Bi-modality emulates the inclusion of non-spenders and those who subscribe to much cheaper products in the experiment, while the heavy tail simply reflects the distribution over purchase amounts for people who do subscribe to a variable-cost product like advertisements.</p><p>In short, the distribution over speed-ups seen in the table above is largely the same with this new noise added, although the speed-ups are slightly reduced. Since the reduction is quite small, as seen in the following table, we can be more confident that our simplified data generating process used in the initial power simulations does reflect reality. Nevertheless, it seems prudent to revise our expectations stated earlier about the properties of our “read metrics \(n\) days earlier” policy: about half of experiment settings would see power no less than 1% lower than the status quo approach, but almost all would see power no less than 2% lower than the status quo approach.</p><table><thead><tr><th class="c1"><strong>Relative Speed-up</strong></th>
<th class="c1"><strong>0.1% Power Tolerance</strong></th>
<th class="c1"><strong>1% Power Tolerance</strong></th>
<th class="c1"><strong>2% Power Tolerance</strong></th>
</tr></thead><tbody><tr><td class="c2"><strong>Mean</strong></td>
<td class="c2">19-26%</td>
<td class="c2">22-30%</td>
<td class="c2">25-34%</td>
</tr><tr><td class="c2"><strong>25th Percentile</strong></td>
<td class="c2">0-0%</td>
<td class="c2">13-17%</td>
<td class="c2">15-21%</td>
</tr><tr><td class="c2"><strong>50th Percentile</strong></td>
<td class="c2">14-18%</td>
<td class="c2">16-22%</td>
<td class="c2">19-25%</td>
</tr><tr><td class="c2"><strong>75th Percentile</strong></td>
<td class="c2">38-51%</td>
<td class="c2">38-51%</td>
<td class="c2">38-51%</td>
</tr></tbody></table><p>The Kaplan-Meier and mean cumulative function estimators are simple-to-use tools which can return reduced-variance estimates of \(n\)-day retention and cumulative spend. In simulations, these estimators afford a speed-up in non-engineering experiment runtime of 12-16% over uncensored approaches. Combining this computational evidence with real-world experimentation has increased Yelp’s ability to iterate on our product and operations more efficiently.</p><p>I would like to thank Anish Balaji, Yinghong Lan, and Jenny Yu for crucial advice and discussion needed to implement the changes to experimentation described here across Yelp. In addition, I genuinely appreciate all the comments from Blake Larkin, Yinghong Lan, Jenny Yu, Woojin Kim, Daniel Yao, Vishnu Purushothaman Sreenivasan, and Jeffrey Seifried that helped refine this blog post from its initial draft into its current form.</p><ol><li>Wasserman, L.. “All of statistics: a concise course in statistical inference.” Springer-Verlag New York, 2004.</li>
<li>Bitouzé, D., B. Laurent, and P. Massart. “A Dvoretzky–Kiefer–Wolfowitz type inequality for the Kaplan–Meier estimator.” In Annales de l’Institut Henri Poincare (B) Probability and Statistics, vol. 35, no. 6, pp. 735-763. 1999.</li>
<li>Luo, D., and S. Saunders. “Bias and mean-square error for the Kaplan-Meier and Nelson-Aalen estimators.” In Journal of Nonparametric Statistics, vol. 3, no. 1, pp. 37-51, 1993.</li>
<li>Nelson, W.. “Confidence Limits for Recurrence Data – Applied to Cost or Number of Product Repairs.” In Technometrics, vol. 37, no. 2, pp. 147-157, 1995.</li>
</ol><div class="island job-posting"><h3>Become an Applied Scientist at Yelp</h3><p>Want to impact our product with statistical modeling and experimentation improvements?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bc9dbcab-f8c8-475b-8637-4dc3becb790c?description=Applied-Scientist_Engineering_San-Francisco-CA?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/02/accelerating-retention-experiments-with-partially-observed-data.html</link>
      <guid>https://engineeringblog.yelp.com/2020/02/accelerating-retention-experiments-with-partially-observed-data.html</guid>
      <pubDate>Thu, 20 Feb 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Yelp Takes on Grace Hopper 2019!]]></title>
      <description><![CDATA[<p>Last October we sent a group of Yelpers to the 2019 Grace Hopper Celebration! Here are a few takeaways and reflections from some of our attendees.</p><h2 id="who-attended">Who attended?</h2><ul><li>Surashree K., software engineer on Semantic Business Information</li>
<li>Clara M., product design lead on Content</li>
<li>Anna F., machine learning engineer on Semantic Business Information</li>
<li>Nikunja G., software engineer on Infrastructure Security</li>
<li>Catlyn K., software engineer on Stream Processing</li>
</ul><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/grace-hopper-2019/GHC_group_photo.jpg" alt="" /></div><h2 id="what-was-your-favorite-session">What was your favorite session?</h2><p><strong>Surashree</strong>: Honestly, it’s hard to choose, but the one that stuck with me was the talk by Jackie Tsay and Matthew Dierker on Google’s Smart Compose, the Gmail feature that helps people write emails faster by auto-completing sentences. It was interesting to learn about inherent biases the earlier versions of the model had, and the engineering decisions that went into combatting those. The speakers also talked about some of the feedback they received; one that was especially moving was from a non-native English speaker who was happy to have a feature that would make writing emails in English easier.</p><p><strong>Clara</strong>: Definitely the talk on AI Meets Creativity by Dr. Pinar Yanardag. It was fascinating how she analyzed the way AI algorithms can actually inspire the creative process. She shared an example of an algorithm that analyzed dress patterns and generated a pattern that a fashion designer went on to create. She also shared examples of how this could work for graffiti art, pizza recipes, and even making perfume. The talk was not only visually compelling but also made me think a lot more about how AI can actually help boost creativity rather than stifle it.</p><p><strong>Anna</strong>: One of my favorite sessions was FarmBeats, Microsoft’s AI and IoT system for agriculture by Zerina Kapetanovic. She first described the challenges in setting up “smart” data-driven agriculture, including low rural internet connectivity, electricity access, and the high cost of sensors. She then walked us through creative solutions for each problem, ranging from clever uses of solar panels to dangling smartphones from balloons to approximate drone footage. It was inspiring to see how this collection of workarounds and approximations came together into a coherent and precise solution.</p><p>A recurring theme throughout many of the sessions I attended was how AI can enhance human capabilities by putting better tools in more people’s hands. AI enables us to get great results even in uncontrolled situations or where precision hardware isn’t available. The FarmBeats talk, described above, demonstrates this in the field of agriculture. I’m excited to see what specialist tools AI will make commonplace in the future.</p><p><strong>Nikunja</strong>: I work in Security, and to see a good representation of women in this field was a welcome change. One of my favorite sessions was the interactive security game put together by three engineers working at OneMedical. The goal was to secure a fictional organization that challenged you to prioritize security projects within an ever-changing threat landscape. The session was highly interactive, informational, and most of all, extremely fun! I never anticipated that such a session could be presented at a conference like Grace Hopper, and I brought back some major takeaways to share with my team at Yelp. What was the best career advice you received?</p><h2 id="what-was-the-best-career-advice-you-received">What was the best career advice you received?</h2><p><strong>Surashree</strong>: My one key takeaway from the conference was the importance of standing up for yourself and others. One of the talks by the CEO of AnitaB.org, Brenda Wilkersan, and COO Jacqueline Copeland, highlighted the still pertinent issue of the gender pay gap in tech and how it isn’t enough for companies to simply hire more diverse people, they also need to create an environment where all groups feel supported. One of our mottos here at Yelp is “Play well with others,” and this talk reminded me that the confidence we have in our daily lives comes from a certain level of privilege that we have to recognize.</p><p><strong>Catlyn</strong>: Don’t be ashamed to ask for more. According to one of the execs from Uber, women ask for less than their worth during the hiring process, resulting in a skewed sense of self-evaluation. We also tend to refrain from responsibility unless we’re certain we’ll be the perfect candidate for the role. But no one’s perfect and it’s okay to figure things out along the way. You’ll never be 100% ready, so why not just seize the opportunity and enjoy the challenge!</p><p><strong>Surashree</strong>: Be prepared to be totally overwhelmed! There are some things you can do to make your life easy during those three days—download the app, check your schedule every day, carry a bottle of water, talk to as many people as possible, and attend as many talks as you can. But really, the sheer size of the conference and the activity around you will be hard to take in at first. Our recruiting team does a wonderful job of organizing, so our job as attendees is really to just make the most of the GHC experience.</p><p><strong>Nikunja</strong>: I feel that however much you prepare, in the end you’ll still feel overwhelmed and unprepared, so my number one suggestion is to go with the flow once you’re there. Having said that, it’s absolutely essential to do some prep before going, like organizing your schedule (pre-registering for sessions and having a good balance of booth duty and conference talks). Also, try attending a mix of sessions! GHC is unique in that it has so many different tracks in one place, so take advantage of it.</p><p><strong>Anna</strong>: Take travel time into account when signing up for sessions! The conference center is half a mile long, and you don’t want to miss a session or show up out of breath due to poor planning.</p><h2 id="what-was-your-most-memorable-moment">What was your most memorable moment?</h2><p><strong>Surashree</strong>: My most memorable moment came right at the beginning of the conference: the keynote by Aicha Evans, the CEO of Zoox. An immigrant from Senegal, her life went from one of domestication to now being the CEO of a company that’s building the next generation of autonomous cars. Her story was soft, gritty, and inspiring—all at once. Her question, “Whose genius are you going to ignite?” highlighted the importance of mentorship and giving back, something I believe we do very well here at Yelp.</p><p><strong>Clara</strong>: The closing keynotes with the DJ playing! It was such a fun atmosphere and a great way to close out the conference by hearing from so many amazing women doing innovative things in their industry.</p><p><strong>Nikunja</strong>: To be honest, this is a tough one, as the whole experience was very memorable in itself. However, there’s one that takes the cake. The conference has several award categories; one of them is the Student of Vision award, which was given to Jhilika Kumar, an undergrad student at Georgia Tech. At such a young age, Jhilika is the founder of AxisAbility, an organization she started to help the lives of differently abled people. Her passion for this cause stems from personal experience. Her brother faced so many challenges in his youth, and she wants to help him and others like him lead a better life. Her video, speech, and determination left so many people inspired, and showed us that no matter how young you are, you can create change.</p><h2 id="why-should-one-go-to-ghc">Why should one go to GHC?</h2><p><strong>Clara</strong>: Honestly, at first I was skeptical about going to GHC as a product designer since I always thought it was a conference for software engineers. However, once I was there I realized how impactful it is for any woman working in the tech industry. I not only learned about new technologies, but also got the chance to be inspired by and network with other women in my industry. I was also surprised by how many people came by the booth looking to talk specifically with me since they knew Yelp had sent a product designer, which apparently not many other companies had. In general, it’s a great opportunity for everyone—product designers included!</p><p><strong>Catlyn</strong>: It was a great experience to be surrounded by so many brilliant women engineers who either are or once were facing the same career challenges that I am right now. I felt enlightened and empowered attending the talks and chatting with others from companies all over the world. I would strongly encourage everyone, especially those early on in their career, to attend some of the workshops to help you find out what kind of path you want to pave going forward and how you can get there.</p><p><strong>Surashree</strong>: Knowing what I know now, the biggest reason to go to GHC for me is to hear the stories from the lives of other female engineers. Working at Yelp, in the harmonious and safe environment that we have, it can be easy to overlook that not everyone has had the same advantages as myself, and not everyone’s experiences in tech have been the same. There are women who’ve had to deal with difficult situations- perhaps a toxic work culture, or misogyny in some form–and have come out stronger and brave enough to talk about it at conferences like this. So I’d say, go to GHC to learn about other people’s experiences and gain new perspectives!</p><div class="island job-posting"><h3>Become a Web Developer at Yelp Toronto!</h3><p>Join our Engineering team and help millions of people connect with local businesses on Yelp.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a6cfee89-2dd0-4451-bf52-746b9547dfb7?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/02/grace-hopper-yelp-2019.html</link>
      <guid>https://engineeringblog.yelp.com/2020/02/grace-hopper-yelp-2019.html</guid>
      <pubDate>Wed, 12 Feb 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Open-Sourcing Varanus and Rusty Jetpack]]></title>
      <description><![CDATA[<p><em>The <strong>monitor lizards</strong> are large lizards in the genus <strong><a href="https://en.wikipedia.org/wiki/Monitor_lizard">Varanus</a>.</strong></em></p><p>Some time ago, our Android app got into a loop of sending data, due to some unlikely interactions between several different systems, which briefly overwhelmed our servers before we were able to turn it off. Fortunately, key code was behind an experiment. Otherwise, apps could have continued misbehaving for days, as there is no guarantee users would immediately update the app. It took an unusual combination of circumstances for this to happen, but this kind of problem seems to be a pervasive concern across the industry, and there are few tools to prevent it.</p><p>Furthermore, even at the best of times, mobile data can be hard to manage. Every now and then an article comes up about how a widely used app has eaten up users’ data. Unfortunately, there aren’t very many good tools for tracking how much data is sent, or what exactly is responsible for sending too much data in the first place.</p><p>Also, because updates are optional, all the code you’ve ever written is out there, somewhere (and we’ve had an Android app for almost as long as Android has existed!). If something goes wrong, you may not be able to push a fix to enough people, and with millions of users, all sorts of strange things can happen. While it’s unlikely that something goes catastrophically wrong and you can’t get enough people to update, it’s not impossible.</p><h2 id="what-does-varanus-do">What Does Varanus Do?</h2><p>In building out Varanus, we had two main goals:</p><ol><li>Always be able to turn off unwanted data on the client, no matter what.</li>
<li>Observe how much traffic is generally being sent so we can spot if something weird happens.</li>
</ol><p>With three constraints:</p><ol><li>It should be exceptionally simple and hard to break.</li>
<li>It should work without anyone having to do anything.</li>
<li>It can be dropped into our different apps with minimal effort.</li>
</ol><p>Also, since it seemed that a lot of people were concerned with this problem but lacked the resources to spend time fixing it, we saw this as an opportunity to contribute something useful to the community.</p><h2 id="how-does-it-work">How Does It Work?</h2><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-02-06-open-sourcing-varanus-and-rusty-jetpack/varanus-diagram.png" alt="All traffic on the app automatically passes through Varanus" /><p class="subtle-text"><small>All traffic on the app automatically passes through Varanus</small></p></div><p>Basically, since all network traffic passes through Varanus, Android developers don’t even have to think about it for it to work. It counts the number of bytes and requests, and bins them by arbitrary categories of traffic that can be specified programmatically. An error message from the server (or <a href="https://en.wikipedia.org/wiki/Content_delivery_network">CDN</a>) then tells the app to hold off on sending more traffic for a bit—one message says to stop sending all traffic, and the other says to stop a specific category of traffic.</p><p>The code is entirely client-side, and no new backend infrastructure is needed (as long as you have a way of sending custom HTTP error codes from your server if necessary). Also, no coordination between devices is required, and turning off traffic is simple: all you need is a runbook.</p><p>Varanus is built around OkHttp interceptors, but with a bit of extra work, there’s no reason other clients couldn’t be supported. It’s also written entirely in Kotlin (like all new code at Yelp).</p><h2 id="where-can-i-find-more-details">Where Can I Find More Details?</h2><p>Take a look at the <a href="https://github.com/Yelp/android-varanus/blob/master/README.md">README</a>, or the code itself. We have a sample app that explains how it should be used.</p><p>In preparation for targeting Android 10, Yelp’s Android apps were migrated to use Android X libraries. Unfortunately, with the size of our apps’ codebases, the provided migration tool in Android Studio didn’t work for us. Rusty Jetpack was then born as a <a href="https://engineeringblog.yelp.com/2018/11/all-about-yelp-hackathon.html">Hackathon</a> project to help ensure seamless adoption across many developers with little downtime.</p><h2 id="what-does-rusty-jetpack-do">What Does Rusty Jetpack Do?</h2><p>The tool migrates all files in a git repository to use the new Android X package name spaces. This includes imports, fully qualified references, pro-guard declarations, and warnings about gradle packages that need to be changed. While this does mean the code won’t compile immediately after using the tool, most of the mundane work is taken care of. And best of all, it achieves all of this in under one second for our largest repository!.</p><p>Rusty Jetpack is critical to preventing downtime during migrations with rapidly changing codebases. Migrations can easily be kept up to date with the latest changes by re-running the tool, and then being distributed to developers (once the migration has been pushed) for quick adoption without major disruption.</p><p>To learn more, <a href="https://github.com/Yelp/rusty_jetpack">check out the repository here</a>!</p><div class="island job-posting"><h3>Become an Android Software Engineer at Yelp</h3><p>Want to help us make even better tools for our Android engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/2c6736d6-7c8e-4f57-8912-15a71815eef0?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/02/open-sourcing-varanus-and-rusty-jetpack.html</link>
      <guid>https://engineeringblog.yelp.com/2020/02/open-sourcing-varanus-and-rusty-jetpack.html</guid>
      <pubDate>Thu, 06 Feb 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Modernizing Ads Targeting Machine Learning Pipeline]]></title>
      <description><![CDATA[<p>Yelp’s mission is to connect users with great local businesses. As part of that mission, we provide local businesses with an <a href="https://biz.yelp.com/support/advertising">ads product</a> to help them better reach out to users. This product strives to showcase the most relevant ads to the user without taking away from their overall search experience on Yelp. In this blog post, we’ll walk through the architecture of how this is made possible by using one of the largest machine learning systems at Yelp: <strong>Ads Targeting System</strong>.</p><p>The Ads Targeting System is a machine learning (ML) system designed to serve only the most relevant ads to users based on their intentions and context on Yelp. There are two primary types of ML models in the ads targeting domain: Click Through Rate (CTR) prediction, and Objective Targeting (OT). Both help determine the likelihood of downstream actions, such as calling a business after clicking on an ad.</p><p>In this post, we’ll primarily focus on architecting ML systems at scale rather than on algorithmic details or feature engineering. For more info on the algorithmic side of our CTR prediction model, check out one of our previous <a href="https://engineeringblog.yelp.com/2018/01/growing-cache-friendly-trees-part2.html">posts</a> where we discuss optimizations made to the XGBoost prediction library.</p><p>Below is a simplified version of the Ads Targeting and Delivery System:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-01-30-modernizing-ads-targeting-machine-learning-pipeline/overview_ads_targeting_system.png" alt="" /></div><p>The <strong>Ad Delivery</strong> service is a low-latency online service written in Java that processes incoming ad requests. It generates features for the incoming request using the <strong>Ad Feature Generation</strong> library, loads the model from the <strong>Ad Model Store</strong>, generates CTR prediction, and then ranks ads accordingly.</p><p>The <strong>Ad Targeting service</strong> is a batch processing service written with <a href="https://github.com/Yelp/mrjob">MR Job</a>: Python Map-Reduce library open-sourced by Yelp. This is the service that we’ll discuss and redesign in this blogpost. Its main features include processing logs using the same Ad Feature Generation library, training ML models, and storing them in the <strong>Ad Model Store</strong>. MRJob also comes with a feature that allows you to call Java code from Python to carry out map reduce operations (as can be seen <a href="https://github.com/Yelp/mrjob/blob/master/mrjob/step.py#L421">here</a>). Using the same Feature Generation library ensures that all feature computation, both on and offline, remains consistent.</p><p>This <a href="https://engineeringblog.yelp.com/2018/01/building-a-distributed-ml-pipeline-part1.html">blog post</a> on CTR prediction illustrates how the Ads Targeting Machine Learning Pipeline used to look:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-01-30-modernizing-ads-targeting-machine-learning-pipeline/old_pipeline_stages.png" alt="" /></div><p>We processed Ad Event JSON logs, downsampled them in Spark, and extracted features from the set of logs with Hadoop MR jobs. We then proceeded to Model training with XGBoost and model evaluation with Hadoop MR jobs using AWS EMR as the compute infrastructure. This pipeline served us well and helped us iterate in an ad hoc fashion to create newer and better ad targeting models. That being said, we did face several issues as the system matured due the following:</p><ul><li>As all stages of the pipeline were closely coupled, failure in any of the intermediate steps required restarting the pipeline</li>
<li>This close coupling also meant that changing the feature generation logic or sampling strategy required running the entire pipeline</li>
<li>The pipeline was closely linked with certain EMR instance types and AMI images that restricted either our upgrade to newer versions of Java or trying newer EMR instances (e.g., upgrades in other Java dependencies and online Java services wouldn’t work with the current Ad targeting service, making it impossible to retrain a model or add a new feature)</li>
<li>As our system matured and we started adding more models to our ads targeting system, the cost of training grew</li>
</ul><p>To solve the above issues, we decided to re-architect the Ad Targeting Service and its interaction with the other main components of the Ad Targeting and Delivery Systems. Keeping an eye on the big picture and setting goals is very important when re-designing a system as large as this. For us, that meant focusing on:</p><ul><li>Making it easy to retrain existing models</li>
<li>Making feature generation cheaper, easier, and faster</li>
<li>Leveraging Yelp’s internal Spark tooling and infrastructure (rather than relying on EMR)</li>
<li>Improving monitoring and alerting, and providing easy promotion of models in production</li>
</ul><p>We decided to use Spark as the underlying engine for this ML system as it allowed us to leverage our own in-house Spark on Mesos infrastructure. This infrastructure provides us with a quick and cheap way to spin up clusters and get started with writing big data workflows. Moreover, moving away from Hadoop map-reduce jobs on EMR increased speed and cut costs. This, coupled with the availability of PySpark (the official Python API for Spark), made the decision even easier, since most of our code and infrastructure is built with Python.</p><p>Armed with better infrastructure and tooling around Spark and its natural fit to our big data ML use-case, we decided to rewrite the Ads Targeting Service in PySpark. The new service now contains the same stages as before: <code class="highlighter-rouge">Sampling -&gt; Feature generation -&gt; Training -&gt; Evaluation</code>, just with all the stages computed with PySpark.</p><h2 id="overview-of-modernized-architecture-based-on-spark">Overview of Modernized Architecture Based on Spark</h2><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-01-30-modernizing-ads-targeting-machine-learning-pipeline/new_pipeline_stages.png" alt="" /></div><p>Above is the current machine learning pipeline powered by the updated Ad Targeting Service. Three significant changes were made here:</p><h3 id="use-spark-as-the-compute-infrastructure">Use Spark as the Compute Infrastructure</h3><p>Spark batches were more efficient both in terms of time and cost. Moving to Spark allowed us to leverage the existing infrastructure at Yelp that enabled us to write ETL jobs and carry out distributed machine learning with XGBoost. This was a very cost-effective move since now we only pay for spot EC2 compute resources (and not for the EMR stack on top of it!).</p><h3 id="decouple-the-ml-pipeline-into-stages">Decouple the ML Pipeline into Stages</h3><p>The batches we created process logs, perform sampling, and generate features that are scheduled to run daily and checkpoint results on S3. Decoupling these batches gave us flexibility: we now have different feature generation strategies on top of the same sampling output, whereas in the older architecture each new feature generation strategy required re-computing sampling output. It also made the system relatively robust; since failure in later stages (say training/evaluation) didn’t disrupt the whole pipeline, engineering and operating costs were reduced.</p><h3 id="automated-monitoring-and-alerting">Automated Monitoring and Alerting</h3><p>We leveraged Yelp’s modernized Data Landscape (<a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">1</a>, <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">2</a>, <a href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">3</a>) and built our monitoring capabilities on top of this infrastructure. Instead of manually running Jupyter notebooks to monitor metrics, we computed these in batches, loaded them to our <strong>AWS Redshift</strong> data warehouse, and created <strong>Splunk</strong> dashboards on top of it. This made it really easy for PMs and engineers to make model promotion/deployment decisions.</p><h2 id="feature-generation-with-java-and-pyspark">Feature Generation with Java and PySpark</h2><p>The online Ad Delivery Java service and the offline Ad Targeting Python service share the same Ad Feature Generation Java library. From there the question then arises: how are we leveraging PySpark to generate features? <em>Hint: What language is Spark written in?</em> Let’s unpack this!</p><h3 id="dataflow-in-pyspark">DataFlow in PySpark:</h3><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2020-01-30-modernizing-ads-targeting-machine-learning-pipeline/pyspark_dataflow.png" alt="" /></div><p>Spark is written in Scala (a JVM language), and PySpark is a Python wrapper on top of it. PySpark relies on <a href="https://www.py4j.org/">Py4J</a> to execute Python code that can call on objects that reside in the JVM. To do that, Py4J uses a <a href="https://www.py4j.org/py4j_java_gateway.html">gateway</a> between the JVM and the Python interpreter, and PySpark sets it up for you with SparkContext. This SparkContext has access to the JVM and all packages and classes known to the JVM. You can see where this is heading…</p><p>To carry out distributed feature generation via PySpark, all we had to do was add our feature generation JAR to the Spark JVM and use SparkContext to refer to these classes. Since Yelp executes Spark within Docker, we added the JARs to our service’s Docker images, then loaded the image in Spark drivers and executors. We then had a feature generation Java library accessible via PySpark! The diagram above, taken from the <a href="https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals">PySpark Wiki</a>, illustrates the above design. As you can see, Py4J essentially carries out all data communication with the JVM.</p><h3 id="implementation">Implementation:</h3><p>You can imagine a simple example of a Java class with a method that prints “Hello World!” that is then called from PySpark to get the printed string: “Hello World!”. This implementation was illustrated in <a href="https://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html">this blog</a> (in Scala, but the same principle applies for Java), so we won’t get into it here. Instead, we’ll demonstrate how to apply this principle to our use-case.</p><p>Say we have some JSON logs containing information about ads that can be read in PySpark as PythonRDD. Now, we want to extract/transform features from these logs using our Java library. One way of doing this is via Java UDF (as is illustrated <a href="https://dzone.com/articles/pyspark-java-udf-integration-1">here</a>). However, there’s a limitation to this approach: it requires Java classes to have a zero-argument constructor. This can be seen in the official <a href="https://github.com/apache/spark/blob/be4faafee43d7b8810cf19deacd22e91b19ccfc6/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L685">Spark UDF Registration</a> code. Since for our use case we wanted to have the ability to parameterize our classes, this approach didn’t work for us.</p><p>Hence, we went with the following: first, we created a Java class that extends <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/FlatMapFunction.html"><code class="highlighter-rouge">flatMapFunction</code></a> interface. This allowed us to generate any number of output rows per row of the input RDD by passing an object of this class to Spark’s <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?#pyspark.RDD.mapPartitions"><code class="highlighter-rouge">mapPartitions</code></a> function. The Java class also gets a list of Java mappers that we want to apply to the input RDD to generate the output fields. One of these Java mappers calls into the feature generation library to extract the transformed features. The library itself essentially consists of simple Java classes that can extract features by applying simple transforms or business logic on raw JSON logs.</p><p>Now that we have all the Java classes ready, we can do the following on the Python side:</p><div class="language-python highlighter-rouge highlight"><pre>
from pyspark.mllib.common import _java2py
from pyspark.mllib.common import _py2java
# Step 1: First convert the PythonRDD object into a java RDD object.
java_rdd_object = _py2java(python_rdd.ctx, python_rdd)
# Step 2: Get the Java class that implements flatMapFunction interface, initialize it,
# and pass some mappers to it to apply on the Java RDD
java_flat_map_function_object = flat_map_function_package.ClassWithFlatMapFunctionInterface(
    initParamA,
    initParamB,
    [
      MapperA(arg_a),
      MapperB(arg_b),
      MapperForFeatures(),
      MapperForLabels()
    ]
)
# NOTE: As one can see above, we can parameterize our mappers as opposed to JavaUDF functions
# Step 2: Call mapPartitions on that java object (effectively calling java code)
# and get the output as a java RDD instance.
mapped_java_rdd_object = java_rdd_object.mapPartitions(
        java_flat_map_function_object
    )
# The above mapped_java_rdd_object now consists of results of all the 4 mappers above applied
# Step 3: Convert the java RDD object back into a python RDD object.
mapped_python_rdd = _java2py(python_rdd.ctx, mapped_java_rdd_object)
</pre></div><p>Voila! Now we have a PythonRDD of features that was generated via the Java feature generation code.</p><h2 id="model-training-with-distributed-xgboost-on-spark">Model Training with Distributed XGBoost on Spark</h2><p>We use <a href="https://www.mlflow.org/docs/latest/tracking.html">MLFlow-tracking</a> to track and log our model training runs. This provides us with a lot of visibility into our model training metrics, an easy way of logging and visualizing hyperparameters and even sharing the model-training reports. Another cool feature of MLFlow-tracking is the ability to query the model-training runs and retrieve the best models based on metrics. We leverage this feature to automate our evaluation and monitoring pipelines.</p><p>To train our ads targeting models, we heavily rely on <a href="https://xgboost.readthedocs.io/en/latest/">XGBoost</a>. However, distributed training with XGBoost on Spark took some work to accomplish. Since the official library (version &lt;= 0.9) doesn’t provide a Python/PySpark interface, we wrote our own wrapper on top of <a href="https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html">XGBoost4J-Spark</a>. We also implemented a SparseVectorAssembler instead of using the <a href="http://spark.apache.org/docs/latest/ml-features#vectorassembler">VectorAssembler</a> provided by Spark, since the default implementation doesn’t integrate well with XGBoost on Spark (<a href="https://github.com/dmlc/xgboost/pull/4805">issues</a> dealing with missing values). Another limitation of XGBoost on Spark is that it’s not as fault-tolerant as native Spark algorithms. This becomes an issue when trying to use AWS Spot instances for model training, since at times when the spot instances became unavailable, the training job dies. Thus, we created a separate pool of on-demand resources to carry out large-scale distributed training with XGBoost on Spark.</p><h2 id="automated-retraining-monitoring-and-alerting">Automated Retraining, Monitoring, and Alerting</h2><p>We use <a href="https://github.com/Yelp/Tron">Yelp’s Tron</a> scheduler to schedule our batch processing jobs. The entire new pipeline is scheduled via Tron, where log-processing and feature generation batches run daily and model-training runs every few days. Through running A/B experiments, we’ve observed that pure retraining of models with newer data leads to <strong>~1% improvement</strong> in our primary metric, which then compounds over time.</p><p>While scheduling model retraining is simple, deployment, monitoring and alerting can be a more difficult process. To instill confidence among developers and PMs to go in production with newly trained models, we developed a solid monitoring infrastructure that does the following:</p><ul><li>Daily model evaluation that replays traffic for all models in production and models yet to be deployed in production(this helps us capture model drift and decays)</li>
<li>Live Splunk dashboards of business, online model, and offline model evaluation metrics</li>
<li>Scoring verification systems that verify online and offline scoring matches and ensures that features don’t drift between online and offline modes</li>
</ul><p>Having a good monitoring infrastructure improves developer velocity in deploying newly trained models. It’s analogous to having a good CI/CD infrastructure for code deployment.</p><p>With this newly designed service and regular retraining-deployment cycle, we’ve seen a vast improvement in our model metrics that has further translated to improving business metrics such as click-through-rate, sell-through-rate, and lower cost per clicks for our advertisers.</p><p>This means that we’ve not only improved serving more relevant and useful ads to our users, but have also reduced the cost for our advertisers to serve ads, making Yelp a more cost-effective platform for their business.</p><h2 id="conclusion">Conclusion</h2><ul><li>Designing large ML systems is hard due to additional complexities introduced by data and models, but it’s especially important when it’s a big part of your product</li>
<li>Sometimes ML systems need to evolve (from Hadoop MR to Spark); ML engineers shouldn’t shy away from this just because it’s infrastructure and not modeling</li>
<li>Decouple system components and checkpoint data often so that each component can be independently worked and improved upon</li>
<li>Create infrastructure such that training+evaluating+monitoring models are easy and automated. Make this infrastructure instill confidence in developers to deploy newly trained models</li>
<li>Retraining models with newer data can provide good gains with almost zero effort, so take advantage of it!</li>
</ul><h2 id="acknowledgements">Acknowledgements</h2><p>A huge thanks to engineers from the Applied ML, Core ML, and Ads Platform teams, without whom such a broad cross-team collaborative effort wouldn’t have been possible. Credit to the contributors: Chris Farrell, Jason Sleight, Aditya Mukherjee, Abhy Vytheeswaran, Vincent Kubala.</p><div class="island job-posting"><h3>Become a Machine Learning Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/01/modernizing-ads-targeting-machine-learning-pipeline.html</link>
      <guid>https://engineeringblog.yelp.com/2020/01/modernizing-ads-targeting-machine-learning-pipeline.html</guid>
      <pubDate>Thu, 30 Jan 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Streams and Monk – How Yelp is Approaching Kafka in 2020]]></title>
      <description><![CDATA[Yelp
<noscript>
</noscript>
<p><a href="https://engineeringblog.yelp.com/">Yelp</a></p>
<div class="error-wrap content-container content-container--error clearfix media-body"><p class="message">Detective Darwin is hot on the case of the missing webpage! While he's sniffing out the problem, you should either <a href="https://engineeringblog.yelp.com/">go home</a> or search for what you need below. Elementary, my dear Yelper.</p><form action="https://www.yelp.com/search" class="yform flex-container">
<p><label class="flex-box">Search for </label> <label class="flex-box">Near </label></p>
</form></div>
<div class="main-content-wrap main-content-wrap--separated content-container main-footer webview-hidden"><small class="main-footer_copyright">Copyright © 2004–2019 Yelp Inc. Yelp, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/17089be275f0/assets/img/logos/logo_desktop_xsmall_outline.png" alt="Yelp logo" class="main-footer_logo-copyright" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/0aade8725c91/assets/img/logos/logo_desktop_xsmall_outline@2x.png 2x" />, <img src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/58cfc999e1f5/assets/img/logos/burst_desktop_xsmall_outline.png" alt="Yelp burst" class="main-footer_logo-burst" srcset="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_styleguide/dcb526e86d86/assets/img/logos/burst_desktop_xsmall_outline@2x.png 2x" /> and related marks are registered trademarks of Yelp.</small></div>]]></description>
      <link>https://engineeringblog.yelp.com/2020/01/streams-and-monk-how-yelp-approaches-kafka-in-2020.html</link>
      <guid>https://engineeringblog.yelp.com/2020/01/streams-and-monk-how-yelp-approaches-kafka-in-2020.html</guid>
      <pubDate>Wed, 22 Jan 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Automated IDOR Discovery through Stateful Swagger Fuzzing]]></title>
      <description><![CDATA[<p>Scaling security coverage in a growing company is hard. The only way to do this effectively is to empower front-line developers to be able to easily discover, triage, and fix vulnerabilities before they make it to production servers.</p><p>Today, we’re excited to announce that we’ll be open-sourcing <a href="https://github.com/Yelp/fuzz-lightyear">fuzz-lightyear</a>: a testing framework we’ve developed to identify <a href="https://blog.detectify.com/2016/05/25/owasp-top-10-insecure-direct-object-reference-4/">Insecure Direct Object Reference (IDOR) vulnerabilities</a> through stateful <a href="https://www.wired.com/2016/06/hacker-lexicon-fuzzing/">Swagger fuzzing</a>, tailored to support an enterprise, microservice architecture. This integrates with our Continuous Integration (CI) pipeline to provide consistent, automatic test coverage as web applications evolve.</p><h2 id="the-problem">The Problem</h2><p>As a class of vulnerabilities, IDOR is arguably one of the most difficult to systematically defend against in an enterprise codebase. Its ease of exploitation, combined with its potential for impact, makes it a high-risk vulnerability that we want to minimize as much as possible.</p><p>In the security industry, there are two main approaches to defending against threats. First, try to <strong>prevent</strong> them from happening. If this isn’t possible, make sure you can <strong>detect</strong> them for fast remediation.</p><p>The problem with IDOR is that it’s difficult to do either one.</p><h3 id="hard-to-prevent">Hard to Prevent</h3><p>The main problem with preventing IDOR vulnerabilities is that there’s no system that can be easily implemented to mitigate it. For <a href="https://www.acunetix.com/websitesecurity/cross-site-scripting/">Cross Site Scripting (XSS)</a>, attacks, you can leverage an effective templating system. For <a href="https://portswigger.net/web-security/sql-injection">SQL Injection attacks</a>, you can use parameterized queries. For IDOR, a common industry recommendation is to leverage a mapping (e.g., random string) to make it harder to enumerate values as an attacker. However, practically speaking, this is not as easy as it seems.</p><p>Maintaining a mapping leads to two categories of caveats:</p><ol><li>
<p>Cache Management</p>
<p>Let’s assume you have an endpoint that’s currently vulnerable to IDOR attacks: <code class="highlighter-rouge">/resource/1</code>. Now, you want to implement a mapping that masks this ID in the URL with a random string: <code class="highlighter-rouge">/resource/abcdef</code>, where <code class="highlighter-rouge">abcdef</code> maps to 1.</p>
<p>In this contrived example, you may be tempted to deprecate the old endpoint and just use the new one. However, this may break browser caches, user bookmarks, and pages indexed by search engines. Imagine taking an unexpected SEO hit when trying to roll out your IDOR-prevention system!</p>
<p>The alternative is that you can redirect traffic from the old endpoint to the new one, and let it bake in production for an extended period of time. However, for the time the redirect is in place, you would still be susceptible to IDOR vulnerabilities. Furthermore, this mapping is publicly harvestable during this period, so there’s a chance that someone may store and use it at a later time to perform the same attacks – just with less enumerable values.</p>
</li>
<li>
<p>Handling Internal References</p>
<p>ID references are littered throughout many different internal systems: various logs, Kafka messages, and database entries to name a few. When you transition from one reference method to another, how do you make sure that none of these systems break?</p>
<p>One good approach is to only use the mapped string for public-facing assets and its numeric counterpart for internal references. However, how do you enforce this to be true? There will always be more data ingresses, and the problem space might be reduced to a whack-a-mole approach of either translating it at the new data ingress or handling both types of IDs downstream.</p>
</li>
</ol><p>Another common industry recommendation is to merely perform access control checks before manipulating resources. While this is easier to do, it’s more suitable for spot-fixing, as it’s a painfully manual process to enforce via code audits. Furthermore, it requires <strong>all</strong> developers to know when and where to implement these access control checks. For example, if you put it at the ORM level, you may need to consider legitimate administrative cases for when you need to “bypass” these checks. If you put it at your view layer (assuming MVC layout), you may find yourself duplicating code everywhere.</p><p>How can you ensure all developers are actively thinking about this attack vector, <em>and</em> know how to mitigate it?</p><h3 id="slow-to-detect">Slow to Detect</h3><p>Detection strategies for this class of vulnerabilities are also somewhat lackluster. While manual code audits are effective, they don’t scale and are often expensive. Off-the-shelf static code analyzers prove more noisy than they’re worth, and a complicated taint analysis model would be required due to the various number of places that access control checks can be done.</p><p>Traditional API fuzzing may seem like another valid option, but this is not the case. The issue with traditional fuzzing is that it seeks to break an application with the assumption that failures allude to vulnerabilities. However, this is not necessarily true. As a security team, we care less about errors that attackers may or may not receive. Rather, we want to identify when a malicious action <strong>succeeds</strong>, which will be completely ignored by traditional fuzzing.</p><h2 id="the-solution">The Solution</h2><p>In February 2019, Microsoft released a <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/02/paper2.pdf">research paper</a> that describes how stateful Swagger fuzzing was able to detect common vulnerabilities in REST APIs, including IDOR vulnerabilities. The premise of this strategy is as follows:</p><ol><li>
<p>Have a user session execute a sequence of requests.</p>
</li>
<li>
<p>For the same sequence of requests, have an attacker’s session execute them. This is to ensure that the user and the attacker are able to reach the same state.</p>
</li>
<li>
<p>For the last request in the sequence, have the attacker’s session execute the user’s request. If this is successful, a potential vulnerability is found.</p>
</li>
</ol><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-11-07-fuzz-lightyear/diagram.png" alt="Detecting IDOR in a hypothetical sequence of requests" /><p class="subtle-text"><small>Detecting IDOR in a hypothetical sequence of requests</small></p></div><h3 id="stateful-fuzzing-vs-traditional-fuzzing">Stateful Fuzzing vs. Traditional Fuzzing</h3><p>Generally speaking, the art of using fuzzing requests to find vulnerabilities relies on one core assumption: <strong>applications should be able to handle any input thrown at them</strong>. This means that when the application breaks due to “malformed” input, it’s indicative of a potential exploit and warrants further investigation.</p><p>The issue with this approach is that as a security team, we care less about whether an application breaks for a specific user and more about successful requests in situations where they should have failed.</p><p><a href="https://swagger.io">Swagger</a>, as a standardized API specification, is fantastic for programmatically defining the rules of engagement for the fuzzing engine. Furthermore, by making it stateful, we can simulate user behavior through proper API requests/responses which keep state between each response. This state can then be used to fuzz future request parameters so that a single request sequence is able to accurately simulate a user’s session, enabling <a href="https://principlesofchaos.org/">chaos engineering testing</a>.</p><p>Finally, user session testing allows for carefully crafted scenarios to assert various security properties of a given API. In this case, we leveraged this to check whether users are able to access private resources that don’t belong to them.</p><p>The simplicity of this concept was profound. It provided a means to scale IDOR detection in an automated fashion through integration with our CI pipeline. However, while our solution was inspired by Microsoft’s research, we encountered several issues when adapting it to our ecosystem.</p><h2 id="issues">Issues</h2><h3 id="infrastructure-dependencies">Infrastructure Dependencies</h3><p>With a microservice architecture, services often have dependencies on other services. This means that in order to fuzz a given service, we would need to spin up its dependent services along with any other nested dependent services. To address this, we leveraged <a href="https://docs.docker.com/compose/">Docker Compose</a> to spin up a sandbox environment so we could perform acceptance testing with the service.</p><p>Acceptance testing is the practice of treating your service as a blackbox and testing whether the entire system as a whole behaves as expected. Through a microservice lens, this differs from integration tests (that mock out external dependencies), as acceptance tests spin up sandboxed instances for more realistic end-to-end testing. Since fuzz-lightyear identifies potential IDOR vulnerabilities by analyzing successful requests, it complements this framework nicely. Running tests in sandboxed instances also prevents leaving after-effects on staging or production databases so we don’t pollute our data with fuzzed, random input. Acceptance tests are typically integrated into CI/CD pipelines but can also be run locally by developers.</p><p>One popular tool we use at Yelp to facilitate running acceptance tests is Docker Compose. This allows developers to define service dependencies in one single YAML file and enables them to start/stop them easily. By leveraging this tooling, we gain two advantages. First, we empower developers by seamlessly integrating into their established development/testing workflow. Second, it integrates effortlessly with our existing CI pipeline to provide continuous coverage, and also tests for IDOR vulnerabilities in a generated sandboxed environment with all the new changes.</p><h3 id="incomplete-resource-lifecycle">Incomplete Resource Lifecycle</h3><p>A fundamental assumption in the original research paper is that the tested application supports all CRUD (Create, Retrieve, Update, Delete) methods. This allows for stateful fuzzing, as any resource can be created and manipulated within the application’s API.</p><p>However, this is not the case at Yelp. Often, services only provide interfaces to retrieve and update resources directly corresponding to that service, but rely on other services to create such resources. This means that stateful fuzzing would not be effective–since there’s no way to test the retrieval of a resource – if we didn’t create it within the request sequence.</p><p>For example, service A has an endpoint X which takes a <code class="highlighter-rouge">business_id</code> as an input, but service A itself doesn’t have the ability to create businesses. By itself, the stateful fuzzing algorithm would never be able to test endpoint X since we have no way of generating a business!</p><p>We can’t just tack on another service’s API to the request sequence generation process, since this would expand the search space of the algorithm too much. Therefore, our solution is to provide developers the ability to define factory fixtures that can be used while fuzzing. This is what a fixture looks like:</p><noscript>
<pre>@fuzz_lightyear.register_factory('userID')
def create_biz_user_id():
  return do_some_magic_to_create_business()</pre></noscript><p>This registers <code class="highlighter-rouge">create_biz_user_id</code> as a provider for the <code class="highlighter-rouge">userID</code> resource, so that if <code class="highlighter-rouge">fuzz_lightyear</code> needs a <code class="highlighter-rouge">userID</code> resource in a request, it can use the factory to generate it. This fixture system makes it easy for developers to configure <code class="highlighter-rouge">fuzz_lightyear</code> to generate vulnerability-testing request sequences by reducing the complexity of creating dependency resources.</p><h3 id="expected-direct-object-reference">Expected Direct Object Reference</h3><p>Not all endpoints that allow a direct object reference need to be authenticated. They could simply be providing <strong>non-sensitive information</strong> about the object being queried. For example, consider our open-sourced <a href="https://engineeringblog.yelp.com/2017/02/open-sourcing-yelp-love.html">Yelp Love</a> app. This <a href="https://github.com/Yelp/love/blob/df39553935d92514e5b78c075d1b9849a6cb3c62/views/web.py#L92-L117">endpoint</a> requires authentication, but the details which it returns are not sensitive in the context of the app. Thus, it doesn’t make any sense to check for IDOR vulnerabilities in this case.</p><p>To address this, we implemented an endpoint whitelisting system to configure which endpoints should be excluded from a <code class="highlighter-rouge">fuzz_lightyear</code> scan. This allows developers to configure the testing framework to only alert off high-signal endpoints, therefore minimizing test flakiness.</p><h2 id="takeaways">Takeaways</h2><p>Automated IDOR detection is a difficult task. Even with the Microsoft-inspired stateful fuzzing approach, there were still limitations to applying this concept in a microservice ecosystem. To address these issues, we designed a testing framework that allows developers to easily configure dynamic tests and integrate them smoothly into our CI pipeline. In doing so, we can achieve continuous, automated IDOR coverage, as well as empower developers to be able to address these issues independently.</p><p>Curious to check it out? View more details on fuzz-lightyear on our <a href="https://github.com/Yelp/fuzz-lightyear">Github</a> page.</p><h2 id="contributors">Contributors</h2><p>I would like to credit the following people (in alphabetical order) for their hard work in building this system and in continuing to bolster Yelp’s security.</p><ul><li><a href="https://www.linkedin.com/in/aaronloo">Aaron Loo</a></li>
<li><a href="https://www.linkedin.com/in/joeysclee">Joey Lee</a></li>
<li><a href="https://github.com/OiCMudkips">Victor Zhou</a></li>
</ul><div class="island job-posting"><h3>Security Engineering at Yelp</h3><p>Want to transform industry-leading ideas into actionable, scalable solutions to help keep the Yelps secure? Apply to join!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/30bfc49d-efdd-4543-9748-d95bef5692ae?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2020/01/automated-idor-discovery-through-stateful-swagger-fuzzing.html</link>
      <guid>https://engineeringblog.yelp.com/2020/01/automated-idor-discovery-through-stateful-swagger-fuzzing.html</guid>
      <pubDate>Thu, 09 Jan 2020 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Streaming Cassandra into Kafka in (Near) Real-Time: Part 2]]></title>
      <description><![CDATA[<p>The <a href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">first half</a> of this post covered the requirements and design choices of the Cassandra Source Connector and dove into the details of the CDC Publisher. As described, the CDC Publisher processes Cassandra CDC data and publishes it as loosely ordered PartitionUpdate objects into Kafka as intermediate keyed streams. The intermediate streams then serve as input for the DP Materializer.</p><h2 id="data-pipeline-materializer">Data Pipeline Materializer</h2><p>The DP Materializer ingests the serialized PartitionUpdate objects published by the CDC Publisher, transforms them into fully formed Data Pipeline messages, and publishes them into the Data Pipeline.</p><p>The DP Materializer is built on top of Apache Flink, a stream processing framework. Flink has been used in production at Yelp for a few years now across various streaming applications. It provides an inherent state backend in the form of RocksDB, which is essential for guaranteeing inorder CDC publishing. In addition, Flink’s checkpoint and savepoint capabilities provide extremely powerful fault tolerance.</p><p>The application has two main phases:</p><ul><li>Schema Inference (or the “bootstrap phase”)</li>
<li>ETL (or the “transform phase”)</li>
</ul><h3 id="schema-inference">Schema Inference</h3><p>During the bootstrap phase, the avro schema necessary for publishing to the Data Pipeline is derived from the Cassandra table schema. The process begins by building the Cassandra table metadata objects (<em>CFMetaData</em>) used by the Cassandra library. Loading this metadata is required to use library functionality to act on the serialized Cassandra data from the CDC Publisher stream. The metadata object contains information on the table primary key, column types, and all other properties specified in a table CREATE statement. This schema representation is processed to produce an avro schema where each Cassandra column is represented by an equivalent avro type.</p><p>As the DP Materializer is deployed outside of the Cassandra cluster, it cannot load the table metadata from files on the local node (like the CDC Publisher). Instead, it uses the Cassandra client to connect to Cassandra and derive the CFMetaData from the schema of the table being streamed. This is done in the following steps:</p><ol><li>Once connected to a cluster, the create table and type (for UDTs) statements are retrieved.</li>
<li>Cassandra’s query processor is used to parse the retrieved create statements into the table metadata objects.</li>
<li>Information about columns previously dropped from the table is retrieved and added to the metadata built in the previous step. Loading the dropped column information is required to read table data created prior to the column being dropped.</li>
</ol><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-18-csource-part-2/loading-cassandra-metadata.jpeg" alt="Loading Table Metadata from Cassandra" /><p class="subtle-text"><small>Loading Table Metadata from Cassandra</small></p></div><p>Once the metadata is loaded, the DP Materializer builds the avro schema from the metadata. A couple of key things happen in this derivation phase:</p><ol><li>The table’s partition key and clustering key(s) are mapped as the primary keys of the avro schema.</li>
<li>All other columns in the table (except the partition and clustering keys) are created as nullable. In the event of schema changes in the table, this guarantees that the corresponding avro schemas are always compatible to their previous versions (except when re-adding a column with a different type, which in itself <a href="https://issues.apache.org/jira/browse/CASSANDRA-14843">can</a> <a href="https://issues.apache.org/jira/browse/CASSANDRA-14948">cause</a> <a href="https://issues.apache.org/jira/browse/CASSANDRA-14913">issues</a>).</li>
</ol><p>Schema generation currently supports nearly all valid Cassandra column types (except when prohibited by Avro), including collections, tuples, UDTs, and nesting thereof.</p><h4 id="schema-change-detection">Schema Change Detection</h4><p>As the above schema inference is part of the bootstrap phase, the DP Materializer needs the ability to detect Cassandra schema changes online and update the output Avro schema automatically. To achieve this, it implements Cassandra’s schema change listener interface, provided by the Cassandra client, to detect when a change is made to the schema of the tracked table. Once detected, the corresponding Cassandra metadata is updated and the avro schema is rebuilt from the updated metadata.</p><h3 id="etl-or-consume-transform-and-publish">ETL (or Consume, Transform, and Publish)</h3><p>This phase of the DP Materializer is where the serialized PartitionUpdate objects from the CDC Publisher are consumed, processed, and transformed into Data Pipeline messages for publishing into the Pipeline. The consumer and publisher are provided out-of-the-box by Flink, so this section primarily focuses on the transformer portion of the DP Materializer.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-18-csource-part-2/dp-materializer-high-level.jpeg" alt="Data Pipeline Materializer" /><p class="subtle-text"><small>Data Pipeline Materializer</small></p></div><h4 id="state-architecture">State Architecture</h4><p>The transformer is backed by Flink’s RocksDB state. This state is abstracted as a collection of map objects, with each map corresponding to a partition key from the Cassandra table. Each map object has, as its keys, the clustering keys from that partition in Cassandra. A PartitionUpdate, containing at most one row, is stored as the value for its corresponding clustering key in the map. For tables which do not have defined clustering keys, each map contains a single entry with a null key.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-18-csource-part-2/state-architecture.jpeg" alt="State Structure" /><p class="subtle-text"><small>State Structure</small></p></div><p>State loading and memory management is handled internally by Flink. In addition, Flink’s stream keying mechanism guarantees that all updates for a partition key will be routed to the same worker and processed against the same map object persistently across application restarts.</p><p>Note that the PartitionUpdate objects from the CDC Publisher can be both duplicated multiple times and out-of-order (by writetime). In addition, oftentimes a PartitionUpdate may not contain the full content of a Cassandra row.</p><h4 id="the-transformer">The Transformer</h4><p>The central piece of the application is the transformer, which:</p><ul><li>Processes the Cassandra CDC data into a complete row (with preimage) for the given avro primary key (Cassandra partition key + clustering key[s]) for publishing to the Data Pipeline.</li>
<li>Produces final output message with the appropriate Data Pipeline message type.</li>
</ul><p>The transformer uses the row (PartitionUpdate) saved in the map objects in the state, along with the incoming PartitionUpdate objects from the CDC Publisher to generate the complete row content, the previous row content (in the case of UPDATE, DELETE messages), and the type of the output message.</p><p>This is achieved by deserializing the input PartitionUpdate and merging it with the saved PartitionUpdate. This is done using the same PartitionUpdate merge functionality Cassandra uses to combine data from SSTables during reads. The merge API takes in two PartitionUpdate objects, one from the Flink state and the other from the CDC Publisher’s output stream. This produces a merged PartitionUpdate which is used to build an avro record with the schema derived during the bootstrap phase. If the previous row value is needed, it is derived from the saved PartitionUpdate in the Flink state. In the end, the state is updated with the merged PartitionUpdate.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-18-csource-part-2/merge-partition-update.jpeg" alt="Determining Row States" /><p class="subtle-text"><small>Determining Row States</small></p></div><p>This process handles duplicate and out-of-order PartitionUpdate objects. The use of Cassandra’s merge functionality results in the same “last write wins” conflict resolution as a Cassandra read. To avoid publishing duplicate messages, it is verified that the input PartitionUpdate changes the row state. This is done by computing the md5 digests of the saved and merged PartitionUpdate objects. If the digests are the same, the PartitionUpdate is ignored.</p><p>The merge, update state, and publish logic can be summarized below:</p><ul><li>The incoming PartitionUpdate is merged with the saved PartitionUpdate (if it exists) and the corresponding Data Pipeline message is determined:
<ul><li>If the merged PartitionUpdate contains live (non-tombstoned) data and the saved does not, a CREATE message is published.</li>
<li>If both the merged and saved PartitionUpdate objects contain live data, an UPDATE message is published if the md5 digests of the objects are different.</li>
<li>If the merged PartitionUpdate contains tombstoned data but the saved one contains live data, a DELETE message is published.</li>
</ul></li>
<li>If the md5 digests of the saved and merged PartitionUpdate objects are different, then the merged PartitionUpdate is saved in the state.</li>
</ul><p>Thus, at the end of the transform phase, a message with the appropriate Data Pipeline message type and the full row content is ready to be published into the Data Pipeline.</p><h2 id="supporting-backfills">Supporting Backfills</h2><h3 id="bootstrapping-a-stream">Bootstrapping a Stream</h3><p>A limited amount of CDC logs can be <a href="http://cassandra.apache.org/doc/latest/operating/cdc.html#warnings">stored on a Cassandra node</a>. Thus, when a table is set up to be streamed by the connector, only the data available in the CDC directory at the time (and going forward) will be processed. However, to maintain the stream-table duality, all of the existing data in the Cassandra table needs to be replayed into the stream.</p><p>To achieve this, the backfill bootstrap process reads through the data stored on disk as SSTables. To ensure that the set of SSTable files are not modified by compaction during the backfill, the table’s snapshot is taken and the SSTables are processed off of that snapshot. The Cassandra SSTable reader returns the scanned data as a series of PartitionUpdate objects. The CDC Publisher processes these PartitionUpdate objects in the same way as commit log segments and publishes them into Kafka, where they’re subsequently transformed into Data Pipeline messages by DP Materializer.</p><p>This process is followed whenever a Cassandra table is first set up to be tracked by the connector. This is also done if there’s a need to rebuild the state in the DP Materializer.</p><h3 id="rebuilding-a-stream">Rebuilding a Stream</h3><p>If a tracked table’s output stream becomes corrupted or is deleted (unlikely but possible), the stream can be rebuilt by replaying the stored state of the DP Materializer. As all of the serialized PartitionUpdate objects are stored in the state, there’s no need to republish data from the SSTables.</p><h2 id="limitations-and-future-work">Limitations and Future Work</h2><h3 id="partition-level-operations">Partition Level Operations</h3><p>The current system design processes each row change independently. A single input message to the DP Materializer will emit at most one message into the Data Pipeline. Changes at a partition level that affect the value of multiple rows are not currently supported. These include:</p><ul><li>Full partition deletion (only when using clustering)</li>
<li>Ranged tombstones</li>
<li>Static columns</li>
</ul><p>There is, however, a potential path to support. The DP Materializer stores all rows in a single Cassandra partition as entries of the same map object during processing. It is conceivable to also store the partition level state separately. When this state changes, the DP Materializer could iterate through the entire map (Cassandra partition) and produce Data Pipeline messages for all affected rows.</p><h3 id="ttl">TTL</h3><p>TTL’ed data is currently not supported by the connector. TTL values are ignored and data is considered as live based on its writetime.</p><h3 id="dropping-tombstones">Dropping Tombstones</h3><p>There’s no support to drop tombstones from DP Materializer’s Flink state. They will remain there indefinitely unless overridden with new data. It may be possible to drop old tombstones when updating row state, similar to the gc_grace_seconds parameter on tables. However, this would not help for rows that are never updated. In addition, great care would need to be taken to ensure backfilling or repairing a table does not create zombie data in the output stream.</p><h3 id="publishing-latency">Publishing Latency</h3><p>As mentioned earlier, commit log segments must be full and no longer referenced by memtables before being made available for processing by Cassandra. In spite of the CDC log filler implementation, some latency is introduced in publishing to the Data Pipeline. This limitation should be overcome in Cassandra 4, which introduces the capability to read live commit log segments and will thus ensure that the publishing latency is as close to real time as possible.</p><h2 id="learnings">Learnings</h2><p>The Cassandra Source Connector has been running in production at Yelp since Q4 2018. It supports multiple use cases, which have helped in surfacing some quirks about its design choices:</p><h3 id="avro-as-a-serialization-format">Avro as a Serialization Format</h3><p>The maximum number of cells (rows * columns) allowed by Cassandra in a single partition is two billion. This means that a row could potentially have two billion columns. However, Avro serialization and deserialization becomes a bottleneck once the number of columns starts going into the hundreds and cannot hold up with the potential maximum number of columns. Horizontal scaling might be needed for consumers depending on the throughput requirements and size (in number of columns) of the Cassandra table being streamed.</p><p>In addition, a few Cassandra data types (such as DECIMAL) don’t have intuitive Avro data type equivalents. In such cases, either the columns cannot be supported or custom avro data types have to be defined.</p><h3 id="flink-state-size">Flink State Size</h3><p>As every single row from the table is stored as a serialized PartitionUpdate in the state, the state size can grow up to be huge for large tables. The state size becomes a bottleneck for code pushes and maintenance as it has to be reloaded for every deployment and restart of the application. Additional work is required for minimizing the time for saving and loading state for huge tables.</p><h2 id="tldr">TL;DR?</h2><p>Yelp presented the Cassandra Source Connector at Datastax Accelerate 2019. You can watch it <a href="https://www.youtube.com/watch?v=p2GLvYActRw">here</a>.</p><div class="post-gray-box">This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track &amp; migrate schemas, how we process and transform streams, and finally how we connect all of this into data stores like Redshift, Salesforce, and Elasticsearch.<p>Read the posts in the series:</p><ul><li><a title="Billions of Messages a Day - Yelp's Real-time Data Pipeline" href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Billions of Messages a Day - Yelp's Real-time Data Pipeline</a></li>
<li><a title="Streaming MySQL tables in real-time to Kafka" href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Streaming MySQL tables in real-time to Kafka</a></li>
<li><a title="More Than Just a Schema Store" href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">More Than Just a Schema Store</a></li>
<li><a title="PaaStorm: A Streaming Processor" href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">PaaStorm: A Streaming Processor</a></li>
<li><a title="Data Pipeline: Salesforce Connector" href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">Data Pipeline: Salesforce Connector</a></li>
<li><a title="Streaming Messages from Kafka into Redshift in near Real-Time" href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Streaming Messages from Kafka into Redshift in near Real-Time</a></li>
<li><a title="Open-Sourcing Yelp's Data Pipeline" href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">Open-Sourcing Yelp's Data Pipeline</a></li>
<li><a title="Making 30x Performance Improvements on Yelp’s MySQLStreamer" href="https://engineeringblog.yelp.com/2018/02/making-30x-performance-improvements-on-yelps-mysqlstreamer.html">Making 30x Performance Improvements on Yelp’s MySQLStreamer</a></li>
<li><a title="Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift" href="https://engineeringblog.yelp.com/2018/04/black-box-auditing.html">Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift</a></li>
<li><a title="Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch" href="https://engineeringblog.yelp.com/2018/06/fast-order-search.html">Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch</a></li>
<li><a title="Joinery: A Tale of Un-Windowed Joins" href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery: A Tale of Un-Windowed Joins</a></li>
<li><a title="Streaming Cassandra into Kafka in (Near) Real-Time: Part 1" href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Streaming Cassandra into Kafka in (Near) Real-Time: Part 1</a></li>
<li><a title="Streaming Cassandra into Kafka in (Near) Real-Time: Part 2" href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-2.html">Streaming Cassandra into Kafka in (Near) Real-Time: Part 2</a></li>
</ul></div><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>Interested in solving problems like these? Apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/2cfdf523-06dd-41d9-b025-3db1b45f0548?description=Software-Engineer-Data-Production-Backend_Engineering_London-UK?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-2.html</link>
      <guid>https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-2.html</guid>
      <pubDate>Wed, 18 Dec 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Architecting Restaurant Wait Time Predictions]]></title>
      <description><![CDATA[<p>Is there a restaurant you’ve always wanted to check out, but haven’t been able to because they don’t take reservations and the lines are out the door?</p><p>Here at Yelp, we’re trying to solve problems just like these and delight consumers with streamlined dining experiences. Yelp Waitlist is part of the Yelp Restaurants product suite, and its mission is to take the mystery out of everyday dining experiences, enabling you to get in line at your favorite restaurant through just the tap of a button.</p><p>For diners, in addition to joining an online waitlist, Yelp Waitlist provides live wait times and queue updates. For restaurants, it facilitates table management and reduces stress and chaos by the door by allowing guests to sign up remotely. The flow is simple: diners see the current wait times at a Waitlist restaurant and virtually get in line right from the Yelp app.</p><p>If you want to know more about the product, check out <a href="https://blog.yelp.com/2019/09/yelp-waitlist-new-predictive-wait-time-and-notify-me-features">this related</a> post!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-11-architecting-wait-time-estimations/shizen_gil_flow.png" alt="" /></div><p>Wait estimates are modeled as a machine learning problem. When you request to be seated at a restaurant through Waitlist, a machine learning model is alerted behind the scenes to generate a prediction. The ability of this model to provide reasonable wait estimates is what makes the online waitlist possible, so you have some bit of AI to thank the next time you enter a line from the comfort of your home.</p><p>The prediction endpoint is part of a larger system that enables the generation of the estimated time. This blog post is aimed at describing the Waitlist machine learning system that bridges hungry diners to their tasty food.</p><h3 id="the-system">The System</h3><p>As you can imagine, the system needs to be as up to date as possible with the state of the restaurant (e.g., how many people are currently in line), and the many other contextual factors that determine an estimate as accurately as possible. For example, you cannot expect the wait time to extend beyond the closing time of the restaurant. Additionally, there are certain latency requirements to serve a high volume of QPS.</p><p>The system can be broken down into three components:</p><ol><li>The offline training pipeline where model iteration, data-wrangling, and ETLs happen.</li>
<li>Online serving which tracks the current state of the restaurant and responds to requests.</li>
<li>Analytics providing model performance reports and analyses.</li>
</ol><p>We chose to use XGBoost as the model to generate wait estimates. Offline training happens via Spark and an <a href="https://engineeringblog.yelp.com/2018/01/growing-cache-friendly-trees-part2.html">optimized XGboost Java Library</a> that helps us meet latency requirements is used online.</p><p>We faced two main challenges while architecting the machine learning system:</p><ol><li>The requirement of serving users live predictions from a Spark ML model with an online service in Python.</li>
<li>The cold start problem when adding new businesses to the product.</li>
</ol><p>Most of the system was initially designed to make the first challenge possible. We slowly added components to enable training and prediction with more features once we felt confident in the system’s ability to work seamlessly on its own. The second challenge was addressed by the use of XGboost, which can make predictions with partial feature-sets. Though these predictions may not be very accurate at first, retraining helps improve them over time.</p><p>Below is a simplified view of the system:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-11-architecting-wait-time-estimations/wait_list_system.png" alt="" /></div><h3 id="the-various-components-in-the-above-diagram-are">The various components in the above diagram are:</h3><ul><li><strong>Data Warehouse</strong>: Source of data for training, backed by Redshift.</li>
<li><strong>Offline Service</strong>: Service responsible for training the model. This is written in Python and uses Spark for model training due to the quantity of data involved (tens of millions of instances after sanitization).</li>
<li><strong>Feature ETLs</strong>: Spark-based ETLs for generating additional features. These are non-time-sensitive features which are shared both online and offline.</li>
<li><strong>Model Server</strong>: In-house Java service which stores the trained model and is optimized for high-throughput traffic.</li>
<li><strong>Online Stores</strong>: Available features generated from Spark-ETL, as well as up-to-date restaurant data. This encompasses:
<ul><li>Cassandra for storing results from Spark-ETLs</li>
<li>MySQL for storing the restaurant’s state</li>
</ul></li>
<li><strong>Online Service</strong>: Service responsible for generating predictions in real time and making calls to the online stores and model server to do so. This service is written in Python.</li>
</ul><p>As hinted above, we rely heavily on Spark for building models, as well as for deriving additional features. It’s important to note, however, that the online service does not make use of Spark, which can result in different data access and manipulation patterns before being fed into the model to make a prediction.</p><p>A lot of care goes into ensuring that the set of features we compute offline match those we compute online. A theoretical example of a mismatched online/offline feature would be different orderings for one-hot encoded feature columns, which, despite having identical raw data, can result in different feature vectors.</p><p>Figure 2 (below) breaks down the model development, evaluation, and launch pipelines:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-11-architecting-wait-time-estimations/pipelines.png" alt="" /></div><h3 id="model-development-pipeline">Model development pipeline:</h3><p>At this stage, the model flows from human intuition/ideation to reality. This encompasses:</p><ul><li>Feature-extraction ETLs</li>
<li>Feature-set blueprints: Feature definitions intended to enforce online/offline consistency (e.g., what subset of features this particular feature-set contains, its data types, etc.)</li>
</ul><h3 id="evaluation-pipeline">Evaluation pipeline:</h3><p>This ensures that the newly trained model obtains an acceptable performance with regard to business metrics. This pipeline is a combination of automation and human decision making. For example, a metric could track the percentage of diners who waited more than five minutes beyond their quoted estimate.</p><p>The steps for evaluation include:</p><ul><li>Running an evaluation batch for the freshly trained model.</li>
<li>Comparing performance against previous benchmarks.</li>
<li>Evaluating if the new model is a viable candidate for experimentation/release. (Unfortunately not all candidates are viable; this can be attributed to the probabilistic nature of machine learning projects.)</li>
<li>The models that pass this stage promise superior performance compared to the status quo model.</li>
</ul><h3 id="experimentation-model-launch-pipeline">Experimentation/ Model-Launch pipeline</h3><p>At this stage, we’re convinced of the model’s promise and want to experiment with it in the real world. To maintain confidence that the model will operate in production as it did in offline evaluation, we promote the model to “dark-launch” mode.</p><p>To do this, we need to be able to reproduce the feature-set in the online service. This means:</p><ul><li>Each incoming user request contains a partial feature-set.</li>
<li>The rest of the features are pulled from online data stores.</li>
<li>The feature-set is guaranteed to maintain the same format as the training data (thanks to feature-blueprints).</li>
</ul><p>Once we have the ability to make predictions from the online service, we can proceed to the dark-launch phase. Here, we:</p><ul><li>Surface our candidate model as a ghost/dark model.</li>
<li>Enable the model to see live incoming requests and produce estimates for these requests (without surfacing them to the user).</li>
<li>Use the event logs generated from the experiment launch to measure the performance of all models across all samples.</li>
</ul><p>We’ve seen several benefits from dark-launching our model:</p><ul><li>Comparing performance across different cohorts of businesses without affecting estimates.</li>
<li>Weeding out any differences in online and offline model-pipelines. Since both perform their own set of computations, etc., there’s plenty of scope for mistakes and we can ensure that offline and dark-launch give identical prediction estimates for the same candidate.</li>
<li>Checking the latency of the new model and ensuring we don’t violate any SLOs.</li>
</ul><p>At any given time, we can have several models launched live, several dark-launched, and several under development.</p><p>We can typically verify within a few days’ time if the dark-launched model is working as expected; if not, we can begin to investigate any discrepancies. If the results are as expected, we can slowly start rolling out the new model. This slow rollout is intended to capture feedback loops that we’re not exposed to during dark-launch.</p><h3 id="whats-a-feedback-loop">What’s a Feedback Loop?</h3><p>Whenever we surface an estimate to the user, we set an expectation of the time they’ll be seated, thereby affecting when the user shows up to the restaurant. If, for instance, this causes the user to arrive at the restaurant after their table is actually available, they may have a longer overall wait time (our label) than if we’d given them a shorter estimate. These instances are tracked in our logs and we try our best to reduce such inaccuracies. The feedback loop here happens when our label data is influenced by our prediction.</p><p>Factors like this add sensitivity to our system, which underscores the importance of providing accurate wait estimations.</p><h3 id="measuring-success">Measuring Success</h3><p>Within this problem area are a variety of metrics we can track, and choosing the right ones is always a challenge. We need to cater to the needs of not only the users (by ensuring they wait only as long as expected), but also the restaurant and its staff (not sending enough people to occupy empty tables vs. sending too many people at the same time which puts pressure on the hosts).</p><p>We can observe a few of these metrics using data streamed into our logs. A few others can be gauged through user-feedback surveys (which in itself has the propensity to be biased), and whatever else that cannot be observed, we hypothesize. We’re constantly trying to collect as much data as possible to improve the coverage of each quantitative and qualitative metric.</p><p>Measuring success is not trivial, especially given that the set of restaurants we serve is constantly growing and providing more opportunities to observe new behavioral patterns. With each model that we build and deploy, we learn a little more about our system, helping us better measure success. So far this strategy has worked well for us.</p><h3 id="conclusion">Conclusion</h3><p>Wait-time estimation is a unique problem we could only begin to address because of the state-of-the-art tooling and support from the wonderful people at Yelp! We continue to make updates to the algorithms and migrate our system to use more efficient tooling to make our estimates as accurate as possible so that you - our customer - don’t have to wait longer than you need at your favorite restaurant.</p><h3 id="acknowledgements">Acknowledgements</h3><p>Huge thanks to my indispensable team for all their contributions: Chris Farrell, Steve Thomas, Steve Blass, Aditi Ganpule, Saeed Mahani, Kaushik Dutt, and Sanket Sharma.</p><div class="island job-posting"><h3>Become a Software Engineer at Yelp</h3><p>Passionate about solving problems with Machine Learning?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/12/architecting-wait-time-estimations.html</link>
      <guid>https://engineeringblog.yelp.com/2019/12/architecting-wait-time-estimations.html</guid>
      <pubDate>Thu, 12 Dec 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Streaming Cassandra into Kafka in (Near) Real-Time: Part 1]]></title>
      <description><![CDATA[<p>At Yelp, we use Cassandra to power a variety of use cases. As of the date of publication, there are 25 Cassandra clusters running in production, each with varying sizes of deployment. The data stored in these clusters is often required as-is or in a transformed state by other use cases, such as analytics, indexing, etc. (for which Cassandra is not the most appropriate data store).</p><p>As seen in previous posts from our Data Pipeline series, Yelp has developed a robust connector ecosystem around its data stores to stream data both into and out of the Data Pipeline. This two-part post will dive into the Cassandra Source Connector, the application used for streaming data from Cassandra into the Data Pipeline.</p><h2 id="data-pipeline-recap">Data Pipeline Recap</h2><p>Yelp’s Data Pipeline is an abstraction on top of Apache Kafka (explained in <a href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">this blog post</a>) and is backed by a schema registry called <a href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">Schematizer</a>. It currently serves as the backbone of hundreds of use cases at Yelp, ranging from analytics and experimentation to notifications, ranking, and search indexing.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/data-pipeline.jpeg" alt="The Data Pipeline Ecosystem at Yelp" /><p class="subtle-text"><small>The Data Pipeline Ecosystem at Yelp</small></p></div><p>Here’s a quick recap of the Data Pipeline:</p><ul><li>Data published into the Data Pipeline must be schematized. In essence, data cannot be published if it doesn’t have a predefined schema.</li>
<li>For data backed by data stores, the corresponding streams in the Data Pipeline must conform to the <a href="https://docs.confluent.io/3.1.1/streams/concepts.html#duality-of-streams-and-tables">stream-table duality</a>.</li>
<li>Every message in the Data Pipeline must contain the full content of an equivalent row in the data store. In addition, UPDATE and DELETE messages must also contain the previous snapshot of the equivalent row before the change.</li>
</ul><h2 id="challenges-with-streaming-data-from-cassandra">Challenges With Streaming Data From Cassandra</h2><p>Due to the nature of how Cassandra works, meeting the aforementioned Data Pipeline requirements can present some challenges.</p><h3 id="achieving-ordering-of-writes">Achieving Ordering of Writes</h3><p>Cassandra uses multiple replicas of data for availability. However, there’s no actual concept of a global replication stream. Each write is independently replicated, with all nodes eligible to coordinate. As a result, concurrent writes may be processed in different orders on different replicas. Cassandra uses several mechanisms (hinted handoffs, repairs, last write wins) to ensure that data is eventually consistent. Although the replicas eventually agree on the final value of the data, this does not resolve the differences in write order. Thus, the Cassandra Source Connector needs to conform to the write ordering guarantees similar to those of Cassandra.</p><h3 id="obtaining-complete-row-content">Obtaining Complete Row Content</h3><p>There’s no requirement for Cassandra writes to contain all table columns. Even if this were the case, the current state of the row would depend on both the data in the write and all previously written data that shadows it. Thus, the write data alone is not sufficient to determine the new row state.</p><h3 id="obtaining-previous-row-content">Obtaining Previous Row Content</h3><p>As is the case when determining new row value, knowledge of the row state prior to a given mutation is required. This prior row state represents the accumulation of all previous writes.</p><h3 id="distributed-data-ownership">Distributed Data Ownership</h3><p>The ownership of data in Cassandra is distributed between the nodes in each datacenter. There’s no special “master”; all nodes are able to coordinate writes. Thus, processing these writes to a cluster involves combining information from multiple nodes.</p><h2 id="possible-approaches">Possible Approaches</h2><p>Several approaches were considered when designing the Cassandra Source Connector. <a href="https://wecode.wepay.com/posts/streaming-cassandra-at-wepay-part-1">This post</a> by WePay gives a solid description of the primary streaming options available along with the pros and cons of each, including:</p><ul><li>Writing to both Cassandra and Kafka (“Double Writing”)</li>
<li>Writing directly to Kafka and using a Cassandra Sink to load the data in Cassandra (“Kafka as Event Source”)</li>
<li>Processing the commit log exposed by Cassandra’s Change Data Capture or CDC (“Parsing Commit Logs”)</li>
</ul><p>The use of Kafka Connect’s <a href="https://docs.lenses.io/connectors/source/cassandra.html">Cassandra Source</a> was also investigated. This connector streams data from a Cassandra table into Kafka using either “bulk” or “incremental” update modes. Both modes function by periodically polling the table for data. Bulk mode performs a full table scan, publishing the entire result, while incremental mode queries the rows written since the last sampling. Both modes have their disadvantages:</p><ul><li>Bulk mode table scans are very expensive on large tables, and each scan publishes a lot of duplicate data.</li>
<li>Incremental mode is only viable for a certain type of workload. The writes must be append-only with monotonically increasing columns (such as timestamps) as part of the primary key. Additionally, polling for this data can cause extra cluster load.</li>
</ul><p>Ultimately, a solution based on processing Cassandra CDC made the most sense for the connector.</p><p>Cassandra’s distributed deployment characteristics coupled with both the need to achieve an ordering of writes and meet Data Pipeline semantics made creating a single application quite challenging. Thus, the Cassandra Source Connector was built as two separate components, each addressing a subset of these issues:</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/csource-high-level.jpeg" alt="Cassandra Source Connector at a High Level" /><p class="subtle-text"><small>Cassandra Source Connector at a High Level</small></p></div><p><strong>CDC Publisher</strong>: A service running locally on Cassandra nodes that uses CDC to publish raw Cassandra writes into intermediate Kafka streams. These streams serve as unified commit logs, removing the aspect of distributed data ownership and defining an order of events to process.</p><p><strong>Data Pipeline Materializer</strong> (<strong>DP Materializer</strong>): An application running on Apache Flink which processes raw Cassandra writes produced by the CDC Publisher and publishes them as Data Pipeline messages.</p><h2 id="cdc-publisher">CDC Publisher</h2><p>The CDC Publisher produces all writes made in Cassandra tables as serialized partition updates into table-specific Kafka streams.</p><h3 id="processing-cassandra-writes-with-cdc">Processing Cassandra Writes with CDC</h3><p>The <a href="http://cassandra.apache.org/doc/latest/operating/cdc.html">Change Data Capture (CDC)</a> capability introduced in version 3.8 of Cassandra is used by the CDC Publisher to process writes.</p><p>Normally (with CDC disabled), writes are stored by Cassandra in the following manner:</p><ul><li>Client writes are persisted to memtables and the commit log by every node</li>
<li>Memtables are periodically flushed to SSTables on disk</li>
</ul><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/cassandra-write-path.jpeg" alt="Cassandra Write Path" /><p class="subtle-text"><small>Cassandra Write Path</small></p></div><p>The commit log is composed of a series of fixed-sized files (defaulted at 32MB) called “commit log segments”. Once the memtables are flushed to SSTables, these segments are discarded by Cassandra.</p><p>If CDC is enabled, all Cassandra commit log segment files containing writes to a tracked table are flagged. When the files are no longer referenced by corresponding memtables, they’re moved into a separate directory (instead of being discarded).</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/cassandra-write-path-with-cdc.jpeg" alt="Cassandra Write Path with CDC" /><p class="subtle-text"><small>Cassandra Write Path with CDC</small></p></div><p>There are several challenges with using the current implementation of Cassandra’s CDC:</p><ul><li>Per-node processing: As each node stores only a portion of the complete table data, CDC must be processed on multiple nodes.</li>
<li>Replication: The same write is stored on each data replica, resulting in duplicate processing.</li>
<li>Partial data: Commit log segments only contain the information from incoming writes and do not have the full view of the corresponding rows.</li>
<li>CDC does not contain schema information about the tables.</li>
<li>CDC directory size limit: If the CDC directory gets too large in size, the node will reject new table writes.</li>
<li>Poorly bounded latency: Commit log segments must be full and no longer referenced by memtables before being made available for processing. For clusters with low write rates, the commit log segments can take a while to fill up, affecting latency.</li>
</ul><p>Despite these drawbacks, CDC was used because it is the solution developed by the Cassandra open source community for processing committed data. This also means that any future improvements to the CDC implementation can be leveraged by upgrading Cassandra versions.</p><h3 id="wrangling-cdc">Wrangling CDC</h3><h4 id="deployment">Deployment</h4><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/region-deployment.jpeg" alt="CDC Datacenter Deployment" /><p class="subtle-text"><small>CDC Datacenter Deployment</small></p></div><p>To ensure that processing CDC doesn’t cause any performance issues on the actual cluster, a virtual Cassandra datacenter is created, which is logically separate from the standard region-specific datacenters. The CDC Publisher is deployed only on the nodes of this datacenter. As all writes go to data replicas in all datacenters, this is sufficient to ensure coverage of all table changes. Additionally, nodes in this datacenter can be provisioned differently as they don’t serve live client read requests.</p><h4 id="bounding-latency">Bounding Latency</h4><p>As mentioned earlier, one of the issues with using CDC is that the latency (defined as the time between the write to Cassandra and the data being made available for processing) is poorly bounded. CDC only allows processing of commit log files that are no longer needed, meaning they should be full and not referenced by an existing memtable. To introduce predictable latency bounds to the connector, the following approaches were adopted:</p><h6 id="removing-memtable-references">Removing Memtable References</h6><p>Memtables are periodically flushed by Cassandra to SSTables when they get too large. However, a table with a low write rate will rarely be flushed, thus delaying CDC processing for the whole cluster. To ensure this does not happen, an explicit flush of all memtables is triggered at periodic intervals (typically 5-10 minutes) for nodes in the CDC datacenter. This ensures that a full commit log segment will only wait, at most, one flush interval before it can be processed. As only the CDC datacenter nodes are flushed, there’s no impact to client read performance in the other datacenters.</p><h6 id="filling-segments">Filling Segments</h6><p>Commit log segment sizes are fixed. If the tracked table has a slow write rate, it may be a while before a segment completely fills up. This fill-up time is bound by creating a process separate from the CDC Publisher which writes to a “filler” table at a predictable rate. This table is replicated only in the CDC datacenter and is fully replicated to all nodes. To limit any performance impact, fewer large writes (~100K) are performed, only a single key is written to, and the data is aggressively TTL’ed.</p><h3 id="processing-cdc">Processing CDC</h3><p>To aid with the processing of CDC commit log segments, the Cassandra library provides a handler interface for applications to implement. This interface allows processing of a stream of all mutations (writes) present in a commit log segment. The <em>Mutation</em> class is the Java object Cassandra uses to represent data, namely:</p><ul><li>A <em>Mutation</em> contains <em>PartitionUpdate</em> objects for multiple tables</li>
<li>A <em>PartitionUpdate</em> contains <em>Row</em> objects for a single partition key value</li>
<li>A <em>Row</em> contains data for a single clustering key value</li>
</ul><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/mutations.jpeg" alt="Structure of a Cassandra Mutation" /><p class="subtle-text"><small>Structure of a Cassandra Mutation</small></p></div><p>The primary function of the CDC Publisher is to break these mutations up into individual PartitionUpdate objects. If a PartitionUpdate contains multiple rows, these are further broken down into a series of updates with single rows. Thus, each update contains data only for a single Cassandra primary key.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/mutation-breakdown.jpeg" alt="Breakdown of a Mutation into Individual Row Objects" /><p class="subtle-text"><small>Breakdown of a Mutation into Individual Row Objects</small></p></div><p>Each of the resulting PartitionUpdate objects is serialized for publishing to Kafka streams. Serializers provided by the Cassandra library are used for serialization before publishing.</p><h3 id="publishing-to-kafka">Publishing to Kafka</h3><p>The PartitionUpdate payloads are used to build messages to publish to the intermediate Kafka stream. Each message includes:</p><ul><li>The serialized PartitionUpdate</li>
<li>The Cassandra messaging version used for serialization</li>
<li>Metadata for auditing (host, file, position, etc.)</li>
</ul><p>The messages are then published to table specific Kafka streams. A stream can have multiple partitions for scalable publishing; in which case, messages are routed to Kafka partitions based on the Cassandra partition key. Thus, all writes for a single partition key will end up in the same topic-partition.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-12-04-csource-part-1/cdc-publisher-pkey-partitioning.jpeg" alt="Publishing CDC to a Multi-Partition Kafka Topic" /><p class="subtle-text"><small>Publishing CDC to a Multi-Partition Kafka Topic</small></p></div><h4 id="intermediate-kafka-streams">Intermediate Kafka Streams</h4><p>The resulting Kafka streams contain all writes to the tracked Cassandra tables. As all updates to a primary key reside in the same topic partition, this sets an ordering of writes for each key.</p><p>While there’s no guarantee events will be in writetime order, there’s also no guarantee that writes will commit to a Cassandra replica in writetime order. Additionally, there will be a duplicate write copy for each data replica. Even though this is the case, the intermediate streams act as unified commit logs for the tables. They provide an order of events per key that can be deterministically processed into the ordered stream of row updates needed for publishing to the Data Pipeline.</p><h4 id="stream-consistency">Stream Consistency</h4><p>Given that the connector uses the Cassandra write path, the consistency of the resulting Kafka stream will not be more consistent than the underlying datastore. As writes are published from each replica in their local commit order, the processed stream should initially be no less consistent than reading from a single replica. As data from additional replicas is processed, the stream becomes eventually consistent. When all replicas have published updates, the consistency will be equivalent to a read covering all CDC datacenter nodes.</p><p>The time-boundness of this eventual consistency is determined by the write consistency level used by the Cassandra clients. If the update has to immediately show up in the stream, a high consistency level (e.g., EACH_QUORUM) must be used to ensure commits to nodes in the CDC datacenter. If a lower/local consistency is used for writes, the PartitionUpdate may not appear in the output stream (in the worst case) until the next table repair. Note that this is in line with the guarantees given to clients reading Cassandra directly.</p><h2 id="whats-next">What’s Next?</h2><p>At this point, the intermediate Kafka streams contain Cassandra PartitionUpdate objects partitioned by keys and in a loosely ordered manner. These objects must now be deserialized, converted into ordered Data Pipeline messages, and published into the pipeline. This is done through the DP Materializer.</p><p>The DP Materializer will be covered in the second half of this two-part post. Stay tuned!</p><div class="post-gray-box">This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track &amp; migrate schemas, how we process and transform streams, and finally how we connect all of this into datastores like Redshift, Salesforce, and Elasticsearch.<p>Read the posts in the series:</p><ul><li><a title="Billions of Messages a Day - Yelp's Real-time Data Pipeline" href="https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html">Billions of Messages a Day - Yelp's Real-time Data Pipeline</a></li>
<li><a title="Streaming MySQL tables in real-time to Kafka" href="https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html">Streaming MySQL tables in real-time to Kafka</a></li>
<li><a title="More Than Just a Schema Store" href="https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html">More Than Just a Schema Store</a></li>
<li><a title="PaaStorm: A Streaming Processor" href="https://engineeringblog.yelp.com/2016/08/paastorm-a-streaming-processor.html">PaaStorm: A Streaming Processor</a></li>
<li><a title="Data Pipeline: Salesforce Connector" href="https://engineeringblog.yelp.com/2016/09/data-pipeline-salesforce-connector.html">Data Pipeline: Salesforce Connector</a></li>
<li><a title="Streaming Messages from Kafka into Redshift in near Real-Time" href="https://engineeringblog.yelp.com/2016/10/redshift-connector.html">Streaming Messages from Kafka into Redshift in near Real-Time</a></li>
<li><a title="Open-Sourcing Yelp's Data Pipeline" href="https://engineeringblog.yelp.com/2016/11/open-sourcing-yelps-data-pipeline.html">Open-Sourcing Yelp's Data Pipeline</a></li>
<li><a title="Making 30x Performance Improvements on Yelp’s MySQLStreamer" href="https://engineeringblog.yelp.com/2018/02/making-30x-performance-improvements-on-yelps-mysqlstreamer.html">Making 30x Performance Improvements on Yelp’s MySQLStreamer</a></li>
<li><a title="Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift" href="https://engineeringblog.yelp.com/2018/04/black-box-auditing.html">Black-Box Auditing: Verifying End-to-End Replication Integrity between MySQL and Redshift</a></li>
<li><a title="Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch" href="https://engineeringblog.yelp.com/2018/06/fast-order-search.html">Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch</a></li>
<li><a title="Joinery: A Tale of Un-Windowed Joins" href="https://engineeringblog.yelp.com/2018/12/joinery-a-tale-of-unwindowed-joins.html">Joinery: A Tale of Un-Windowed Joins</a></li>
<li><a title="Streaming Cassandra into Kafka in (Near) Real-Time: Part 1" href="https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html">Streaming Cassandra into Kafka in (Near) Real-Time: Part 1</a></li>
</ul></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html</link>
      <guid>https://engineeringblog.yelp.com/2019/12/cassandra-source-connector-part-1.html</guid>
      <pubDate>Thu, 05 Dec 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Organizing and Securing Third-Party CDN Assets at Yelp]]></title>
      <description><![CDATA[<p>At Yelp, we use a <a href="http://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html">service-oriented architecture</a> to serve our web pages. This consists of a lot of frontend services, each of which is responsible for serving different pages (e.g., the search page or a business listing page).</p><p>In these frontend services, we use a couple of third-party JavaScript/CSS assets (<a href="https://reactjs.org">React</a>, <a href="https://babeljs.io/docs/en/babel-polyfill">Babel polyfill</a>, etc.) to render our web pages. We chose to serve such assets using a third-party Content Delivery Network (CDN) for better performance.</p><p>In the past, if a frontend service needed to use a third-party JavaScript/CSS asset, engineers had to hard-code its CDN URL. For example:</p><div class="language-html highlighter-rouge highlight"><pre>&lt;script
  src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.8.3/jquery.min.js"
&gt;&lt;/script&gt;
</pre></div><p>With hundreds of engineers working at Yelp, it was difficult to ensure the following (for each third-party asset):</p><ul><li><code class="highlighter-rouge">&lt;script&gt;</code> or <code class="highlighter-rouge">&lt;link&gt;</code> tags had a subresource integrity checksum via the <code class="highlighter-rouge">integrity</code> attribute <em>(see the section on <a href="https://engineeringblog.yelp.com#subresource-integrity-checksums">Subresource integrity checksums</a> below)</em></li>
<li>URLs used the HTTPS protocol</li>
<li>Only public CDN providers (approved by our security team) were used</li>
<li>Engineers could update to the latest versions easily</li>
</ul><p>Here at Yelp, we’ve built our frontend services using a Python service stack, with <a href="https://trypyramid.com">Pyramid</a> as our web framework and <a href="https://uwsgi-docs.readthedocs.io/en/latest">uWSGI</a> as our web server.</p><p>We created a shared Python package, <code class="highlighter-rouge">cdn_assets</code>, for storing the URLs and subresource integrity checksums of our third-party JavaScript/CSS assets.</p><p>For each asset, we simply used a Python dictionary with the asset’s semantic version as the key. For example:</p><div class="language-py highlighter-rouge highlight"><pre># React (facebook.github.io/react)
CDN_SCRIPT_REACT = {
    '16.8.6': CDNAsset.construct_asset(
        cdn=CDNDomain.CDNJS,
        library='react',
        version='16.8.6',
        filename='umd/react.production.min',
        filename_unminified='umd/react.development',
        extension='js',
        integrity='sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1',
        integrity_unminified='sha384-u6DTDagyAFm2JKvgGBO8jWd9YzrDzg6FuBPKWkKIg0/GVA6HM9UkSxH2rzxEJ5GF',
    ),
    '16.8.5': CDNAsset.construct_asset(
        # … similar properties for this version
    ),
    # … more versions…
}
# Babel Polyfill (babeljs.io/docs/usage/polyfill)
CDN_SCRIPT_BABEL_POLYFILL = {
    '6.23.0': CDNAsset.construct_asset(
        cdn=CDNDomain.CDNJS,
        library='babel-polyfill',
        version='6.23.0',
        filename='polyfill.min',
        filename_unminified='polyfill',
        extension='js',
        integrity='sha384-FbHUaR69a828hqWjPw4PFllFj1bvveKOTWORGkyosCw720HXy/56+2hSuQDaogMb',
        integrity_unminified='sha384-4L0QKU4TUZXBNNRtCIbt9G73L2fXYHnzgCjL65qwFxsXPvuAf1aB6D3X+LIflqu3',
    ),
    # … more versions…
}
# … more assets…
</pre></div><h2 id="usage">Usage</h2><p>Here’s a Python code snippet which shows how the asset is included in our <a href="https://github.com/Yelp/yelp_cheetah">Yelp-Cheetah</a> templates:</p><div class="language-py highlighter-rouge highlight"><pre>CDN_SCRIPT_REACT['16.8.6'].generate_script_tag(minified=True)
# returns &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/react/16.8.6/umd/react.production.min.js" integrity="sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1" crossorigin="anonymous"&gt;&lt;/script&gt;
</pre></div><h2 id="scaffolding-infrastructure">Scaffolding Infrastructure</h2><p>To facilitate ease of use and maintenance, we developed some scaffolding infrastructure to:</p><ul><li>Define public CDN providers (e.g., <a href="https://cdnjs.com/about">Cloudflare CDNJS</a>, <a href="https://developers.google.com/speed/libraries">Google CDN</a>, etc.)</li>
<li>Render minified scripts &amp; styles in the production environment and unminified scripts &amp; styles in the development environment</li>
<li>Create a helpful <code class="highlighter-rouge">generate_script_tag</code> method, which allows consumers of this package to easily generate an HTML <code class="highlighter-rouge">&lt;script&gt;</code> tag with the correct subresource integrity SHA <em>(see the section on <a href="https://engineeringblog.yelp.com#comparing-cryptographic-hash-functions">Comparing cryptographic hash functions</a> below)</em></li>
</ul><p>We made it easy for engineers to add a new version by creating a <a href="https://www.gnu.org/software/make"><code class="highlighter-rouge">make</code></a> target to calculate the integrity checksum, like so:</p><div class="language-sh highlighter-rouge highlight"><pre># Usage: make sri-hash --urls="URL1[ URL2 ... URLn]
$ make sri-hash --urls="https://cdnjs.cloudflare.com/ajax/libs/react/16.8.6/umd/react.production.min.js"
sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1
</pre></div><h2 id="testing">Testing</h2><p>We wrote tests which iterate all versions of all assets to ensure that:</p><ul><li>URLs point to a valid asset on the CDN</li>
<li>Integrity SHA checksums are correct</li>
<li>URLs begin with <code class="highlighter-rouge">https://</code> and end with <code class="highlighter-rouge">.js</code> or <code class="highlighter-rouge">.css</code></li>
</ul><p>Here’s a snippet from one of our test files:</p><div class="language-py highlighter-rouge highlight"><pre># `all_cdn_scripts` is a Pytest fixture; it’s not shown in this snippet.
@pytest.mark.parametrize('script', all_cdn_scripts)
def test_integrity_hashes_match(script):
    # Test that the unminified URL doesn’t error and has the right integrity hash.
    resp = requests.get(script.url_unminified)
    resp.raise_for_status()
    assert (
        'sha384-{}'.format(base64.b64encode(hashlib.sha384(resp.content).digest()).decode('utf8')) ==
        script.integrity_unminified
    )
    # Test that the minified URL doesn’t error and has the right integrity hash.
    resp = requests.get(script.url)
    resp.raise_for_status()
    assert (
        'sha384-{}'.format(base64.b64encode(hashlib.sha384(resp.content).digest()).decode('utf8')) ==
        script.integrity
    )
def test_sha384_for_all_checksums(all_cdn_scripts):
    SHA384_CHECKSUM_LENGTH = 64
    for cdn_script in all_cdn_scripts:
        assert cdn_script.integrity.startswith('sha384-')
        assert cdn_script.integrity_unminified.startswith('sha384-')
        checksum = cdn_script.integrity.replace('sha384-', '')
        assert len(checksum) == SHA384_CHECKSUM_LENGTH
        checksum = cdn_script.integrity_unminified.replace('sha384-', '')
        assert len(checksum) == SHA384_CHECKSUM_LENGTH
def test_valid_https_urls(all_cdn_scripts):
    https_url_validator = URLValidator(schemes=['https'], message='HTTPS URL validation failed')
    for cdn_script in all_cdn_scripts:
        https_url_validator(cdn_script.url)
def test_valid_script_files(all_cdn_scripts):
    for cdn_script in all_cdn_scripts:
        assert cdn_script.url.endswith('.js')
def test_minified_and_unminified_urls(all_cdn_scripts):
    for cdn_script in all_cdn_scripts:
        assert cdn_script.url.endswith('.min.js')
        assert not cdn_script.url_unminified.endswith('.min.js')
</pre></div><p>Yelp serves tens of millions of users every month. Ensuring that these users are protected should an attacker gain control of the CDN we’re using is of prime importance. That’s where subresource integrity checksums come into the picture.</p><h2 id="subresource-integrity-checksums">Subresource Integrity Checksums</h2><p>The <a href="https://developer.mozilla.org/docs/Web">web docs on Mozilla Developer Network</a> define <a href="https://developer.mozilla.org/docs/Web/Security/Subresource_Integrity">Subresource Integrity</a> as:</p><blockquote>
<p>A security feature that enables browsers to verify that resources they fetch (for example, from a CDN) are delivered without unexpected manipulation. It works by allowing you to provide a cryptographic hash that a fetched resource must match.</p>
</blockquote><p>Support for subresource integrity checksum verification is achieved by adding an <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script#attr-integrity"><code class="highlighter-rouge">integrity</code></a> attribute on the <code class="highlighter-rouge">&lt;script&gt;</code> or <code class="highlighter-rouge">&lt;link&gt;</code> tags. For example:</p><div class="language-html highlighter-rouge highlight"><pre>&lt;script
  src="https://cdnjs.cloudflare.com/ajax/libs/react/16.8.6/umd/react.production.min.js"
  integrity="sha384-qn+ML/QkkJxqn4LLs1zjaKxlTg2Bl/6yU/xBTJAgxkmNGc6kMZyeskAG0a7eJBR1"
&gt;&lt;/script&gt;
</pre></div><p>The web browser will calculate a hash from the contents of the <code class="highlighter-rouge">&lt;script&gt;</code> or <code class="highlighter-rouge">&lt;link&gt;</code> tag. It will then compare this hash with the <code class="highlighter-rouge">integrity</code> attribute’s value. If they don’t match, the browser will stop the <code class="highlighter-rouge">&lt;script&gt;</code> or <code class="highlighter-rouge">&lt;link&gt;</code> tag from executing.</p><p>As per the <a href="https://www.w3.org/TR/SRI/#cryptographic-hash-functions">Subresource Integrity (SRI) specification</a>:</p><blockquote>
<p>Conformant user agents must support the SHA-256, SHA-384 and SHA-512 cryptographic hash functions for use as part of a request’s integrity metadata and may support additional hash functions.</p>
</blockquote><p>Although both SHA-256 and SHA-512 are supported, we recommend using the SHA-384 cryptographic hash function for the integrity attribute. This is largely because SHA-384 is <a href="https://en.wikipedia.org/wiki/SHA-2#cite_note-9">less susceptible</a> to <a href="https://en.wikipedia.org/wiki/Length_extension_attack">length extension attacks</a>. (See <a href="https://github.com/w3c/webappsec/issues/477">github.com/w3c/webappsec — SRI: upgrade examples to sha384?</a> and <a href="https://github.com/mozilla/srihash.org/issues/155">github.com/mozilla/srihash.org — Why SHA384?</a> for further information.)</p><h2 id="always-using-https-for-loading-cdn-assets">Always Using HTTPS for Loading CDN Assets</h2><p>At Yelp, we’ve migrated web traffic to be served exclusively using <a href="https://en.wikipedia.org/wiki/HTTPS">HTTPS</a> and <a href="https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security">HSTS</a>. If you’re interested in learning more, check out these excellent blog posts by my colleagues: <a href="https://engineeringblog.yelp.com/2016/09/great-https-migration.html">The Great HTTPS Migration</a> and <a href="https://engineeringblog.yelp.com/2017/09/the-road-to-hsts.html">The Road To HSTS</a>.</p><h3 id="protocol-relative-urls">Protocol Relative URLs</h3><p>It’s recommended to use HTTPS while serving CDN assets instead of protocol-relative URLs. Quoting the article <a href="https://www.paulirish.com/2010/the-protocol-relative-url">“The Protocol-relative URL”</a> by <a href="https://www.paulirish.com">Paul Irish</a>:</p><blockquote>
<p>Now that SSL is <a href="https://www.eff.org/encrypt-the-web-report">encouraged for everyone</a> and <a href="https://istlsfastyet.com">doesn’t have performance concerns</a>, this technique is now an anti-pattern. If the asset you need is available on SSL, then always use the https:// asset. Allowing the snippet to request over HTTP opens the door for attacks like the <a href="http://www.netresec.com/?page=Blog&amp;month=2015-03&amp;post=China%27s-Man-on-the-Side-Attack-on-GitHub">recent Github Man-on-the-side attack</a>. It’s always safe to request HTTPS assets even if your site is on HTTP, however the reverse is not true. More guidance and details in <a href="https://github.com/konklone/cdns-to-https#conclusion-cdns-should-redirect-to-https">Eric Mills’ guide to CDNs &amp; HTTPS</a> and <a href="https://www.digitalgov.gov/2015/08/14/secure-central-hosting-for-the-digital-analytics-program">digitalgov.gov’s writeup on secure analytics hosting</a>.</p>
</blockquote><p>The work described in this blog post has been carried out and supported by numerous members of the Engineering Team here at Yelp. Particular credit goes to engineers on our Core Web Infrastructure (Webcore) team.</p><div class="island job-posting"><h3>Become a Software Engineer at Yelp</h3><p>Want to help us make even better tools for our full stack engineers?</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/bd07a618-9b6f-4920-91c6-99280f1b268d?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/11/organizing-and-securing-third-party-cdn-assets-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2019/11/organizing-and-securing-third-party-cdn-assets-at-yelp.html</guid>
      <pubDate>Wed, 20 Nov 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Remember Clusterman? Now It's Open-Source, and Supports Kubernetes Too!]]></title>
      <description><![CDATA[<p>Earlier this year, I wrote a <a href="https://engineeringblog.yelp.com/2019/02/autoscaling-mesos-clusters-with-clusterman.html">blog post</a> showing off some cool features of our in-house compute cluster autoscaler, Clusterman (our Cluster Manager). This time, I’m back with two announcements that I’m really excited about! Firstly, in the last few months, we’ve added another supported backend to Clusterman; so not only can it scale Mesos clusters, it can also scale Kubernetes clusters. Second, Clusterman is now open-source on <a href="https://github.com/Yelp/clusterman">GitHub</a> so that you, too, can benefit from advanced autoscaling techniques for your compute clusters. If you prefer to just read the code, you can head there now to find some examples and documentation on how to use it; and if you’d like to know a bit more about the new features and why we’ve built them, read on!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/open-source-clusterman/clusterman_logo.png" alt="" /></div><h2 id="going-from-mesos-to-kubernetes">Going from Mesos to Kubernetes</h2><p>Over the last five years, we’ve <a href="https://www.youtube.com/watch?v=tXbLMRhLQQE">talked</a> (and <a href="https://engineeringblog.yelp.com/2015/11/introducing-paasta-an-open-platform-as-a-service.html">written</a>) a lot about our compute stack at Yelp; we’ve gone from our monolithic <code class="highlighter-rouge">yelp_main</code> repo to a fully-distributed, service-oriented architecture running in the cloud on top of Apache Mesos and our in-house platform-as-a-service, <a href="https://github.com/Yelp/paasta">PaaSTA</a>. And, truthfully, without that move, we wouldn’t have been able to grow to the scale that we are now. We’ve been hard at work this year preparing our infrastructure for an even more growth, and realized that the best way to achieve this is to move away from Mesos and onto Kubernetes.</p><p>Kubernetes allows us to run workloads (Flink, Cassandra, Spark, and Kafka, among others) that were once difficult to manage under Mesos (due to local state requirements). We strongly believe that managing these workloads under a common platform (PaaSTA) will boost our infrastructure engineers’ output by an order of magnitude (can you imagine spinning up a new Cassandra cluster with just a few lines of YAML? We can!).</p><p>In addition, we’re migrating all of our existing microservices and batch workloads onto Kubernetes. This was a point of discussion at Yelp, but we eventually settled on this approach as both a way to reduce the overhead of maintaining two competing schedulers (Mesos and Kubernetes), and to take advantage of the fast-moving Kubernetes ecosystem. Thanks to the abstractions that PaaSTA provides, we’ve been able to do this migration seamlessly! Our feature developers don’t know their service is running on top of an entirely different compute platform.</p><p>Of course, to make this migration possible, we need to build support for Kubernetes into all our tooling around our compute clusters, including our very important autoscaler, Clusterman. Due to Clusterman’s modular design, this was easy! We simply defined a new connector class that conforms to the interface the autoscaler expects. This connector knows how to talk to the Kubernetes API server to retrieve metrics and statistics about the state of the Kubernetes cluster it’s scaling. These metrics are then saved in our metrics data store, which is sent to the signals and autoscaling engine to determine how to add or remove compute resources.</p><h2 id="why-clusterman--why-now">Why Clusterman? Why Now?</h2><p>We’re big proponents of open-source software at Yelp; we benefit from the efforts of many other open-source projects and release what we can back into the community. Ever since Clusterman’s inception, we’ve had the dream of open-sourcing it, and now that it has support for Kubernetes, there’s no better time to do so!</p><p>Whenever a project like this is released, the first question people ask is, “Why should I use your product instead of this other, established one?” Two such products are the <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-automatic-scaling.html">AWS Auto Scaling for Spot Fleet</a> and the <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler">Kubernetes Cluster Autoscaler</a>. So let’s compare and contrast Clusterman with them:</p><table><thead><tr><th class="c1">Clusterman</th>
<th class="c1">Auto Scaling for Spot Fleet</th>
<th class="c1">Kubernetes Cluster Autoscaler</th>
</tr></thead><tbody><tr><td class="c2"><em>Supports any type of cloud resource (ASGs, spot fleets, etc)</em></td>
<td class="c2">Only for Spot Fleets</td>
<td class="c2">Only supports homogeneous cloud resources (all compute resources must be identical)</td>
</tr><tr><td class="c2"><em>Pluggable signal architecture</em></td>
<td class="c2">Three different scaling choices: target tracking, step functions, or time-based</td>
<td class="c2">Scales the cluster when pods are waiting to be scheduled</td>
</tr><tr><td class="c2"><em>Can proactively autoscale to account for delays in node bootstrapping time</em></td>
<td class="c2">No proactive scaling</td>
<td class="c2">Waits for nodes to join the cluster before continuing</td>
</tr><tr><td class="c2">Basic Kubernetes support</td>
<td class="c2">No knowledge of Kubernetes</td>
<td class="c2"><em>Supports advanced features like node and pod affinity</em></td>
</tr><tr><td class="c2"><em>Can simulate autoscaling decisions on production data</em></td>
<td class="c2">No simulator</td>
<td class="c2">No simulator</td>
</tr><tr><td class="c2"><em>Extensible (open-source)</em></td>
<td class="c2">Closed-source API</td>
<td class="c2"><em>Extensible (open-source)</em></td>
</tr></tbody></table><p>A few highlights we’d like to call out: firstly, note that Clusterman is the only autoscaler that can support a mixture of cloud resources (Spot Fleets, Auto-Scaling Groups, etc.) - it can even handle this in the same cluster! This allows for a very flexible infrastructure design.</p><p>Moreover, Clusterman’s pluggable signal architecture lets you write any type of scaling signal you can imagine (and write in code). At Yelp, we generally believe that the Kubernetes Cluster Autoscaler approach (scale up when pods are waiting) is right for “most use cases,” but having the flexibility to create more complex autoscaling behavior is really important to us. One example of how we’ve benefitted from this capability is Jolt, an internal tool for running unit and integration tests. The Jolt cluster runs millions of tests every day, and has a very predictable workload; thus, we wrote a custom signal that allows us to scale up and down before pods get queued up in the “waiting” state, which saves our developers a ton of time running tests! To put it another way, the Kubernetes Cluster Autoscaler is reactive, but Clusterman has enough flexibility to be proactive and scale up before resources are required.</p><p>To be fair, not everyone needs the ability to make complex autoscaling decisions; many users will be just fine using something like the AWS Spot Fleet Autoscaler or Kubernetes Cluster Autoscaler. Fortunately for these users, Clusterman can be easily swapped in as needed. For example, it can be configured to read all of the same node labels that the Kubernetes Cluster Autoscaler does, and behave appropriately. Also note that the Kubernetes Cluster Autoscaler does support some Kubernetes features that Clusterman doesn’t (yet) know about, like pod affinity and anti-affinity. But we’re constantly adding new features to Clusterman, and of course, pull requests are always welcome!</p><h2 id="want-to-know-more">Want to Know More?</h2><p>If you’re as excited as we are about this release, we encourage you to head over to our <a href="https://github.com/Yelp/clusterman">GitHub</a> and check it out! Give it a star if you like it, and if you have any questions about getting Clusterman set up in your environment, feel free to open an issue or send us an email! Also, we’d love to hear any success stories you have about autoscaling with Clusterman, or Kubernetes in general; you can reach us on Twitter (<a href="https://twitter.com/YelpEngineering">@YelpEngineering</a>) or on Facebook (<a href="https://www.facebook.com/pg/yelpengineers/photos/">@yelpengineers</a>).</p><hr /><p>David is going to be at KubeCon 2019 and will happily talk your ear off about Clusterman and Kubernetes; ping him on <a href="https://twiter.com/drmorr0">Twitter</a> or find him in the hallway track.</p><hr /><div class="island job-posting"><h3>Become an Infrastructure Engineer at Yelp</h3><p>Want to work on exciting projects like Clusterman? Apply here!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/7f3e2412-3736-473e-95ff-5d11a9190080?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html</link>
      <guid>https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html</guid>
      <pubDate>Mon, 11 Nov 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Inside TensorFlow]]></title>
      <description><![CDATA[<p>It’s probably not surprising that Yelp utilizes deep neural networks in its quest to connect people with great local businesses. One example is the selection of photos you see in the Yelp app and website, where neural networks try to identify the best quality photos for the business displayed. A crucial component of our deep learning stack is <a href="https://www.tensorflow.org/">TensorFlow</a> (TF). In the process of deploying TF to production, we’ve learned a few things that may not be commonly known in the Data Science community.</p><p>TensorFlow’s success stems not only from its popularity within the machine learning domain, but also from its design. It’s very well-written and has been extensively tested and documented (you can read the documentation offline by simply cloning its <a href="https://github.com/tensorflow/docs">repository</a>). You don’t have to be a machine learning expert to enjoy reading it, and even experienced software engineers can learn a thing or two from it.</p><h2 id="building-tensorflow">Building TensorFlow</h2><p>You can start using TF without the extra build steps by installing the Python package from <a href="https://pypi.org/project/tensorflow/">pypi.org</a>. Doing it this way is straightforward, but also means you won’t have access to any optimization features. Here’s an example of what this can look like in practice:</p><div class="language-bash highlighter-rouge highlight"><pre>$ python3 -c 'import tensorflow as tf; tf.Session().list_devices()' 2&gt;&amp;1 | grep -oE 'Your CPU .*'
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
</pre></div><p>If you want to hack TF (the second part of this post explains how), then in order to test your changes, you’ll have to build the package yourself. So, assuming you’re interested in building TF for your own requirements, or perhaps with your own code changes, here’s a compilation of hints on how to make it a relatively painless experience. <em>Note: this is not a step-by-step recipe; obvious points (like “copy the <a href="http://github.com/tensorflow/tensorflow">sources</a>”, and “read the <a href="https://www.tensorflow.org/install/source">documentation</a>”) are not included!</em></p><p>We recommend building TensorFlow inside containers like <a href="https://docs.docker.com">Docker</a> or <a href="https://podman.io">Podman</a>. The TF project uses Docker for both continuous integration and <a href="https://hub.docker.com/r/tensorflow/tensorflow">official images</a>. You’ll find Dockerfiles and documentation for the latter in the <code class="highlighter-rouge">tensorflow/tools/dockerfiles</code> directory. However, it is a Continuous Integration (CI), which is of more interest in the context of building TF, so make sure to read <code class="highlighter-rouge">tensorflow/tools/ci_build/README.md</code> and check out other files in this directory. Using containers to build TF makes it easier to consistently install all required packages and helps ensure the builds are reproducible (a critical requirement of CI).</p><p>A major required package for building TF is the <a href="https://bazel.build">Bazel Build system</a> (it’s possible, but not recommended, to use make instead of bazel. For instructions see <code class="highlighter-rouge">tensorflow/contrib/make/README.md</code>). In addition to Bazel, other TF dependencies can be found inside the <code class="highlighter-rouge">configure.py</code> script (in the project root directory). TF also depends on a number of Python packages, all of which are listed inside the <code class="highlighter-rouge">tensorflow/tools/pip_package/setup.py</code> file (look for <code class="highlighter-rouge">REQUIRED_PACKAGES</code>). Important among those is NumPy, which may require you to install an extra package in the operating system, such as the <code class="highlighter-rouge">libatlas3-base</code> package for Ubuntu users. Additionally, if you want to build TF for GPU, you’ll need either CUDA with cuDNN (for NVIDIA) or ROCm (for AMD, which we have not tried) installed inside your container. The simplest way to ensure that all CUDA dependencies are present is to use the <a href="https://hub.docker.com/r/nvidia/cuda">official nvidia images</a> as your container base, as demonstrated in the <code class="highlighter-rouge">tensorflow/tools/ci_build/Dockerfile.gpu</code> file.</p><p>You’ll need to execute <code class="highlighter-rouge">configure.py</code> before the actual build. The script will ask many questions, such as “Please specify which C compiler should be used.” For a scripted build, the answer to all questions can be automated with “<code class="highlighter-rouge">yes |</code>” (as demonstrated in <code class="highlighter-rouge">tensorflow/tools/ci_build/builds/configured</code>). Also, if you read the <code class="highlighter-rouge">configure.py</code> source, you’ll quickly discover that individual questions can be suppressed with environment variables, such as <code class="highlighter-rouge">HOST_C_COMPILER</code>. Among these, a very useful variable is <code class="highlighter-rouge">CC_OPT_FLAGS</code>, which by default contains “<code class="highlighter-rouge">-march=native -Wno-sign-compare</code>”. If you want to use the resulting package with a model of CPU different than the one where you run your build, you should replace “native” with a <a href="https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options">more appropriate value</a>. The output of <code class="highlighter-rouge">configure.py</code> is the <code class="highlighter-rouge">.tf_configure.bazelrc</code> file, which you may want to look into.</p><p>After the initial configuration step, you’ll need to run “<code class="highlighter-rouge">bazel build</code>” with options to build TF binaries (but not its Python wheel - yet!). The selection of <a href="https://www.tensorflow.org/install/source#build_the_pip_package">Bazel options</a> can be a little tricky, but the script <code class="highlighter-rouge">tensorflow/tools/ci_build/ci_build.sh</code> may give you some ideas. The build typically takes between 30–60 minutes (or longer when CUDA is enabled) on 40 CPUs - it is quite a large project! After this step is completed, you still need to build the Python wheels. As explained in the documentation, this step is actually performed by the “<code class="highlighter-rouge">build_pip_package</code>” binary you’ve just built!</p><p>Here’s an example of what the above steps may look in a Dockerfile:</p><div class="language-dockerfile highlighter-rouge highlight"><pre>RUN curl -L https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VER}/bazel-${BAZEL_VER}-installer-linux-x86_64.sh --output bazel.sh &amp;&amp;
    bash bazel.sh --prefix=/opt/bazel &amp;&amp;
    rm bazel.sh
ENV PATH ${PATH}:/opt/bazel/bin
RUN curl -L https://github.com/tensorflow/tensorflow/archive/${VERSION}.tar.gz | tar xz --strip-components=1
ENV TF_NEED_CUDA 0
ENV CC_OPT_FLAGS -mtune=intel -march=haswell -Wno-sign-compare
RUN tensorflow/tools/ci_build/builds/configured CPU
RUN cat .tf_configure.bazelrc
RUN bazel build --config=opt  //tensorflow/tools/pip_package:build_pip_package
RUN bazel-bin/tensorflow/tools/pip_package/build_pip_package /tensorflow
</pre></div><p>This of course implies that you’ll want to actually build TF with a “<code class="highlighter-rouge">docker build</code>”. This may seem counterintuitive at first (running Bazel in the context of “<code class="highlighter-rouge">build run</code>” will be a more natural choice to some, and in fact will be required for the incremental build), but is actually quite useful as it lets you re-run the build very quickly if no changes have been made, and you don’t have to worry about the build directory. Just remember to “<code class="highlighter-rouge">build run</code>” with <code class="highlighter-rouge">--user</code> option to copy your Python wheels out of the container image afterwards.</p><h2 id="tensorflow-project-structure">TensorFlow project structure</h2><p>There are two important top-level directories in the TF project: <code class="highlighter-rouge">tensorflow</code> and <code class="highlighter-rouge">third_party</code>. The latter contains TF dependencies (which you may want to check out). While the list is rather extensive and some third-party libraries can alternatively be brought in as system dependencies (you may see them inside <code class="highlighter-rouge">third_party/systemlibs/syslibs_configure.bzl</code>), our focus is going to be on the <code class="highlighter-rouge">tensorflow</code> directory. It may not be immediately apparent, but most of the TF functionality is, at the lowest level, implemented in C++. This is what the <code class="highlighter-rouge">tensorflow/core</code> directory is for. Next, this low-level functionality is exported as a public API to various programming languages inside directories named after each language. Most TF users are familiar with the Python API inside the <code class="highlighter-rouge">tensorflow/python</code> directory, but there are also subdirectories for C, C++, Java and Go. Knowing your way around the Python subdirectory can help you find useful pieces of information without the need to seek external documentation. For example, to find the constants used by selu activation, you can look in <code class="highlighter-rouge">tensorflow/python/keras/activations.py</code>. Another useful Python subdirectory is <code class="highlighter-rouge">debug</code>. If you’ve ever wondered what the computation graph of your deep learning model looks like, then file <code class="highlighter-rouge">tensorflow/python/debug/README.md</code> is a good start. There are also some very useful tools inside the (you guessed it!) <code class="highlighter-rouge">tensorflow/python/tools</code> directory.</p><p>Some C++ functions are imported by Python with the <a href="http://www.swig.org/tutorial.html">SWIG</a> file <code class="highlighter-rouge">tensorflow/python/tensorflow.i</code>, which in turn includes <code class="highlighter-rouge">*.i</code> files in various subdirectories. As you’ll see, most of these files have an accompanying <code class="highlighter-rouge">*.cc</code> with implementation, which in turn include headers from the <code class="highlighter-rouge">tensorflow/core</code> directory (and also from the <code class="highlighter-rouge">tensorflow/c</code> public API directory). However, SWIG is only used for low-level functions, and TF focuses mostly on high-level operations. These are coded and registered in the <code class="highlighter-rouge">tensorflow/core</code> directory as so-called “ops” (look for <code class="highlighter-rouge">REGISTER_OP</code> macro; the majority of ops are inside the <code class="highlighter-rouge">ops</code> subdirectory). Ops are imported by language APIs using their name. Note that in Python, the spelling of each op is changed, replacing CamelCase with snake_case (for example, <code class="highlighter-rouge">ApplyGradientDescent</code> from <code class="highlighter-rouge">tensorflow/core/ops/training_ops.cc</code> is imported inside <code class="highlighter-rouge">tensorflow/python/training/gradient_descent.py</code> as <code class="highlighter-rouge">apply_gradient_descent</code>). Other language APIs refer to ops using the original CamelCase names.</p><p>The C++ implementation of each op is coded in the so-called “kernel” (there can be separate kernels for CPU and GPU as demonstrated in <code class="highlighter-rouge">tensorflow/core/kernels/fact_op.cc</code>), which is then mapped to an op with a <code class="highlighter-rouge">REGISTER_KERNEL_BUILDER</code> macro. Most kernels reside inside the <code class="highlighter-rouge">tensorflow/core/kernels</code> directory. For example, <code class="highlighter-rouge">ApplyGradientDescent</code> is implemented in <code class="highlighter-rouge">tensorflow/core/kernels/training_ops.cc</code>. Unit tests for kernels are written in Python and reside either inside the <code class="highlighter-rouge">tensorflow/python/kernel_tests</code> directory or next to their Python API wrapper, in “*_test.py” files. For example, unit tests for <code class="highlighter-rouge">ApplyGradientDescent</code> are coded in <code class="highlighter-rouge">tensorflow/python/training/training_ops_test.py</code>.</p><p>A complete list of ops is available in two locations: the <code class="highlighter-rouge">tensorflow/core/api_def</code> directory and the <code class="highlighter-rouge">tensorflow/core/ops/ops.pbtxt</code> file. As you can see, TF defines a considerable number of ops which explains the large size of its binary. When building TF, you can minimize its size by enabling only selected ops. This is documented inside the <code class="highlighter-rouge">tensorflow/core/framework/selective_registration.h</code> file (note, this is an experimental feature). Interestingly, you don’t need to maintain a fork of TF if you want to add your own custom ops. Instead, TF’s design allows for an external project to extend TF with a new functionality. This is demonstrated in the <a href="https://github.com/tensorflow/addons/">TensorFlow Addons project</a>.</p><p>Finally, you may want to check the content of the <code class="highlighter-rouge">tensorflow/core/platform</code> directory. There, you can find files not specific to TensorFlow, but rather low-level operating systems or network protocol functionalities. Files shared by all platforms reside in this directory, but there are also several platform-specific subdirectories. For example, if you’re troubleshooting an S3-related issue, there’s an “<code class="highlighter-rouge">S3</code>” subdirectory to help you. This code is very well-written and potentially useful outside of the TF project (but please do check the license first!). Finally, for a high-level overview of the TensorFlow architecture, we recommend you check the official <a href="https://www.tensorflow.org/guide/extend/architecture">documentation</a>.</p><p>We hope you’ll find this collection of hints useful when playing with TensorFlow or deploying it in your machine learning workflow!</p><h3 id="note">Note</h3><p><em>Neither Yelp nor the author of this post are affiliated with Google or TensorFlow authors.</em></p><div class="island job-posting"><h3>Become a Machine Learning Engineer at Yelp</h3><p>Want to build state of the art machine learning systems at Yelp? Apply to become a Machine Learning Engineer today.</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/f674cef9-b635-4f25-8dd9-66663494392a?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/11/inside-tensorflow.html</link>
      <guid>https://engineeringblog.yelp.com/2019/11/inside-tensorflow.html</guid>
      <pubDate>Fri, 08 Nov 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Winning the Hackathon with Sourcegraph]]></title>
      <description><![CDATA[<p><em>Visualizing how code is used across the organization is a vital part of our engineers’ day-to-day workflow - and we have a *lot* of code to search through! This blog post details our journey of adopting Sourcegraph at Yelp to help our engineers maintain and dig through the tens of gigabytes of data in our git repos!</em></p><hr /><p>Here at Yelp, we maintain hundreds of internal services and libraries that power our website and mobile apps. Examples include our mission-critical “<em>emoji service</em>” which helps translate and localize emojis, as well as our “<em>homepage service</em>” which… you guessed it, serves our venerable homepage, yelp.com!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-11-01-winning-the-hackathon-with-sourcegraph/yelp_homepage.jpg" alt="Yelp homepage" /><p class="subtle-text"><small>Yelp homepage</small></p></div><h3 id="dont-break-the-website">Don’t Break the Website</h3><p>Imagine you’re a developer tasked with implementing an exciting new feature. Perhaps you need to change the interface of the “<code class="highlighter-rouge">getBusinesses</code>” API endpoint to power a dedicated <em>Find Desserts Near Me</em> button on the homepage. “Piece of cake!” you say to yourself, as you add new parameters to alter the response of the shared resource. In order to not break <em>the rest</em> of the website though, you figure it’s best to see where other code is calling this endpoint so you can create a design that works for all use cases and doesn’t break existing call sites.</p><p>We have over 100,000 Python files alone to power Yelp - that’s a lot of code to search through! In order to figure out a safe rollout plan, we need to scan through all of our existing code to understand where and how the method is being called across multiple git repositories. So how can we do this?</p><p>Combined, our git repositories amount to tens of gigabytes of data. So cloning everything down locally whenever you want to perform a search is not a viable solution. Instead, we do this in the background as a scheduled process on a subset of our development machines, powered by <a href="https://github.com/asottile/all-repos">all-repos</a>. Some folks use this workflow, stringing together xargs and git grep, etc. into many homegrown bash scripts. A web interface (historically cgits and opengrok) is generally a more convenient go-to tool for browsing and searching code.</p><p>Tools like this are essential to our workflow. And since we’re always on the lookout for ways we can improve the developer experience at Yelp, we want the best-in-class tool for the job!</p><p>We first heard about <a href="https://about.sourcegraph.com/">Sourcegraph</a> at a React meetup hosted at Yelp. There was a discussion around how different companies view and search code, and Sourcegraph was introduced as an interesting-looking new search tool. One of the participants pulled up sourcegraph.com to demonstrate its capabilities. We tried a couple of searches using the repo and file regex filters and jumped around the codebase using the Jump to Definition feature. Coming from other tools and homegrown scripts, this was a huge step up in the developer experience! It stood out as a clear win on that front, and we decided to look into it some more and see how we could maybe bring Sourcegraph to Yelp.</p><p>We validated the idea to see if it was worth pursuing by first setting it up locally. Sourcegraph is conveniently distributed as a docker image, so we were able to get a proof-of-concept running quickly and share it out with a small group of people. The feedback was positive! After using it regularly for a few weeks, we felt that the code browsing experience had been improved significantly and we pushed on to try and roll it out to the rest of Yelp!</p><h2 id="productionizing-sourcegraph">Productionizing Sourcegraph</h2><p>At Yelp, we run a biannual <a href="https://engineeringblog.yelp.com/2018/11/all-about-yelp-hackathon.html">Hackathon</a> – an opportunity for engineers to “scratch their creative itch” on projects outside of their day-to-day work. It was during one of these Hackathons that we started to productionize Sourcegraph at Yelp - which meant graduating the Sourcegraph instance from running on a local machine to being deployed on our PaaS platform, <a href="https://engineeringblog.yelp.com/amp/2015/11/introducing-paasta-an-open-platform-as-a-service.html">PaaSTA</a>. By the end of the three days, we had Sourcegraph ready for the whole company to try out.</p><p>The feedback was great, and Sourcegraph was well received. We even won an award!</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-11-01-winning-the-hackathon-with-sourcegraph/award.jpg" alt="A coveted Hackathon trophy" /><p class="subtle-text"><small>A coveted Hackathon trophy</small></p></div><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-11-01-winning-the-hackathon-with-sourcegraph/demo.jpg" alt="Showing off Sourcegraph to Yelpers at the Hackathon “Science Fair”" /><p class="subtle-text"><small>Showing off Sourcegraph to Yelpers at the Hackathon “Science Fair”</small></p></div><p>Once Sourcegraph was up and running at Yelp, we had to decide whether we wanted to invest more in the product to get features such as Code Intelligence. To come to this decision, we surveyed developers on how they liked Sourcegraph compared to other code search/viewing tools we were using, and the results were heavily favored towards Sourcegraph. <strong>70% of developers rated Sorcegraph as very good, and 51% percent of developers were already using Sourcegraph exclusively as their preferred code analysis tool.</strong> As a result of this feedback, we decided to make Sourcegraph the singular supported tool at Yelp for code search and viewing!</p><h2 id="shipping-code-faster-with-sourcegraph">Shipping Code Faster with Sourcegraph</h2><p>Sourcegraph empowers developers at Yelp to ship code faster and more reliably than ever before. <a href="https://docs.sourcegraph.com/user/code_intelligence">Code intelligence</a> features such as Go-to-Definition and Find References are heavily-used features that enable developers to understand the plethora of microservices and libraries in our code base. When making large changes, Sourcegraph is the way to discover how your code is being called throughout the rest of the code base. Sourcegraph has also been helpful for onboarding new hires and introducing them to the code base.</p><p>Sourcegraph has proven to be one of the most useful tools for making mass code migrations and deprecations. A quick search can help scope out the magnitude of the change and the difficulty of implementing it, while also providing an easy way to track the progress of long-running migrations and deprecations.</p><p>Sourcegraph’s GraphQL API has also proved to be useful for tooling we have built in-house. Developers at Yelp have used the Sourcegraph API to power services such as our internal npm registry and flaky test analysis engine, both of which heavily utilize source control metadata.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-11-01-winning-the-hackathon-with-sourcegraph/stats.jpg" alt="Daily active users of Sourcegraph at Yelp" /><p class="subtle-text"><small>Daily active users of Sourcegraph at Yelp</small></p></div><h2 id="future-work">Future Work</h2><p>We are evaluating running Sourcegraph as a clustered deployment. While we are currently able to serve all Sourcegraph usage on a single host, we are looking into running all of Sourcegraph’s different services individually. This would allow us to scale up more resource-intensive instances of Sourcegraph’s services. We are planning to put it on Kubernetes, an initiative that is underway for a lot of Yelp’s infrastructure.</p><h2 id="written-by">Written By</h2><ul><li>Mark Larah, Software Engineer (<a href="https://twitter.com/mark_larah">@mark_larah</a>)</li>
<li>Dennis Coldwell, Engineering Manager</li>
<li>Kevin Chen, Software Engineer</li>
</ul><div class="island job-posting"><h3>Become an Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/a0fc4d3d-1fd2-495b-94d4-cc2ed1d80cf3?description=Software-Engineer-New-Grad-Backend_College-Engineering-Product_San-Francisco-CA?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/11/winning-the-hackathon-with-sourcegraph.html</link>
      <guid>https://engineeringblog.yelp.com/2019/11/winning-the-hackathon-with-sourcegraph.html</guid>
      <pubDate>Fri, 01 Nov 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Beyond Labels: Stories of Asian Pacific Islanders at Yelp]]></title>
      <description><![CDATA[<p>During <em>Asian Pacific American Heritage Month</em>, ColorCoded (a Yelp employee resource group) hosted a panel discussion called <strong>“Beyond Labels: Stories of Asian Pacific Islanders (API)* at Yelp.”</strong></p><p>We heard stories from five API Yelpers about their cultural backgrounds, identities, and thoughts on what it means to be an API in today’s world. Their stories helped us understand that identity is both multilayered and contextual, and that individuality goes beyond labels.</p><p><img src="https://engineeringblog.yelp.com/images/posts/2019-10-28-beyond-labels-stories-of-asian-pacific-islanders-at-yelp/api-blog-image.jpg" alt="Beyond Labels: Stories of Asian Pacific Islanders at Yelp" /></p><p>Read more from their unique perspectives below.</p><h4 id="tenzin-kunsal-events--partnerships-engineering-recruiting">Tenzin Kunsal, Events + Partnerships, Engineering Recruiting</h4><p>From a young age, I knew the concept of “home” was complicated. Like many refugees, my family called multiple countries home. My grandparents left my first home, Tibet, in the 1960s, after it was taken over by China. My second home, India, is where I was born and where I grew up, in a Tibetan refugee community. I was not automatically granted Indian citizenship, so for the first few years of my life, I was state-less, born without a country. That was until 1996, when Minneapolis became my third home. Soon after, I became an American citizen and finally officially “belonged” to a country. Growing up, this was all very confusing. I never felt like I fully fit in anywhere. It wasn’t until college that I started to accept the multifacetedness of my identity and that it’s okay to call multiple places “home.”</p><h4 id="nivedita-mittal-software-engineer-reader-experience">Nivedita Mittal, Software Engineer, Reader Experience</h4><p>I moved to the U.S. four years ago to get my Master’s in Computer Science. Since then, it’s been a journey of self-discovery. When I moved from Mumbai to Boston, I always said “I’m from Mumbai, India.” Then, after moving to San Francisco, it became “I’m from Boston.” Something that has always stuck with my identity is how my immigration status defined whether I “belonged.” Whether it’s finding a job that sponsors your H-1B visa, or filling out your green card, defining who you are and whether you belong in the first place is an ongoing insecurity. It didn’t help that during grad school, every conversation I had with other international students revolved around my visa situation. The same applied to recruiting conversations with companies—I would always get questions like, “Did you get your H-1B yet? Did they file your green card already?” Once this is all said and done, I wonder if I’ll finally find that sense of belonging, or whether it’ll still be a conscious thought in my head to remind people that I belong here.</p><h4 id="gabe-ramos-director-corpeng">Gabe Ramos, Director, CorpEng</h4><p>I identify as Filipino American, a person of color, and a Hapa. “Hapa” is a Hawaiian word that’s used to describe people who are part Asian and part Caucasian. Growing up in the Bay Area, I bounced around schools that had different ethnic make-ups. People often can’t tell what race I am. When I was in a predominantly Black and Latino school, classmates teased me for being “white.” When I was in a mostly white Palo Alto public school, classmates teased me for being “Japanese” because they didn’t know what race I was. I felt like I was between worlds because I didn’t pass for white yet often didn’t feel Filipino enough. Learning about different racial identities in college was pivotal for me. I have a liberal arts background, and my education really helped me learn about other Asian Americans’ experiences, the history of racial violence in the U.S., and anti-miscegenation laws. This helped me gain more of a sense of shared history. Most importantly, this empowered me to feel more ownership over my opinions of my own racial and cultural identity.</p><h4 id="julie-truong-software-engineer-restaurant-plan">Julie Truong, Software Engineer, Restaurant Plan</h4><p>From my last name, you may assume that I’m Vietnamese; I’m actually Chinese. My family immigrated from China to Vietnam (and later to the U.S.), and in order to blend in, my paternal grandfather changed our last name. My family is a mix of Chinese and Vietnamese cultures. At any given family gathering, you can hear English, Cantonese, and Vietnamese—all within the span of a couple minutes. I grew up in a primarily Latinx/Black/Samoan/Fillipino neighborhood in the East Bay. When I was younger, I had an idea of what being a “cool Asian” entailed, and Chinese people weren’t necessarily portrayed in this light. So I actually wished I were Fillipino, just like the cool kids in school. Now, as an adult living in the Bay Area, I feel I’m actually quite privileged. There’s a large Asian American population here, and I don’t have to think about my cultural identity very often. Interestingly, I find I have to think more about my gender and sexual orientation and how these parts of my identity show up in my personal and professional life.</p><h4 id="wing-yung-vice-president-engineering">Wing Yung, Vice President, Engineering</h4><p>I grew up near Arcadia, California, in a community with many other Asian Americans. Most of my classmates in public school were like me—our parents immigrated here, and we were born here. I can speak three dialects of Chinese (poorly): Mandarin (which I learned through lessons), Cantonese (which my parents speak at home because they grew up in Hong Kong), and Wenzhounese (my grandparents’ dialect). Throughout college I became more aware of my Asian identity, but didn’t seek out opportunities to explore it. Early on in my career at IBM, one of my managers sent me to an Asian leadership development program. In retrospect, it was one of the first times I became aware that leadership comes in many forms. I’m very much aware of the fact that I’m often the only (or one of the few) Asians in leadership settings. It’s important to me to be a role model for others so that they know there are paths to these roles.</p><h3 id="conclusion">Conclusion</h3><p>What ties all of these stories together is a sense of belonging that impelled us to redefine our identities on our own terms. Finding the right communities and support groups was critical for our journeys of self-discovery. The process of preparing for this panel was in itself extremely empowering, as it allowed us to dig deeper and reflect on what makes us who we are. Opportunities like these provide a platform to learn about others’ experiences and to realize how much representation influences our lives. It’s important to remind ourselves that sharing these stories makes us stronger and is an important part of cultivating community.</p><p>Want to be a part of the dialogue? Here are a few steps you can take right now!</p><ul><li>Join a resource group/meetup/support group that focuses on diversity and inclusion. We have <a href="https://www.yelp.com/careers/who-we-are">employee resource groups</a> employee resource groups here at Yelp, including Colorcoded, Diverseburst, and Awesome Women in Engineering (AWE).</li>
<li>For a more personal conversation, grab coffee with someone who identifies as an API to hear more about their journey.</li>
</ul><p>*In the context of this conversation, API stands for Asian Pacific Islanders—people with origins in Asia or the Pacific Islands.</p><div class="island job-posting"><h3>Engineering at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/3021acac-2237-4288-bb84-73e770fc2c90?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/10/beyond-labels-stories-of-asian-pacific-islanders-at-yelp.html</link>
      <guid>https://engineeringblog.yelp.com/2019/10/beyond-labels-stories-of-asian-pacific-islanders-at-yelp.html</guid>
      <pubDate>Mon, 28 Oct 2019 01:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Open sourcing spark-redshift-community]]></title>
      <description><![CDATA[<p>At Yelp, we are heavy users of both Spark and Redshift. We’re excited to announce <a href="https://github.com/spark-redshift-community/spark-redshift">spark-redshift-community</a>, a fork from <a href="https://databricks.com">databricks</a>’ original <a href="https://github.com/databricks/spark-redshift">spark-redshift</a> project.</p><p>spark-redshift is a Scala package which uses Amazon S3 to efficiently read and write data from AWS Redshift into Spark DataFrames. After the open source project effort was abandoned in 2017, the community has struggled to keep up with updating dependencies and fixing bugs. The situation came to a complete halt upon release of Spark 2.4 which was sharply incompatible with the latest spark-redshift. Developers looking for a solution turned to online threads on websites like StackOverflow or Github. Answers strayed far from even a simple workaround.</p><p>At Yelp, it was only a matter of time before we jumped into action. The inability to upgrade Spark from 2.3.3 to 2.4 meant that:</p><ul><li>We could not use highly sought-after features from Spark 2.4,</li>
<li>
<p>Our move on to Kubernetes was endangered. In order to move our infrastructure to run on Kubernetes, we needed Spark on 2.4:</p>
<blockquote>
<p>“Spark can run on clusters managed by <a href="https://kubernetes.io/">Kubernetes</a>. This feature makes use of native Kubernetes scheduler that has been added to Spark [2.4].” <sup id="fnref:1"><a href="https://engineeringblog.yelp.com#fn:1" class="footnote">1</a></sup></p>
</blockquote>
</li>
</ul><div class="c2"><img src="https://engineeringblog.yelp.com/images/posts/2019-10-25-open-sourcing-spark-redshift-community/nounprojbuildsoftware.png" class="c1" alt="image" /></div><p>The <a href="https://github.com/snowflakedb/spark-snowflake">spark-snowflake</a> open source project is a stable spark-redshift fork for Snowflake. We considered adapting spark-snowflake to work with Redshift but the time estimate was higher than forking and upgrading the original spark-redshift. Upon suggestion from databricks, we did exactly that.</p><p>We focused on porting the functionalities that we use the most, like performant reads from Redshift. We had to make tradeoffs in supporting a subset of features due to the timeline and workload. While some made the cut (reading from Redshift, various data types parsing, implementing an InMemoryS3AFileSystem for testing), others didn’t (Postgres driver support, AWS IAM Authentication, some SaveMode options). We have already seen great internal adoption, and several teams are unblocked in their progress on moving to Spark 2.4.</p><p>Our plans for the future include supporting the project by focusing on the features we use the most, in the hope that the community could carry forward features they find useful. <a href="https://github.com/spark-redshift-community/spark-redshift">spark-redshift-community</a> is an edition for the community. Any support in the form of Github issues or pull requests is greatly welcomed.</p><div class="footnotes"><ol><li id="fn:1">
<p><a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html">https://spark.apache.org/docs/latest/running-on-kubernetes.html</a> <a href="https://engineeringblog.yelp.com#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol></div><div class="island job-posting"><h3>Become a Backend (Big Data) Engineer at Yelp</h3><p>We work on a lot of cool projects at Yelp, if you're interested apply!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/461e8999-1bb8-4d37-9212-da7558ebdc21?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/10/open-sourcing-spark-redshift-community.html</link>
      <guid>https://engineeringblog.yelp.com/2019/10/open-sourcing-spark-redshift-community.html</guid>
      <pubDate>Fri, 25 Oct 2019 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Redesigning Yelp for Apple Watch with SwiftUI]]></title>
      <description><![CDATA[<p>At this year’s WWDC, Apple unveiled <a href="https://developer.apple.com/xcode/swiftui/">SwiftUI</a>, a framework that helps developers build declarative user interfaces. At Yelp, we were immediately excited about it and were looking for a way to start adopting it. We decided that our Apple Watch application was the perfect candidate for modernization using SwiftUI and were excited to explore a redesign with this new framework.</p><p>At Yelp, one of the things we pride ourselves on is the quality of our content. Yelp users have posted hundreds of millions of reviews and photos. As we set out to re-imagine the user interface for our Apple Watch app, we knew that our gorgeous photos should be the star.</p><p>Here is a side-by-side comparison of the old interface and the new one.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-10-21-redesigning-yelp-for-apple-watch-with-swiftui/image1.png" alt="Star ratings as of October 16, 2019" /><p class="subtle-text"><small>Star ratings as of October 16, 2019</small></p></div><p>As you can see, we’ve adopted an interface similar to the Audiobooks and Music apps which put a very strong emphasis on the thumbnail image. Users of the Apple Watch Series 5 will also see a compass that will allow them to see the direction and distance to each business in their search results. We hope this will help users in their search for great local businesses near them.</p><div class="image-caption"><img src="https://engineeringblog.yelp.com/images/posts/2019-10-21-redesigning-yelp-for-apple-watch-with-swiftui/image2.gif" alt="Star ratings as of October 16, 2019" /><p class="subtle-text"><small>Star ratings as of October 16, 2019</small></p></div><p>In contrast with WatchKit, SwiftUI gives us much more freedom when building our user interface. It feels much more like developing for the iPhone, with the added constraint of designing for a small screen. One thing that’s notable about the design of the search listings is the simplicity of it in code. This scrollable card stack took less than 120 lines of code, animations included! The magic of it resides in the custom <a href="https://developer.apple.com/documentation/swiftui/viewmodifier">view modifiers</a> you can create to apply to your SwiftUI views. Let’s dive into a simplified example.</p><p>Here is a slightly simplified modifier that shifts the cards vertically and doesn’t take care of any scaling down or rotation.</p><p>Given a cardOffset that represents the difference between the current index and the card’s index, we return a custom view modifier that offsets the view’s origin on the y-axis, and modifies its opacity if it goes to the background. Our own implementation also takes care of adding a scale effect for the depth impression, and a zRotation effect, to give the animation more flavor when the cards are scrolled off-screen.</p><p>Now that we have view modifiers, let’s create the scrollable stack.</p><p>We create a <a href="https://developer.apple.com/documentation/swiftui/zstack">ZStack</a> that will fill out the remaining screen space left out by the Spacer. We then compute the cardOffset needed for returning the correct view modifier, and apply the modifiers on their respective cards.</p><p>SwiftUI is able to smoothly interpolate animation parameters for the offset and opacity whenever the modifier changes for a given card. This means the animation logic is all handled for us if the current index is changed within an animation block. Since this code hooks into the digitalCrownRotation modifier and passes the animated binding that represents the current index, the animation will be automatically performed when the crown is rotated. How convenient!</p><p>This redesign made us eager to see where Apple is going to take the framework, and what we’re going to be able to build with it in the upcoming years. We’re thrilled to launch the new Yelp for Apple Watch application today and hope you will love it as much as we do!</p><div class="island job-posting"><h3>Become an iOS Software Engineer at Yelp</h3><p>Want to build more great looking products? Come join us!</p><a class="ybtn ybtn-primary" href="https://www.yelp.com/careers/job-openings/d38ed5fc-bbfa-4f96-92fd-0d194b0433fb?lever-source=engineering_blog" target="_blank">View Job</a></div><p class="back-to-blog"><a href="https://engineeringblog.yelp.com/">Back to blog</a></p>]]></description>
      <link>https://engineeringblog.yelp.com/2019/10/redesigning-yelp-for-apple-watch-with-swiftui.html</link>
      <guid>https://engineeringblog.yelp.com/2019/10/redesigning-yelp-for-apple-watch-with-swiftui.html</guid>
      <pubDate>Mon, 21 Oct 2019 02:00:00 +0200</pubDate>
    </item>
  </channel>
</rss>
