Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

Cribl Search Provides an Audit Capability to Assess Your Snowflake Account
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Agent Consolidation ›

Avoid Vendor Lock-in ›

Free Up Space for High-Value Data ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Route From Any Source To Any Destination ›

Slash Storage Costs ›

Tackle Application Infrastructure Sprawl ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

Confluent Cloud ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

September 25 | 10am PT / 1pm ET

Hold my beer: lessons from one team’s data pipeline journey
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

Building for Multi-Petabyte Scale, Part 1

January 16, 2020

Categories: Engineering

Back To Blogs

Measure twice, cut once – understanding the requirements.

This is the first of a series of posts where we’ll talk about architecture and implementation principles we’ve followed when building Cribl Stream to be able to scale to processes 10s of Petabytes per day and at sub-millisecond latency. First, in this post, we’ll discuss one of the most important aspects of designing any system: getting the requirements right. Many projects and products start first from the wrong requirements and end up at the wrong destination. At Cribl, we endeavor to deeply understand the requirements of our customers and build a product which meets them. The follow on posts will dive deeper into our scale up and scale out architecture and implementation decisions.

The screenshot below is from our out of the box monitoring and shows a 300 node Cribl Stream deployment processing data at a rate of 117 trillion events per day, or about 20PB per day. The bumps in throughput around 22:14 and 22:18 are due to the cluster scaling out from 100 to 150 to 300 nodes.

Why does Petabyte scale and sub-millisecond latency matter?

Cribl Stream is the first streams processing engine built for logs and metrics. The growth trajectory of these types of data has been exponential for quite some time. The rise in popularity of microservices architectures as well as the number of devices coming online has caused a dramatic increase in the number of endpoints emitting logs, metrics and traces (aka machine data). We are working with customers and prospects who are already at multi-Petabyte/day scale, and we believe customers are only prospecting a small fraction of the potentially valuable raw data.

For a machine data streams processing engine being able to scale to process Petabytes of data per day is simply table stakes. Handling such volumes of data is only economically feasible if done in an efficient manner: thus the need to be able to process 10-100s of thousands of events per second per CPU core, which we can also state as sub-millisecond per event latency.

Seamless experience

In order to process data volumes at such scale, reliability and resiliency the system must be distributed. However, there’s also a requirement for the system to be able to scale way down. The first time users experience our product is likely to be on their laptop or in dev/test environment. This way users can easily try out the system and gain confidence and expertise before moving on to production.

We believe that a system’s usability and scale are orthogonal problems: users shouldn’t have to care that a system is distributed. From an engineering point of view this means the UX for single instance and distributed mode must be as similar as possible. Let’s look at scaling as it covers some terminology and fundamental concepts used to solve this problem, which we’ll cover in more details in the next post.

Scaling and manageability

There are two scaling dimensions to consider when designing a system for high resource efficiency:

Scale out – ability to add more hosts/nodes/instances to handle increased load
Scale up – ability to consume all the resources inside a host to handle increased load

Scale out generally receives most of the attention because of the theoretical ability to scale a system to infinity. However, as the number of nodes in a distributed system increases, so does the complexity of the control/management plane. We view scale up just as important, because it has the unique potential to reduce the size of a distributed system by one to two orders of magnitude, a non-trivial reduction! We designed LogStream to scale in both dimensions, with users being able to specify a resource cap per instance/node (consume N cores, or consume all but N cores).

Any distributed system must provide a control plane which users utilize to configure, manage and monitor the deployment, all of which are tightly dependent on the distributed architecture. We made a number of key design decisions when scaling out:

We chose one of the simplest and most resilient distributed architectures: shared nothing distributed architecture with a centralized master instance. The architecture can be described simply as: each worker node in the deployment acting completely independently and without knowledge of the existence of any nodes other than the master node.
All configuration settings and user changes be backed by version control, git in our case, with support for an optional remote repo where changes are pushed to.
The master node is out of the data path and is responsible for holding the master copy of the configuration and gathering of monitoring information, which can also be sent to other systems from the worker nodes. Losing the master node only affects the configurability of the system while worker nodes continue to process data with the last known config settings until the master comes back online. Recoverability of a master node is trivial when using a remote git repo for (2).
Workers can be grouped into logical management units, all accessed through the same master instance. This is necessary for large organizations with global presence that have different requirements and regulations for parts of their data pipeline.
Must not sacrifice UX for scale. Management and usability of a distributed version of our system should be as similar as possible to the single instance version. We’ve managed to achieve this by giving users the exact UX that is available for a single instance – ie the experience is the same as if they were interacting directly with one worker node. (more on how we achieved on the next blog post)

Coming Up Next

In this post we discussed why machine data streams processing engines should scale to processing petabytes of raw data per day with maximal resource efficiency and scale all the way down to work on a commodity laptop. Scaling is not a single dimension problem and we believe providing users a seamless experience independent of system’s scale is crucial for adoption.

In the next posts we dive deeper into scale discussing scale up and scale out, including many technical decisions and implementation details we made to achieve the requirements we discussed here. Read part 2 here.

One more thing, we’re hiring! If the problems above excite you, drop us a line at hello@cribl.io or better yet talk to us live by joining our Cribl Community.

Blog

Drowning in Your SIEM’s Archive? Save on Costs and Get Quick Access to Data With Cribl Lake

Blog

A Next-Gen Partnership with CrowdStrike’s Falcon Next-Gen SIEM

Blog

The Layers, Not Pillars, of Observability

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

Launch Now

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

Building for Multi-Petabyte Scale, Part 1

Measure twice, cut once – understanding the requirements.

Why does Petabyte scale and sub-millisecond latency matter?

Seamless experience

Scaling and manageability

Coming Up Next

Blog

Drowning in Your SIEM’s Archive? Save on Costs and Get Quick Access to Data With Cribl Lake

Blog

A Next-Gen Partnership with CrowdStrike’s Falcon Next-Gen SIEM

Blog

The Layers, Not Pillars, of Observability

Try Your Own Cribl Sandbox

So you're rockin' Internet Explorer!