Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

Cribl Search Provides an Audit Capability to Assess Your Snowflake Account
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Agent Consolidation ›

Avoid Vendor Lock-in ›

Free Up Space for High-Value Data ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Route From Any Source To Any Destination ›

Slash Storage Costs ›

Tackle Application Infrastructure Sprawl ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

Confluent Cloud ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

September 25 | 10am PT / 1pm ET

Hold my beer: lessons from one team’s data pipeline journey
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

Engineering Deep Dive: Streaming Aggregations Part 1 – Time Bucketing

June 24, 2019

Categories: Engineering

Back To Blogs

With Aggregations in Cribl LogStream 1.6, you can send your logs directly to Cribl and shape, extract, filter, drop, and now, aggregate! This new, powerful capability allows users to easily transform logs into metrics while running real-time, tumbling window aggregate functions on them.

In this post I would like to share various engineering problems/challenges we had to overcome in order to deliver a performant streaming aggregation system.

We had to determine how to:

handle various speed data streams
- we don’t want to flush out buckets just because the data is coming in too slow, so we need to take in account the various speeds at which data can stream in to the aggregator.
handle concurrent historical and current data streams
- we need to ensure that rules are in place to have all data regardless of event times are treated equally, so we don’t get lopsided aggregations with multiple buckets with one event for the same time range.
optimize memory
- we’re aggregating an unbounded number of events (potentially across multiple instances of the aggregator function), so we need to make sure we don’t eat up all the RAM in the system.

I Ain’t Got Nothin’ but Time…Based Flushing

Cribl LogStream has two different types of streaming aggregation: cumulative and time-bucketed. We’ll discuss the time-bucketed streaming aggregations as they’re much more interesting from an engineering perspective.

To quickly summarize cumulative streaming aggregations, we essentially keep the running total of all the aggregates to output, and we dump out the current value every n seconds where n is the customer specified time span. Pretty simple stuff there.

Now to the main event…

As you will come to learn, one of my main principles of engineering is “you can’t optimize something that doesn’t exist.” We’ll follow that approach here while building out the time-bucketed streaming aggregations. First things first, let’s get something functional.

Step 1: One Stream, Two Stream, Old Stream, New Stream

As the heading of this section suggests, right now, we don’t care about the properties of the streams of data coming in to the aggregator. We just need to make sure that we put everything into the correct time bucket and flush out events.

The easiest way to do this, is to have a map of buckets that we can look up by their earliest event time. If a bucket doesn’t exist, just create a new one and put it in the map.

Very inefficiently, we need to walk all the entries in the map and check the latest times of all the buckets to see which ones can/should be flushed. In the current implementation, we also just do this check every time we receive a new event. This leaves us with a huge problem. If a data stream is slow, we could end up with events sitting in the aggregator waiting to be flushed, but no event is coming along to flush the aggregated event out in a timely manner.

Since aggregations are in the context of a pipeline, and good software design tells us that all the pieces within a framework should be decoupled, we can’t just call setTimeout in time-window-seconds. We have to work within the framework or update the framework.

Step 2: One Small Step for Aggregations, One Giant Leap for Framework Kind

Doing the right thing is usually the right thing to do, so we decided to update the framework at this point to enable primitive signals that can be sent to aggregations. This allows the aggregations to run within the context of the framework (i.e. no need to call setTimeout arbitrarily). The framework essentially heartbeats the aggregations function every second, and the aggregator will flush out any buckets that have a latest event time earlier than now(). This flushing mechanism also allows the framework to signal to aggregations to perform a “final flush” where we can dump out all of the time buckets that have been built up, so we don’t lose any data upon config change or shutdowns.

With periodic flushing built into the framework, we now have a “functional” time bucketing solution; however, as mentioned in the previous step, we have to perform inefficient walks over a map, and we still haven’t addressed what to do in the case that we’re getting historic data streams (meaning time buckets that will always be earlier than the now() cursor).

Step 3: The Bucket Data Structures They Are A-Changin’

Let’s stop and think about some properties of streams for a moment to determine exactly what we should do moving forward to address the issues mentioned in the previous section.

The most common data stream should be one made up of current data. This means all the data in the stream will have event times approximately equal to now(). This should be the hot path for our code/data structures.

The edge cases for data streams will be historic streams and/or slow streams. Historic streams are the opposite of our common case, where data has event times that are in the past. Slow streams are exactly what they sound like; they are streams where the events slowly trickle in over time.

This informs our design to optimize for new events with event times approximately equal to now() and to ensure that we can efficiently remove buckets when now() surpasses a bucket’s latest event time boundary. We should then have a data structure that handles buckets for historic streams as there really isn’t any heuristic we can pick on how those will show up over time whereas new data should always be around now() and strictly increasing in time.

These new data structures help us optimize for contemporary data; however, we still face the problem of historic streams. Once a time bucket’s latest event time boundary is older than now(), should we just flush it out and render it to an event? We’d argue no. First, we need to handle the case of an event arriving late to a time bucket. If the latest time boundary of a bucket is t=10 and an event arrives with event_time=9, but it arrives at t=11, we would want that event to be placed in that time bucket before the bucket is flushed out. This is when we decided to add a configurable lag tolerance into aggregations. This way, we can wait n number of seconds past the time bucket latest event boundary for any late arriving events. This gives us better aggregation accuracy as events will be more accurately grouped together.

Now that contemporary data is being handled correctly, we need to take a look at how to handle historic data. If data comes in at t=0 and now()=30 (bucket time window is 10s), that means this bucket will have a latest event time boundary of 10. This event will flush instantly since the bucket falls behind now(). We could update the lag time to be huge, but that would lead to all buckets being delayed by a lot. Our solution to this problem is to introduce a bucket idle time. This setting makes sure that we don’t flush out a bucket unless no events have arrived for the idle time. In the above example, if we set the idle time to 1s, we won’t flush out the historic bucket until it has not received any events for 1s, so if events continuously are being sent to this time bucket, we’ll keep it open for writing. This keeps us from flushing out singular events, and we actually end up with much more accurate aggregations.

Now with these two settings, we introduce multiple stages of time buckets:

Stage	Writable	Description
1	Yes	Bucket consists of contemporary data. Bucket progresses to next stage via event time falling behind `now()`
2	Yes	Bucket consists of historic data. Bucket progresses to next stage if no events arrive for idle time limit
3	No	Bucket is flushed out and rendered to an event

Here is a diagram showing the lifecycle of a time bucket within the aggregation system:

Takeaways

Anything dealing with time is going to be complex. The key to a successful time-based feature is to introduce/expose as little complexity to the customer as possible. We could have gone so many different routes with a plethora of configurations to handle all sorts of exceptional data streams, but that would have made aggregations a convoluted mess and impossible to fully understand for any customer (or even future Cribl engineers). The best solution is usually the simplest and not over engineered. We examined what the main customer use case for aggregations would be, and we optimized the solution for that. Our minimalistic approach to configuration options empowers customers to better handle any exceptional data streams while lowering the barrier of entry for anyone trying to get setup for the first time.

Now that we have finished up our model for dealing with time, let’s take a deep dive into bucket data structures and optimizations in our post Engineering Deep Dive: Streaming Aggregations Part 2 – Memory Optimization.

Blog

Drowning in Your SIEM’s Archive? Save on Costs and Get Quick Access to Data With Cribl Lake

Blog

A Next-Gen Partnership with CrowdStrike’s Falcon Next-Gen SIEM

Blog

The Layers, Not Pillars, of Observability

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

Launch Now

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

Engineering Deep Dive: Streaming Aggregations Part 1 – Time Bucketing

I Ain’t Got Nothin’ but Time…Based Flushing

Step 1: One Stream, Two Stream, Old Stream, New Stream

Step 2: One Small Step for Aggregations, One Giant Leap for Framework Kind

Step 3: The Bucket Data Structures They Are A-Changin’

Takeaways

Blog

Drowning in Your SIEM’s Archive? Save on Costs and Get Quick Access to Data With Cribl Lake

Blog

A Next-Gen Partnership with CrowdStrike’s Falcon Next-Gen SIEM

Blog

The Layers, Not Pillars, of Observability

Try Your Own Cribl Sandbox

So you're rockin' Internet Explorer!