The real-time is right for data teams to adopt a streaming-native perspective to the analytics architecture says Julia Brouillette, Senior Technologist at real-time analytics specialist Imply.
In 2022, we began to see data streaming finally getting the recognition it deserves. What was once thought of as a niche element of data engineering has now become the status quo. Every major cloud provider (Google Cloud Platform, Microsoft Azure and AWS) has launched their own streaming service and more than 80% of Fortune 100 companies have adopted Apache Kafka.
Leading this shift is the growing need for reliability, quick delivery and the ability to support a wide range of both external and internal applications at scale. This continues to be exemplified by the growing number of use cases that depend on subsecond updates, which ups the ante for real-time data processing and dissemination.
That said, there is no doubt that streaming technology will see continued growth in 2023.
As streaming moves toward ubiquity, there is another shift taking place, specifically in the way businesses use data. Events are being analyzed as they’re created to gather insights in real time. With the right tools, businesses can instantly compare what’s happening now to what has happened previously, make time-sensitive decisions in an instant and triage problems as they occur.
With the increase of data streams comes a new set of requirements and use cases for real-time analytics. To fully unlock the power of streaming in 2023, data teams will need to look beyond the traditional batch-orient stack and adopt a streaming-native perspective to the analytics architecture.
When you look at business operations, historically we have lived in a batch-dominant world of data. The ultimate goal of data infrastructure was to identify data at a fixed moment in time and store it for eventual use. But in the evolution from mainframes that used daily batch operations to today’s Internet-driven, always-on world, what was once ‘data at rest’ is replaced by fast-moving data in motion. With streaming, information flows freely between applications and data systems within and between organizations.
While ‘data at rest’ is still around and continues to support a number of reporting use cases, reality isn’t fixed. To meet the need for seamless and authentic data experiences, the systems we build must be designed for data in motion.
With the popularity of streaming technology on the rise, so is a new way of thinking regarding data. Streaming platforms became the central data hub for organizations, connecting every function and driving critical operations. Stream processors and event databases are evolving technologies that are purpose-built to support and handle data-in-motion systems.
As a real-time database, Apache Druid fits into the purpose-built category. It is designed to enable users to query events as they join the data stream at an immense scale, all while enabling subsecond queries on a mix of batch and stream data.
Many businesses are already using streaming processors like Amazon Kinesis and Kafka with Druid to make cutting-edge systems that make terabytes of streaming data accessible to people and applications in milliseconds. Reddit, Citrix and Expedia were some of the businesses highlighted at Current 2022, the annual streaming event organized by Confluent, for doing just that.
The ability to react to events as they are happening is the next step of data evolution, and for some, that next step is already here. Even so, we are only at the beginning of an upward curve where streaming and the technology built for it become the basis of everyone’s data architecture.
Now, when it comes to enabling scalable, subsecond analytics on streaming data, many developers and data innovators are wondering ‘what’s next?’
While at Current, we talked to hundreds of Kafka users who had that same question.
Even though streaming adaption is becoming more widespread, most companies still only have one or two use cases they’re using a streaming platform to solve. Many people at Current spoke about how Kafka was effectively setting their data in motion, but when it came time to analyze or use those streams in a user-facing application, their ‘data in motion’ became ‘data in waiting’ due to their analytics systems being designed for batch data rather than streaming data.
To remedy this, a new database was needed – enter Apache Druid.
With the ability to turn billions of events into streams that can be immediately queried by thousands of users simultaneously, Druid, in combination with stream processors like Kafka, can unlock a new set of use cases for developer-built analytics applications.
Take Reddit, for example. Reddit generates tens of gigabytes of events per hour just from ads present on its platform. To enable advertisers to decide how to target their spending and understand their impact, Reddit would need to enable interactive queries across the last six months of data. They would also need to empower advertisers to see sizes and user groups in real time, adjusting based on interests and location, to find how many Reddit users fit into their target demographic. To do this, they built a Druid-powered application with the ability to ingest data from Kafka and enabled Reddit’s ad partners to make real-time decisions that yield the best ROI on their campaigns.
Because of its close integration with Kafka and because Druid was designed to analyze and ingest streaming data, Reddit chose Druid as the database layer of their application. Unlike other analytics databases that are built for batch ingestion, this is what sets Druid apart.
Batch ingestion takes chunks of stream data, puts them in a file, processes the file and then loads that file into the database. The problem with using a batch-based system to analyze streams is that it creates a bottleneck in your streaming pipeline. Druid, by contrast, provides connector-free integration with the top streaming platforms and handles the latency, consistency and scale requirements of high-performance stream analytics cost-effectively.
Druid also comes with built-in index services that provide you with event-by-event ingestion, meaning that streaming data is ingested into memory and made instantly available for use. Coupled with exactly-once semantics, this capacity guarantees data is always fresh and consistent. If there were to be a failure during streaming ingestion, Druid would automatically continue to ingest every event only once to prevent any data loss or duplicates.
The biggest reason to use Druid for streaming data analytics is because of its near-infinite scalability.
Druid can easily scale into the largest and most complex ingestion jobs. With variable ingestion patterns, Druid avoids resource lag by enabling dynamic scaling. It combines the query and ingestion performance of shared-nothing architecture with the flexibility and non-stop reliability of a cloud data warehouse, so you can add computing power and scale out without the need for rebalancing or downtime, which is handled automatically.
While the idea of using a cloud data warehouse to serve both real-time and batch-oriented use cases might sound efficient, doing so ultimately defeats the purpose of a streaming pipeline.
As Databricks CEO Ali Ghodsi told the audience at Current: “The weakest link in the chain sort of dominates everything. If you have one step of your pipeline that is batch-processed and really slow, then it doesn’t matter how fast you are in the rest of the pipeline.”
Looking into the future, we see more and more developers adopting this entirely new mindset toward analyzing, moving and sharing data. Because of streaming technology, new use cases and products are coming to fruition. We believe organizations that view streaming data as a force multiplier and adapt a streaming-native approach to analytics will set themselves up with a competitive advantage in 2023 and beyond.
Not only is Druid built for these new streaming use cases, but it is also built for a brand-new type of application – one that has characteristics from both the transactional database and analytics worlds.
Data teams have built mission-critical applications that enable operational visibility, customer-facing analytics, drill-down exploration and real-time decisioning using streaming data and Druid. This trend is now starting to take hold and the next wave of analytics applications is already being built.Click below to share this article