The real-time is right for data teams to adopt a streaming-native perspective to the analytics architecture says Julia Brouillette, Senior Technologist at Imply.
In 2022, we began to see data streaming finally getting the recognition it deserves. What was once thought of as a niche element of data engineering has now become the status quo. Every major cloud provider (Google Cloud Platform, Microsoft Azure and AWS) has launched their own streaming service and more than 80% of Fortune 100 companies have adopted Apache Kafka.
Leading this shift is the growing need for reliability, quick delivery and the ability to support a wide range of both external and internal applications at scale. This continues to be exemplified by the growing number of use cases that depend on subsecond updates, which ups the ante for real-time data processing and dissemination.
That said, there is no doubt that streaming technology will see continued growth in 2023.
As streaming moves toward ubiquity, there is another shift taking place, specifically in the way businesses use data. Events are being analyzed as they’re created to gather insights in real time. With the right tools, businesses can instantly compare what’s happening now to what has happened previously, make time-sensitive decisions in an instant and triage problems as they occur.
With the increase of data streams comes a new set of requirements and use cases for real-time analytics. To fully unlock the power of streaming in 2023, data teams will need to look beyond the traditional batch-orient stack and adopt a streaming-native perspective to the analytics architecture.
When you look at business operations, historically we have lived in a batch-dominant world of data. The ultimate goal of data infrastructure was to identify data at a fixed moment in time and store it for eventual use. But in the evolution from mainframes that used daily batch operations to today’s Internet-driven, always-on world, what was once ‘data at rest’ is replaced by fast-moving data in motion. With streaming, information flows freely between applications and data systems within and between organizations.
While ‘data at rest’ is still around and continues to support a number of reporting use cases, reality isn’t fixed. To meet the need for seamless and authentic data experiences, the systems we build must be designed for data in motion.
With the popularity of streaming technology on the rise, so is a new way of thinking regarding data. Streaming platforms became the central data hub for organizations, connecting every function and driving critical operations. Stream processors and event databases are evolving technologies that are purpose-built to support and handle data-in-motion systems.
As a real-time database, Apache Druid fits into the purpose-built category. It is designed to enable users to query events as they join the data stream at an immense scale, all while enabling subsecond queries on a mix of batch and stream data.
Many businesses are already using streaming processors like Amazon Kinesis and Kafka with Druid to make cutting-edge systems that make terabytes of streaming data accessible to people and applications in milliseconds. Reddit, Citrix and Expedia were some of the businesses highlighted at Current 2022, the annual streaming event organized by Confluent, for doing just that.
The ability to react to events as they are happening is the next step of data evolution, and for some, that next step is already here. Even so, we are only at the beginning of an upward curve where streaming and the technology built for it become the basis of everyone’s data architecture.
Now, when it comes to enabling scalable, subsecond analytics on streaming data, many developers and data innovators are wondering ‘what’s next?’
While at Current, we talked to hundreds of Kafka users who had that same question.
Even though streaming adaption is becoming more widespread, most companies still only have one or two use cases they’re using a streaming platform to solve. Many people at Current spoke about how Kafka was effectively setting their data in motion, but when it came time to analyze or use those streams in a user-facing application, their ‘data in motion’ became ‘data in waiting’ due to their analytics systems being designed for batch data rather than streaming data.
To remedy this, a new database was needed – enter Apache Druid.
With the ability to turn billions of events into streams that can be immediately queried by thousands of users simultaneously, Druid, in combination with stream processors like Kafka, can unlock a new set of use cases for developer-built analytics applications.
Take Reddit, for example. Reddit generates tens of gigabytes of events per hour just from ads present on its platform. To enable advertisers to decide how to target their spending and understand their impact, Reddit would need to enable interactive queries across the last six months of data. They would also need to empower advertisers to see sizes and user groups in real time, adjusting based on interests and location, to find how many Reddit users fit into their target demographic. To do this, they built a Druid-powered application with the ability to ingest data from Kafka and enabled Reddit’s ad partners to make real-time decisions that yield the best ROI on their campaigns.
Because of its close integration with Kafka and because Druid was designed to analyze and ingest streaming data, Reddit chose Druid as the database layer of their application. Unlike other analytics databases that are built for batch ingestion, this is what sets Druid apart.Click below to share this article