Bob Eve, Senior Director, Data Intelligence Evangelist, TIBCO Software, delves deeper into how your AI and ML algorithms are only as good as the data that fuels them.
As we enter the Exabyte era, the fact that the business world is awash with data, is accepted from the board room to the shop floor.
The ready availability of AI and ML engines is presenting opportunities for businesses to move quicker and make better data-driven decisions and this is something which has been widely accepted outside the world of information technologists and data scientists.
Data combined with AI has the potential to deliver innovative new products, compelling customer experiences and optimised operations in markets ranging from health and financial services, to comms and media. It will, in effect, be at the heart of digitally transforming every business.
We know why this is happening. Within five years, every app, application and service will be AI-driven as consumers expect – in fact, begin to demand – personalised services matched to their experience and desires.
To get there, however, requires data. The fact that AI relies on the availability of vast and varied training datasets to accelerate time to value is a significant challenge and has been seen as a barrier to success. This is because everyone knows that your AI/ML algorithms are only as good as the data that fuels them. And in today’s complex data landscape, data can be a big bottleneck.
Data, data, everywhere…
Data enters the business from all sides and in all imaginable formats – it is coming from clouds, from data warehouses, from streaming services, from social and from mobile. It is in traditional structured columnar formats, it is unstructured email, social posts, video and, increasingly, voice.
A quick look at the data sources at play, reveals some of the complexities that we face: transaction systems, operational data, data warehouses, data marts, Big Data, packaged apps, RDBMS, Excel and external sources such as cloud data, web services, IoT data and mobile.
Once inside your organisation, it is likely that this data will be distributed. Can you make the time to gather all this data (from wherever it is hosted – possibly across multiple cloud platforms) and present it for analysis?
The realities of gathering, moving, inputting suitable data in the necessary volume to AL/ML models is where the bottleneck exists. We need to think about the data itself. Value is being held back, so we need to consider how can we overcome this data bottleneck.
There are two sides to fuelling AI with data. The first is fuelling AI with data in development. Among the many requirements are addressing data quality and using the best techniques for addressing bias. AI in development means addressing data process flow and model building flow. Dataset preparation requires agile algorithms.
The other side of fuelling AI with data is in production. The challenges of fuelling AI applications with data in production to deliver useful analytics include streaming data and data at scale, where models must be continually refined and outputs continually improved.
We have arrived at a point where data abundance is not the issue, it is the benefit. But that means we need to think more about how we sort through it all to uncover what data is valuable to fuel our intelligent algorithms and how to get quality data to these algorithms apace.
We need to think about getting it all into a single development environment where we can address what data is valuable and how much should be used to build intelligent algorithms. This requires applying integration technology that brings together these diverse, distributed data sources and makes this data readily available to our data science teams.
Once in production, this will require enabling agile ‘DataOps’ processes that help data science teams remove the data bottleneck and accelerate new algorithm time to solution. This requires collaboration between data science teams and IT operations, as these intelligent algorithms move from the lab to production.
We still spend 80% of our time gathering and managing data and just 20% of our efforts analysing it. In the world of AI and ML, it has been said that ‘equipping Machine Learning models with dependable, unbiased training datasets for reliable outcomes is unambiguously the most difficult aspect of deploying this transformative technology’.
To reap the full benefits afforded by combined data and AI and ML, it is essential to continue to develop new solutions and we must prioritise the challenges of sourcing the data needed to fuel the training models and find the means to overcome data bottlenecks.