Is AI simply a numbers game?

JAN 03, 2020 | Matt Hatton

region: ALL vertical: ALL Artificial Intelligence

The common consensus in the world of AI is that deep learning, i.e. the application of ML methods to large data sets in conjunction with neural networks, is the ultimate route towards further breakthroughs, particularly in pursuit of Artificial General Intelligence. This being the case, success in deep learning depends heavily on having an enormous training data set. The bigger the set, the greater the accuracy. This naturally gives an overwhelming advantage to any organisation that has established an unassailable lead in terms of access to data. By virtue of their relative strength in their respective markets, Amazon, Google and Facebook have this, as do Alibaba and Tencent. Microsoft has managed through an aggressive acquisition strategy to buy itself a seat at the deep learning top table. IBM has taken a rather different approach.

Microsoft: diverse data sets for ML training

Whether intentionally or not, Microsoft has made some enormous bets on companies that have large diverse data sets which can be used to train a wide range of deep learning applications. Skype (2011) gives an almost unparalleled set of audio data which can be used for training natural language processing. The potential learnings from LinkedIn (2016) are almost unlimited; it holds an unrivalled set of data on almost every aspect of business life. Until now the monetisation opportunities for LinkedIn have tended to focus on recruitment and retention, but the volume of business-related data being shared on the platform offers the opportunity to run sentiment analysis, targeted marketing, risk analysis, and numerous other things. The other acquisition to single out is Github (2018), which supports the training of AIs on software development and process workflows. The above purchases are also notwithstanding whatever Microsoft can gather from its ever-expanding business platforms, including the Office 365 suite.

IBM: capabilities rather than data sets

Contrast this approach with that of IBM. Over the last few years, it has acquired lots of vendors of capabilities that surround AI, such as healthcare imaging, data management, cloud storage, financial analytics and so on, but only a few that might have big data sets that can be drawn on for deep learning. The Weather Company (acquired in 2015) is one, and there are others that may have more niche specialist data sets that might be applicable for particular use cases such as Promontory Financial Group (2016) related to regulatory compliance and risk management.

Data set size is key to applied AI

In the context of progressing to AGI by way of more speculative unsupervised learning, clearly the larger general data sets appear to offer the greater utility. Although there is something inelegant about brute force volume of numbers for training, for now the greatest progress towards AGI seems to be associated with having bigger sets of data. However, in the context of applied AI for real-world enterprise use cases, which is what Transforma Insights cares about, both are valid approaches. If it were simply about the size of the data set CERN, with its hundreds of petabytes of data, would be amongst the most valuable assets in the AI world, but it’s hard to see how that can be commercialised. Microsoft’s assets are relatively large scale as well as being applicable for specific ML use cases. IBM in contrast has started from the application and worked backwards, picking up smaller niche players with significant data sets along the way.

Two things are for sure, however. Firstly, if we had to make a prediction for 2020 it’s that holders of rich data sets, either broad or niche, particularly those with unique sets, will be aggressively acquired by the big guys hoping to establish dominance in AI. Secondly, it is futile attempting to build a deep learning capability without the benefit of a top tier set of learning data, either niche or general. The plethora of AI start-ups should take note.