Cloud-native data science is on the rise, as traditional Hadoop-centric big data infrastructure falls out of favor, according to a Wednesday report from data science platform Anaconda.
The company surveyed 4,218 data science students, professionals, academics, and software developers who use the platform. In terms of data sources, files reign supreme, with the majority of data scientists (89%) tapping CSV or other files, the report found.
In second place, 49% of data scientists surveyed said they use a SQL database like Oracle or MySQL, and in third, 25% use a REST API from another app like Twitter. Google Cloud’s data services just edged out traditional big data stores like HDFS/Hadoop/Spark, with both gaining about 17% of users. Amazon Web Services (AWS) had about 16% of users taking advantage of their data offerings.
SEE: Job description: Data scientist (Tech Pro Research)
To scale out data science, practitioners are increasingly turning to Linux servers (34%) and Docker (19%), as opposed to Hadoop/Spark (15%). Kubernetes is also on the rise (5.8%), especially compared to to Apache Mesos (0.85%).
“The survey shows that data science is undergoing a shift away from traditional big data (Hadoop/Spark) towards cloud-native technologies such as Docker containers, Kubernetes and API-driven applications,” Mathew Lodge, senior vice president of products and marketing at Anaconda, said in a press release.
Hadoop has dominated on-premises data infrastructure for the past decade, the report noted. However, it was introduced in 2005, and today, what was big data back then can now fit on a single server’s memory. There are also a number of alternatives for companies other than building a Hadoop data lake. Containers are also growing in production and enterprise adoption, the report said.
It’s also interesting that Google Cloud Platform’s data services outranked AWS and Microsoft Azure, despite the fact that the platform comes in third behind those two in terms of enterprise adoption. As more companies move to the cloud and expand their use of data analytics, it’s possible that Google Cloud Platform could see more pickup due to its strengths in this area.
Building a slide deck, pitch, or presentation? Here are the big takeaways:
- The most popular data sources for data scientists are CSV files (89%), SQL databases (49%), and REST APIs (25%). — Anaconda, 2018
- The most popular tools for scaling out data science are Linux servers (34%), Docker (19%), and Hadoop/Spark (15%). — Anaconda, 2018