For years, enterprise technologists have studied both Apache Hadoop and the many tools in that framework and the rise of Apache Spark. In the early days of Spark it was viewed as a bit of a competitor to the rest of the framework. But now thanks to the contributions of a wide team of open source developers and leaders, the consensus is becoming clear (especially among enterprise CTOs), Spark is better as part of a more comprehensive framework. It fits well in the Hadoop ecosystem.
In August 2016 Sean Anderson posted some strategic context on how Apache Spark fits in the overall Apache Hadoop framework at the Cloudera Vision blog. His post, titled titled "Enhanced Streaming and Machine Learning with Apache Spark 2.0", was helpful in highlighting the rise of Apache Spark to the point where it is now the de-factor processing engine in the Apache Hadoop ecosystem. There are still more challenges for the Spark ecosystem to address, especially challenges around some particular use cases in solutions that must leverage streaming and complex data types while enabling data access in simple ways. But all indications are that these challenges are being addressed.
Cloudera is helping the community address remaining challenges by The One Platform Initiative, a project designed to help unite development efforts and continually advance Spark's role in the Hadoop ecosystem. The project has five key thrusts: Streaming, Security, Management, Scale and Cloud.
Sean highlighted a few driving use cases that are making it imperative to keep moving out on these project thrusts. Perhaps the greatest are the many solutions around Machine Learning and Artificial Intelligence. However, there is also a growing need for speed, including speed in parsing streaming data for real-time decision making. These use cases are applicable to domains like operational efficiency in solutions involving the Internet of Things and the Industrial Internet of Things, as well as precision medicine.
Success in these domains requires Spark to work with most all the other components of the Apache Hadoop ecosystem to provide reliable pipelines to collect, transport, process and serve, as well as store, backup and conduct traditional analysis over, all data holdings.
Enterprises who seek to use Spark will want to do so in a way that ensures it works with everything else. This is perhaps the most compelling reason to get your Spark as part of your Hadoop distribution vice trying to bring it into your enterprise alone. Leveraging it as part of a commercially supported distribution means it is pre-tested, integrated and commercially supported.
Stand by for more news from this community. We will continue to track developments and report on them in our directory of Categorized Content, especially in the Big Data section.