With the release of Apache Spark version 2.0 out in preview, there has been a lot of buzz recently about the implications of this advanced technology. Nowhere was that more apparent than in San Francisco this week where Spark Summit West drew a sold-out crowd of 2,500 software developers and data scientists, according to host and Spark cloud service provider Databricks.
As you might be aware, Talend is a massive supporter of Apache Spark. In fact, when we launched Talend 6 last fall, we were the first integration platform to offer native support for Apache Spark and Spark Streaming. As a result, we are now able to deliver some of the fastest big data integration on the market today. Talend—along with the rest of the market—believes that faster is better. So we are certainly interested in the speed enhancements that Apache Spark 2.0 promises to deliver. As my colleague, Ashley Stirrup, discussed during his Cube interview at Spark Summit West, the value of increased data sources is only as great as your ability to combine and synthesize them (full video below). With Apache Spark 2.0, companies will be able to more quickly combine new or real-time data sources with existing or historical data sets in a more holistic way to get a clearer view of a customer’s behavior, business performance, market trends, etc.
Additionally, we understand that Apache Spark 2.0 will bring a major overhaul to Spark Streaming. According to DataBricks, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Apache Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams. They’ve also worked on integrating machine learning to ensure they have the write APIs in structured streaming to be able to do things like online training of a model. This allows developers to apply data as it arrives to this model and then with each timestamp, just get the most up-to-date copy of the model.
However, I believe the real gem in the upcoming release is better abstraction through the move to a single API and Spark SQL. While this will certainly streamline things for developers, it should also make it a whole lot cheaper for companies in terms of costs and skills to gain real-time actionable insights from big data. Similarly, because our Talend Data Fabric platform is fully integrated with Apache Spark and Spark Streaming, it allows companies perform even the most complex data processing tasks with ease and speed. Our commitment to Spark is unquestionable, which is why we will continue to improve the performance of our platform to keep pace with the upgrade path Spark has embraced.
The swell of enthusiasm is just beginning…and the amount of continued innovation that is expected in the market as a result of Apache Spark 2.0 over the next 18-24 months will be substantial. So fasten your seatbelt everyone, because it’s going to be a wild ride!