Apache Beam Your Way to Greater Data Agility

 

If you are Captain Kirk or Mr. Spock and you need to get somewhere in a hurry, then you “beam” there, it’s just what you do. If you are a company and you want to become more data driven, then as surprising as it may sound, the answer there could be beam as well, Apache® Beam™.

This week, the Apache Software Foundation announced that Apache Beam has become a top-level Apache project. Essentially, becoming a top-level project formalizes and legitimizes it and indicates the project has strong community support. For those of you not familiar with Beam, it’s a unified programming model for batch and streaming data processing. Beam includes software development kits in Java and Python for defining data processing pipelines, as well as runners to execute the pipelines on a range of engines, such as Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. So, with Beam, you no longer have to worry about the actual runtime where your processes will be deployed and executed. We see this as massive for IT teams looking to keep up with both data technology innovation and the increasing pace of business. There’s an introduction to Beam overview here, if you wish to learn more.

All organizations are racing to transform themselves into digital businesses and use data as the basis for growth and innovation ahead of the competition. The problem is, getting there is anything but straightforward, and the shot clock is on. Modern information needs demand far more complex data integration and management to support things like greater market responsiveness, and real-time, personalized relationships with customers, partners, and suppliers. With the data needs of the business increasing exponentially, CIOs are being forced to make strategic technology bets even as the market continues its dramatic transformation. This can be a major issue, as a technology choice made today to fuel progress can easily become an anchor to advancement tomorrow.

Helping companies get over this hurdle has been a major focus for Talend and is why we designed our Data Integration solutions the way we did to be native code generators. Even ten years ago when we first introduced Talend, we knew that there had to be a better, faster, and cheaper way than hand-coding to manage data integration projects. Our head of sales in Europe, Francois Mero, recently wrote an entire piece detailing the advantages of code-generating tools over hand-coding, so I won’t go into a lot of detail here. Net/net, ten years ago code generation provided strong economies of skills and scale over hand coding because it is quicker and more cost effective; however, today with the velocity of modern data use cases, it’s an absolute no-brainer. It’s really about creating greater agility through the portability or re-usability of projects. To explain, in 2014 and 2015, MapReduce was the standard, but by the end of 2016 Spark emerged to replace it. Spark offered such significant advantages over its predecessor it was a competitive advantage to make the switch as quickly as possible. If companies were using hand-coding to develop MapReduce projects, then they had to recreate everything in Spark costing them a tremendous amount of time and money. In the case of companies leveraging code-generating tools, the change was as simple as a couple of clicks of the mouse.

Enter Apache Beam. Here is what our CTO, Laurent Bride stated in the Apache Software Foundation announcement about Beam’s move to becoming a top-level project:

“The graduation of Apache Beam as a top-level project is a great achievement and, in the fast-paced Big Data world we live in, recognition of the importance of a unified, portable, and extensible abstraction framework to build complex batch and streaming data processing pipelines. Customers don’t like to be locked-in, so they will appreciate the runtime flexibility Apache Beam provides. With four mature runners already available and I’m sure more to come, Beam represents the future and will be a key element of Talend’s strategic technology stack moving forward.”

Talend has chosen to embrace Beam because we see it as a natural extension to our code-generating platform and a way to provide even greater agility to our customers. By updating the Beam “runner” for any new API changes (including adopting a brand new framework like Flink or Apex), we get 100% full fidelity support across the product suite. In contrast, what do those still using custom code do when Spark changes their APIs? They have to rewrite large chunks of it or the entire thing. Again, this isn’t theoretical; Spark made big disruptive changes to their APIs going from 1.6 to 2.0. It’s a particularly tricky situation now since Spark 2.0 isn’t ready for production use. What do you do? If you write to the version that works now you know that you’re digging yourself into a huge hole going forward. Or perhaps you roll the dice and write to the new one and hope that it’s ready for real production when you need to go live. Perhaps you guess right, and that’s good for you; however, it’s only going to keep happening – over and over again. It’s not just about Spark. If you decide that your streaming use cases are latency sensitive enough that micro-batching isn’t good enough then you’ll want to look at using Flink, Apex, or something else. Again, with a code-generation tool like Talend, that’s a couple of clicks for the increasing number of frameworks that Beam supports, or at worst a new runner away from supporting something brand new. With hand coding, it’s a complete ditch and restart. 

So, what say you, ready to get beamed up?

Share

Leave a comment

コメントを追加

More information?