We recently published a benchmark comparing Talend Big Data Platform to Informatica Big Data Edition, showing the performance benefits of our native Apache Spark approach over Informatica’s solution. Informatica responded with a rebuttal that combines some good points along with some claims that are either misleading or completely false. (Privately, their lawyers also sent a letter to the group that performed the benchmark demanding that they retract the benchmark.) I’d like to set the record straight. Let’s start with their more valid points:
· The benchmark used a “two year old version of Informatica”. This is mostly true. When we started the benchmark, we used the most recent version of Informatica (which they released in June 2014, so it was then 16 months old). Almost simultaneous with the benchmark publishing, Informatica released their new version 10 which we haven’t benchmarked yet. In general, Informatica releases their products every 2-3 years, while we release twice per year, so it’s not surprising to see their product out of date relative to ours and the rest of the big data ecosystem – this is the normal state of things, except right around one of their release windows.
I’d like to also point out that since the benchmark was done, we also released a newer version of our platform, and according to our internal benchmark, with Spark 1.5, we’ve already seen a 16% speed improvement.
· The benchmark compares Informatica using MapReduce to Talend using Spark. True. Informatica’s latest available version at the time only supported Hive (which runs on top of MapReduce), so we used that.
· Our benchmark didn’t use TPC-DS. True. We compared the products using several common real-world digital marketing and e-commerce scenarios such as product recommendations and coupon-influence analysis. Interestingly, even though Informatica apparently used an industry standard benchmark suite they didn’t actually publish their full results and configurations, which is actually required by the TPC consortium to publish a benchmark.
We actually think our scenarios are a better real-world integration scenario than what was tested with TPC-DS. The TPC-DS benchmark is primarily focused on analytics use cases with a smaller focus on data integration. The authors of the benchmark even wrote: “Although the emphasis is on information analysis, the benchmark recognizes the need to periodically refresh its data.”
· We only used 12M records. Somewhat true. The total benchmark actually processed 75 million records, but it’s true that many real-world scenarios will process more. That said, our performance differential actually improved dramatically as the data volumes tested increased.
Informatica’s post then went on to make a number of technical claims, many of which were misleading or simply false. I’ll cover a few that are particularly worth while discussing. In Informatica’s blog post, they talked about three key issues in comparing the products:
1. Performance. We agree this is critical and maintain we are faster. Nothing in their unpublished benchmark tells us otherwise.
2. Layer of abstraction. Informatica pointed out that this is key to provide future protection in the fast-changing big data landscape, which we heartily agree with. In fact, we provided exactly that when we launched Talend 6, allowing our customers to upgrade any of their existing MapReduce jobs to Spark to gain the 5x performance benefit with just one click. If you’re able to find anyone running the Informatica Big Data Edition, ask them what the upgrade experience to the brand new version 10 is like (this is likely to be a challenge as there are so few of them in production). Unlike Talend’s fully compatible approach, Informatica actually doesn’t provide a clean abstraction layer, and requires a lengthy and awkward upgrade/conversion approach to go from their own version 9.x to version 10. I can only say that they must have cut the upgrade feature because there wasn’t enough customer demand for it…
Here’s our UI to upgrade a MapReduce job to Spark to get the 5x performance improvement:
And by the way, if it makes sense for you to run the job in real-time rather than batch, that’s just two clicks away using either Storm or Spark Streaming:
3. Breadth of functionality. We agree. If you can find one of those elusive Informatica Big Data Edition customers, ask them how compatible it is with the classic PowerCenter. You may be surprised to find out that their Big Data Edition is actually a completely separate product from their classic Powercenter with a different job designer, different server, different management, different metadata – basically different everything – and that’s not compatible with PowerCenter. It actually supports a very limited subset of the full PowerCenter functionality, and so you need to figure out how to partition your jobs between full PowerCenter and their Big Data Edition, shuffling data back and forth along the way. This doc (helpfully written by an Informatica Sales Engineer) describes the missing functionality and where you’ll need to fall back on PowerCenter. Warning, it’s 15 feature-packed pages. At Talend, of course, we support everything on our Big Data Edition since it’s a pure superset of our standard Data Integration with additional Big Data functionality. It’s the most popular version of our product so it shouldn’t be surprising that it’s fully functional.
In addition to those three evaluation criteria proposed by Informatica, we’d suggest a few more:
1. Ease of deployment and management. How easy is it to deploy, manage, monitor, and upgrade your integration solution? Does it require putting something separate on each node in the Hadoop cluster, or does it natively leverage the full power of Hadoop and Spark without any additional management overhead? Think about upgrade scenarios as well – do you have to worry about specific version compatibilities of some legacy component that you had to install on every node? Check out this post, especially the last exchange: “No, you only need to install on the data nodes, but you do need to install on ALL the data nodes.” I don’t think it has changed in the version 10 edition but I might be wrong…
2. Cost. How much does it cost, and how do costs scale as you use more data and thus more Hadoop nodes? Are you paying for each Hadoop node twice? Are you paying both your Hadoop distributor and your data integration vendor for each node or are you only paying your data integration vendor the developers using the system?
3. Cloud compatibility. If you’re interested in moving to the cloud now or in the future, is your vendor’s approach compatible with that desired direction? Talend’s solution is 100% symmetrical between cloud and premise, so anything that you build on premise can run without changes in the cloud. And as part of that our native Spark solution can run in your own Spark cluster or you can use an on-demand Amazon EMR cluster in Amazon Web Services, which we will spin up before the job and then spin down after the job completes. I have no idea how you’d use Blaze in an AWS/EMR scenario, since it requires a turd dropped on every node. If it’s even possible to use there, it certainly won’t be something you can dynamically spin up and spin down.
4. Future trajectory. Are you locked into one vendor’s proprietary runtime and upgrade trajectory, or are you leveraging the amazing amount of innovation and progress going into the open source Hadoop and Spark ecosystem? Which technology do you expect to progress faster? When I joined Talend over a year ago, we made a decision to go all in on Apache Spark. This has turned out to be a terrific strategic decision. The Spark project is the most active Apache project in the world, and Hadoop overall is progressing at a rate faster than anything I’ve ever seen in my professional career.
If you pause to think about it for a moment, it might seem like a surprising technology strategy choice for Informatica to create something like Blaze rather than leveraging Spark. But if you step away from their spurious claims for a moment and look at the problem from their point of view, you realize that it solves a very real problem. For them that is, not for you.
Informatica’s problem is that they’ve always charged for their proprietary runtime, first with PowerCenter CPUs and now with Blaze Hadoop nodes. From a business model perspective, this is critical since most of their $1B revenue is tied to these runtime licenses. So the idea of leveraging someone else’s runtime – even an incredibly powerful and flexible one like Spark – is not just foreign to them but actively dangerous to their business model. They’ll do everything they can to keep you paying for runtime licenses as long as they can. This is a data tax, or the modern version of the old mainframe MIPS pricing model. Again, this is Informatica’s problem, not yours.
In summary, if you’re so committed to the Informatica stack that you are willing to put a legacy runtime on every Hadoop node, suffer the performance hit, toggle back and forth between their incompatible traditional and big data ETL products, and rule out a simple migration to the cloud then Informatica has a good solution for you. If on the other hand you want a product that takes full advantage of native Spark and Hadoop performance/scale, is fully functional, improves at the breakneck speed of the Hadoop ecosystem, works seamlessly in the cloud when you’re ready to do that, and doesn’t require a data tax, then I humbly suggest that you take a look at Talend. There’s a reason why we’re #1 in big data integration.