Cloudera vs AWS vs AZURE vs Google Cloud: How to decide on the right big data platform?

Big Data & Hadoop
2

Background

Big data concepts evolved to solve a specific problem of processing data of diversified nature, high volume and streaming data. Hadoop came with the first architectural solution to process this nature of data on commodity hardware against the high cost HPC and appliance based systems. During past progressive years, it solved the then challenges of scalability, compute and storage limit of a single machine and comparatively slow network bandwidth. It successfully solved the problem of huge data processing with MapReduce architecture and data locality principle.

Cloud offerings: How the game changed?

With the advancement of computing technologies in last few years, the cloud resources not only got mature enough for consumer but it came down in pricing model as well. Due to efficient and higher network bandwidth, you can separate out compute and storage resources.

Businesses adopting cloud services in favor of agility and avoiding complexity to manage infrastructure. Today the cloud services enterprises like AWS, Azure and Google not only providing the infrastructure and platform but they are also offering integrated solutions as service.

So many vendors and services creating more dilemma for consultants, data engineers and architects that, which of the providers to prefer over the others. Many organization adopting multi-cloud strategy which could lead another challenges of cross-cloud integration.

Let’s dive in and get insight what each has to offer uniquely. We will also discuss few considerations which would help to set a directive thought. 

AWS – Amazon Web Services

In recent years, AWS has aggressively created the Big data ecosystem data science and AI services advocating the cost effective and pay per use model. It has also developed the equivalent tools and technologies as service like AWS EMR for Hadoop and Spark, Dynamo DB, Glue, Kinesis etc. (whole list of AWS product) with the flexibility of compute and storage separation on cloud.

AWS emphasis is to build most innovative services and adding all the latest tech trend into their ecosystem, ready to be used by common consumers effortlessly. For example, from all the solutions available like big data, data management services, AI & ML, serverless etc. One of the latest addition in their ecosystem is AWS Ground Station. It will provide ground antennas through their existing network of worldwide availability zones, as well as data processing services to simplify the entire data retrieval and processing for satellite companies, or for others who consume the satellite data. You can imagine, the notion of AWS is to provide everything possible on earth as service.

Enterprises with extensive Amazon cloud deployments will view S3 as an attractive option for AWS data lakes, but there are some limitations, especially around data transfers and analysis. Once users store data in S3, there are no transfer costs for analytics or data processing with apps that run within the same AWS region. However, enterprises must pay a premium when they transfer data to private infrastructure or other cloud platforms for analytics.

Adoption Strategy: IOT use-cases, Quick data pipeline and AI, Research Projects, impressive documentation which easily help novice professional to implement IAAS.

Cloudera

Cloudera is not a cloud vendor, but it’s a data platform which is cloud service agnostic. There is no cloud vendor lock-in, you can opt to go with either AWS or Azure and would be available on other cloud platform as well in near future. Very soon this product would be available as a PAAS as well on public cloud.

Cloudera is original creator of Hadoop and first big data platform on open source technologies and now after its merger with competitor Hortonworks, it has become more powerful product offering. The combined company roadmap is to offer an Enterprise Data Cloud with the tag line of “EDGE to AI”.

Unlike AWS, Cloudera vision is very data centric and it emphasize on building end to end data management and solution platform for enterprises. The decision to merge Cloudera and Hortonworks was very welcoming and it would definitely add value for everyone. Engineers and workforce from both the organizations can focus collectively more on innovation. Simultaneously, for data engineer’s community, the focus would be learning on single platform rather than juggling between Cloudera vs Hortonworks.

I would like to quote Tom Relly here from his keynote – ““An Enterprise Data Cloud provides a public cloud-like experience everywhere for data anywhere. It offers the agility, elasticity, and ease-of-use of public cloud infrastructure across private cloud, hybrid-cloud, multi-cloud and all major public clouds. It enables multiple analytic functions — from real-time streaming at the edge to artificial intelligence — working together on the same data to support your most demanding use cases. An Enterprise Data Cloud is secure and governed, meeting strict data privacy, compliance, data migration and metadata management requirements across all environments. Lastly, it’s defined by openness, powered by open source software, open computing architectures and open data stores like Amazon S3 and Azure Data Lake Storage. Our open and portable approach delivers you the best of cloud infrastructure without the cloud lock-in.””

The new unity release would be called Cloudera Data Platform (CDP) and first release of Enterprise Data Cloud would combine best components out of Hortonworks and Cloudera existing platforms and will support both hybrid and multi-cloud deployments, with the flexibility to perform machine learning and analytics on data anywhere.

Cloudera product strategy is, to learn from the enterprise customers and understand their need, solve their problem by adding enterprise grade features and bringing values to this platform. Cloudera is continuously making successful effort to fill the gaps in its offerings to business need and tech community.

I would say the objective of the Cloudera is to be a true “enterprise” data platform. The key differentiator in the market compared to other data solution provider are:

  • Provide a single control plane to manage infrastructure, data, and analytic workloads across hybrid and multi-cloud environments
  • Include SDX shared services to migrate and safeguard data, ensure data privacy and governance, comply with regulations, audit lineage and secure metadata across all cloud environments. It offers single pane view of security, shared catalog, central governance, lifecycle control, to provide the enterprise grade services
  • Be 100 percent open source, supporting customers’ objectives to avoid vendor lock-in and accelerate innovation

Cloudera is fully certified as compliant with Data security and protection standards like – PCI, HIPPA and GDPR. Another unique capability of this platform is on-premises deployment in in-house datacentre, in case you don’t wish to go with Cloud provider.

Adoption Strategy: multi-disciplinary analytics on multi-cloud workload, on-premises deployment may be critical for business like banks, Data security compliance

Microsoft Azure

Microsoft is another cloud provider similar to Amazon AWS. Microsoft Azure has re-engineered and innovative services for data and AI in their suites as well.

One can build seamless stream data pipeline using EventHub and DataBricks which can be directly analysed and visualised using Power BI. Alternatively, data can be stored in Azure PDW (Parallel Data warehouse), which is engineered with PolyBase and integrate well with Hadoop.

Some of the new capabilities of MS SQL Server 2019 are – it supports stored procedure in Java and R, similar to what we call UDF in Hadoop. With PolyBase engine MS SQL Server can connect and query to external database like MongoDB, Teradata, Oracle, SAP HANA, Cloudera, DB2, Excel etc. Microsoft SQL Server 2019 has ability to set up big data cluster with Kubernetes. Microsoft is famous for offering intuitive product by creating nice abstraction and hiding the complexity under the hood. Coming with MS SQL Server 2019, you can set up big data cluster and can use the big data capabilities with your knowledge of MS SQL Server, even if you are novice in big data technology.

Inclusion of DataBricks in Azure suite, rather than creating its own spark platform is a clever step from Microsoft and I would really appreciate it. DataBricks is deeply integrated in Azure cloud console for spark-based data processing and soon Cloudera would be added as well for data analytics workload. Azure has some smart solutions offering for example the recent addition of AI workbench is a drag and drop interface which Data Scientist can build solution without witting any code.

Azure is also compliant to PCI, HIPPA and GDPR. The other list of compliance and certifications can be find here.

AWS run other business as well like retail. Any similar business might consider Amazon as a competitor to them and Microsoft Azure as Cloud service could be a great option for these business.

I can see AWS and Microsoft offer similar capabilities. It’s just the matter of what flavour do you like. But Microsoft Azure offerings are more enterprise oriented with better integration of Microsoft active directory, fully DevOps and seamless integrations. 

Adoption Strategy: Integrated true enterprise experience, DevOps, Seamless integration and PAAS experience, Support for C#, .Net skills.

GCP – Google Cloud Platform

Google is trying to position itself in the same way as other cloud vendor like AWS. It is marketing itself as a hybrid cloud company that can help with your digital transformation. Other than big organizations, the SMB business which are planning to expand their data technology stack and already running their business on Google product like Google Analytics and G-Suite can consider the GCP offerings.

Google is going with the strategy of partnering with top open-source data management and analytics companies to integrate their products into its Google Cloud Platform and offer them as managed services operated by its partners. Naming the partners are Confluent, DataStax, Elastic, InfluxData, MongoDB, Neo4j and Redis Labs and this list will keep growing. What Google will offer in all this is a seamless user experience and ability to easily leverage these open-source technologies in Google’s cloud.

Adoption Strategy:  Although Kubernetes is available on all major cloud service provider and TensorFlow in AI space, you can natively get the seamless and unified experience on Google cloud.

Disclaimer: The cloud technology has a long way to go and continue to evolve. The point of view is based on the current offerings from above mentioned providers. This might change in future based on addition of new product, features and services.

Rahul Singh
Big Data Solution Architect

Please follow and like us:

Rahul,
Good analysis. In the next update, pls do add the market share of each vendor etc.

good luck

Krishna

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Big Data & Hadoop
2
Hadoop Streaming with Perl Script

In this article, I am going to explain how to use Hadoop streaming with Perl scripts. First, let’s understand some theory behind Hadoop streaming. Hadoop has been written in Java. Therefore, the native language to write MapReduce program is Java. But, Hadoop also provide an API to MapReduce that allows …

Big Data & Hadoop
1
Implementing Security in Hadoop Cluster

When we talk about security in Hadoop, we need to explore all the aspect of cluster networking and understand how the Nodes and Client communicate to each other. Let’s list down possible communication in a simple cluster. Master – Slave communication => Namenode – Datanode / Jobtracker – Tasktracker communication …

Big Data & Hadoop
1
Tool & ToolRunner – Simplifying the concept

Writing a mapper & reducer Program definition is easy. Just extend your class by org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapreduce.Reducer respectively and override the map and reduce methods to implement your logics. But, when it comes to write driver program (contain main method of program) for the MapReduce Job, it’s always preferable to …

error

Enjoyed this blog? Please spread the word :)