3 minutes read time
04/02/2019
The revolution of data-driven energy savings
Energy-saving technology for the industry
BigDa Solutions, shortlisted for El Periódico de la Energía Awards
Today the advantages of hosting your Hadoop cluster in a public cloud are clear: flexibility, scalability and cost reduction. It is not necessary to buy and maintain hardware locally with all its associated acquisition and maintenance costs and it provides us with unprecedented scalability: we are no longer tied to the number of available computers, it is possible to increase and reduce the cluster resources as needed. If at any time the workload of the cluster increases or decreases at specific times, it is possible to adjust it to the needs of each moment and, as the payment is made only for the resources used, the cost would be adjusted to the real needs.
The three best-known major platforms for hosting Hadoop clusters are Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). All three are very similar in terms of the possibilities they offer, but each one has small particularities that will make us choose one provider or another.
The first characteristic to analyse is the number of machine types available on the platform. Normally, the greater the number, the better the machine resources can be adjusted to the real needs. In this case, AWS is the provider that offers the most types, although this is not a critical feature in most cases.
Cost flexibility is something to be taken into account when selecting a provider. In the case of AWS, there is the possibility of reserving instances or contracting Spot instances that allow savings of up to 90% compared to on-demand instances. Azure has similar mechanisms through reservation and GCP automatically applies discounts for continued use.
Finally, the feature that can make us choose one platform over another is the managed Hadoop support provided. In the case of AWS, the managed service is called Elastic MapReduce (EMR), in Azure it is called HDInsight and in GCP it is called DataProc. EMR and DataProc use the Apache Hadoop core (EMR also supports the MapR distribution). Azure's HDInsight service instead uses the Hortonworks distribution that allows us to apply the knowledge already acquired or to be acquired on other platforms.
If our intention is to install our own distribution and not use the managed service, any of the three platforms allows us to hire virtual machines. The flexibility of this solution will be greater than using the managed service; we will be able to configure and adapt the platform to our needs. Apart from taking into account only the characteristics of the Hadoop cluster, it is also interesting to evaluate the ecosystem provided by the chosen provider. In this aspect, the most prominent provider is AWS, both for the services it offers directly and for the services offered by third-party companies due to its great popularity.
One of the services offered by the three platforms that may be interesting for a Hadoop cluster is data storage and archiving. In the case of AWS we have the S3 service for data hosting, Blob Storage in Azure and Cloud Storage in GCP. The features and prices offered by the three are on a par in this case, but it may be in our interest to choose one or the other depending on the Hadoop software to be used and its integration with the cluster. Integration with NoSQL database services may also be a consideration when selecting a vendor. NoSQL databases can be useful for storing Hadoop cluster data similar to file storage. Here too, all three vendors offer comparable solutions: DynamoDB on AWS, DocumentDB and Managed MongoDB on Azure, and BigTable and BigQuery on GCP.
After assessing all the points mentioned above, there is one provider that stands out from the rest, and that is Amazon Web Services. It is one of the most popular platforms and offers the most services in its ecosystem. It has services for Big Data processing such as Kinesis for real-time data, Lambda for event-driven processing and AWS IoT for ingesting and processing large amounts of data produced by IoT devices.