Dubbed the Cloud Dataproc service, the product is in beta, but Google Beta products normally stay that way for years.
The service sits between managing the Spark data processing engine or Hadoop framework directly on virtual machines and a fully managed service like Cloud Dataflow.
This allows the partner to orchestrate data pipelines on Google’s platform.
Dataproc users can create a Hadoop cluster in under 90 seconds and Google will only charge 1 cent per virtual CPU/hour in the cluster. It is top of the usual cost of running virtual machines and data storage, but you can add Google’s cheaper preemptible instances to your cluster to save a bit on compute costs. Billing is per-minute, with a 10-minute minimum.
Users can set up ad-hoc clusters when needed and because it is managed, Google will handle the administration for them.
It is compatible with all existing Hadoop-based products, and it should be a doddle to port existing workloads over to Google’s new service.
Some punters want total control over their data pipeline and processing architecture and are more likely to want to run and manage their own virtual machines. Dataproc users won’t have to make any real tradeoffs when compared to setting up their own infrastructure.