What is Greenplum Database? This is on of PostgreSQL forks optimized for OLAP and analytics workloads. In my opinion the second life of GreenplumDB began in 2015 year. In this year Greenplum became the open source project. The current 6 version based on PostgreSQL 9.4, and the Greenplum Community is actively developing the 7 version, which should be compatible with PostgreSQL 13! So, this is really cool!
But what prompted me to write this article? The fact is that sometimes we are confronted with the opinion that Greenplum is just a “big sharded Postgres”. Yes, there really is some truth in this, but there are nuances…
Briefly about architecture
GreenplumDB is an analytical DBMS with Shared-nothing (Massive Parallel Processing) architecture. You can see the high-level Greenplum Database architecture bellow.
As you can see, Greenplum consists of a master and a segment hosts. There are some nodes on each segment host. All communication between the master and the segments takes place through an internal network called interconnect. I think many of you have already begun to understand why GreenplumDB is called an analytical database and is not recommended for OLTP systems. The fact is that in addition to network delays, there is also overhead for processing a query on the master and on each of the nodes. This can be critical for queries with a small execution time. In addition, there cannot be many such queries. You should consider these nuances if you suddenly decide to migrate your system from PostgreSQL to Greenplum DB.
But there is good news as well. GreenplumDB is compatible with almost all PostgreSQL software (ODBC, JDBC or libpq). In addition, SQL syntax is mostly compatible with PostgreSQL, including stored functions. But you must be careful when migrating code from PostgreSQL “as is”. The point is that it may not be optimal for GreenplumDB at all.
As the documentation says “In Greenplum Database, the default GPORCA optmizer co-exists with the legacy query optimizer.” This means that there are two query planners in GreenplumDB. The default is GPORCA, which is optimized for some operations, such as queries against partitioned tables. But for queries with a large number of joins, it is sometimes better to use a planner inherited from PostgreSQL.
Platform eXtension Framework
The Greenplum Platform Extension Framework (PXF) is a Java APi which provides parallel, high throughput data access and federated queries across heterogeneous data sources via built-in connectors that map a Greenplum Database external table definition to an external data source. It can exchange data in parallel with external systems. For example, HDFS, Hive, Oracle, MSSQL, Hbase, Kafka, ClickHouse, S3, SAP, MySQL etc.
Greenplum Database, like PostgreSQL, has data partitioning functionality. Describing all the possibilities for distributing data using partitioning will take a very long time. Let’s just mention some interesting possibilities. Partitions can be:
- on fast SSDs (for example, in heap storage),
- on slow HDDs (for example, in column-oriented storage with compression),
- on s3 or Hadoop,
- + push-down to an external DBMS;
A picture from a talk by Andreas Scherbaum at one of the PgConf conferences illustrates very well the partition capabilities of Greenplum DB.
Data Science and Procedural Languages
Greenplum Database has a large number of opportunities for work of data scientists. Some interesting theses for datascientists:
- functions in PL / Python: will be executed in parallel on all nodes;
- own aggregates: on the segment + on the master + final – you can write aggregates by Java, Python, C;
- PL/ Container extension – for each call of the PL-function, it starts PL/ Container;
- MADlib is a SQL wrapper around Data science functions. No need to write code by Python;
- can be installed GPU to segment server for Data science calculation acceleration.
In addition, there is a very important functionality with which you can differentiate the consumption of server resources. Using resource groups, you can:
- Set limits per user, group, session, query;
- Limited resources: CPU, RAM, cost, disk space, concurrency, container resources, hard-linking cores to users and etc.
Another important difference between GreenplumDB and PostgreSQL is the availability of several compression methods. The following methods are currently supported in 6.x version:
For each compression method, you can also specify a compression level from 1 to 9.
This article is not intended to describe all the features of GreenplumDB and differences from PostgreSQL. I just indicated some of the features and nuances that you should consider if you decide to migrate your database from PostgreSQL (or another DBMS) to GreenplumDB. PostgreSQL is an excellent database management system and works well on OLAP systems under certain conditions. But if you have tens or hundreds TB of data and want to migrate to GreenplumDB, I hope this article will be useful to you as a starting information. As you can see Greenplum DB is not “just big PostgreSQL”.