In this article I’ll share some of my experiences with Trino (ex. Presto) – high performance distributed query engine.
First some intro about the project Presto.
Couple of members from the Facebook infrastructure team created the project Presto to address problems they have with 300 Petabytes Hadoop Data Warehouse.
The main goal of the project is to have a tool that can ensure low latency interactive analytics over a massive, petabytes Hive Big Data Warehouse.
In 2018, project founders due to some disagreements, leave the FB.
The whole story you can read on the following link:
https://trino.io/blog/2020/12/27/announcing-trino.html
The reason why is this important is because as a result, now we have two separate projects.
First one is supported by the Facebook:
while the other project – PrestoSQL, which you can previously found on the link below:
is now rebranded as Trino, which is supported by all founders of the Presto project:
In this article I’ll use Trino, which has more active community and is supported by founders of that project, as you can see on the following picture.
Besides that, there are some commercial offerings, starting with Starburst, where the project founders now work:
If you are already on the Cloud, it might be interesting for Qubole, which offers Presto as a service. You can find out more on the link below:
Why you should seriously consider installing a Trino (Presto) at your place?
Here are some use cases that I’ve found common for every company I’ve been working so far:
1. Real-time Data Warehouse
Majority of Data Warehouses are still implemented by using the same, old ETL process, which is schedule once per day, with the goal to inject data from the last business day into the star schema tables, so business users can get their reports with data from the previous day.
But what if you want to have history data from the DWH and current data from the OLTP system?
Well, you can use various ways, but no one is efficient and fast, with significant impact.
If you have such case, you should consider Trino (ex. PrestoSQL).
2. AI (Artificial Intelligence), ML (Machine Learning) – data preparation
When developing AI / ML pipelines, the most of your time is spent on data preparation and data analysis, not only because of limitations of Spark, Pandas, Dask or R Studio, but also because of performance and scalability.
If you are working as a Data Scientist, distributed query engine such as Trino (PrestoSQL) is a way to go.
Once you try it, you’ll never go back.
3. Troubleshooting issues with data
If your job is DataOps engineer, System Integrator, Support service, Business analyst or any role where you need to check and troubleshoot data inconsistency among different systems (SQL, NoSQL, Kafka, Hive…), Trino (PrestoSQL) will make your life easier.
What makes Trino (Presto) apart from its competitors?
1. SQL as the query/programming language for any data source
Instead of learning different DSL languages for each data sources (such as MongoDb, ELK etc.), you probably already know SQL engine, so you don’t need to learn anything to get all the benefits from Trino (PrestoSQL).
2. Fast to implement data integration
Instead of building ETL chains to get data from two different data sources together, in Trino (PrestoSQL) you only need to add connector for those data sources, and in a matter of minutes you are ready to go.
3. Minimal impact on data source
In case you are using Oracle Gateway, or MS Anylysis Server, you are probably aware of many limitations they have.
One of them is Driving Site – which means that if you are connected to the Oracle database and want to join local table with either another Oracle Db or some other RDBMS (e.g. MS SQL Server), by default all data from the remote site will be fetched across the network to the database from where you launched SQL statement.
As you can see, this is very inefficient way to do the job, because you are moving a (lot) of data across the network to perform join, which will be sub-optimal since statistics for Cost Based Optimizer might be missing.
Instead of moving data across the network, Trino (PrestoSQL) approach is to move your code close to the data source.
Since Trino (PrestoSQL) will open single thread to fetch the data (no parallelism will be used unless you explicitly request it), the impact will be minimal (just reading the data).
Additionally, Trino (Presto) with each release comes with more tricks that CBO is capable to do, such as pushing predicate to eliminate as many rows as possible, to further lower down amount of data you need to move across the network.
Joins, which in old approach will take place in driving site (database that started the initial query), in Trino (PrestoSQL) all joins will take place on worker nodes that are governed by the controller node, avoiding any impact on the data sources.
4. Many data sources available
When you want to join data from different (heterogenous) data sources with old approach you are limited with SQL relational databases.
With Trino (PrestoSQL) you can join Mongo, Kafka, Oracle and Hive in the same SQL.
Some of the available connectors you can see on the following slide.
5. Programmatically using distributing query engine
Besides command line client or database management tools to write SQL code for Trino (PrestoSQL) such as DataGrip, DBeaver and likes, you can use available jdbc drivers to use query engine from your application.
6. Combining On-prem, Cloud and Hybrid data
Although I didn’t mention explicitly, it doesn’t matter where your data lies – is it on prem, on Cloud, or somewhere between.
In all cases the principle is the same.
On the next post, I’ll provide some examples of using this amazing technology.
Comments