If you’ve chatted with me in depth about BI/DW it’s highly likely I’ve mentioned Presto – an open source project I’ve been following for the last few years.
Presto is a “SQL on anything” distributed SQL query engine which was developed by Facebook and publicly announced back in 2012. It’s not a database (even though the platform name “Prestodb” might allude to that), it’s a query engine which uses MPP (Massively Parallel Processing) which in turn, at scale, is blisteringly fast. The key to all of this is that Presto has the ability to run ANSI SQL, via its connectors, against an array of sources (potentially combining result sets) including; HDFS, Amazon S3, Azure, Hive, Kafka, MongoDB, MySQL, Postgres, Redis, Redshift, SQL Server and many, many more.
Just to bolster the street cred’ of this platform, some of the biggest players in Silicon Valley are onboard with Presto including: Facebook, Airbnb, NASDAQ, Teradata, MicroStrategy, Uber, Netflix, Groupon, Slack, Dropbox, Linkedin, Atlassian, Twitter and even AWS utilize the Presto engine in their Athena product. Facebook has been running Presto over a 300PB (yep… petabytes), with over 1,000 users accessing and executing in excess of 30,000 queries against the platform on a daily basis.
Here’s a quick view on how the query engine architecture looks:
There’s plenty of information on the Internet, and I may well write another blog on the platform nitty-gritty in the future.
As Presto is an open source Apache project deployment options are currently limited to *nix and OSX platforms, I haven’t seen any appetite for Windows deployments. So with that in mind, today I am going to walk through deploying Presto on Linux. I’m running a pentest distro called Kali Linux (which I would not recommend for this) with a Debian based kernel, so mileage may vary depending on your distribution.
So first off we need to go and collect the tarball:
Next is to unpack it:
Then double check that it has unpacked (we can see the tarball and the extract here):
So we will get into the extracted folder:
And create a folder called etc which will contain some configuration files:
We can validate that it’s there:
Now that we have the folder we can jump into it:
And we are going to use a visual editor (vi) to make our required config files:
Once you hit enter on the above vi will open up and you want to push “i” to enter into insert/edit mode, from there we add the configuration per below (note insert mode is indicated on the lower left):
Once the configuration is in place, escape [esc] and then type “:wq” and hit enter [enter] to quit vi:
We need to use vi in the exact same manner to populate all these configuration files:
The settings for jvm.config, config.properties and log.properties are below:
*note that the config.properties configuration above is for a single node instance where the configuration and worker nodes reside on the same machine, this is fine for testing and evaluation, however, reaping the rewards of the distributed engine in a production instance requires alternate configuration – I will likely follow up these options in another post.
Once those configuration files are completed we need to create a new subfolder in etc folder called catalog, once again validating that it exists, and then moving into the new catalog folder and finally launching vi to create the final properties file:
Below is the configuration for jmx.properties:
Now that all the configuration is completed we need to back out of the current catalog location back to root, and then into the Presto installation folder (note I have done a list of the files to easily cut and paste the folder name):
Now we are ready to fire Presto up:
And if you go over to the URL specified in config.properties you should see the running cluster (the uptime indicator should be in green – I had stopped my cluster):
And last but not least to kill the cluster:
Congratulations, you’ve now built a (single node) Presto cluster!
In my next blog, we are going to setup Superset and either in that blog or the next integrate it with Presto so keep reading!
Thomas – MacGyver of code
Thomas blogs about big data, reporting platforms, data warehousing and the systems behind them.