A Primer on Computer Memory

While I've long known a computer runs by manipulating bits, I didn't have a good mental model of where those bits are stored and how they are accessed. I've had to build that understanding over the years, which has helped greatly as I've optimized training pipelines for neural nets, or developed data structures for high-performance computing. Talking to colleagues in the industry, I've found many of them are also uncertain about some foundational details. This article will build a mental model of computer memory, so that you can reason about the computer's memory model, and why certain things are fast and others slow.

Presto Data Flow

Presto's speed comes from massively parallelizing queries. We've talked about how it plans queries to be parallized, now let's talk about how it organizes execution of queries: clients, coordinators, workers, and channels of communiation between them.

Presto Joins

Joining two database tables is one of the harder operations to make performant. They are also foundational to most analytical queries. Let's talk about how Presto performs joins, the choices it makes, and how to make your JOIN queries more efficient.

Presto Map Reduce

Viewed from afar, the query engine consumes one or more input streams of rows, and produces a single output stream of rows. In this note, we focus on the basic case where there is one input stream that gets converted to the output stream. This is conceptually similar to the Map-Reduce paradigm, where rows get filtered, transformed, exploded, or aggregated into new rows. For performance, Presto constructs these to be as parallelizable as possible.

Presto Connectors

Presto is a SQL query engine, one that ultimately understands how to consume one or more input streams of rows and produce an output stream of rows. At its core, it doesn't understand things like datastores, disk IO, primary keys, and partitions. To be practically useful, it needs to be able to connect to datastores, which it does via connectors. A connector is specific to a particular datastore (say, MySQL, Hive, Cassandra, etc), and is what understands concepts such as disk IO, partitions, etc.