SQL JOIN on Java

I often interview software engineers and ask a simple question about how JOIN works on SQL internally. They usually reply with some rambling about magic trees and optimal indexing. I used to think that a good programmer knows their tools; unfortunately, it is not always so. I still cannot grasp how a specialist can be comfortable operating the most basic tools without imagining what they have under the hood.

Let’s go over SQL Joins 101, look at how they work and implement a couple of algorithms together.

Problem

If my interviewee claims to work a lot and closely with SQL, I usually ask the following question.

Here is an SQL script:

How would you write the same on Java, i.e. implement the following:

I’m not asking for the code of an actual implementation, rather for an explanation of the algorithm behind it. My other question is: what has to be changed in the signature and implementation in order to pretend that we operate with indexes.

Let’s first understand, why should we even figure out how JOIN works? Here is why:

  1. Knowing the theory is useful for educational purposes.
  2. If you distinguish between the different types of SQL joins, then EXPLAIN QUERY PLAN will no longer look like a string of incomprehensible words to you. You will be able examine the query plan, identify slow spots and optimize it by rewriting or putting hints on the query.
  3. Query Optimizers for the Hadoop stack in the new trendy analytical tools are either dumb (Cloudera Impala) or non-existent (Twitter Scalding, Spark RDD). In the latter case, the query has to be assembled manually from the primitives.
  4. Finally, there is a risk that one day you’ll get an interview with me or another nitpicker. Alright. This article indeed is not about interviews but operation of join.

Nested Loops Join

The most basic algorithm to join two tables is now taught in the middle school and is very simple. For each element of the first table, let’s run through all the elements of the other table. If the keys of the elements are equal, store the match in the resulting table. Having two nested loops is enough to implement this algorithm, that’s why it is called Nested Loops Join.

The main advantage of this method is that it is completely indifferent to the input data. The algorithm works for any two tables, does not require indexes or sorting data in memory, and is easy to implement. In practice, this means that it is enough to just run through the disk using two cursors and periodically push the matches into the socket.

However, a serious disadvantage of Nested Loops Join is its high time complexity O(N*M) (quadratic growth). For instance, joining two small tables of 100k and 500k rows results in as much as 100.000 * 500.000 = 50.000.000.000 (50 bln) comparison operations. Queries with such JOIN take indecently long time for execution, and oftentimes they are the drag behind some crooked DIY examples of CMS.

Modern RDBMS use Nested Loops Join only in the most hopeless cases when no optimization can be applied.

UPD. For a small number of rows, running intelligent optimization is often less efficient than Just Nested Loops alone. It works well for a certain class of systems.

Hash Join

If one of the tables is small enough to fit entirely into memory, then a hash map can be used to look up the keys in it. Let’s discuss Hash Join.

Check the sizes of both tables. Take the smaller one, read it completely and load into memory building your HashMap. Now, go back to the larger table and run through it with a cursor from the beginning. For each key, check if it exists in HashMap. If yes, store the match in the resulting table.

The time complexity of Hash Join is now linear O(N+M), but requires additional memory.

It is important to note that in ‘prehistoric times’ it was believed that the right table should be loaded into memory, while the left table should be iterated. Nowadays, good RDBMS have column cardinality statistics and determine the order of joins automatically. If for some reason cardinality is not available, then the right table is loaded into memory. This is important to remember when working with new, clunky tools like Cloudera Impala.

Merge Join

Let’s imagine that the data in both tables is already sorted, for instance, in ascending order. This happens if we had indexes on these tables, or if we sorted the data at the previous stages of the query. As you might remember, two sorted tables can be merged together in one sorted table in linear time. This is what Merge Sort is based on. Here we have a similar task, but instead of merging tables, we are going to look for matching elements.

Let’s put the cursors at the top of both tables. If the cursors refer to the equal keys, store the match in the resulting table. If not, check which of the cursors refer to a smaller key. Then move the cursor with the smaller key forward by one, catching up with the other cursor.

If the data is sorted, time complexity of this algorithm is linear O(M+N), and no additional memory is required. If not, previous sorting is required. For this reason, time complexity increases to O(M log M + N log N), plus additional memory requirements.

Outer Joins

You might notice that the script above only mimics INNER JOIN expecting that all the keys in both tables are unique, that is, they occur no more than once. I did it on purpose for two reasons. First, it is a good demonstration of the algorithm. The script shows only the logic of joins and nothing else. Second, I really wanted to go to sleep.:-) Let’s discuss what has to be changed in the code to support different types of joins and non-unique key values.

The first problem is non-unique, or duplicate keys. Duplicate keys require the Cartesian product of all the values for each key.

For Nested Loops Join, it works right away.

For Hash Join, replace HashMap with MultiHashMap.

For Merge Join, the situation is much more serious. We have to remember how many elements with the same key we encounter.

Duplicate keys increase the asymptotics to O(N*m + M*n), where n and m are the average of rows per key in the tables. The degenerate case when n = N and m = M produces CROSS JOIN.

Also, you need to track the keys that have no duplicates.

For Merge Join, the keys without a duplicate are visible right away for all directions of JOIN.

For Hash Join, you can immediately register that the left table has a key that has no matches on the right. In order to register unique keys in the right table, you have to create a separate flag ‘Matched’ for each element of the hash table. After the main join is completed, you have to run through the hash table and add the keys that have no Matched Flag into the resulting table.

For Nested Loops Join, the situation is similar and so simple that I was even able to implement it:

Afterword. Author’s opinion

If you made it this far, I hope you found this article interesting. Then, please forgive me for this brief lecture.

I do believe that knowing RDBMS at the SQL level is absolutely not enough to consider yourself a professional software developer. A good specialist should not only know their code, but also how their stack neighbors function. This includes third-party systems, such as databases, frameworks, network protocols, file systems, etc. Without this, developer degenerates into a coder and becomes useless in seriously complicated tasks.

UPD. Despite the afterword, this article is about SQL joins.

Public Sector Wage Distribution and Inequality in San Francisco

By a curious twist of fate, I came across the information about salaries, overtime pay and bonuses for the public sector in the City by the Bay – San Francisco. The dataset included public employees of all ranks, from the Mayor to the janitor of the City Hall.

Without losing a single minute, I dug into the numbers. There is nothing more exciting than looking at the wages of others, especially when you can justify it by doing data science! 🙂

This dataset had the information not only about top executives’ wages but regular employees as well, including the low-level. This data show fundamental patterns in the income distribution which can be easily observed in real life. I would like to use this opportunity and invite all of you, couch economists, to explore an exciting world of impressive numbers and stingy statistics.

We will talk about average and median pay, Gini index for income inequality, rich/poor income ratio, Matthew Effect which describes large accumulating income gap, and career growth. Continue reading Public Sector Wage Distribution and Inequality in San Francisco

Fedor Lavrentyev's blog