Skip to main content

Hash Outer Join


As per the definition of hash outer join, the driving (or parent) table whose rows are being preserved is used to build the hash table, and the driven (or child) table is used to probe the hash table.

To perform this hash outer join, Oracle follows these steps (assume, child table has to be outer-joined with parent table):
1.     Oracle chooses parent table as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
2.     Oracle chooses child table as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
3.     Oracle returns the output if a record from the driving table is already present in the same hash key and will flag this row as “MATCHED” in the parent hash table else no record will be returned. [ I used “MATCHED” flag term to explain this step in an easy way].
4.     After all the records of driven table is processed, Oracle will traverse through parent driving table once again and the rows not marked as “MATCHED” will be returned along with NULL values for child probing table’s columns mentioned in the query.

As you aware, in hash equi-join, deciding which table is going to be the hash table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because the smallest of two tables will fit perfectly in RAM, occupy less amount of memory space in RAM and child table will completely be traversed in a single pass]. However, in this hash outer join, this decision making process becomes immaterial since Parent table will be chosen as the hash table by default irrespective of whether it is a small or big table. Reason is, joining condition determines which table is going to be the hash table in this hash outer join instead of cost which is the metric used to identify the hash table in the hash equi-join.

The logical activity diagram for this methodology will be like this,

Read all the required data from the parent table and built the hash table in RAM after applying hash function
Loop (for all the required records in the child table)
·         Apply the hash function for each record to get the address [hashed value] in RAM
·         Hit RAM in the corresponding address
·         If a record from the driving table is already present in the same hash key, flag this record of parent hash table as “MATCHED” and return the columns from both the tables mentioned in the query else go the next record in child table
End loop
Loop (for all the records in the parent hash table which are not flagged as “MATCHED”)
·         Return the columns from the hash table along with NULL values for child probing table’s columns mentioned in the query
End loop

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ USE_HASH(<<driving table>>) */ is the hint that can be used to impose this hash outer join.
·         This methodology is opted by Oracle only if both the tables are joined by outer join “(+)” operator.
·         In RAM, hash outer join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only driving table will be inserted in the hash table in RAM and the driven table will never be written into the hash table in RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         This is very successful when one table is smaller and another table is bigger. In this case, it outplays the nested loop outer join.
·         If the driving table cannot be hashed in RAM in single pass (i.e., at one shot or one go), then a portion of the hash-table spills to disk(actually, TEMP tablespace will be used here to hold that spilled dataset from the driving table). When the hash table is probed by the driven table, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest [of the driven table] are also written into TEMP tablespace and joined in the second pass. The bigger driving table is, the smaller the proportion of the hash table that can fit in RAM, the remaining data will be spilled into TEMP tablespace and have to go through the subsequent passes till it processes all the records spilled into TEMP tablespace from both these driving and driven table. This slows the Hash Join down considerably and also makes the join non-scalable in this scenario.
ADVANTAGE:
·         If the driving table is small enough to fit in RAM, outer join “(+)” operator is used in the query & the driven table is very big, then the hash outer join outplays the nested loop outer join.
·         Still nested loop outer join will be the fastest way to retrieve the first matching record [since hash outer join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash outer join will be the fastest way to get all the matching records quickly when compared to the nested loop outer join. In general, hash function (in hash outer join) will perform better and more efficiently.

DISADVANTAGE:
·         When this is opted for joining 2 big tables, TEMP tablespace will be used extensively as the spilled dataset from both driving and driven tables will be written into TEMP tablespace
·         The above scenario makes hash outer join as more non-scalable as it has to go through subsequent passes to join these spilled dataset (available in TEMP tablespace) from both these tables

HOW TO VERIFY:
How to verify whether Oracle follows hash outer join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN (OUTER)                               |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<parent table’s name>>     |            |
|   3 |       TABLE ACCESS (FULL)                          | <<child table’s name>>          |            |

In the explain plan, whenever it chooses this methodology, it displays the keyword “HASH JOIN (OUTER)” in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving table (otherwise called as hash table) and that’s why you will find parent table’s name in that position followed by the driven table’s name (otherwise called as probe table).


EXAMPLE:

In order to understand how the mechanism of hash outer join is different from hash equi-join, I am taking the same example which I used there (Tip#20) with a few modifications and imposing the outer join.

Create 2 tables.
One table is store the information about all the employees and inserts 6 records.
Another table is store the information of the sales done by the employees and inserts 3 records.
The tables will look like this,

DATA TABLE:
EMP:
ROWID
Empid
Empname
AAAAA1
3825
CHARLES
AAAAA2
9827
FERGUSON
AAAAA3
2389
NADAL
AAAAA4
1784
ERIN
AAAAA5
4556
ROONEY
AAAAA6
8711
TOMMY

SALES:
ROWID
Empid
Sales_Amt
AAAAA7
9827
6000
AAAAA8
2389
8500
AAAAA9
5642
4000

Fire this query against these tables where the requirement is to display the employee’s name & sales_amt (if they have done) of all the employess available in EMP table,
Select a.empname, b.sales_amt
from emp a, sales b
where a.empid = b.empid(+);

since outer-join “(+)” operator is used , hash join will be used to join these 2 tables.

As per this query, emp is the parent table and bonus is the child table since the latter[sales.empid(+)] is outer-joined with the former. So, oracle will consider emp as the driving able and bonus as the driven table though the latter is smaller than the former. That’s the reason cost is not the deciding factor and join condition is the deciding factor in hash outer join.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be more complex and hard to decode it.

EMPTY HASH TABLE:
=================

Hash Key        Data
    

0

1

2

3

4

5

6

7

8

9



As per the design of hash outer join, the following steps will be followed while executing this query,
1.     Since SALES is outer-joined with EMP, EMP is considered as the driving table and SALES is considered as the driven table
2.     For the first record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
3.     So, this record will be inserted as the 6th record in hash table
4.     For the next record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
5.     So, this record will be inserted as the 8th record in hash table
6.     For the next record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
7.     So, this record will be inserted as the 10th record in hash table
8.     For the next record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
9.     So, this record will be inserted as the 5th record in hash table
10.  For the next record in EMP, hash function is applied and it returns 6 as the hashed value (i.e.,hash(4556)èmod(4556/10)è6)
11.  So, this record will be inserted as the 7th record in hash table
12.  For the next record in EMP, hash function is applied and it returns 1 as the hashed value (i.e.,hash(8711)èmod(8711/10)è1)
13.  So, this record will be inserted as the 2nd record in hash table. So, the hash table will look like this at the end of this step,

Hash Key                        Data
    






                       empid                 empname                
0


1
8711
TOMMY
2


3


4
1784
ERIN
5
3825
CHARLES
6
4556
ROONEY
7
9827
FERGUSON
8


9
2389
NADAL

14.  For the first record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
15.  So, it hits 8th record of hash table
16.  Since there is a record from EMP is available in 8th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
17.  For the second record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
18.  So, it hits 10th record of hash table
19.  Since there is a record from EMP is available in 10th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
20.  For the third(last) record in SALES, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
21.  So, it hits 3rd record of hash table
22.  Since there is no record from EMP is available in 3rd record of hash table, no data will be returned in the output

                        Hash Key                         Data
    






Rounded Rectangle: Empid:5642 from SALES hits here                                       empid                 empname
0

1
8711
TOMMY
2


3

4
1784
ERIN
5
3825
CHARLES
6
4556
ROONEY
7
9827
FERGUSON       MATCHED
8

9
2389
NADAL              MATCHED
Rounded Rectangle: Empid:2389 from SALES hits hereRounded Rectangle: Empid:9827 from SALES hits here

23.  All the records from hash table is being processed once more those are not flagged as “MATCHED”, all those records (empids:8711,1784,3825,4556) from the hash table will be returned along with NULL for the child table’s columns mentioned [sales_amt column in this case] in the query.
24.  It exits.

So, the output will look like this,
Empname
Sales_amt
FERGUSON
6000
NADAL
8500
TOMMY

ERIN

CHARLES

ROONEY


If you close look at the output, you can see that the output is displayed in such a way that, matching records are getting displayed at first and then followed by the unmatching records available only in EMP table.

Now, explain table will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  6          |
|   3 |    HASH JOIN (OUTER)                               |                                                          |  6          |
|   1 |          TABLE ACCESS (FULL)                       | EMP                                               |  6          |
|   2 |          TABLE ACCESS (FULL)                       | SALES                                             |  3          |

From the explain plan, we can say that both the tables are joined by hash outer join methodology, EMP is considered as the driving table [hash table] and SALES is considered as the driven table [probe table].

Comments

Popular posts from this blog

Listener refused the connection due to NetExcepti​on

I was testing some piece of code for calculation of new date on the basis of a given pattern and the specified date. I wrote a method to automate the test cases to generates those patterns and calculate the new date for each date of the specified date of the range of years. Since there were around 1 million pattern test cases are possible, so I want to insert this data in database for any future reference. After creating a pattern I was inserting data of the pattern and the calculation date along with the calculated date. It was working fine. I was prepare to hit the start button now, after testing different patterns individually. I hit the run button and it started its executions, but in the middle, I got this error. java.sql.SQLException: Listener refused the connection with the following error:ORA-12516, TNS:listener could not find available handler with matching protocol stack       at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:4...

B*Tree Reverse index

B*Tree Reverse index is one of the index types that is supported by oracle. It more or less behaves like b*tree unique & non-unique index but only difference is the way it constructs the tree. In this index, index value is stored in the reverse direction when compared to the values in the actual data table. This index is used in order to distribute the index data evenly across all the leaf index nodes if table data is skewed very high. Like b*tree index, this index doesn’t understand NULL. This B*Tree Reverse index can be created on a single column or more than one column. More than one B*Tree Reverse index can be created for the same data table. HOW TO VERIFY: How to verify whether the created index is reverse index or not? Fire this query, select index_type from user_indexes where index_name = 'DEPT_ID_INDX' ð   This query would return “ NORMAL/REV ” (it means it is B*Tree index but of “Reverse” type) LITTLE-KNOWN FACTS TO BE REMEMBERED: ·      ...

Hash Natural Join

To perform hash join, Oracle follows these steps: 1.      Oracle chooses the smallest of two tables as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table. 2.      Oracle chooses the other [big] table as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table. 3.      Oracle returns the output if a record from driving table is already present in the same hash key, else no record will be returned. It may look like Nested loop join & Hash join have the same architecture since both these have the concept of driving & driven tables but they have entirely different design if you closely look a...