Skip to main content

Hash Outer Join


As per the definition of hash outer join, the driving (or parent) table whose rows are being preserved is used to build the hash table, and the driven (or child) table is used to probe the hash table.

To perform this hash outer join, Oracle follows these steps (assume, child table has to be outer-joined with parent table):
1.     Oracle chooses parent table as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
2.     Oracle chooses child table as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
3.     Oracle returns the output if a record from the driving table is already present in the same hash key and will flag this row as “MATCHED” in the parent hash table else no record will be returned. [ I used “MATCHED” flag term to explain this step in an easy way].
4.     After all the records of driven table is processed, Oracle will traverse through parent driving table once again and the rows not marked as “MATCHED” will be returned along with NULL values for child probing table’s columns mentioned in the query.

As you aware, in hash equi-join, deciding which table is going to be the hash table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because the smallest of two tables will fit perfectly in RAM, occupy less amount of memory space in RAM and child table will completely be traversed in a single pass]. However, in this hash outer join, this decision making process becomes immaterial since Parent table will be chosen as the hash table by default irrespective of whether it is a small or big table. Reason is, joining condition determines which table is going to be the hash table in this hash outer join instead of cost which is the metric used to identify the hash table in the hash equi-join.

The logical activity diagram for this methodology will be like this,

Read all the required data from the parent table and built the hash table in RAM after applying hash function
Loop (for all the required records in the child table)
·         Apply the hash function for each record to get the address [hashed value] in RAM
·         Hit RAM in the corresponding address
·         If a record from the driving table is already present in the same hash key, flag this record of parent hash table as “MATCHED” and return the columns from both the tables mentioned in the query else go the next record in child table
End loop
Loop (for all the records in the parent hash table which are not flagged as “MATCHED”)
·         Return the columns from the hash table along with NULL values for child probing table’s columns mentioned in the query
End loop

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ USE_HASH(<<driving table>>) */ is the hint that can be used to impose this hash outer join.
·         This methodology is opted by Oracle only if both the tables are joined by outer join “(+)” operator.
·         In RAM, hash outer join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only driving table will be inserted in the hash table in RAM and the driven table will never be written into the hash table in RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         This is very successful when one table is smaller and another table is bigger. In this case, it outplays the nested loop outer join.
·         If the driving table cannot be hashed in RAM in single pass (i.e., at one shot or one go), then a portion of the hash-table spills to disk(actually, TEMP tablespace will be used here to hold that spilled dataset from the driving table). When the hash table is probed by the driven table, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest [of the driven table] are also written into TEMP tablespace and joined in the second pass. The bigger driving table is, the smaller the proportion of the hash table that can fit in RAM, the remaining data will be spilled into TEMP tablespace and have to go through the subsequent passes till it processes all the records spilled into TEMP tablespace from both these driving and driven table. This slows the Hash Join down considerably and also makes the join non-scalable in this scenario.
ADVANTAGE:
·         If the driving table is small enough to fit in RAM, outer join “(+)” operator is used in the query & the driven table is very big, then the hash outer join outplays the nested loop outer join.
·         Still nested loop outer join will be the fastest way to retrieve the first matching record [since hash outer join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash outer join will be the fastest way to get all the matching records quickly when compared to the nested loop outer join. In general, hash function (in hash outer join) will perform better and more efficiently.

DISADVANTAGE:
·         When this is opted for joining 2 big tables, TEMP tablespace will be used extensively as the spilled dataset from both driving and driven tables will be written into TEMP tablespace
·         The above scenario makes hash outer join as more non-scalable as it has to go through subsequent passes to join these spilled dataset (available in TEMP tablespace) from both these tables

HOW TO VERIFY:
How to verify whether Oracle follows hash outer join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN (OUTER)                               |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<parent table’s name>>     |            |
|   3 |       TABLE ACCESS (FULL)                          | <<child table’s name>>          |            |

In the explain plan, whenever it chooses this methodology, it displays the keyword “HASH JOIN (OUTER)” in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving table (otherwise called as hash table) and that’s why you will find parent table’s name in that position followed by the driven table’s name (otherwise called as probe table).


EXAMPLE:

In order to understand how the mechanism of hash outer join is different from hash equi-join, I am taking the same example which I used there (Tip#20) with a few modifications and imposing the outer join.

Create 2 tables.
One table is store the information about all the employees and inserts 6 records.
Another table is store the information of the sales done by the employees and inserts 3 records.
The tables will look like this,

DATA TABLE:
EMP:
ROWID
Empid
Empname
AAAAA1
3825
CHARLES
AAAAA2
9827
FERGUSON
AAAAA3
2389
NADAL
AAAAA4
1784
ERIN
AAAAA5
4556
ROONEY
AAAAA6
8711
TOMMY

SALES:
ROWID
Empid
Sales_Amt
AAAAA7
9827
6000
AAAAA8
2389
8500
AAAAA9
5642
4000

Fire this query against these tables where the requirement is to display the employee’s name & sales_amt (if they have done) of all the employess available in EMP table,
Select a.empname, b.sales_amt
from emp a, sales b
where a.empid = b.empid(+);

since outer-join “(+)” operator is used , hash join will be used to join these 2 tables.

As per this query, emp is the parent table and bonus is the child table since the latter[sales.empid(+)] is outer-joined with the former. So, oracle will consider emp as the driving able and bonus as the driven table though the latter is smaller than the former. That’s the reason cost is not the deciding factor and join condition is the deciding factor in hash outer join.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be more complex and hard to decode it.

EMPTY HASH TABLE:
=================

Hash Key        Data
    

0

1

2

3

4

5

6

7

8

9



As per the design of hash outer join, the following steps will be followed while executing this query,
1.     Since SALES is outer-joined with EMP, EMP is considered as the driving table and SALES is considered as the driven table
2.     For the first record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
3.     So, this record will be inserted as the 6th record in hash table
4.     For the next record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
5.     So, this record will be inserted as the 8th record in hash table
6.     For the next record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
7.     So, this record will be inserted as the 10th record in hash table
8.     For the next record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
9.     So, this record will be inserted as the 5th record in hash table
10.  For the next record in EMP, hash function is applied and it returns 6 as the hashed value (i.e.,hash(4556)èmod(4556/10)è6)
11.  So, this record will be inserted as the 7th record in hash table
12.  For the next record in EMP, hash function is applied and it returns 1 as the hashed value (i.e.,hash(8711)èmod(8711/10)è1)
13.  So, this record will be inserted as the 2nd record in hash table. So, the hash table will look like this at the end of this step,

Hash Key                        Data
    






                       empid                 empname                
0


1
8711
TOMMY
2


3


4
1784
ERIN
5
3825
CHARLES
6
4556
ROONEY
7
9827
FERGUSON
8


9
2389
NADAL

14.  For the first record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
15.  So, it hits 8th record of hash table
16.  Since there is a record from EMP is available in 8th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
17.  For the second record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
18.  So, it hits 10th record of hash table
19.  Since there is a record from EMP is available in 10th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
20.  For the third(last) record in SALES, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
21.  So, it hits 3rd record of hash table
22.  Since there is no record from EMP is available in 3rd record of hash table, no data will be returned in the output

                        Hash Key                         Data
    






Rounded Rectangle: Empid:5642 from SALES hits here                                       empid                 empname
0

1
8711
TOMMY
2


3

4
1784
ERIN
5
3825
CHARLES
6
4556
ROONEY
7
9827
FERGUSON       MATCHED
8

9
2389
NADAL              MATCHED
Rounded Rectangle: Empid:2389 from SALES hits hereRounded Rectangle: Empid:9827 from SALES hits here

23.  All the records from hash table is being processed once more those are not flagged as “MATCHED”, all those records (empids:8711,1784,3825,4556) from the hash table will be returned along with NULL for the child table’s columns mentioned [sales_amt column in this case] in the query.
24.  It exits.

So, the output will look like this,
Empname
Sales_amt
FERGUSON
6000
NADAL
8500
TOMMY

ERIN

CHARLES

ROONEY


If you close look at the output, you can see that the output is displayed in such a way that, matching records are getting displayed at first and then followed by the unmatching records available only in EMP table.

Now, explain table will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  6          |
|   3 |    HASH JOIN (OUTER)                               |                                                          |  6          |
|   1 |          TABLE ACCESS (FULL)                       | EMP                                               |  6          |
|   2 |          TABLE ACCESS (FULL)                       | SALES                                             |  3          |

From the explain plan, we can say that both the tables are joined by hash outer join methodology, EMP is considered as the driving table [hash table] and SALES is considered as the driven table [probe table].

Comments

Popular posts from this blog

B*Tree Reverse index

B*Tree Reverse index is one of the index types that is supported by oracle. It more or less behaves like b*tree unique & non-unique index but only difference is the way it constructs the tree. In this index, index value is stored in the reverse direction when compared to the values in the actual data table. This index is used in order to distribute the index data evenly across all the leaf index nodes if table data is skewed very high. Like b*tree index, this index doesn’t understand NULL. This B*Tree Reverse index can be created on a single column or more than one column. More than one B*Tree Reverse index can be created for the same data table. HOW TO VERIFY: How to verify whether the created index is reverse index or not? Fire this query, select index_type from user_indexes where index_name = 'DEPT_ID_INDX' ð   This query would return “ NORMAL/REV ” (it means it is B*Tree index but of “Reverse” type) LITTLE-KNOWN FACTS TO BE REMEMBERED: ·          If this B*Tree Reve

Nested Loop Outer Join

Before we go into this topic in detail, first we will understand what outer join is meant by. For an example, take 2 tables which have to be joined, say “Parent” table & “Child” table [I have taken these table names, only to discuss this topic in an easy way]. If Child table is outer joined with Parent table, then you can say, the matching records from both these tables will be retrieved and then the records which are available only in Parent table will also be retrieved though the corresponding matching records are not available in Child table. This is called outer join. [in natural or equi-join, only the matching records from both the tables will be retrieved.] To perform this outer join, 3 different join techniques have been provided by oracle. ·          Nested loop outer join ·          Hash outer join ·          Sort merge outer join In this, we will discuss about nested loop outer join. To perform this nested loop outer join, Oracle follows these steps (assume, child ta

Index Fast Full Scan

Index Fast Full Scan is one of the index access methods supported by Oracle. Index Fast Full Scan is otherwise called as Fast Full Index Scan also. Index Fast Full Scan can be imposed by Oracle Optimizer only in the certain scenario. The scenario where this index would be invoked is, Whenever the Oracle SQL optimizer detects that the query is serviceable without touching table rows, Oracle invokes this fast full index scan and quickly reads every block of the index without touching the table itself provided that query doesn’t contain any ORDER BY clause. The definition for this scan is more or less similar to the index full scan but the only difference is that former  will be invoked when ORDER BY clause is not mentioned in the query. It differs from the index full scan in another way where output won’t be in the sorted order since ORDER BY clause is not mentioned. If Oracle detects the query is serviceable without touching the table rows, it means, all the columns mentioned in th