Skip to main content

Hash Natural Join

To perform hash join, Oracle follows these steps:
1.     Oracle chooses the smallest of two tables as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
2.     Oracle chooses the other [big] table as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
3.     Oracle returns the output if a record from driving table is already present in the same hash key, else no record will be returned.

It may look like Nested loop join & Hash join have the same architecture since both these have the concept of driving & driven tables but they have entirely different design if you closely look at how the tables are being joined. In nested loop join, for each record in driving [outer] table, the corresponding records from the driven [inner] table will be joined and the output will be returned. But in the hash join, the driving [hash] table will be built at first in RAM by applying the hash function, then the driven [probe] table will be traversed next. While traversing each record in the probe table, Oracle applies the hash function on the joining column(s) and then hit the corresponding entry in the hash table. If a record from the driving table is already there in the same hash key, then the output will be returned.

A brief explanation on what is meant by hash table, hash function, hash value & hash key is mentioned below.

Hash Table:
=========

Hash Key                        Data

    

1
<<actual data of a record whose joining columns’ hash value is equal to 1>>
2
<<actual data of a record whose joining columns’ hash value is equal to 2>>
3
<<actual data of a record whose joining columns’ hash value is equal to 3>>
..
..
..
..
N
<<actual data of a record whose joining columns’ hash value is equal to n>>


Hash table is one of the common data structures. Hash table is nothing but an array holds the data. In hash table, each record is uniquely identified by hash key which is nothing but the memory address [in ORACLE]. Hash function accepts an input value and returns an output value which is otherwise called as hash value (or hashed value).

From the above diagram, you can see that each record in hash table is uniquely identified by hash key(like 1,2,3,…,n).

In general, hash function is a set of formulas that accepts a number as an input and returns a number as an output [from mathematical perspective] where in output number is shorter and more compact than input number.

From Oracle perspective, hash function is a set of formulas that is applied to each value of joining column(s) of driving table, to get the address of RAM in which the row should be stored. Same hash function will be applied on each value of joining column(s) of driven table, to get the address of RAM in order to locate the data from the driving table if exists.

So, Oracle applies hash function on the joining column’s value to get the output which is otherwise called as hashed value. In Oracle, this returned hashed value is always been an address in RAM.

The logical activity diagram for this methodology will be like this,

Read all the required data from the driving table and built the hash table in RAM after applying hash function
Loop (for all the records in the driven table)
·         Apply the hash function for each record to get the address [hashed value] in RAM
·         Hit RAM in the corresponding address
·         If a record from the driving table is already present in the same hash key, then return the output else, don’t return the output
End loop

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ USE_HASH(<<driving table>>) */ is the hint that can be used to impose this hash natural join.
·         This methodology is opted by Oracle only if both the tables are joined by equal (=) operator.
·         In RAM, hash join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only driving table will be inserted in the hash table in RAM and the driven table will never be written into the hash table in RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         This is very successful when one table is smaller and another table is bigger. In this case, it outplays both nested loop and sort merge joins.
·         If the driving table cannot be hashed in RAM in single pass (i.e., at one shot or one go), then a portion of the hash-table spills to disk(actually, TEMP tablespace will be used here to hold that spilled dataset from the driving table). When the hash table is probed by the driven table, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest [of the driven table] are also written into TEMP tablespace and joined in the second pass. The bigger driving table is, the smaller the proportion of the hash table that can fit in RAM, the remaining data will be spilled into TEMP tablespace and have to go through the subsequent passes till it takes care of all the records spilled into TEMP tablespace from both these driving and driven table. This slows the Hash Join down considerably and also makes the join non-scalable in this scenario.
ADVANTAGE:
·         If the driving table is small enough to fit in RAM and equal (=) operator is used in the query, then hash join outplays both nested loop and sort merge joins.
·         Still nested loop will be the fastest way to retrieve the first matching record [since hash join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash join will be the fastest way to get all the matching records when compared to both nested loop and sort merge join. Hash function (in hash join) will perform more efficient and better than merging activity (in sort merge join).

DISADVANTAGE:
·         When this is opted for joining 2 big tables, TEMP tablespace will be used extensively as the spilled dataset from both driving and driven tables will be written into TEMP tablespace
·         The above scenario makes hash join as more non-scalable as it has to go through subsequent passes to join these spilled dataset (available in TEMP tablespace) from both these tables

HOW TO VERIFY:
How to verify whether Oracle follows hash natural join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN                                                  |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<small table’s name>>         |            |
|   3 |       TABLE ACCESS (FULL)                          | <<big table’s name>>             |            |

In the explain plan, whenever it chooses this methodology, it displays the keyword (HASH JOIN) in the operation column. Whichever the table name that is getting displayed immediately after this keyword is nothing but the driving table (otherwise called as hash table) and the next one is the driven table(otherwise called as probe table).

EXAMPLE:
Create 2 tables.
One table is store the information of the sales done by the employees for January Month and inserts 6 records.
Another table is store the information of the sales done by the employees for February Month and inserts 3 records.
The tables will look like this,

DATA TABLE:
EMP_JAN:
ROWID
Empid
empname
Sales_Amt
AAAAA1
3825
CHARLES
2500
AAAAA2
9827
FERGUSON
1000
AAAAA3
2389
NADAL
3000
AAAAA4
1784
ERIN
7000
AAAAA5
4556
ROONEY
5500
AAAAA6
8711
TOMMY
6500

EMP_FEB:
ROWID
Empid
empname
Sales_Amt
AAAAA7
9827
FERGUSON
6000
AAAAA8
2389
NADAL
8500
AAAAA9
5642
JOHN
4000

Fire this query against these tables where the requirement is to display all the employee name(s) along with their sales information whoever able to do the sales on both the months,
Select a.empname, a.sales_amt january_amt, b.sales_amt february_amt
from emp_jan a, emp_feb b
where a.empid = b.empid;

since one table is small and another table is big, equal (=) operator is used , hash join will be the best candidate for joining these 2 tables.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function for the base 10 will be used as hash function here. In real world, the hash function would be more complex and hard to decode it.

EMPTY HASH TABLE:
=================

Hash Key        Data
    

0

1

2

3

4

5

6

7

8

9



As per the design of hash join, the following steps will be followed while executing this query,
1.     Since EMP_FEB is the smallest of these two tables, EMP_FEB is considered as the driving table
2.     For the first record in EMP_FEB, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
3.     So, this record will be inserted as the 8th record in hash table
4.     For the second record in EMP_FEB, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
5.     So, this record will be inserted as the 10th record in hash table
6.     For the third record in EMP_FEB, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
7.     So, this record will be inserted as the 3rd record in hash table. So, the hash table will look like this at the end of this step,

Hash Key                                        Data
    






                       empid                 empname                 sales_amt
0



1



2
5642
JOHN
4000
3



4



5



6



7
9827
FERGUSON
6000
8



9
2389
NADAL
8500

8.     Since EMP_JAN is the biggest of these two tables, EMP_JAN is considered as the driven table
9.     For the first record in EMP_JAN, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
10.  So, it hits 6th record of hash table
11.  Since there is no record from EMP_FEB is available in 6th record of hash table, no data will be returned in the output
12.  For the second record in EMP_JAN, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
13.  So, it hits 8th record of hash table
14.  Since there is a record from EMP_FEB is available in 8th record of hash table, data will be returned in the output
15.  For the third record in EMP_JAN, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
16.  So, it hits 10th record of hash table
17.  Since there is a record from EMP_FEB is available in 10th record of hash table, data will be returned in the output
18.  For the fourth record in EMP_JAN, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
19.  So, it hits 5th record of hash table
20.  Since there is no record from EMP_FEB is available in 5th record of hash table, no data will be returned in the output
21.  For the fifth record in EMP_JAN, hash function is applied and it returns 6 as the hashed value (i.e.,hash(4556)èmod(4556/10)è6)
22.  So, it hits 7th record of hash table
23.  Since there is no record from EMP_FEB is available in 7th record of hash table, no data will be returned in the output
24.  For the sixth(last) record in EMP_JAN, hash function is applied and it returns 1 as the hashed value (i.e.,hash(8711)èmod(8711/10)è1)
25.  So, it hits 2nd record of hash table
26.  Since there is no record from EMP_FEB is available in 2nd record of hash table, no data will be returned in the output







Rounded Rectangle: Empid:8711 from EMP_JAN hits here
Rounded Rectangle: Empid:3825 from EMP_JAN hits here
Rounded Rectangle: Empid:9827 from EMP_JAN hits here


Hash Key                                        Data
    







Rounded Rectangle: Empid:2389 from EMP_JAN hits here





                       empid                 empname                 sales_amt
0


1



2
5642
JOHN
4000
3



4



5



6



7
9827
FERGUSON
6000
8



9
2389
NADAL
8500

So, the output will look like this,
Empname
January_amt
February_amt
FERGUSON
1000
6000
NADAL
3000
8500

If you close look at the output, you can see that the output is displayed in the traversing order of records from EMP_JAN because while probing the driven table, FERGUSON is processed first and then NADAL.

Now, explain table will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  2          |
|   3 |    HASH JOIN                                                  |                                                          |  2          |
|   1 |          TABLE ACCESS (FULL)                       | EMP_FEB                                      |  3          |
|   2 |          TABLE ACCESS (FULL)                       | EMP_JAN                                     |  6          |

From the explain plan, we can say that both the tables are joined by hash natural join methodology, EMP_FEB is considered as the driving table [hash table] and EMP_JAN is considered as the driven table [probe table].

Comments

Popular posts from this blog

B*Tree Reverse index

B*Tree Reverse index is one of the index types that is supported by oracle. It more or less behaves like b*tree unique & non-unique index but only difference is the way it constructs the tree. In this index, index value is stored in the reverse direction when compared to the values in the actual data table. This index is used in order to distribute the index data evenly across all the leaf index nodes if table data is skewed very high. Like b*tree index, this index doesn’t understand NULL. This B*Tree Reverse index can be created on a single column or more than one column. More than one B*Tree Reverse index can be created for the same data table. HOW TO VERIFY: How to verify whether the created index is reverse index or not? Fire this query, select index_type from user_indexes where index_name = 'DEPT_ID_INDX' ð   This query would return “ NORMAL/REV ” (it means it is B*Tree index but of “Reverse” type) LITTLE-KNOWN FACTS TO BE REMEMBERED: ·          If this B*Tree Reve

Nested Loop Outer Join

Before we go into this topic in detail, first we will understand what outer join is meant by. For an example, take 2 tables which have to be joined, say “Parent” table & “Child” table [I have taken these table names, only to discuss this topic in an easy way]. If Child table is outer joined with Parent table, then you can say, the matching records from both these tables will be retrieved and then the records which are available only in Parent table will also be retrieved though the corresponding matching records are not available in Child table. This is called outer join. [in natural or equi-join, only the matching records from both the tables will be retrieved.] To perform this outer join, 3 different join techniques have been provided by oracle. ·          Nested loop outer join ·          Hash outer join ·          Sort merge outer join In this, we will discuss about nested loop outer join. To perform this nested loop outer join, Oracle follows these steps (assume, child ta

Index Fast Full Scan

Index Fast Full Scan is one of the index access methods supported by Oracle. Index Fast Full Scan is otherwise called as Fast Full Index Scan also. Index Fast Full Scan can be imposed by Oracle Optimizer only in the certain scenario. The scenario where this index would be invoked is, Whenever the Oracle SQL optimizer detects that the query is serviceable without touching table rows, Oracle invokes this fast full index scan and quickly reads every block of the index without touching the table itself provided that query doesn’t contain any ORDER BY clause. The definition for this scan is more or less similar to the index full scan but the only difference is that former  will be invoked when ORDER BY clause is not mentioned in the query. It differs from the index full scan in another way where output won’t be in the sorted order since ORDER BY clause is not mentioned. If Oracle detects the query is serviceable without touching the table rows, it means, all the columns mentioned in th