FetchMode Join vs SubSelect

Question

FetchMode Join vs SubSelect

I have two tables: Employee and Department - these are entity classes for both of them.

Department.java @Entity @Table(name = "DEPARTMENT") public class Department { @Id @Column(name = "DEPARTMENT_ID") @GeneratedValue(strategy = GenerationType.AUTO) private Integer departmentId; @Column(name = "DEPARTMENT_NAME") private String departmentName; @Column(name = "LOCATION") private String location; @OneToMany(cascade = CascadeType.ALL, mappedBy = "department", orphanRemoval = true) @Fetch(FetchMode.SUBSELECT) //@Fetch(FetchMode.JOIN) private List<Employee> employees = new ArrayList<>(); } Employee.java @Entity @Table(name = "EMPLOYEE") public class Employee { @Id @SequenceGenerator(name = "emp_seq", sequenceName = "seq_employee") @GeneratedValue(generator = "emp_seq") @Column(name = "EMPLOYEE_ID") private Integer employeeId; @Column(name = "EMPLOYEE_NAME") private String employeeName; @ManyToOne @JoinColumn(name = "DEPARTMENT_ID") private Department department; }

The following are the queries when I did em.find(Department.class, 1);

- fetch mode = fetchmode.join

  SELECT department0_.DEPARTMENT_ID AS DEPARTMENT_ID1_0_0_, department0_.DEPARTMENT_NAME AS DEPARTMENT_NAME2_0_0_, department0_.LOCATION AS LOCATION3_0_0_, employees1_.DEPARTMENT_ID AS DEPARTMENT_ID3_1_1_, employees1_.EMPLOYEE_ID AS EMPLOYEE_ID1_1_1_, employees1_.EMPLOYEE_ID AS EMPLOYEE_ID1_1_2_, employees1_.DEPARTMENT_ID AS DEPARTMENT_ID3_1_2_, employees1_.EMPLOYEE_NAME AS EMPLOYEE_NAME2_1_2_ FROM DEPARTMENT department0_ LEFT OUTER JOIN EMPLOYEE employees1_ ON department0_.DEPARTMENT_ID =employees1_.DEPARTMENT_ID WHERE department0_.DEPARTMENT_ID=?

- fetch = fetchmode.subselect mode

  SELECT department0_.DEPARTMENT_ID AS DEPARTMENT_ID1_0_0_, department0_.DEPARTMENT_NAME AS DEPARTMENT_NAME2_0_0_, department0_.LOCATION AS LOCATION3_0_0_ FROM DEPARTMENT department0_ WHERE department0_.DEPARTMENT_ID=? SELECT employees0_.DEPARTMENT_ID AS DEPARTMENT_ID3_1_0_, employees0_.EMPLOYEE_ID AS EMPLOYEE_ID1_1_0_, employees0_.EMPLOYEE_ID AS EMPLOYEE_ID1_1_1_, employees0_.DEPARTMENT_ID AS DEPARTMENT_ID3_1_1_, employees0_.EMPLOYEE_NAME AS EMPLOYEE_NAME2_1_1_ FROM EMPLOYEE employees0_ WHERE employees0_.DEPARTMENT_ID=?

I just wanted to know which one we prefer FetchMode.JOIN or FetchMode.SUBSELECT ? which should we choose in which scenario?

+7

join hibernate jpa sql-subselect

eatSleepCode Oct 7 '15 at 5:59

source share

4 answers

gabrielgiussi · Answer 1 · 2016-05-02T13:08:01+0000

The SUBQUERY strategy that Marmite belongs to is related to FetchMode.SELECT, not SUBSELECT.

The console output you posted to fetchmode.subselect is curious because this is not the way that should work.

FetchMode.SUBSELECT

use the subquery request to download additional collections

Hibernate docs :

If you need one lazy collection or an unambiguous proxy, Hibernate will download all of them by re-executing the original request in the subquery. This works the same as batch loading, but without a phased download.

FetchMode.SUBSELECT should look something like this:

 SELECT <employees columns> FROM EMPLOYEE employees0_ WHERE employees0_.DEPARTMENT_ID IN (SELECT department0_.DEPARTMENT_ID FROM DEPARTMENT department0_)

You can see that this second request will deliver to the memory of all employees belonging to a certain department (i.e. employee.department_id is not zero), it does not matter if this is the department that you receive in your first request. Thus, this is a potentially serious problem if the employee table is large, because it may be accidentally loading the entire database into memory .

However, FetchMode.SUBSELECT significantly reduces the number of queries, since it accepts only two queries compared to N + 1 FecthMode.SELECT queries.

Perhaps you think that FetchMode.JOIN makes even fewer requests, only 1, so why use SUBSELECT at all? Well, that's true, but at the cost of duplicate data and a harder answer.

If an unambiguous proxy is to be selected using JOIN, the request may receive:

 +---------------+---------+-----------+ | DEPARTMENT_ID | BOSS_ID | BOSS_NAME | +---------------+---------+-----------+ | 1 | 1 | GABRIEL | | 2 | 1 | GABRIEL | | 3 | 2 | ALEJANDRO | +---------------+---------+-----------+

The boss employee’s data is duplicated if he manages several branches and has a cost in the passband.

If the lazy collection needs to be loaded using JOIN, the request may receive:

 +---------------+---------------+-------------+ | DEPARTMENT_ID | DEPARTMENT_ID | EMPLOYEE_ID | +---------------+---------------+-------------+ | 1 | Sales | GABRIEL | | 1 | Sales | ALEJANDRO | | 2 | RRHH | DANILO | +---------------+---------------+-------------+

Department data is duplicated if it contains more than one employee (natural case). We not only suffer from the cost of bandwidth, but also get duplicate objects duplicated objects , and we must use SET or DISTINCT_ROOT_ENTITY to remove duplicates.

However, duplicating data at a lower latency position is a good compromise in many cases, such as Markus Winand.

A SQL connection is even more efficient than a subselect approach, even if it performs the same index lookups as it avoids a lot of network communications . It is even faster if the total amount of data transferred is greater due to duplication of employee attributes for each sale. This is due to two performance measurements: response time and throughput; in computer networks, we call them latency and bandwidth. Bandwidth has little effect on response time, but delays have a huge impact . This means that the number of database calls is more important for response time than the number of data transferred.

Thus, the main problem with using SUBSELECT is that it is difficult to control and can load an entire array of objects into memory. With Batch fetching, you get the related object in a separate request as SUBSELECT (so that you do not suffer from duplicates), gradually and most importantly, you only request related objects (so that you do not suffer from the potential load of a huge graph), because the IN subquery is filtered by identifiers obtained using an exit request).

 Hibernate: select ... from mkyong.stock stock0_ Hibernate: select ... from mkyong.stock_daily_record stockdaily0_ where stockdaily0_.STOCK_ID in ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )

(It can be an interesting test if a batch sample with a very large batch size acts like a SUBSELECT, but without loading the entire table)

A few posts showing various selection strategies and SQL logs (very important):

Summary:

JOIN: avoids the main problem with N + 1 queries, but can duplicate data.
SUBSELECT: avoids N + 1 and does not duplicate data, but loads all objects of the associated type into memory.

Tables were built using ascii-tables .

Marmite bomber · Answer 2 · 2015-10-07T07:18:26+0000

I would say it depends ...

Suppose you have N employees in a department that contains D bytes of information, and the average employee consists of E bytes. (Bytes are the sum of the attribute length with some overhead).

Using the merge strategy, you execute 1 request and pass the data N * (D + E).

Using the subquery strategy, you execute 1 + N queries, but only transfer D + N * E data.

Typically, an N + 1 query is NO GO if N is large, so JOIN is preferred.

But in fact, you should check your mileage between the number of requests and the data transfer.

Please note that I do not consider other aspects as Hibernate caching.

An additional subtle aspect can be valid if the employee table is large and divided into sections - the division of sections into index access is also considered.

michaeak · Answer 3 · 2017-08-02T07:03:55+0000

My clients (financial services) had a similar problem, and he wanted to "get data in one request." Well, I explained that it is better to have more than one query because of the following:

For FetchMode.JOIN, the department will be transferred from the database to the application once per employee, because the join operation will multiply the department by one employee. If you have 10 departments with 100 employees each, each of these 10 departments will be transferred 100 times as part of a single query, simple SQL. Thus, each department in this case is transferred 99 times more often than necessary, which leads to a redistribution of data for the department.

For Fetchmode SUBSELECT, two queries are launched into the database. One of them will be used to obtain data from 1000 employees, one of which will receive 10 departments. It sounds much more effective to me. Surely you will make sure that the indexes are in place so that you can immediately get the data.

I would prefer FetchMode.SUBSELECT.

This will be a different case if there is only one employee in each department, but, as the name "department" suggests, this is unlikely to occur.

I suggest measuring access time to support this theory. For my client, I took measurements for different types of access, and the "department" table for my client had many more fields (I did not design it). Therefore, it soon became apparent that FetchMode.SUBSELECT was much faster.

gabrielgiussi · Answer 4 · 2017-09-03T23:53:02+0000

Planky said

(1) This is grossly misleading. (2) The subtitle will not load your entire database into memory. A related article on quirk, where subselect (3) ignores paging commands from the parent, (4), but it is still a subquery.

After your comment, I examined FetchMode.SUBSELECT again, and I found out that my answer is not entirely correct.
It was a hypothetical situation where the hydration of each object that was fully loaded into memory (Employee in this case) would end the wetting of many other objects. The real problem is loading the entire table that was selected if that table contains thousands of rows (even if each of them does not eagerly extract other objects from other tables).
I do not know what you mean by paging commands from the parent.
Yes, this is still a subquery, but I don’t know what you are trying to point it out.

The console output that you placed in the fetchmode.subselect file is curious because it is not the way that should work.

This is true, but only if there is something more than the essence of the department (which means that more than one collection of employees is not initialized), I tested it with 3.6.10.Final and 4.3.8.Final In scripts 2.2 ( FetchMode.SUBSELECT hidrating 2 of 3 departments) and 3.2 (FetchMode.SUBSELECT hidrating all Departments) , SubselectFetch.toSubselectString returns the following (references to Hibernate classes are taken from 4.3.8. The final tag):

 select this_.DEPARTMENT_ID from SUBSELECT_DEPARTMENT this_

This subquery is used to create a where OneToManyJoinWalker.initStatementString clause ending in

 employees0_.DEPARTMENT_ID in (select this_.DEPARTMENT_ID from SUBSELECT_DEPARTMENT this_)

Then the where clause is added to CollectionJoinWalker.whereString ending with

 select employees0_.DEPARTMENT_ID as DEPARTMENT3_2_1_, employees0_.EMPLOYEE_ID as EMPLOYEE1_1_, employees0_.EMPLOYEE_ID as EMPLOYEE1_3_0_, employees0_.DEPARTMENT_ID as DEPARTMENT3_3_0_, employees0_.EMPLOYEE_NAME as EMPLOYEE2_3_0_ from SUBSELECT_EMPLOYEE employees0_ where employees0_.DEPARTMENT_ID in (select this_.DEPARTMENT_ID from SUBSELECT_DEPARTMENT this_)

In this case, in both cases, all employees are removed and hydrated. This is clearly a problem in scenario 2.2, because we only moisturize departments 1 and 2, but also moisturize all employees, even if they do not belong to these departments (in this case, employees of department 3).

If a session has only one organizational unit object that has not been initialized, its request is not like the one that eatSleepCode wrote. Check out script 1.2

 select subselectd0_.department_id as departme1_2_0_, subselectd0_.department_name as departme2_2_0_, subselectd0_.location as location3_2_0_ from subselect_department subselectd0_ where subselectd0_.department_id=?

From FetchStyle

  /** * Performs a separate SQL select to load the indicated data. This can either be eager (the second select is * issued immediately) or lazy (the second select is delayed until the data is needed). */ SELECT, /** * Inherently an eager style of fetching. The data to be fetched is obtained as part of an SQL join. */ JOIN, /** * Initializes a number of indicated data items (entities or collections) in a series of grouped sql selects * using an in-style sql restriction to define the batch size. Again, can be either eager or lazy. */ BATCH, /** * Performs fetching of associated data (currently limited to only collections) based on the sql restriction * used to load the owner. Again, can be either eager or lazy. */ SUBSELECT

~~So far, I have not been able to decide what this Javadoc means:~~

based on sql constraints used to load the owner

UPDATE Planky said:

Instead, it will simply load the table in the worst case, and even then only if your initial query does not have a where clause . Therefore, I would say that using subquery queries may unexpectedly load the entire table if you LIMIT the results and you have no WHERE criteria .

This is true, and this is a very important detail that I tested in the new scenario 4.2

The request created to retrieve employees is

 select employees0_.department_id as departme3_4_1_, employees0_.employee_id as employee1_5_1_, employees0_.employee_id as employee1_5_0_, employees0_.department_id as departme3_5_0_, employees0_.employee_name as employee2_5_0_ from subselect_employee employees0_ where employees0_.department_id in (select this_.department_id from subselect_department this_ where this_.department_name>=?)

The subquery inside the where clause contains the original restriction this_.department_name> =? avoiding loading all employees. This is what javadoc means

based on sql constraint used to load owner

Everything I said about FetchMode.JOIN and the differences with FetchMode.SUBSELECT remain true (and also apply to FetchMode.SELECT).

FetchMode Join vs SubSelect

More articles: