How should you separate dimension tables from fact tables if you are not creating a data warehouse?

Question

How should you separate dimension tables from fact tables if you are not creating a data warehouse?

I understand that referring to them as tables of measurements and facts is not entirely suitable. I lost for better terminology, so please excuse this categorization that I use in the post.

I am creating an employee record keeping application.

The database will contain information about organizations. Information is mainly defined in three tables: “Locations”, “Sections” and “Departments”. However, there are other similar problems. First, I need to keep the available values for these tables. This will allow you to use the available values in the application when managing an employee and managing these values when adding / removing departments, etc. For example, a Locations table might look like this:

LocationId | LocationName | LocationStatus 1 | New York | Active 2 | Denver | Inactive 3 | New Orleans | Active

Then I need to save these values for each employee and save their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I can’t pinpoint why, but it seemed like a bad design to me. My next desire was to create a set of tables DimLocation / FactLocation, DimDivision / FactDivision, DimDepartment / FactDepartment. I do not believe this makes sense either. I also viewed them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I believe that the data will look like the simplified version below:

 EmployeeId | LocationId | EffectiveDate | EndDate 1 | 3 | 2008-07-01 | NULL 1 | 2 | 2007-04-01 | 2008-06-30

I understand that any of the imaginary solutions described above can work, but I'm really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this help from the community, opinions and experience in this matter. I am open and welcome any suggestion to consider. For example, should I store the available values for these three tables in a database? Should they be supported at the application code / business logic level? Do I just need to sort out three words? Is history repeated three times?

Thanks!

+4

sql sql-server sql-server-2008 naming-conventions database-design

K Richard Aug 16 '11 at 18:48

source share

2 answers

If you want to be effective and make history, do it. There are several solutions to this problem, but I keep coming back to this:

Remember that each line represents a single entity, if you make corrections to this entity, this is good, but do not reuse the identifier for the new location. Set it so that instead of deleting the location, you mark it as a deleted bit and hide it from the interface, so when it is referenced historically, it is still there.
Create a history table that contains the current value, or there are no entries if no value is currently set. Ask the foreign key to associate itself with the employee and tie it to the place.
Create a column in the employee table that indicates the current active location in the story. When you need to get the location of employees, you join the history table based on this identifier. When you need to get the whole story for the employee you are joining the story table.
This structure keeps everything normalized and gives you an easy way to find the current value without having to compare dates.
Regarding the use of word history, think about it in different terms: since it contains the current element as well as historical objects, this is really just a join table that stores around the old element. As such, you can call it something like EmployeeLocations.

+1

Lucent fox Aug 16 '11 at 19:08

source share

MatBailie · Accepted Answer · 2011-08-16T19:33:25+0000

Firstly, I do not see a problem when describing these tables of measurements and facts outside the warehouse :)

From the point of view of conceptualization and understanding of relationships, I personally see that the use of start / end dates is very easy for people to understand. Resolution of agent fact tables and locations, and then time-dependent binding tables, such as Agent_At_Location, etc. However, they have problems that deserve attention.

If EndDate is 2008-08-30 , was an employee in this place until August 30 or UP TO and including August 30.
Working with overlapping date periods in queries can produce erratic queries, but more importantly, slow queries.

The first seems to be just a convention issue, but it can have certain consequences when working with other data. For example, consider that EndDate from 2008-08-30 means that they are in this place until and on August 30th. Then you join their agent’s daily data for that day (for example, when they actually arrived at work, went for breaks, etc.). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 to enable all events that occurred on that day.

This is because EventTimeStamp data is not measured in days, but probably minutes or seconds.

If you think that EndDate from '2008-08-30' means that the Agent was in this UP location, but DOES NOT INCLUDE on August 30, + 1 not required for the connection. In fact, you do not need to know if the date is related to the day or whether it may include a time component or not. You just need a TimeStamp < EndDate .

Using EXCLUSIVE End markers, all your queries are simplified and never need + 1 day or + 1 hour to solve boundary conditions.

The second solution is much more complicated. The easiest way to resolve the overlap period is as follows:

 SELECT CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom], CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom], FROM TableA INNER JOIN TableB ON TableA.InclusiveFrom < TableB.ExclusiveFrom AND TableA.ExclusiveFrom > TableB.InclusiveFrom -- Where InclusiveFrom is the StartDate -- And ExclusiveFrom is the EndDate, up to but NOT including that date

The problem with this query is related to indexing. The first condition is TableA.InclusiveFrom < TableB.ExclusiveFrom can be resolved using an index. But it can give a massive date range. And then, for each of these records, ExclusiveDate can be almost anything, and, of course, not in order, which can help quickly solve TableA.ExclusiveFrom > TableB.InclusiveFrom

The solution I previously used for this is to have the maximum allowable gap between InclusiveFrom and ExclusiveFrom . This allows something like ...

  ON TableA.InclusiveFrom < TableB.ExclusiveFrom AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30 AND TableA.ExclusiveFrom > TableB.InclusiveFrom

Condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL cannot use indexes. But instead, we limited the number of rows that can be returned by doing a TableA.InclusiveFrom search. This is not more than 30 days, because we know that we have limited the duration to 30 days.

An example of this is the breakdown of associations by calendar month (maximum duration is 31 days).

 EmployeeId | LocationId | EffectiveDate | EndDate 1 | 2 | 2007-04-01 | 2008-05-01 1 | 2 | 2007-05-01 | 2008-06-01 1 | 2 | 2007-06-01 | 2008-06-25 (Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)

This is an effective compromise; using disk space to improve performance.

I even saw that this was taken to the extreme, without actually preserving date ranges, but preserving the actual display for each day. Essentially, he would like to limit the maximum duration to 1 day ...

 EmployeeId | LocationId | EffectiveDate 1 | 2 | 2007-06-23 1 | 2 | 2007-06-24 1 | 3 | 2007-06-25 1 | 3 | 2007-06-26

Instinctively, I initially rebelled against this. But in subsequent ETLs, Warehousing, Reporting, etc. I really found it very powerful, adaptable and supported. I actually saw people make fewer mistakes when coding, writing code in less time, the code ended faster, and it was much more able to adapt to changing customer needs.

The only two downsides were:
1. More disk space (but trivial compared to the size of the fact table)
2. Inserts and updates for this mapping were slower

Actual slowdown for investments and updates really matters Once upon a time, when this model was used to represent an ever-changing network of processes; where the application wanted to change the display approximately 30 times per second. Even then, it worked, it just turned up more CPU time than it was perfect.

How should you separate dimension tables from fact tables if you are not creating a data warehouse?

More articles: