I am trying to recreate client status at a specific point in time. For example, each client has many attributes that can change at any time (for example, risk assessment, billing today, customer satisfaction).
Every time a client applies for a loan, I would like to see the significance of all these characteristics at the time of sending. Subsequently, I want to use these values โโto develop a predictive model.
My first thought was to create a slowly changing type 2 dimension with effective expiration dates and dates and use the semi-open connection time_effective <= date_of_application <time_expired.
However, most of these attributes are behavioral dimensions that require complex calculations using historical data from fact tables. Moreover, the calculated values โโalso cannot be grouped using ranges (from 0 to 500, $ 500-750, etc.). Tracking all of these attributes for each dimension leads to its explosion. Note. Some values โโchange daily, others change at arbitrary points in time.
My ideal data extract would look like this:
- ID # for a loan application
- Dispatch time
- Attribute 1 value at time of submission
- Attribute value 2 ...
- Attribute Value N
In addition to credit applications, there are other fact tables in which I want to find the characteristics that were in effect during this event.
What are the guidelines for handling this? I see several approaches:
- Allow measurement to explode
- Create separate tables with one or more attributes and separately request those tables that have attributes of interest to me.
- Add a column to the credit application fact table containing a snapshot of all the attributes that interest me.
Some of these issues are discussed in Kimball's ETL Toolkit (p. 190-192) and in his Data Warehouse Toolkit (187-191). Pp. 154-157 discusses the "fast-changing monster sizes" that seem very relevant. However, it is difficult for me to implement these recommendations.
source share