Avoid writing SQL queries in SSIS

While working on the Data Warehouse project, the guy who gave us the tutorial advised us to use SQL queries to identify a large number of transformations of the data stream, citing moments like it would consume a lot of memory in the ETL field, so we would prefer to leave the processing in DB field. Is it really advisable? What is the balance between using GUI tools to execute a bunch of SQL scripts in your integration package?

And to be honest, I would like not to write SQL queries as much as possible. (but this is not so. I would really like to look at it objectively.)

+7
ssis data-warehouse
source share
6 answers

Answer: it depends, but you want to choose one or the other for any given task and avoid mixing where possible.

As a rule, it is best to do everything possible inside the tool or do everything possible in the code of the stored procedure. When you have a significant number of logical partitions between layers, the system becomes more difficult to track and debug.

  • If a tool can perform conversions without data streams that become inconvenient and confusing, you can use this tool and try to have little or no logic in your queries. This means that one layer has business logic, and it should be fairly obvious where it can be found. However, ETL tools tend to relate to very complex transformations relatively weakly. A sweet spot for this approach is a system in which you have a large number of data sources, but relatively simple transformations.

  • If you have relatively complex transformations, you might be better off putting all of your business logic and transformation at the stored procedure level. SQL code better implements complex transformations in a maintainable form - I have a fairly good reputation: approximately half of all data warehouse projects in the banking and insurance sectors use this particular type of architecture for this reason. In this case, the ETL tool can be used to implement relatively dumb copies of the data. The source data can be copied into the production area essentially verbatim, and then collected by the body of the stored procedure code that the ETL does. The ETL tool can be used to copy data, bulk load operations, logging, planning, and other infrastructure tasks.

In any case, it is best for you to choose one approach. Otherwise, you can deploy business logic between the retrieval levels, database views, data streams, and stored procedure code. Logic distributed across multiple layers is much more difficult to verify.

When all the logic (for example) is contained in stored procedures or focused ETL transform jobs, you can isolate the unit test with the given test. Clarity in design also helps in maintenance and auditing.

+7
source share

I find that the use of SQl code is not only accelerated, but also developed faster and much easier to maintain.

+4
source share

In general, if you want to process each row individually, use a data stream, otherwise it might be better to use the Sql command.

Personally, I would go write SQL where I can. It’s easier to optimize later and (usually) faster. Google will give more detailed answers.

Another factor to consider is the provider you use for your connections.

You need to make a decision based on your needs. We use postgres DB, so we need to create a load of staging tables for some processes, which speeds it all up.

You should also take into account the box it works on, if you have all the powerful DB boxes, and a little ETL box, then there would be no point in managing anything.

If you do all your processing in the ETL field, you will also drag a lot of data over the network.

Check out these links to get started:

ssistalk.com/category/ssis/ssis-advanced-techniques/

msdn.microsoft.com/en-us/library/ms141031.aspx

weblogs.sqlteam.com/jamesn/Default.aspx

+3
source share

I think this is a difficult question; and interesting as well.

One of the reasons for using SSIS is to improve maintainability, IMHO. If you pack all the logic into SQL statements (and you can do it!), You will usually spoil this reason for using SSIS. You can no longer see the data stream.

On the other hand, I feel that there are times when a well-posed SQL statement matters. For example, when you read data from a table and for some reason you already know that you only need rows that satisfy condition XI, they don’t see a reason to read the whole table, and in the next step, “conditionally split most of them”.
By the way, I don't know what that means in terms of performance. Is SSIS smart enough to see what happens and change “read-whole-table-and-conditional-split-it” to “choose Y, where does X come from on the fly” (or when creating / deploying)?

The big question is where to draw the line. And it depends to some extent on the people working on your ETL process. If everyone who has ever supported the process knows SQL from the start, you can better support more SQL in your ETL than if you have employees (or clients or successors you care about) who are unlikely to understand that happens in all of your SQL, not to mention changing / improving / adding to it.

So, I believe that the bottom line is that it’s better not to use and do everything in SQL. Try to make some simple rules that suit your requirements and that everyone can live, and then follow them. This provides you the most value when using SSIS.

+1
source share

SQL Server does some things well and other things not so well. I use SSIS to import or export data from SQL Server. During the transition, I use SSIS where this makes sense. I can easily work on every line, which is not very effective in SQL Server (cursors). To say that you should not use conversions and data streams in the ETL field because it is too expensive in the ETL field, it seems that "do not drive too fast because it makes the engine run." The goal of ETL and SSIS is to do some processing that SQL Sever does not succeed and move it to the engine that does.

+1
source share

You will have to use the right tool for the job. Typically, you do most things in SSIS, with some things being done in pure SQL.

For example, in cases where you use UPDATE a lot (the difference between the tables in the size table in the dimensional model, say), you really do not want to perform UPDATE for each row. In this case, you perform regular insertion into the temporary table, and then do the UPDATE in SQL, joining the corresponding keys.

+1
source share

All Articles