Excluding duplicate fields in a join

I have a data set on which I am analyzing. It turns out that it can be easily enriched with demographic and community data, which greatly improves analytical results.

To do this, I enter demographic and community data before doing the analysis. I need to exclude some fields from my base sample set, so my connection looks something like this:

select sampledata.c1, sampledata.c2, demographics.*, community.* from sample data join demographics using (zip) join community using (fips) 

This gives me a few zip or fips output columns that my analysis engine cannot handle. I cannot specify each field manually - enrichment tables lead to hundreds of columns at the end.

I could select *, but then I would have all the columns from my sample data that I do not want.

How can I join my enrichment data without duplicating fields, while maintaining the choice of columns that I want to get from the sample table?

I thought that if postgres (my database) could fully qualify each column in the output (e.g. sample.c1, demographics.c1, etc.), I would be very pleased with this.

+6
source share
1 answer

There is no column exclusion syntax in SQL; only the column inclusion syntax exists (via the * operator for all columns or explicitly specify column names).

Create a list of only the columns you need.

However, you can generate an SQL statement with your hundreds of column names, minus a few repeating columns that you don't need, using schema tables and some of the built-in functions of your database.

 SELECT 'SELECT sampledata.c1, sampledata.c2, ' || ARRAY_TO_STRING(ARRAY( SELECT 'demographics' || '.' || column_name FROM information_schema.columns WHERE table_name = 'demographics' AND column_name NOT IN ('zip') UNION ALL SELECT 'community' || '.' || column_name FROM information_schema.columns WHERE table_name = 'community' AND column_name NOT IN ('fips') ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)' AS statement 

It only prints the expression; it does not execute it. Then you just copy the result and run it.

If you want to dynamically generate and run a statement in one go, then you can read about how to run dynamic SQL in the PostgreSQL documentation .

Prepare column names with table name

Alternatively, this generates a selection list of all columns, including those that contain duplicate data, but then pseudonizes them to include the table name for each column.

 SELECT 'SELECT ' || ARRAY_TO_STRING(ARRAY( SELECT table_name || '.' || column_name || ' AS ' || table_name || '_' || column_name FROM information_schema.columns WHERE table_name in ('sampledata', 'demographics', 'community') ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)' AS statement 

Again, this only gives rise to a statement. If you want to dynamically generate and run the statement, then you will need to execute dynamic SQL execution for your database, otherwise just copy and run the result.

If you really need a dot delimiter in column aliases, you will have to use double-quoted aliases such as SELECT table_name || '.' || column_name || ' AS "' || table_name || '.' || column_name || '"' SELECT table_name || '.' || column_name || ' AS "' || table_name || '.' || column_name || '"' SELECT table_name || '.' || column_name || ' AS "' || table_name || '.' || column_name || '"' . However, double-quoted aliases can cause additional complications (case sensitivity, etc.); so I used the underscore instead to separate the table name from the column name inside the alias, and then the aliases can be treated like regular column names otherwise.

+2
source

All Articles