Group ranges

I have a datatable with many rows, and I would like to conditionally group two columns, namely Begin and End. These columns indicate the specific month in which the related person does something. Here are some examples of data (you can use R to read, or find clean tables below if you are not using R):

# base: test <- read.table( text = " 1 A mnb USA prim 4 12 2 A mnb USA x 13 15 3 A mnb USA un 16 25 4 A mnb USA fdfds 1 2 5 B ghf CAN sdg 3 27 6 B ghf CAN hgh 28 29 7 B ghf CAN y 24 31 8 B ghf CAN ghf 38 42 ",header=F) library(data.table) setDT(test) names(test) <- c("row","Person","Name","Country","add info","Begin","End") out <- read.table( text = " 1 A mnb USA fdfds 1 2 2 A mnb USA - 4 25 3 B ghf CAN - 3 31 4 B ghf CAN ghf 38 42 ",header=F) setDT(out) names(out) <- c("row","Person","Name","Country","add info","Begin","End") 

Grouping should be carried out as follows: if person A made a trip from the 4th to the 15th month and travels from the 16th to the 24th month, I will group the following (i.e., without a break) activity from the 4th to the 24th month . If then Person A surfed from 25 to 28 months, I would also add this, and all group activities will last from 4 to 28. Currently, there are problems when there are periods of overlap, for example, Person A can also go fishing from 11 to 31, so all this will be from 4 to 31. However, if person A did something from 1 to 2, it would be separate (compared to 1 to 3, which also needs to be added, since 3 is connected with 4) . Hope this was clear, if not you can find more examples in the above code. I use datatable because my dataset is quite large. I started with sqldf so far, but this is problematic if you have so many actions per person (let them say 8 or more). Can this be done in datatable, or plyr, or sqldf? Please note: I am also looking for an answer in SQL, because I could use it directly in sqldf or try to convert it to another language. sqldf supports (1) the SQLite backend database (default), (2) the Java java database, (3) the PostgreSQL database, and (4) sqldf 0.4-0 also supports MySQL.

Edit: Here are the "clean" tables:

IN:

 Person Name Country add info Begin End A mnb USA prim 4 12 A mnb USA x 13 15 A mnb USA un 16 25 A mnb USA fdfds 1 2 B ghf CAN sdg 3 27 B ghf CAN hgh 28 29 B ghf CAN y 24 31 B ghf CAN ghf 38 42 

Of:

 A mnb USA fdfds 1 2 A mnb USA - 4 25 B ghf CAN - 3 31 B ghf CAN ghf 38 42 
+7
sql r range data.table plyr
source share
2 answers

I did this, which worked in my tests, and almost all the main databases there should run it normally ... I highlighted my columns ... please change the names before the test:

 SELECT r1.person_, r1.name_, r1.country_, CASE WHEN max(r2.begin_) = max(r1.begin_) THEN max(r1.info_) ELSE '-' END info_, MAX(r2.begin_) begin_, r1.end_ FROM stack_39626781 r1 INNER JOIN stack_39626781 r2 ON 1=1 AND r2.person_ = r1.person_ AND r2.begin_ <= r1.begin_ -- just optimizing... LEFT JOIN stack_39626781 r3 ON 1=1 AND r3.person_ = r1.person_ -- matches when another range overlaps this range end AND r3.end_ >= r1.end_ + 1 AND r3.begin_ <= r1.end_ + 1 LEFT JOIN stack_39626781 r4 ON 1=1 AND r4.person_ = r2.person_ -- matches when another range overlaps this range begin AND r4.end_ >= r2.begin_ - 1 AND r4.begin_ <= r2.begin_ - 1 WHERE 1=1 -- get rows -- with no overlaps on end range and -- with no overlaps on begin range AND r3.person_ IS NULL AND r4.person_ IS NULL GROUP BY r1.person_, r1.name_, r1.country_, r1.end_ 

This request is based on the fact that in any range from the output there are no connections / overlaps. Suppose that to display five ranges, five begin and five end exist without joins / overlaps. Finding and linking them should be easier than creating all the connections / overlaps. So what this request does:

  • Find all ranges per person without overlapping / compounds in their end value;
  • Find all ranges per person without overlaps / connections in their begin value;
  • These are valid ranges, so tie them all together to find the right pair;
  • For each person and end correct begin pair is the maximum available whose value is equal to or less than this end ... it's easy to check this rule ... firstly, you cannot have begin more than end ... also, if you have there are two or more possible begin less end , eg, begin1 = end - 2 and begin2 = end - 5 , choosing less ( begin2 ) makes more ( begin1 ) overlap this range.

Hope this helps.

+2
source share

If you are running SQL Server 2012 or later, you can use the LAG and LEAD functions to create your own logic to get the final desired dataset. These features are also available in Oracle with Oracle 8i, I suppose.

Below is the solution I created in SQL Server 2012 that should help you. The sample values ​​that you present are loaded into a temporary table to represent what you call your first "clean table". Using these two functions along with the OVER clause, I came to your final dataset with the following T-SQL code below. I left some of the commented lines in the code to show how I was able to put together a general piecemeal solution that takes into account the various scripts placed in the CASE statement for the GapMarker column, which acts as a grouping flag.

 IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL DROP TABLE #MyTable CREATE TABLE #MyTable ( Person CHAR(1) ,[Name] VARCHAR(3) ,Country VARCHAR(10) ,add_info VARCHAR(10) ,[Begin] INT ,[End] INT ) INSERT INTO #MyTable (Person, Name, Country, add_info, [Begin], [End]) VALUES ('A', 'mnb', 'USA', 'prim', 4, 12), ('A', 'mnb', 'USA', 'x', 13, 15), ('A', 'mnb', 'USA', 'un', 16, 25), ('A', 'mnb', 'USA', 'fdfds', 1, 2), ('B', 'ghf', 'CAN', 'sdg', 3, 27), ('B', 'ghf', 'CAN', 'hgh', 28, 29), ('B', 'ghf', 'CAN', 'y', 24, 31), ('B', 'ghf', 'CAN', 'ghf', 38, 42); WITH CTE AS (SELECT mt.Person ,mt.Name ,mt.Country ,mt.add_info ,mt.[Begin] ,mt.[End] --,LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) --,CASE WHEN [End] + 1 = LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) -- --AND LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) = LEAD([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) -- THEN 1 -- ELSE 0 -- END AS Grp --,MARKER = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) ,CASE WHEN mt.[End] + 1 = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) OR 1 + COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) = mt.[Begin] OR COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) BETWEEN mt.[Begin] AND mt.[End] OR [End] BETWEEN LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) AND LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) THEN 1 ELSE 0 END AS GapMarker ,InBetween = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) ,EndInBtw = LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) ,LagEndInBtw = LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) FROM #MyTable mt --ORDER BY mt.Person, mt.[Begin] ) SELECT DISTINCT X.Person ,X.[Name] ,X.Country ,t.add_info ,X.MinBegin ,X.MaxEnd FROM (SELECT c.Person ,c.[Name] ,c.Country ,c.add_info ,c.[Begin] ,c.[End] ,c.GapMarker ,c.InBetween ,c.EndInBtw ,c.LagEndInBtw ,MIN(c.[Begin]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MinBegin ,MAX(c.[End]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MaxEnd --, CASE WHEN c.[End]+1 = c.MARKER -- OR c.MARKER +1 = c.[Begin] -- THEN 1 -- ELSE 0 -- END Grp FROM CTE AS c) X LEFT JOIN #MyTable AS t ON t.[Begin] = X.[MinBegin] AND t.[End] = X.[MaxEnd] AND t.Person = X.Person ORDER BY X.Person, X.MinBegin --ORDER BY Person, [Begin] 

And here is a screenshot of the results that match your desired final data set:

enter image description here

+2
source share

All Articles