Is there a data structure / library for working in memory olap / pivot tables in Java / Scala?

Hot issues

This question is quite appropriate, but he is 2 years old: In memory, the OLAP engine in Java

Background

I would like to create a pivot table similar to a matrix from a given table dataset in memory

eg. age by marital status (rows - age, columns - marital status).

  • Entrance : a list of people with age and some Boolean property (e.g. married),

  • Desired result : the number of people by age (row) and isMarried (column)

What I tried (Scala)

case class Person(val age:Int, val isMarried:Boolean) ... val people:List[Person] = ... // val peopleByAge = people.groupBy(_.age) //only by age val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status 

I managed to do it naively, first grouping by age, then map , which performs count by marital status and displays the result, then I foldRight to aggregate

 TreeMap(peopleByAge.toSeq: _*).map(x => { val age = x._1 val rows = x._2 val numMarried = rows.count(_.isMarried()) val numNotMarried = rows.length - numMarried (age, numMarried, numNotMarried) }).foldRight(List[FinalResult]())(row,list) => { val cumMarried = row._2+ (if (list.isEmpty) 0 else list.last.cumMarried) val cumNotMarried = row._3 + (if (list.isEmpty) 0 else l.last.cumNotMarried) list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried) }.reverse 

I do not like the code above, it is not efficient, it is difficult to read, and I am sure that there is a better way.

Question (s)

How do I group both? and how do I make a count for each subgroup, for example.

How many people are exactly 30 years old and married?

Another question is how I can execute the total to answer the question:

How many people over 30 are married?


Edit:

Thank you for your great answers.

just to clarify, I would like the output to include a โ€œtableโ€ with the following columns

  • Age (increasing)
  • Num Married.
  • Num not married
  • Running everything is married.
  • Running total single

Not only answers these specific requests, but also creates a report that will answer all such questions.

+6
source share
4 answers

Here is an option that is a bit more verbose, but does it in a general way instead of using strict data types. Of course, you could use generics to do it better, but I think you get the idea.

 /** Creates a new pivot structure by finding correlated values * and performing an operation on these values * * @param accuOp the accumulator function (eg sum, max, etc) * @param xCol the "x" axis column * @param yCol the "y" axis column * @param accuCol the column to collect and perform accuOp on * @return a new Pivot instance that has been transformed with the accuOp function */ def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = { // create list of indexes that correlate to x, y, accuCol val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1)) // group by x and y, sending the resulting collection of // accumulated values to the accuOp function for post-processing val data = body.groupBy(row => { (row(colsIdx(0)), row(colsIdx(1))) }).map(g => { (g._1, accuOp(g._2.map(_(colsIdx(2))))) }).toMap // get distinct axis values val xAxis = data.map(g => {g._1._1}).toList.distinct val yAxis = data.map(g => {g._1._2}).toList.distinct // create result matrix val newRows = yAxis.map(y => { xAxis.map(x => { data.getOrElse((x,y), "") }) }) // collect it with axis labels for results Pivot(List((yCol + "/" + xCol) +: xAxis) ::: newRows.zip(yAxis).map(x=> {x._2 +: x._1})) } 

my pivot type is pretty simple:

 class Pivot(val rows: List[List[String]]) { val headers = rows.head.zipWithIndex.toMap val body = rows.tail ... } 

And to check this, you can do something like this:

 val marriedP = Pivot( List( List("Name", "Age", "Married"), List("Bill", "42", "TRUE"), List("Heloise", "47", "TRUE"), List("Thelma", "34", "FALSE"), List("Bridget", "47", "TRUE"), List("Robert", "42", "FALSE"), List("Eddie", "42", "TRUE") ) ) def accum(values: List[String]) = { values.map(x => {1}).sum.toString } println(marriedP + "\n") println(marriedP.doPivot(accum)("Age", "Married", "Married")) 

What gives:

 Name Age Married Bill 42 TRUE Heloise 47 TRUE Thelma 34 FALSE Bridget 47 TRUE Robert 42 FALSE Eddie 42 TRUE Married/Age 47 42 34 TRUE 2 2 FALSE 1 1 

The nice thing is that you can use currying to pass any function to values โ€‹โ€‹similar to the one you would use in a traditional excel summary table.

More details can be found here: https://github.com/vinsonizer/pivotfun

+4
source

You can

 val groups = people.groupBy(p => (p.age, p.isMarried)) 

and then

 val thirty_and_married = groups((30, true))._2 val over_thirty_and_married_count = groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum 
+4
source

I think it would be better to use the count method on List directly

In question 1

 people.count { p => p.age == 30 && p.isMarried } 

For question 2

 people.count { p => p.age > 30 && p.isMarried } 

If you also want the actual groups of people that match these predicates to use a filter.

 people.filter { p => p.age > 30 && p.isMarried } 

Perhaps you could optimize them by doing a crawl only once, but is this a requirement?

+1
source

You can group using a tuple:

 val res1 = people.groupBy(p => (p.age, p.isMarried)) //or val res2 = people.groupBy(p => (p.age, p.isMarried)).mapValues(_.size) //if you dont care about People instances 

You can answer both questions:

 res2.getOrElse((30, true), 0) res2.filter{case (k, _) => k._1 > 30 && k._2}.values.sum res2.filterKeys(k => k._1 > 30 && k._2).values.sum // nicer with filterKeys from Rex Kerr answer 

You can answer both questions using the count method on List:

 people.count(p => p.age == 30 && p.isMarried) people.count(p => p.age > 30 && p.isMarried) 

Or using a filter and size:

 people.filter(p => p.age == 30 && p.isMarried).size people.filter(p => p.age > 30 && p.isMarried).size 

edit: a slightly cleaner version of your code:

 TreeMap(peopleByAge.toSeq: _*).map {case (age, ps) => val (married, notMarried) = ps.span(_.isMarried) (age, married.size, notMarried.size) }.foldLeft(List[FinalResult]()) { case (acc, (age, married, notMarried)) => def prevValue(f: (FinalResult) => Int) = acc.headOption.map(f).getOrElse(0) new FinalResult(age, married, notMarried, prevValue(_.cumMarried) + married, prevValue(_.cumNotMarried) + notMarried) :: acc }.reverse 
+1
source

Source: https://habr.com/ru/post/928141/


All Articles