What is a good cross tabulation data model?

I am introducing a cross tabulation library in Python as a programming exercise for my new job, and I have an implementation of requirements that work, but are inelegant and redundant. I would like to have a better model in it, which allows easy and clean data movement between the base model, stored as tabular data in flat files, and all the statistical analysis results that can be set from this.

Currently, I have a transition from a set of tuples for each row in the table, to a histogram that counts the frequency of occurrence of the sets of interest to us, to a serializer that somewhat awkwardly compiles the output into a set of table cells for display. However, I have to return to the table or the histogram more often than I want, because it lacks information.

So any ideas?

Edit: Here is an example of some data and what I want to build from This. Note that "." indicates a little "missing" data, that is, only conditionally counted.

1 . 1 1 0 3 1 0 3 1 2 3 2 . 1 2 0 . 2 2 2 2 2 4 2 2 . 

If I were considering the correlation between columns 0 and 2 above, this is the table I would have:

  . 1 2 3 4 1 0 1 0 3 0 2 2 1 1 0 1 

In addition, I would like to be able to calculate the ratio of frequency / total value, frequency / subtotal, etc.

+6
python algorithm data-structures statistics crosstab
source share
4 answers

You can use the sqlite database as a data structure and define the desired operations as SQL queries.

 import sqlite3 c = sqlite3.Connection(':memory:') c.execute('CREATE TABLE data (a, b, c)') c.executemany('INSERT INTO data VALUES (?, ?, ?)', [ (1, None, 1), (1, 0, 3), (1, 0, 3), (1, 2, 3), (2, None, 1), (2, 0, None), (2, 2, 2), (2, 2, 4), (2, 2, None), ]) # queries # ... 
+1
source share

SW posted a good basic recipe for this on activestate.com .

The essence seems ...

  • Define xsort = [] and ysort = [] as arrays of your axes. Fill them out by iterating through your data or in some other way.
  • Define rs = {} as the dict dicts of your tabulated data, iterating through your data and incrementing rs [yvalue] [xvalue]. Create missing keys if necessary.

Then, for example, the sum for the string y will be sum([rs[y][x] for x in xsort])

+1
source share

Since this is early Python programming, they probably want you to see which Python built-in mechanisms are appropriate for the original version of the problem. Dictionary structure seems to be a good candidate. The first column value from your tab-sep file may be the key in the dictionary. The record found by this key can itself be a dictionary whose key is the second value of the column. Subdomain entries will be counters initialized to 1 when you add new subtasks when the couple first encounters.

0
source share

Why not save it using HTML tables? It may not be the best, but you could very easily view it in a browser.

Edit:

I am just re-reading the question and you are asking for a data model, not a storage model. To answer this question ...

It all depends on how you report the data. For example, if you are going to do a lot of rotation or aggregation, it might make sense to store it in the main order of the column, so you can just sum the column to get the number of samples, for example.

This will help a lot if you explain what information you are trying to extract.

-one
source share

All Articles