I am introducing a cross tabulation library in Python as a programming exercise for my new job, and I have an implementation of requirements that work, but are inelegant and redundant. I would like to have a better model in it, which allows easy and clean data movement between the base model, stored as tabular data in flat files, and all the statistical analysis results that can be set from this.
Currently, I have a transition from a set of tuples for each row in the table, to a histogram that counts the frequency of occurrence of the sets of interest to us, to a serializer that somewhat awkwardly compiles the output into a set of table cells for display. However, I have to return to the table or the histogram more often than I want, because it lacks information.
So any ideas?
Edit: Here is an example of some data and what I want to build from This. Note that "." indicates a little "missing" data, that is, only conditionally counted.
1 . 1 1 0 3 1 0 3 1 2 3 2 . 1 2 0 . 2 2 2 2 2 4 2 2 .
If I were considering the correlation between columns 0 and 2 above, this is the table I would have:
. 1 2 3 4 1 0 1 0 3 0 2 2 1 1 0 1
In addition, I would like to be able to calculate the ratio of frequency / total value, frequency / subtotal, etc.
python algorithm data-structures statistics crosstab
Chris r
source share