Improving the set of comparisons by hashing them (excessive skill ..?)

After my last, failed, try to ask a question here. This time I ask a more precise question:

What I have:

  • Huge dataset (finite, but I spent days of multicore processing time to calculate it to ...) ISet<Point> .
  • List of input values ​​from 0 to 2 n n & le; 17

What I need:

3) The table from [1], [2], where I compare each value [2] with the value [1]

Treatment:

For this calculation, I have a formula that takes a bit value (from [2]) and a set of positions (from [1]) and creates a new ISet<Point> . I need to find out which of the source set is equal to the result set (ie, the β€œCell” in the table in β€œA7” can point to β€œB”).

Naive way:

Compute the new ISet<Point> and use .Contains(mySet) or something similar in the list of values ​​from [1]. I did this in previous versions of this proof of concept / pet project, and it was slow when I started to feed huge numbers. Yes, I used a profiler. No, this was not the only slow part of the system, but I spent a lot of time on this naive search / matching.

The question finally:

Since I basically just need to reassign the input, I thought about creating a list of hashed values ​​for the ISet<Point> list, doing the same for my processing result and therefore avoiding comparing whole sets.

Is that a good idea? Would you call this a premature optimization (I know that the naive path above is too slow, but first I have to do something less smart? Performance is really important here, think that work time again)? Any other suggestions for relieving bredons here or ideas that I should read?


Update: Sorry for not receiving a more detailed explanation or sample right away.

Sample for [1] (Note: these are real possible data points, but obviously they are limited):

 new List<ISet<Point>>() { new HashSet() {new Point(0,0) }, new HashSet() {new Point(0,0), new Point(2,1) }, new HashSet() {new Point(0,1), new Point(3,1) } } 

[2] is just a Boolean vector of length n. For n = 2 this

  • 0,0
  • 0.1
  • 1,0
  • 1,1

I can do this with int or long, basically.

Now I have a function that takes a vector and an ISet<Point> and returns a new ISet<Point> . This is not a 1: 1 conversion: a set of 5 can lead to a set of 11 or something else. However, the resulting ISet<Point> is part of the input.

Using letters to set dots and numbers for bit vectors, I start with this

  ABCDEF
 one
 2
 3
 4
 5
 6
 7

In the end i need

  ABCDEF
 1 - CAE - -
 2 BCEFA -
 3 ................
 4 ................
 5 ................
 6 FCBA - - 
 7 E - CA - D

There are several expensive operations in this project, one is the preparation of point sets ([1]). But now this question is about correspondence: I can easily (more or less, not so important now) calculate the target ISet for a given bit vector and source ISet. Now I need to combine / find this in the source set.

The whole beast will be a finite state machine, where many points are a real state. Later, I am not interested in individual states, I can actually refer to them for anything (letter, index, whatever). I just need to keep the associations:

1, B => C


Update: Eric asked if a HashSet is possible. The answer is yes, but only if the dataset remains small enough. My question is (hashing): can it be / is it advisable to use a hashing algorithm for this hash? My idea is this:

  • Go ISet<Point> (lazily generated) list / sequence of ISet<Point> (I could change this type, I just want to emphasize that this is a mathematical set of points, without duplicates).

    • Create a simpler representation of the input (hash?) And save it (in hashset?)
    • Compute all target sets for this input, but only keep a simple representation (see above)
    • Cancel set
  • Fix display (equal hash = equal state)

A good idea? Problems with that? One of them that I could come up with is a collision (how much is this possible?) - and I would not know a good hash function to start with.

+6
optimization c # hash
source share
1 answer

OK, I think I understand the problem now. Let me see if I can rephrase.

Let's start by getting out of it. We will keep it abstract.

You have a large list L containing instances of the reference type S (for "set"). Such a list, of course, is logically a mapping from natural numbers N to S.

 L: N --> S 

S has the property that two instances can be compared for both reference equality and value. That is, there can be two instances of S that are not referential, but logically represent the same value.

You have an F function that takes a value of type V (for "vector") and an instance of type S and creates another instance of type S.

 F: (V, S) --> S 

In addition, you know that if F is given an instance of S from L, then the resulting instance of S will be equal to the value in the list, but not necessarily equal to zero.

The problem you are facing: you are given an instance of s from S, which is the result of calling F, the member L (n) is equal to the value of s

Yes?

The naive method is to go down L (1), L (2), ... testing a set of equalities along the way will be dead slowly. It will be at least linear in size L.

I can come up with several different ways. The easiest is your initial thought: make L something different from the list. Can you do this HashSet<S> instead of List<S> ? If you implement a hashing algorithm and an equality method in S, you can build a quick lookup table.

If this does not fit, we will consider other options.

UPDATE:

OK, so I see two main ways to solve the memory problem. (1) Store everything in memory using data structures that are much smaller than your current implementation, or (2) change how you store things on disk so that you can store the β€œindex” in memory and quickly go to the β€œpage” on the right, a disk file to extract the necessary information.

You could represent the point as one short, where the top byte is x and the bottom byte is y, instead of representing it as two ints; save 75%.

A set of points can be implemented as a sorted array of shorts, which is quite compact and easily writes a hash algorithm for.

This is probably the approach I would use, since your data is so compressible.

+3
source share

All Articles