Best choice for memory data structure for IP address filter in Java

I have a file which is a CIDR format like this 192.168.1.0/24 and it will convert to these two strucutre columns

 3232236030 3232235777 

Each string IP address translation happens with this code:

 String subnet = "192.168.1.0/24"; SubnetUtils utils = new SubnetUtils(subnet); Inet4Address a = (Inet4Address) InetAddress.getByName(utils.getInfo().getHighAddress()); long high = bytesToLong(a.getAddress()); Inet4Address b = (Inet4Address) InetAddress.getByName(utils.getInfo().getLowAddress()); long low = bytesToLong(b.getAddress()); private static long bytesToLong(byte[] address) { long ipnum = 0; for (int i = 0; i < 4; ++i) { long y = address[i]; if (y < 0) { y += 256; } ipnum += y << ((3 - i) * 8); } return ipnum; } 

Consider more than 5 million records (low high : 3232236030 3232235777) .
There will also be intersections, so IP can come from several ranges. Only the first is more than normal.
Read only data.
What would be the fastest way to find the range that ipToBefiltered belongs ipToBefiltered ? The structure will be completely in memory, so there will be no database search.

UPDATE:

I found this Peerblock project (it has over a million downloads, so I think it should have some quick algorithms): http://code.google.com/p/peerblock/source/browse/trunk/src/pbfilter/filter_wfp .c

Does anyone know which method the project uses to create a list of ranges and a search?

+7
source share
5 answers

When it comes to this, I just need to know if IP is present in any of the 5M bands.

I would consider an n-ary tree, where n = 256, and work with the exact address, not the converted integer.

The top level will be an array of 256 objects. A null entry means “No”, there is no range that contains the address, so if your example is 192.168.1.0/24 array [192] will contain an object, but array [100] can be zero, because for any 100.xxx/

The saved object contains a link (link to) of another array [256] and a range specifier, only one of the two will be set, therefore 192.0.0.0/8 will have a range specifier indicating that all addresses in this range correspond to filter. This will allow you to use things like 192.255.0.0/10 , where the first 10 bits of the address are significant 1100 0000 11xx xxxx - otherwise you need to check the next octet in the array of the 2nd level.

Initially coalescing overlapping ranges, if any, in large ranges ... for example. 3 .. 10 and 7 .. 16 becomes 3 .. 16 ... allows this, since you do not need to associate a given IP address with a specific range.

This requires no more than 8 comparisons. Each octet is initially used directly as an index, followed by a comparison for null, a comparison for terminal-node (this is a range or pointer to the next level of the tree)

In the worst case, memory consumption is theoretically 4 GB (256 ^ 4) if each IP address is in the filtering range, but, of course, will be combined into one range, so in reality it will be only 1 object in the range. A more realistic worst case is likely to be more like (256 ^ 3) or 16.7 MB. Use in the real world is likely to have most array nodes [256] at each level empty.

This is essentially similar to the Huffman / prefix encoding. The shortest single prefix may end as soon as an answer (range) is found, so often you will compare average values < 4 .

+7
source

I would use a sorted int array (base address) and another array of the same size (end address). This will use 5M * 8 = 40 MB. The first IP is the base, and the second IP is the last address in the range. You will need to remove the intersection.

To find out if the address is filtered in binary search O (log N), and if there is no exact match, check that it is less than (or equal to) the upper boundary.

+1
source

I found this binary interrupt algorithm in a Vuze project (aka azureus) :

 public IpRange isInRange(long address_long) { checkRebuild(); if (mergedRanges.length == 0) { return (null); } // assisted binary chop int bottom = 0; int top = mergedRanges.length - 1; int current = -1; while (top >= 0 && bottom < mergedRanges.length && bottom <= top) { current = (bottom + top) / 2; IpRange e = mergedRanges[current]; long this_start = e.getStartIpLong(); long this_end = e.getMergedEndLong(); if (address_long == this_start) { break; } else if (address_long > this_start) { if (address_long <= this_end) { break; } // lies to the right of this entry bottom = current + 1; } else if (address_long == this_end) { break; } else { // < this_end if (address_long >= this_start) { break; } top = current - 1; } } if (top >= 0 && bottom < mergedRanges.length && bottom <= top) { IpRange e = mergedRanges[current]; if (address_long <= e.getEndIpLong()) { return (e); } IpRange[] merged = e.getMergedEntries(); if (merged == null) { //inconsistent merged details - no entries return (null); } for (IpRange me : merged) { if (me.getStartIpLong() <= address_long && me.getEndIpLong() >= address_long) { return (me); } } } return (null); } 

It seems to work pretty well. If you know something faster, let me know.

+1
source

If you only have a CIDR address (or list of them) and you want to check if any ipAddress is within that CIDR (or CIDR list), simply define a Set of SubnetUtils object.

If you do not filter very large N addresses, this is all String comparison and is very fast. You do not need to create a binary tree based on higher / lower order bits and all this complex Jazz.

 String subnet = "192.168.1.0/24"; SubnetUtils utils = new SubnetUtils(subnet); //... //for each subnet, create a SubnetUtils object Set<SubnetUtils> subnets = getAllSubnets(); //... 

Use the Guava predicate to filter ipAddresses that are not in your subnet range:

  Set<String> ipAddresses = getIpAddressesToFilter(); Set<String> ipAddressesInRange = Sets.filter(ipAddresses, filterIpsBySubnet(subnets)) Predicate<String> filterIpsBySubnet(final Set<SubnetUtils> subnets){ return new Predicate<String>() { @Override public boolean apply(String ipAddress) { for (SubnetUtils subnet : subnets) { if (subnet.getInfo().isInRange(ipAddress)) { return true; } } return false; } }; } 

Now, if the IP is on any of the subnets, you have a simple simple filter, and you do not need to create a data structure that you will need a unit test. If this is not effective enough, go to optimization. Do not prematurely optimize :)

+1
source

Here is the beginning of the answer, I'll be back when I get more free time

Setup:

  • Sort ranges by starting number.
  • Since these are IP addresses, I assume that none of the ranges overlap. If there is overlap, you should probably start merging ranges of lists and trim unwanted ranges (for example, if you have a range of 1 to 10, you can trim a range of 5 to 7).
    • To combine or crop, do this (suppose range a immediately precedes range b):
      • If b.end <a.end, then range b is a subset of range a, and you can remove range b.
      • If b.start <b.end and b.end> a.end, then you can combine ranges a and b. Set a.end = b.end, then delete range b.
0
source

All Articles