How can a class be implemented, such as the .NET ConcurrentBag <T>?
I am very intrigued by the existence of the ConcurrentBag<T> class in the following .NET 4.0 platform:
Bags are useful for storing objects when ordering does not matter, and unlike sets, bags support duplicates.
My question is: how can this idea be implemented? Most of the collections that I am familiar with basically make up (under the hood) some form of array, in which order there can be "matter", but there is order (which is why, even if it is not necessary, the listing will almost always go through an unchanged collection, be it List , Queue , Stack , etc. in the same sequence).
If I were to guess, I could assume that inside it could be Dictionary<T, LinkedList<T>> ; but this actually seems rather doubtful, since it does not make sense to use any type of T as a key.
What I expect / hope is that it is actually an already established type of object that has already been "figured out" and that someone who knows about this established type can tell me about it. This is so unusual for me - one of those concepts that is easy to understand in real life, but difficult to translate into a useful class as a developer - thatβs why I am interested in opportunities.
EDIT
Some respondents suggested that a Bag could be a hash table form inside. This was my initial thought, but I foresaw two problems with this idea:
- A hash table is not so useful when you do not have a suitable hash code function for the type in question.
- Simple tracking of the βaccountβ object in the collection does not coincide with the storage of the object.
As suggested by Meta-Knight, perhaps an example will make this clearer:
public class ExpensiveObject() { private ExpensiveObject() { // very intense operations happening in here } public ExpensiveObject CreateExpensiveObject() { return new ExpensiveObject(); } } static void Main() { var expensiveObjects = new ConcurrentBag<ExpensiveObject>(); for (int i = 0; i < 5; i++) { expensiveObjects.Add(ExpensiveObject.CreateExpensiveObject()); } // after this point in the code, I want to believe I have 5 new // expensive objects in my collection while (expensiveObjects.Count > 0) { ExpensiveObject expObj = null; bool objectTaken = expensiveObjects.TryTake(out expObj); if (objectTaken) { // here I THINK I am queueing a particular operation to be // executed on 5 separate threads for 5 separate objects, // but if ConcurrentBag is a hashtable then I've just received // the object 5 times and so I am working on the same object // from 5 threads at the same time! ThreadPool.QueueUserWorkItem(DoWorkOnExpensiveObject, expObj); } else { break; } } } static void DoWorkOnExpensiveObject(object obj) { ExpensiveObject expObj = obj as ExpensiveObject; if (expObj != null) { // some work to be done } } If you look at the details of the ConcurrentBag<T> , you will find that it is inside a basically customizable linked list.
Since Bags may contain duplicates and are not accessible by index, a bidirectional list is a very good option to implement. This allows you to block a rather fine-grained area for insertion and deletion (you do not need to block the entire collection, only the nodes where you insert / delete). Since you are not worried about duplicates, this is not related to hashing. This makes a double link list ideal.
Here is some good info about ConcurrentBag: http://geekswithblogs.net/BlackRabbitCoder/archive/2011/03/03/c.net-little-wonders-concurrentbag-and-blockingcollection.aspx
How ConcurrentBag works is to take advantage of the new ThreadLocal type (new in System.Threading for .NET 4.0) so that each thread using the bag has a list locally for this topic only.
This means that adding or removing a local stream list requires very low synchronization. The problem arises where the thread goes to consume the item but its local list is empty. In this case, if the bag performs a "theft of work", where it will rob an item from another that contains elements in its list. This requires a higher level of synchronization, which adds a bit of overhead to the operation.
Since ordering doesn't matter, ConcurrentBag can use a hash table behind the scenes to provide quick data retrieval. But unlike the Hashset, the bag accepts duplicates. Perhaps each element can be paired with the Count property, which is set to 1 when adding an element. If you add the same element a second time, you can simply increase the Count property of that element.
Then, to remove an item with a number greater than one, you can simply reduce the number for that item. If there was only one account, you would remove the Item-Count pair from the hash table.
Well, in smalltalk (where the concept of Bag came from), the collection is basically the same as a hash, although it does allow duplication. Instead of storing a duplicate object, it maintains a "number of errors", for example, recounting each object. If ConcurrentBag is the right implementation, this should give you a starting point.
I believe the concept of "bag" is synonymous with "Multiset".
There are several "Bag" / "Multiset" implementations (they seem to be java) that are open source if you are interested in how they are implemented.
These implementations show that the βBagβ can be implemented in any number of ways depending on your needs. There are examples of TreeMultiset, HashMultiset, LinkedHashMultiset, ConcurrentHashMultiset.
Google Collections
Google has a number of "MultiSet" implementations , one of which is ConcurrentHashMultiset.
Apache commons
Apache has a number of Bag implementations.