How to efficiently create the vector and index of this vector when processing a data stream?

Question

How to efficiently create the vector and index of this vector when processing a data stream?

I have a Foo structure:

 struct Foo { v: String, // Other data not important for the question }

I want to process the data stream and save the result in Vec<Foo> , and also create an index for this Vec<Foo> in the Foo::v field.

I want to use HashMap<&str, usize> for an index where the keys are &Foo::v and the value is the position in Vec<Foo> , but I am open to other suggestions.

I want to make data flow processing as fast as possible, which requires not to do obvious things twice.

For example, I want:

allocate String only once per data stream, reading
do not search the index twice, check once that the key does not exist, once to insert a new key.
do not increase runtime with Rc or RefCell .

The borrowing controller does not allow this code:

 let mut l = Vec::<Foo>::new(); { let mut hash = HashMap::<&str, usize>::new(); //here is loop in real code, like: //let mut s: String; //while get_s(&mut s) { let s = "aaa".to_string(); let idx: usize = match hash.entry(&s) { //a Occupied(ent) => { *ent.get() } Vacant(ent) => { l.push(Foo { v: s }); //b ent.insert(l.len() - 1); l.len() - 1 } }; // do something with idx }

There are several problems:

hash.entry takes the key, so s must have a "longer" lifetime than hash
I want to move s in line (b), while I have a read-only link in line (a)

So, how do I implement this simple algorithm without calling String::clone or calling HashMap::get after calling HashMap::insert ?

+7

rust

user1244932 Apr 17 '17 at 22:13

source share

3 answers

Shepmaster · Answer 1 · 2017-04-23T01:00:22+0000

In general, what you are trying to accomplish is unsafe, and Rust correctly prevents you from doing something you shouldn't. For a simple example, consider a Vec<u8> . If a vector has one element and unit capacity, adding another value to the vector will redistribute and copy all the values in the vector, which will invalidate any links in the vector. This will cause all your keys in your index to point to arbitrary memory addresses, which will lead to unsafe behavior. The compiler prevents this.

In this case, there are two additional pieces of information that the compiler does not know, but the programmer does not:

An additional indirect notation is added there - String , so moving the pointer to this heap distribution is not a problem.
String will never be changed. If this were so, then he could redistribute by canceling the address mentioned.

In such cases, it is normal to use unsafe code if you correctly document why it is unsafe .

 use std::collections::HashMap; use std::mem; #[derive(Debug)] struct Player { name: String, } fn main() { let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"]; let mut players = Vec::new(); let mut index = HashMap::new(); for &name in &names { let player = Player { name: name.into() }; let idx = players.len(); // INSERT REASON WHY THIS CODE IS NOT UNSAFE let stable_name: &str = unsafe { mem::transmute(&*player.name) }; players.push(player); index.insert(idx, stable_name); } for (k, v) in &index { println!("{:?} -> {:?}", k, v); } for v in &players { println!("{:?}", v); } }

However, I assume that you do not want to use this code in your main method, but want to return it from some function. This will be a problem since you will quickly come across Why can't I keep the value and the link to that value in the same structure? .

Honestly, there are code styles that don't fit into Rust’s limitations. If you come across this, you can:

Decide that Rust is not suitable for you or your problem.
use unsafe code, preferably thoroughly tested and only exposing a secure API.
Explore alternative views.

For example, I would probably rewrite the code so that the index is the primary owner of the key:

 use std::collections::BTreeMap; #[derive(Debug)] struct Player<'a> { name: &'a str, data: &'a PlayerData, } #[derive(Debug)] struct PlayerData { hit_points: u8, } #[derive(Debug)] struct Players(BTreeMap<String, PlayerData>); impl Players { fn new<I, S>(iter: I) -> Self where I: IntoIterator<Item = S>, S: AsRef<str>, { let players = iter.into_iter() .map(|name| (name.as_ref().to_string(), PlayerData { hit_points: 100 })) .collect(); Players(players) } fn get<'a>(&'a self, name: &'a str) -> Option<Player<'a>> { self.0.get(name).map(|data| { Player { name: name, data: data, } }) } } fn main() { let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"]; let players = Players::new(&names); for (k, v) in &players.0 { println!("{:?} -> {:?}", k, v); } println!("{:?}", players.get("eustice")); }

Alternatively, as shown in What is an idiomatic way to make a lookup table that uses an element field as a key? , you can wrap your type and save it in a container with a set:

 use std::collections::BTreeSet; #[derive(Debug, PartialEq, Eq)] struct Player { name: String, hit_points: u8, } #[derive(Debug, Eq)] struct PlayerByName(Player); impl PlayerByName { fn key(&self) -> &str { &self.0.name } } impl PartialOrd for PlayerByName { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.cmp(other)) } } impl Ord for PlayerByName { fn cmp(&self, other: &Self) -> std::cmp::Ordering { self.key().cmp(&other.key()) } } impl PartialEq for PlayerByName { fn eq(&self, other: &Self) -> bool { self.key() == other.key() } } impl std::borrow::Borrow<str> for PlayerByName { fn borrow(&self) -> &str { self.key() } } #[derive(Debug)] struct Players(BTreeSet<PlayerByName>); impl Players { fn new<I, S>(iter: I) -> Self where I: IntoIterator<Item = S>, S: AsRef<str>, { let players = iter.into_iter() .map(|name| PlayerByName(Player { name: name.as_ref().to_string(), hit_points: 100 })) .collect(); Players(players) } fn get(&self, name: &str) -> Option<&Player> { self.0.get(name).map(|pbn| &pbn.0) } } fn main() { let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"]; let players = Players::new(&names); for player in &players.0 { println!("{:?}", player.0); } println!("{:?}", players.get("eustice")); }

do not increase runtime with Rc or RefCell

Guessing performance characteristics without profiling is never a good idea. I honestly don’t think that there will be a noticeable performance loss from incrementing an integer when cloning or deleting a value. If this problem requires both an index and a vector, then I would achieve some sort of sharing.

Matthieu M. · Answer 2 · 2017-04-27T11:38:53+0000

do not increase runtime with Rc or RefCell .

@Shepmaster has already demonstrated this using unsafe as soon as you ask you to check how much Rc will really cost you. Unfortunately, Rc<str> not easy to create , as it resolves the problem here.

However, just in case, here is the full version with Rc (dumb). It is safe and probably quite effective. A more efficient version can be made by creating Rc<[u8]> and converting it to Rc<str> with a small number of unsafe in the Foo constructor.

 use std::collections::HashMap; use std::collections::hash_map::Entry; use std::rc::Rc; #[derive(Debug)] struct Foo { v: Rc<String> } #[derive(Debug)] struct Collection { vec: Vec<Foo>, index: HashMap<Rc<String>, usize>, } impl Foo { fn new(s: &str) -> Foo { Foo { v: Rc::new(s.into()) } } } impl Collection { fn new() -> Collection { Collection { vec: Vec::new(), index: HashMap::new() } } fn insert(&mut self, foo: Foo) { match self.index.entry(foo.v.clone()) { Entry::Occupied(o) => panic!( "Duplicate entry for: {}, {:?} inserted before {:?}", foo.v, o.get(), foo ), Entry::Vacant(v) => v.insert(self.vec.len()), }; self.vec.push(foo) } } fn main() { let mut collection = Collection::new(); for foo in vec![Foo::new("Hello"), Foo::new("World"), Foo::new("Go!")] { collection.insert(foo) } println!("{:?}", collection); }

Peter Hall · Answer 3 · 2017-04-17T22:47:45+0000

Mistake:

 error: `s` does not live long enough --> <anon>:27:5 | 16 | let idx: usize = match hash.entry(&s) { //a | - borrow occurs here ... 27 | } | ^ `s` dropped here while still borrowed | = note: values in a scope are dropped in the opposite order they are created

note: at the end is the answer.

s must survive hash because you use &s as the key in the HashMap . This link will become invalid when dropping s . But, as noted in the note, hash will be discarded after s . A quick fix is to replace the order of their declarations:

 let s = "aaa".to_string(); let mut hash = HashMap::<&str, usize>::new();

But now you have another problem:

 error[E0505]: cannot move out of `s` because it is borrowed --> <anon>:22:33 | 17 | let idx: usize = match hash.entry(&s) { //a | - borrow of `s` occurs here ... 22 | l.push(Foo { v: s }); //b | ^ move out of `s` occurs here

This is more obvious. s borrowed by Entry , which will wait until the end of the block. Cloning s record the following:

 l.push(Foo { v: s.clone() }); //b

I want to select s only once, and not clone it

But the type of Foo.v is String , so in any case, it will have its own copy of str . Only this type means you must copy s .

Instead, you can replace it with &str , which allows it to remain a reference in s:

 struct Foo<'a> { v: &'a str, } pub fn main() { // s now lives longer than l let s = "aaa".to_string(); let mut l = Vec::<Foo>::new(); { let mut hash = HashMap::<&str, usize>::new(); let idx: usize = match hash.entry(&s) { Occupied(ent) => { *ent.get() } Vacant(ent) => { l.push(Foo { v: &s }); ent.insert(l.len() - 1); l.len() - 1 } }; } }

Note that I used to have to move the declaration s to hash so that it survived it. But now l contains a reference to s , so it needs to be declared even earlier, so that it survives l .

How to efficiently create the vector and index of this vector when processing a data stream?

More articles: