Best way to parse and search a string?

I was looking for acceleration of a basic Python function that basically just takes a line of text and checks for a substring string. The Python program is as follows:

import time def fun(line): l = line.split(" ", 10) if 'TTAGGG' in l[9]: pass # Do nothing line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31" time0 = time.time() for i in range(10000): fun(line) print time.time() - time0 

I wanted to see if I could use some of the high-level Rust features to get some performance, but the code is much slower. Rust Conversion:

 extern crate regex; extern crate time; use regex::Regex; fn main() { let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"; let substring: &str = "TTAGGG"; let time0: f64 = time::precise_time_s(); for _ in 0..10000 { fun(line, substring); } let time1: f64 = time::precise_time_s(); let elapsed: f64 = time1 - time0; println!("{}", elapsed); } fn fun(line: &str, substring: &str) { let l: Vec<&str> = line.split(" ") .enumerate() .filter(|&(i, _)| i==9) .map(|(_, e) | e) .collect(); let re = Regex::new(substring).unwrap(); if re.is_match(&l[0]) { // Do nothing } } _ ^ _ c_ ^` ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM: i: extern crate regex; extern crate time; use regex::Regex; fn main() { let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"; let substring: &str = "TTAGGG"; let time0: f64 = time::precise_time_s(); for _ in 0..10000 { fun(line, substring); } let time1: f64 = time::precise_time_s(); let elapsed: f64 = time1 - time0; println!("{}", elapsed); } fn fun(line: &str, substring: &str) { let l: Vec<&str> = line.split(" ") .enumerate() .filter(|&(i, _)| i==9) .map(|(_, e) | e) .collect(); let re = Regex::new(substring).unwrap(); if re.is_match(&l[0]) { // Do nothing } } 

On my machine, Python multiplies this by 0.0065s versus Rusts 1.3946s.

Just checking some basic timings, the line.split() code takes about 1 s, and the regex step is about 0.4 s. Could this be right, or is there a problem with timing right?

+4
source share
2 answers

As a baseline, I ran your Python program using Python 2.7.6. Over 10 runs, it had an average time of 12.2 ms with a standard deviation of 443 μs. I do not know how you got a very good time 6.5ms .

Running Rust code with Rust 1.4.0-dev ( febdc3b20 ) without optimization, I got an average of 958 ms and a standard deviation of 33 ms.

Running the code with optimization ( cargo run --release ), I got an average of 34.6 ms and a standard deviation of 495 μs. Always benchmark in release mode .

Further optimizations you can do:

Compiling a regex once, out of sync loop:

 fn main() { // ... let substring = "TTAGGG"; let re = Regex::new(substring).unwrap(); // ... for _ in 0..10000 { fun(line, &re); } // ... } fn fun(line: &str, re: &Regex) { // ... } 

It produces an average of 10.4 ms with a standard deviation of 678 μs.

Switch to substring matching:

 fn fun(line: &str, substring: &str) { // ... if l[0].contains(substring) { // Do nothing } } 

It has an average value of 8.7 ms and a standard deviation of 334 μs.

And finally, if you look at only one result, and not at collection in a vector:

 fn fun(line: &str, substring: &str) { let col = line.split(" ").nth(9); if col.map(|c| c.contains(substring)).unwrap_or(false) { // Do nothing } } 

It has an average value of 6.30 ms and a standard deviation of 114 μs.

+8
source

Python direct translation will be

 extern crate time; fn fun(line: &str) { let mut l = line.split(" "); if l.nth(9).unwrap().contains("TTAGGG") { // do nothing } } fn main() { let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"; let time0 = time::precise_time_s(); for _ in 0..10000 { fun(line); } println!("{}", time::precise_time_s() - time0); } _ ^ _ c_ ^` ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM: i: extern crate time; fn fun(line: &str) { let mut l = line.split(" "); if l.nth(9).unwrap().contains("TTAGGG") { // do nothing } } fn main() { let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"; let time0 = time::precise_time_s(); for _ in 0..10000 { fun(line); } println!("{}", time::precise_time_s() - time0); } 

Using cargo run --release on stable (1.2.0), I get about 0.0267 compared to 0.0240 for Python (CPython, 2.7.10). Given that Python in for strings is just a C routine, that makes sense.

Impressively, in beta (1.3.0) and nightly (1.4.0), this decreases to about 0.0122 , or about twice as fast as CPython!

+3
source

All Articles