Optimization of vector indexing and matrix creation in R

First of all, I am sorry if this topic was already discussed somewhere else, but I could not find anything suitable in the search.

My problem is this: I have 4 vectors with partially overlapping names, and I want to organize all this data into a matrix. I want the final matrix to have an entry for all the names that are present in at least one of the input vectors. I used the following code.

IDs <- unique(c(names(v1), names(v2), names(v3), names(v4))) 
mat <- matrix(c(v1[IDs], v2[IDs], v3[IDs], v4[IDs]), nrow=length(IDs), ncol=4)
mat[is.na(mat)] <- 0 
# This last line is to convert NAs generated when the entry isn't present in all vectors into 0 values. 

It works well, but since I have a total of records> 2.2 million, it is very slow (it took 2.5 days to run ...). Therefore, I am looking for a way to speed up the process.

I tried to use other structures (for example, create a data frame instead of a matrix), but without much improvement. After some tests, the bottleneck seems to be the next step (even if it is considered separately):

v1[IDs]

which is repeated for each of the vectors (from 1 to 4). Note that usually only ~ 50% of the names overlap between two vectors (and therefore only 50% of the identifiers / names used for indexing are initially present in the vector names).

I had a little control over the processor and memory used during the process, and it seems that this is not a memory problem (6 free Gb remained free during the process).

, . , 2 , .

. =)

.

+4
1

reshape2, dcast . data.frame:

df <- rbind(data.frame(IDs=names(v1), value=v1, vec=1),
data.frame(IDs=names(v2), value=v2, vec=2),
data.frame(IDs=names(v3), value=v3, vec=3),
data.frame(IDs=names(v4), value=v4, vec=4))

:

dcast(df, ids ~ vec, value.var="value")

data.frame, matrix

N: 5 N=5000, 30 N=10000, 67x N=50000, N - v1.

+2

All Articles