First of all, I am sorry if this topic was already discussed somewhere else, but I could not find anything suitable in the search.
My problem is this: I have 4 vectors with partially overlapping names, and I want to organize all this data into a matrix. I want the final matrix to have an entry for all the names that are present in at least one of the input vectors. I used the following code.
IDs <- unique(c(names(v1), names(v2), names(v3), names(v4)))
mat <- matrix(c(v1[IDs], v2[IDs], v3[IDs], v4[IDs]), nrow=length(IDs), ncol=4)
mat[is.na(mat)] <- 0
It works well, but since I have a total of records> 2.2 million, it is very slow (it took 2.5 days to run ...). Therefore, I am looking for a way to speed up the process.
I tried to use other structures (for example, create a data frame instead of a matrix), but without much improvement. After some tests, the bottleneck seems to be the next step (even if it is considered separately):
v1[IDs]
which is repeated for each of the vectors (from 1 to 4). Note that usually only ~ 50% of the names overlap between two vectors (and therefore only 50% of the identifiers / names used for indexing are initially present in the vector names).
I had a little control over the processor and memory used during the process, and it seems that this is not a memory problem (6 free Gb remained free during the process).
, . , 2 , .
. =)
.