Vectorization (as this term is commonly used) refers to the SIMD operation (one command, several data).
This means that essentially one instruction performs the same operation on multiple parallel operands. For example, for a multiple vector of size N with a scalar, call M the number of operands whose size can work simultaneously. If so, then the number of commands that he needs to execute is approximately N / M, where (with purely scalar operations) he would have to perform N operations.
For example, Intel's instruction set for the current AVX 2 uses 256-bit registers. They can be used to store (and operate) a set of 4 operands of 64 bits or 8 operands of 32 bits apiece.
So, assuming you are dealing with 32-bit real numbers with one precision, this means that one command can perform 8 operations (multiplications in your case) right away, so (at least theoretically) you can finish N multiplications using only N / 8 multiplication commands. At least theoretically, this should allow the operation to complete about 8 times faster than allowing the execution of one command at a time.
Of course, the exact benefit depends on how many operands you support in each instruction. Intel initially tries to support only 64-bit registers, so for working with 8 elements at the same time, these elements can be as little as 8 bits. They currently support 256-bit registers, and they announced support for 512-bit (and they may even have sent this to several high-performance processors, but not to ordinary consumer processors, at least for now). The positive use of this feature can also be nontrivial, to say the least. Scheduled instructions, so that you really have N operands available and in the right place at the right time, are not necessarily an easy task (in general).
To put things in perspective, (now ancient) Cray 1 got great speed in that way. Its vector block worked on sets of 64 registers at 64 bits apiece, so it could perform 64 operations with double precision per cycle. On optimally vectorized code, it was much closer to the speed of the current processor than you might expect based solely on its (much lower) clock speed. Making full use of it was not always easy (and still not).
Keep in mind, however, that vectorization is not the only way the CPU can perform operations in parallel. There is also the possibility of a parallelism training level that allows a single processor (or a single processor core) to execute multiple commands simultaneously. Most modern processors include hardware (in theory) to execute up to 4 instructions per cycle, if the instructions are a combination of loads, storage, and ALUs. They can quite regularly execute about two instructions per hour on average or more in well-tuned cycles when memory is not a bottleneck.
Then, of course, there is multithreading - launching several threads of instructions (at least logically) of individual processors / cores.
Thus, a modern processor can have, say, 4 cores, each of which can perform 2 vector multiplications per cycle, and each of these instructions can work on 8 operands. So, at least theoretically, he can perform 4 * 2 * 8 = 64 operations per cycle.
Some instructions have better or worse throughput. For example, FP bandwidth is lower than FMA, or multiplied by Intel before Skylake (1 vector per cycle instead of 2). But logical logic like AND or XOR has 3 vectors per bandwidth of each clock cycle; building a AND / XOR / OR block does not require a lot of transistors, so processors replicate them. Bottlenecks on the total width of the pipeline (the front server, which decodes and goes to the non-standard part of the kernel) are common when using high-performance instructions, and not bottlenecks on a particular actuator.