This is not an easy question. Computer architecture is not surprisingly complex. Below are some guidelines, but even those are simplifications. Many of them will come to your application and the limitations that you work with (both business and technical).
Processors have several (2-3 common) levels of caching on the processor . Some modern processors also have a memory controller on the matrix. This can significantly increase the speed of memory exchange between the cores. I / O memory between processors should go over an external bus, which tends to be slower.
AMD / ATI chips use HyperTransport , which is a point-to-point protocol.
The complication of all this is bus architecture. Intel Core 2 Duo / Quad uses a shared bus . Think about it, for example, Ethernet or cable Internet, where there is only so much bandwidth, and each new member just gets another share of the whole. Core i7 and newer Xeons use QuickPath , which is very similar to HyperTransport.
More cores will take up less space, use less space and less power and cheaper (unless you use really low-power processors) both in terms of the core and in the cost of other equipment (for example, motherboards).
In general, one of the processors will be the cheapest (both in terms of hardware and software). For this, you can use equipment for goods. As soon as you move to the second socket, you will usually have to use different chipsets, more expensive motherboards and often more expensive RAM (for example, ECC fully buffered RAM), so you take a huge number of hits from one processor to two. This is one of the reasons many large sites (including Flickr, Google, and others) use thousands of product servers (although Googleโs servers are somewhat configured to include things like a 9V battery, the principle is the same) .
Your changes do not change much. "Performance" is a very subjective concept. Performance for what? Keep in mind that if your application is not multi-threaded (or multi-processor) enough to use additional cores, you can actually slow down the performance by adding more cores.
I / O-bound applications probably won't prefer each other. In the end, they are associated with I / O, and not with the processor.
For computing-based applications, this depends on the nature of the computing. If you do a lot of floating point, you can win much more by using the GPU to calculate the unloading (for example, using Nvidia CUDA ). You can benefit from tremendous productivity. Take a look at the GPU client for Folding @Home for an example.
In short, your question does not lend itself to a concrete answer, because the topic is complex and there is not enough information. Technical architecture is what needs to be developed for a particular application.