In general, you want your blocks / grids to match your data and at the same time increase occupancy, that is, how many threads are simultaneously active. The main factors affecting employment are the use of shared memory, the use of registers, and the size of the stream block.
A CUDA-enabled GPU has its processing capabilities divided into SMs (stream multiprocessors), and the number of SMs depends on the real card, but here we focus on one SM for simplicity (they all behave the same), Each SM has a finite number of 32- bit registers, shared memory, the maximum number of active blocks and the maximum number of active threads. These numbers depend on the CC (computing power) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA .
First of all, the thread block size should always be a multiple of 32, because the kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue instructions for 64 threads, and you'll just squander them.
Secondly, before worrying about shared memory and registers, try block size based on the maximum number of threads and blocks corresponding to the computing power of your card. Sometimes there are several ways to do this ... for example, a CC 3.0 card for each SM can have 16 active blocks and 2048 active threads. This means that if you have 128 threads per block, you can put 16 blocks in your SM before you reach the thread limit of 2048. If you use 256 threads, you can only fit in 8, but you still use all available threads and youโll still be full. However, when using 64 threads per block, only 1024 threads will be used when hitting 16 blocks, so only 50% of the occupancy. If shared memory and register usage are not a bottleneck, this should be your main concern (other than your data sizes).
On the topic of your grid ... the blocks in your grid are distributed by SM to run, and then the remaining blocks are placed in the pipeline. Blocks are moved to SM for processing, as soon as this SM has enough resources to take the block. In other words, when blocks are completed in SM, new ones move. You can make the argument that smaller blocks (128 instead of 256 in the previous example) can complete faster, since a particularly slow block will process fewer resources, but it depends a lot on the code.
Regarding registers and shared memory, look at the following, as this may limit your placement. The total memory is limited for the entire SM, so try to use it in an amount that allows as many blocks as possible to still fit on the SM. The same goes for register use. Again, these numbers are computationally dependent and can be found in the table on the wikipedia page. Good luck