I finally understood this problem and proved that my interpretation was correct if I applied Z. Zhang's “Flexible Camera Calibration” document by looking at a plane from unknown orientations. International Computer Vision Conference (ICCV'99), Corfu, Greece, pp. 666-673, September 1999.
Let me explain everything from scratch. The next photo is the original pinhole camera model and the predicted result on the image sensor. However, this is not what we should see in the “image”.

What we need to see is

Comparing Figures 1 and 2, we should notice that this image is up-down and left-right. My friend, who works for the CMOS sensor company, told me that there are built-in functions for automatically displaying the perceived image.
Since we want to model the relationship between the image coordinate and the world coordinate, we must directly consider the image sensor as a projection plane. What confused me earlier was that the projection is always limited to the projected side, and this misleads me to understand the conclusion geometrically.
Now we have to look from the "back side" of the image sensor as a blue (View Perspective) arrow.
The result is shown in Figure 2. The x1-y1 coordinate is now directed to the right and down, respectively, so the equations
x1=-f(X/Z) y1=-f(Y/Z)
Now, in terms of the xy coordinate, the equation
x=f(X/Z)+u0 y=f(Y/Z)+v0
which are described in the article.
Now let's look at an equivalent model that does not exist in the real world, but helps visual interpretation.

The principle is the same. Look from the center of the projection to the image plane. The result is

where the projected "F" is right-left. Equations
x1=f(X/Z) y1=f(Y/Z)
Now, in terms of the xy coordinate, the equation
x=f(X/Z)+u0 y=f(Y/Z)+v0
which are described in the article.
And last but not least, because the unit of measurement in the world coordinate is mm or inch, and the image coordinate is pixels, there is a scaling factor where some books are described as
x=a*f(X/Z)+u0 y=b*f(Y/Z)+v0
or
x=fx(X/Z)+u0 y=fy(Y/Z)+v0
where fx=a*f , fy=b*f