To calculate world coordinates from screen coordinates using OpenCV

I calculated the internal and external parameters of the camera using OpenCV. Now I want to calculate the world coordinates (x, y, z) from the screen coordinates (u, v).

How am i doing this?

NB since I use kinect, I already know the z coordinate.

Any help is greatly appreciated. Thanks!

+8
source share
2 answers

First, to understand how you calculate it, it will help you if you read some things about the pinhole camera model and a simple perspective projection. Check for quick viewing. I will try to update more.

So, let's start from the opposite position describing the operation of the camera: we project a three-dimensional point in the world coordinate system onto the 2d-point of our image. According to the camera model:

P_screen = me * P_world

or (using uniform coordinates)

| x_screen | = I * | x_world | | y_screen | | y_world | | 1 | | z_world | | 1 | 

Where

 I = | f_x 0 c_x 0 | | 0 f_y c_y 0 | | 0 0 1 0 | 

is the internal 3x4 matrix, f is the focal point, and c is the center of the projection.

If you solve the system above, you will receive:

 x_screen = (x_world/z_world)*f_x + c_x y_screen = (y_world/z_world)*f_y + c_y 

But you want to do the opposite, so your answer is:

 x_world = (x_screen - c_x) * z_world / f_x y_world = (y_screen - c_y) * z_world / f_y 

z_world is the depth that Kinect returns to you, and you know that f and c are from your calibration of the built-in parameters, so for each pixel you apply the above to get the actual world coordinates.

Change 1 (why the above correspond to world coordinates and what are the external parameters that we get during calibration):

Check this one first, it explains the various coordinate systems very well.

Your 3d coordinate system: Object ---> World ---> Camera. There is a transformation that takes you from the coordinate system of the object to the world and another that takes you from the world to the camera (external names that you refer to). You usually assume that:

  • Or the system of objects corresponds to the World system,
  • or, the camera system complies with the World system

1. When capturing an object using Kinect

When you use Kinect to capture an object, what returns to you from the sensor is the distance from the camera. This means that the z coordinate is already in the camera coordinates. Transforming x and y using the above equations, you get a point in the coordinates of the camera .

Now the world coordinate system is determined by you. One general approach is to assume that the camera is located in the (0,0,0) world coordinate system. So, in this case, the extrinsics matrix actually corresponds to the identity matrix and the found camera coordinates correspond to world coordinates .

Sidenote: since Kinect returns z in the coordinates of the camera, there is also no need to convert from the coordinate system of the object to the world coordinate system. Say, for example, that you had another camera that captured faces, and for each point it returned the distance from the nose (which you considered the center of the coordinate system of the object). In this case, since the returned values ​​will be in the coordinate system of the object, we really need a rotation and translation matrix to bring them into the camera coordinate system.

2. During camera calibration

I suggest that you calibrate the camera using OpenCV using a calibration board with various poses. The usual way is to assume that the board is actually stable and the camera is moving instead of the opposite (in both cases the same transformation). This means that now the world coordinate system corresponds to the coordinate system of the object. Thus, for each frame, we find the corners of the checkerboard and assign their 3d coordinates, doing something like:

 std::vector<cv::Point3f> objectCorners; for (int i=0; i<noOfCornersInHeight; i++) { for (int j=0; j<noOfCornersInWidth; j++) { objectCorners.push_back(cv::Point3f(float(i*squareSize),float(j*squareSize), 0.0f)); } } 

where noOfCornersInWidth , noOfCornersInHeight and squareSize are dependent on your calibration board. If, for example, noOfCornersInWidth = 4, noOfCornersInHeight = 3 and squareSize = 100, we get three-dimensional points

 (0 ,0,0) (0 ,100,0) (0 ,200,0) (0 ,300,0) (100,0,0) (100,100,0) (100,200,0) (100,300,0) (200,0,0) (200,100,0) (200,200,0) (200,300,0) 

So, here our coordinates are actually in the coordinate system of the object . (We assumed arbitrarily that the upper left corner of the board is (0,0,0), and the remaining angular coordinates correspond to this). Therefore, we really need a rotation and transformation matrix in order to take us from the object (world) into the camera system. These are the external properties that OpenCV returns for each frame.

To summarize in the Kinect case:

  • The camera systems and the world of coodinate are considered the same, so there is no need for external applications.
  • There is no need to convert the object to the world (camera), since the return value of Kinect is already in the Camera system.

Change 2 (In the used coordinate system):

This is a convention, and I think it also depends on which drivers you use and what data you return. Check what and what .

Sidenote: This will help you a lot if you have visualized a point cloud and played a little with it. You can save your glasses as 3D objects (like ply or obj ), and then just import it into a program like Meshlab (very easy to use).

+26
source

Edit 2 (in the used coordinate system):

This is an agreement, and I think that it also depends on which drivers you use and what data you receive. Check, for example, this, this and that.

if you, for example, use microsoft sdk: then Z is not the distance to the camera, but the β€œflat” distance to the camera. This can change the corresponding formulas.

0
source

All Articles