When working with computer vision models we frequently need to know information about the camera that was used to generate the image. This includes information about the position of the camera in the world coordinate system as well as the intrinsic properties of the camera such as the focal length. We will write a more detailed post about cameras and coordinate systems later on but for now we will only address the positioning of the camera in the world coordinate system. Think of the world coordinate system as the 3D space in which all the objects in the image live, including the camera.
The position of the camera can be expressed as an
R and a
T matrix. This can either represent:
- a “world-to-view” transform - this represents the transformation to go from the world coordinate system where the camera lives to the camera position and viewpoint
- a “view-to-world” transform - this represents the inverse transformation to go from the camera viewpoint back to the world coordinate system
Datasets and codebases have different conventions for how cameras are represented. It’s very important to understand the conventions of the codebase and dataset that you are using. For example synthetic data generated using
habitat-sim uses the “view-to-world” convention, whereas
pytorch3D uses the “world-to-view” convention.
It’s easy to convert between the two, simply by inverting the 4x4 transformation matrix.
from typing import Tuple import torch def convert_cameras( R: torch.Tensor, T: torch.Tensor ) -> Tuple[torch.Tensor, torch.Tensor]: """ Convert camera extrinsics definition from view-to-world to world-to-view. This requires inverting the 4x4 RT matrix. Args: R: torch.FloatTensor of shape (3, 3) representing the view-to-world transform T: torch.FloatTensor of shape (1, 3) representing the camera position in camera coordinates Returns R, T converted to be a world-to-view transform """ assert R.shape == (3, 3) assert T.shape == (1, 3) Pinv = torch.eye(4) # Compose into RT Pinv[0:3,0:3] = R Pinv[0:3,3] = T # Invert P = torch.linalg.inv(Pinv).to(torch.float32) # Split back to R and T R = P[0:3, 0:3] T = P[0:3, 3] return R, T
In a later blog post we’ll cover coordinate system conventions in more detail. Understanding the data conventions and the coordinate system in which it lives is extremely important when setting up a deep learning training pipeline.