We believe that using deep learning and synthetic data can aid a person’s depth estimation task. In this article, we will discuss the details of a relevant dataset generation process and demonstrate the depth estimation results on real data.
Why train on pose data?
Domain adaptation from the synthetic to real-world images might be challenging. In this light, the idea of using a predicted pose as an input, instead of a full image, emerges as a low-hanging fruit that may allow us to assess the feasibility of the DL + synthetic data approach without the domain adaptation overhead.
Unlike how it is with images, synthetic pose data is very similar to the real one: just 17 keypoints which represent human pose, and no variation in clothes, backgrounds, facial expressions, hairstyles, lighting conditions, color balance, and other image attributes. This is demonstrated by the image below:
The real image data looks quite unlike the synthetic one.
But the real pose data looks very much like its synthetic counterpart.
Also, pose data is just 51 floats (17 keypoints: x, y, and conf) + 1 float for depth ground truth, hence even a desktop GPU can fit a dataset with tens of millions of samples. This eliminates the need to load samples from disk in the training loop. Also, combined with the fact that we make our experiments with computationally cheap models (shallow fully-connected ones in which samples are processed within a few milliseconds), we end up with very efficient training: it converges within a few minutes.
How was the dataset made?
As it was described in the previous part, one of the main problems of depth estimation to a person that we try to solve with Sloth, our symbolic AI approach, is the “smaller projection = higher depth” myth, which is ironically tangled with the core concept of depth computation. Yes, we consider the depth to a person to be inversely proportional to their projection size, indeed, but we also want to distinguish between projection size variation caused by an actual depth change and one caused by the pose change.
In other words, we want to train our model to predict the same depth value for a crouching person (a person occupies a smaller area on the image) and a standing person with their hands up (a person occupies a larger area on the image).
Consequently, our dataset should be comprised of varying poses. How would we achieve that?
Pre-defined poses
The first idea we had was manually sculpting different poses and randomly applying them to our humanoid dummy. The primary reason we didn’t stick to this idea is that it’s challenging to attain with the current Blender Python API. Using Blender’s GUI, different poses of an object’s armature can be stored as assets, and applied with a right mouse button click on the asset and then selecting the “Apply Pose” option. But this straightforward GUI operation becomes unwieldy with Python bindings. It’s disappointing that the current version of Blender doesn’t have convenient Python bindings to apply previously stored poses. However, there are workarounds:
- Iterating over GUI elements until the correct pose asset is selected. We didn’t try this approach because we didn’t find a reliable way to select the pose asset we want to apply. And scripting blind iterations over GUI elements until the desired pose asset is selected is a too fragile approach to rely on.
- Load poses with MB-Lab API. MB-Lab is a plug-in for Blender in which we obtain people’s models. It has a Python binding to load a pose from a .json file. We used this method. However, the drawback of this approach is that it only allows loading a pose from disk, which adds latency each time we update the pose. Also, it doesn’t support the native Blender’s pose manipulations flexibility like interpolation between two poses.
Automated pose variation
If we manually sculpt poses, we can only have so many options. So we looked into the automation of pose variation.
We started by measuring the limits of bending angles that produce possible poses of different joints:
bones_rotation_limits = {
"thigh_L": {
"X": (-45, 90),
"Y": (-60, 30),
"Z": (-17, 35),
},
"calf_L": {
"X": (-130, 0),
"Y": (0, 0),
"Z": (0, 0),
},
...
}
Then, during the pose randomization, we were iterating over these joints and applying a bend angle sampled from a uniform distribution between the previously measured limits of each specific joint.
The resulting poses were anatomically possible, but quite unnatural, rarely occurring in real data. Changing the sampling method from uniform to normal (with the mean in the middle of the bend angle range) helped with poses looking unnatural, but didn’t help with them being not very representative of the real data.
Hybrid approach (the one we used)
To surpass the limitations of previous depth estimation algorithms (such as the one that uses bounding box height as a reference projection), we needed to teach our model that it shouldn’t rely just on the pose height to determine the depth. To achieve that, we decided to populate our dataset specifically densely with samples of crouching people and people with their hands up. We used a hybrid approach to vary the pose of a 3D dummy, which contained both loading pre-defined pose presets and applying random tweaks to individual joints. The presets specifically addressed the dummy’s legs to create variation in crouching, kneeling, and standing poses. These poses were loaded half of the time, and the other half were a simulation of walking:
# randomize legs
if np.random.rand() < 0.5:
# load legs pose (squatting / kneeling)
bpy.ops.mbast.restpose_load(filepath=str(np.random.choice(poses_paths)))
else:
if np.random.rand() < 0.9:
set_random_walk_phase(armature)
else:
set_random_walk_phase_both_legs(armature)
And after we figured it out with legs, other body parts were randomized based on the previously measured bend angle limits:
# randomize other parts
if np.random.rand() < 0.3:
turn_spine(armature)
random_hands(armature)
if np.random.rand() < 0.3:
hands_up(armature)
if np.random.rand() < 0.3:
turn_head(armature)
As it is seen in the code above, we don’t necessarily apply variations to each body part in each sample of the dataset. It makes sense, since in real-life data we don’t expect a person to often have their head turned sideways, hands being up, and spine crunched forward at the same time.
The resulting samples were comprised of mostly natural poses, and still diverse enough to get the gist of “crouching not equals to further”:

Along with each image, we were logging a .json file with the camera parameters (including its depth, which is used as the ground truth in the final dataset) and the actual 2D pose, which we wanted to use to filter too imprecise YOLO pose estimations.
Dataset generation efficiency
Randomize pose first
Calling a function to randomize takes more time than rotating the dummy or moving the camera. That’s why the main loop of dataset generation was nested like:
while True:
randomize_pose()
for _ in range(SAMPLES_PER_POSE):
randomize_camera_depth()
randomize_camera_height_and_tilt()
randomize_dummy_rotation()
render_sample()
Render quality
The renders were 640×640 pixels, 100% of the resolution, GPU-accelerated, with only 4 samples. It was taking about 3 seconds to render an image on RTX 4070 SUPER. Modern computer graphics can produce 300 FPS renders with way more advanced scenes and higher resolution. So, most likely, we are missing something in the render options, and there is a way that the images of such quality can be rendered hundreds or thousands times faster.
Don’t move the camera, just downscale
So far, we were getting samples of different distances by moving the camera further from the object and then re-rendering the scene. This can be efficiently approximated if we render the entire dataset at the same distance, and then achieve distance variation by downscaling the images.
One may ask why not go even further, and just scale the pose. The answer is we want to capture YOLO’s pose estimation peculiarities at different scales. The pose of a person who is 70 pixels tall might be quite noisy. And we want to teach our model to deal with these imprecisions, which won’t happen if we only feed it with perfect downscaled versions of poses cleanly predicted from images of people who are say 300 pixels tall.
Results
The trained model performed reasonably well on the real-world data, even though it was trained only on the synthetic one. It shows early signs of pose-invariance, and we believe that with further development, it will become robust enough to be used in real-world applications.
Next steps
Our further goals are:
- Training a model that would handle occlusions.
- Utilizing multiple frames.
- More effective rendering.
- Zooming, upscaling into the person’s ROI to give YOLO pose estimation more resolution for distant people