BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos
Abstract.
Capturing smooth motions from videos using markerless techniques typically involves complex processes such as temporal constraints, multiple stages with data-driven regression and optimization, and bundle solving over temporal windows. These processes can be inefficient and require tuning multiple objectives across stages. In contrast, BundleMoCap introduces a novel and efficient approach to this problem. It solves the motion capture task in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions. BundleMoCap outperforms the state-of-the-art without increasing complexity. The key concept behind BundleMoCap is manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption, we can efficiently solve a bundle of frames using a single code. Additionally, the method can be implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap’s strength lies in its ability to achieve high-quality motion capture results with simplicity and efficiency. For more results visit our project’s page https://moverseai.github.io/bundle-mocap/
1. Introduction
Human motion capture (MoCap) is a long-standing goal and highly researched topic, with its popularity mainly stemming from the multitude of applications that it can enable in domains such as gaming, content creation, sports, and surveillance. The complexity of human motion necessitates the use of costly equipment and complex processes but during the last years, the emergence of modern machine learning has created new opportunities to reduce complexity and costs. Monocular and/or sparse multi-view markerless capture has seen tremendous growth as summarized in a recent state-of-the-art report (Tian et al., 2023).
However, most works focus on single image captures or simply extend image-based captures to a video stream of images, producing unrealistic motion streams with a distinct lack of temporal coherence. While simple localization jitter may be reduced using traditional filtering, occlusion creates challenging deviation patterns that cannot simply be addressed via filtering. Various solutions (Wei et al., 2022; Zeng et al., 2022; Tang et al., 2023; Shen et al., 2023) have been proposed for modelling long-range motion dependencies to overcome this.
Still, these are purely data-driven solutions for the highly ill-posed monocular case. Sparse multi-view captures are harder to model with end-to-end machine learning models due to the lack of data and the complexities stemming from the limited expressivity and scalability of neural networks when applied to multi-view data. This has, in turn, led to the predominant use of optimization-based approaches for multi-view MoCap and the integration of motion smoothness objectives in addition to solving entire temporal windows (Huang et al., 2017, 2021; Arnab et al., 2019; Ye et al., 2023).
Another challenge arises from the data constraints used to solve these optimization problems, typically 2D keypoint estimations from an image-based pose estimation model (Cao et al., 2019). These are highly jittery estimates and also suffer from missing or inverted estimates (Ruggero Ronchi and Perona, 2017). Jittery 2D keypoints accentuate jitter in the captured motion, and missing or inverted estimates render the sparse multi-view setting very sensitive to outliers. The solution to this in prior work (Huang et al., 2017; Ye et al., 2023) has been the use of multi-staged body fits, where each stage progressively refines the solution. Initial stages typically solve on a per-frame basis, with subsequent stages adding temporal constraints and sometimes solving a bundle of frames simultaneously. Other solutions involve solving for the entire video simultaneously after extracting initial estimates from a single-shot regressor model (Arnab et al., 2019), or relying on novel motion priors and solving motion segments over multiple stages (Huang et al., 2021; Rempe et al., 2021).
In this work, we present a novel solution for solving sparse multi-view markerless MoCap that addresses all aforementioned challenges. Instead of solving for a group of frames we solve for a single latent code that reconstructs a bundle of frames via manifold interpolation. The reconstructed bundle is then constrained and solved across a temporal window. This approach has several advantages, it is more robust to the prevalent outlier cases that manifest in sparse multi-view settings, it is far more efficient as it solves a video in a single stage, and produces smooth motion with no motion smoothness objectives.
2. Related Work
2.1. Markerless Multi-view MoCap
In addition to digitizing human motion, capturing the pose and/or motion of humans is critical for dataset creation (Cheng et al., 2022; Zhang et al., 2023; Huang et al., 2022), training data collection (Bhatnagar et al., 2020), bootstrapping dynamic radiance fields (Zhao et al., 2022), and has even found use in physicalizing human motion (Zhang et al., 2018). The standard approach is to rely on 2D keypoint detections (Cao et al., 2019) on calibrated images/videos and fit an articulated skeleton to these observations (Huang et al., 2017). While there are approaches that only estimate 3D joints (Iskakov et al., 2019; Ye et al., 2022; Bartol et al., 2022), these do not provide a consistent structure, suffering from varying bone lengths, and neither offer articulation parameters. Fitting to 2D keypoint detections is challenging even in the multi-view case, especially when the sensors are sparse, as 2D detectors suffer from high amounts of jitter, inversion and missing detections (Ruggero Ronchi and Perona, 2017) due to occlusions. While direct (Jiang et al., 2022) or progressive (Gong et al., 2023) data-driven regression approaches exist, extra progress is required to improve their performance and generalization.
2.2. Temporal Constraints
Capturing motion on a per-frame basis suffers from noisy estimates or fitting to noisy observations. While temporal filtering is one option to resolve the temporal jitter and inconsistency of per-frame captures (Ingwersen et al., 2023), it is very challenging to design filters to handle occlusions and outliers. This led to the integration of temporal constraints and priors when solving for the pose and motion. One of the earlier attempts (Kanazawa et al., 2019) used a bundle of images as input to regressors predicting a coherent bundle of outputs, and specifically the parameters of the SMPL body model. Follow-up approaches (Li et al., 2021) relied on recurrent architectures which are a better architectural fit to the problem at hand, as each prediction directly depends on the previous frame estimates. Regarding temporal constraints within optimization loops, the most common objective is a joint smoothness one (Arnab et al., 2019; Peng et al., 2018; Zhang et al., 2018; Ye et al., 2023) that essentially enforces a position constancy. A DCT smoothness term has also been used when fitting the SMPL body to multiple images (Huang et al., 2017). Another variant involves higher order smoothness terms like velocity constancy (Zanfir et al., 2018; Loper et al., 2014; Mahmood et al., 2019), either on joint/marker positions or on joints’ angles. With the advent of data-driven priors, lots of works have added higher level smoothness constraints to supplement the joint position one. These include different representations, like the feature space of a velocity autoencoder (Zhang et al., 2021; Huang et al., 2022), the latent space of poses (Huang et al., 2021), motion (Saini et al., 2023), or the latent space and transition state of an autoregressive motion prior (Jin and Liu, 2023; Ye et al., 2023).
A common theme to all prior approaches is that motion smoothness is achieved by the addition of one or more additional objectives to the optimization problem. This incurs the cost of additional parameters that need tuning, a case that is especially complex when considering the high number of objectives that are involved when fitting a parametric body model to images. Instead, our approach achieves motion smoothness implicitly by smoothly interpolation a pose manifold, requiring no additional hyper-parameters.
2.3. Bundle Solving
Adding such temporal constraints is more effective when solving for a bundle of frames simultaneously. The DCT prior used in (Huang et al., 2017) was added the second stage of optimization when a group of $30$ frames was solved simultaneously. Similarly, joint smoothness was imposed on a group of frames solved simultaneously in MoSculp (Zhang et al., 2018) as well as other works (Peng et al., 2018; Saini et al., 2023; Ye et al., 2023). Bundle solving with joint smoothness is also possible in the latent space of regressor models (Peng et al., 2018), albeit costly given its excessive parameter count. Apart from solving a small bundle, it has also been shown that it is possible to solve entire videos simultaneously (Arnab et al., 2019), exploiting the entire temporal context, even with very complex objective functions (Huang et al., 2022) involving human-object interactions. Still this requires good initialization, which can be achieved using a regressor model (Arnab et al., 2019) or a per-frame body fitting stage (Huang et al., 2022). This type of staged optimization is standard in the ill-posed monocular case (Bogo et al., 2016; Pavlakos et al., 2019), but also finds widespread use in multiview settings (Huang et al., 2017; Saini et al., 2023). When considering videos, multiple stages are used to initialize estimates for each separate frame, and then solve bundles using smoothness constraints (Zhang et al., 2021; Ye et al., 2023; Saini et al., 2023; Arnab et al., 2019; Peng et al., 2018), be it either via direct regression models or solving an optimization problem for each frame.
Nonetheless, this is rather costly, as multiple passes are required over the videos. In addition, the temporal nature of the data are only considered at the constraints level, but not at the parameters level, with each frame’s state being solved separately in the bundle. Our approach, BundleMoCap, only solves for a single keyframe for each bundle, and only requires a single – first – frame initialization, essentially solving for smooth motion robustly in a single stage.
2.4. Latent Parameters
Earlier body fitting works solved for the pose articulation parameters directly and regularized the solution to plausible poses using Gaussian mixture priors (Bogo et al., 2016; Huang et al., 2017). Modern representation learning offered a more compact latent space to solve instead of the rotation parameter space and simultaneously allowed for penalizing implausible poses directly on the solved parameters (Pavlakos et al., 2019). As a result, while earlier bundle solving MoCap works solved for the articulation (Arnab et al., 2019; Huang et al., 2017; Zanfir et al., 2018), solving for latent representations is now established (Pavlakos et al., 2019; Ye et al., 2023; Saini et al., 2023). Even though initial attempts used the latent space of regressors that is not compact (Peng et al., 2018), or pure autoencoder embeddings (Zhang et al., 2021), variational autoencoders (VAE) (Kingma and Welling, 2015) have proven to be more effective as a prior over their latent space enforces the solution to be directed towards more plausible poses. These have been used to model poses, as in the case of VPoser (Pavlakos et al., 2019), as well as motion (Saini et al., 2023; Huang et al., 2021), with the latter typically employing higher dimensional latent spaces. Another category of solved latent spaces are autoregressive models, represented by HuMoR (Jin and Liu, 2023), which has been used when in video-based MoCap (Ye et al., 2023) during the third solving stage to offer motion smoothness. Still, a VPoser VAE was used in the second stage for pose plausibility. On the other hand, GAN-based methods(Davydov et al., 2022) discriminate the generated poses from the real ones, essentially learning adversarial priors, while Pose-NDF (Tiwari et al., 2022) employs a neural implicit representation for modelling the plausible pose manifolds with a neural implicit function.
In BundleMoCap we show that a compact pose prior can be used to solve for smooth motion using a single latent code, instead of relying on motion or autoregressive priors. This carries the advantage of being less reliant on data offering a high variety of motion transitions, and more reliant on learning a high quality smooth pose manifold, reducing data pressure.
3. BundleMoCap
3.1. Preliminaries
To capture human motion, we use a parametric human body model $\mathcal{B}$ that acts as a function over a group of parameters to generate a human body geometry $(\mathbf{V},\mathbf{F})=\mathcal{B}(\boldsymbol{\beta},\boldsymbol{\theta},\mathbf{T})$. A triangular mesh surface $(\mathbf{V},\mathbf{F})$ is defined by the vertices $\mathbf{V}\in\mathbb{R}^{V\times 3}$ and faces $\mathbf{F}\in\mathbb{N}^{F\times 3}$. It is reconstructed by $\mathcal{B}$ using $S$ blendshape coefficients $\boldsymbol{\beta}\in\mathbb{R}^{S}$, articulated by $P$ pose parameters $\boldsymbol{\theta}\in\mathbb{SO}(3)^{P}$, and globally positioned by the transform $\mathbf{T}=\left[\begin{smallmatrix}\mathbf{R}&\mathbf{t}\\ \mathbf{0}&1\end{smallmatrix}\right]\in\mathbb{SE}(3)$. Using linear operations expressed as a matrix $\mathbf{R}\in\mathbb{R}^{J\times V}$ it is possible to extract $J$ different body joints $\mathbf{j}\in\mathbb{R}^{J\times 3}$ via matrix multiplication $\mathbf{j}=\mathbf{R}\times\mathbf{V}$. The joints $\mathbf{j}$ are projected to an image domain $\Omega:=\mathbb{R}^{W\times H}$ as keypoints $\mathbf{k}=\boldsymbol{\mathcal{\pi}}(\mathbf{j})$ using a projection function $\boldsymbol{\mathcal{\pi}}$ parameterized by the intrinsic properties of $\Omega$.
3.2. Latent Keyframe Bundle Solving
To estimate the human pose and shape at a single time instance $t$ a minimization problem is formulated using an objective comprising weighted data $\mathcal{E}_{data}$ and prior $\mathcal{E}_{prior}$ terms (Bogo et al., 2016; Pavlakos et al., 2019):
(1) | $\operatorname*{\arg\!\min}_{\mathbf{z}^{t},\boldsymbol{\beta}^{t},\mathbf{T}^{t}}\mathcal{E}^{t}_{data}+\mathcal{E}^{t}_{prior},$ |
with a notable difference being the optimization over a latent code $\mathbf{z}\in\mathbb{R}^{L}$ instead of the pose parameters $\boldsymbol{\theta}$. While earlier works (Bogo et al., 2016) relied on Gaussian mixture pose priors using the joints’ angles $\boldsymbol{\theta}$ as input, recent works are now using data-driven priors (Pavlakos et al., 2019) to instead optimize lower dimensional latent codes. These reconstruct the pose parameters $\boldsymbol{\theta}=\mathcal{G}(\mathbf{z})$, with the generator function $\mathcal{G}$ being a fixed, pre-trained neural network. Apart from the benefit of optimizing a lower dimensional parameter space, it is also possible to impose a prior term on the latent code itself (Pavlakos et al., 2019; Davydov et al., 2022; Tiwari et al., 2022). This is important as it helps prevent degenerate solutions and provides additional constraints to alleviate the ill-posedness of the problem. Another use of such priors is to initially improve the convexity of the problem when weighted higher, and stage the solve by progressively lowering the prior weights as the solution is refined. This type of annealing is typical and used in both challenging monocular (Bogo et al., 2016; Pavlakos et al., 2019) or multi-view cases (Huang et al., 2017), and even when fitting body models to marker sets (Loper et al., 2014; Mahmood et al., 2019). Still, this multi-staged solving process incurs higher run-times that are costly as each time instance $t$ is separately solved and is usually improved by fixing the shape $\boldsymbol{\beta}$ using a single estimate across the entire sequence.
Our proposition is to solve a temporal window $\mathcal{T}:=[0,\dots,T]$ simultaneously using two latent keyframes at time instances $0$ and $T$:
(2) | $\operatorname*{\arg\!\min}_{\mathbf{z}^{0},\mathbf{z}^{T},\mathbf{T}^{0},\mathbf{T}^{T}}\mathcal{E}^{\mathcal{T}}_{data}+\mathcal{E}^{0,T}_{prior},$ |
using an initial and fixed shape estimate $\boldsymbol{\beta}$, with the prior term imposed only on the keyframes, and the data term defined over the entire window $\mathcal{T}$:
(3) | $\mathcal{E}^{\mathcal{T}}_{data}=\sum\limits_{t=1}^{T}\mathcal{E}^{t}_{data}.$ |
where:
(4) | $\mathcal{E}^{t}_{data}=\lambda_{R}\sum_{c}^{C}\sum_{i}^{J}{w_{i}\rho(\mathbf{k}^{t}_{i}-\mathbf{k}^{t}_{det,i})}$ |
Here, $\rho$ is the Geman-McClure penalty function which we favour over a traditional L2 as it can deal better with noisy estimates.
To solve over the entire window and fit our solution to all available data constraints, we reconstruct the intermediate frames via interpolation:
(5) | $\boldsymbol{\theta}^{t}=\mathcal{G}(\mathcal{S}^{t}(\mathbf{z}^{0},\mathbf{z}^{T})),\!\left[\!\begin{smallmatrix}\mathbf{R}^{t}&\mathbf{t}^{t}\\ \mathbf{0}&1\end{smallmatrix}\!\right]\!\!=\!\!\left[\!\begin{smallmatrix}\mathcal{S}^{t}(\mathbf{R}^{0},\mathbf{R}^{T})&\mathcal{L}^{t}(\mathbf{t}^{0},\mathbf{t}^{T})\\ \mathbf{0}&1\end{smallmatrix}\!\right],$ |
with $\mathcal{S}^{t}$ and $\mathcal{L}^{t}$ being spherical and linear interpolation functions that map $t$ to a closed unit interval ($[0,1]$) within $\mathcal{T}$ to blend the starting $t=0$ and end $t=T$ points, as depicted below:
(6) | $\mathcal{S}^{t}=\frac{\sin{\big{(}1-\frac{t}{T}\big{)}\theta}}{\sin{\theta}}\mathbf{z}^{0}+\frac{\sin{\frac{t}{T}\theta}}{\sin{\theta}}\mathbf{z}^{T}$ |
with $\theta$ representing the angular distance between the two latent codes $\mathbf{z}^{0}$ and $\mathbf{z}^{T}$ (Davydov et al., 2022).
Specifically for the pose $\boldsymbol{\theta}$, this process crucially relies on a well-trained generator $\mathcal{G}$ that captures an expressive manifold $\mathcal{M}$ that can be smoothly interpolated to reconstruct correspondingly smooth pose space transitions.
4. Results
4.1. Implementation Details.
We use SMPL-X (Pavlakos et al., 2019) as the parametric human body model $\mathcal{B}$. Even though we only optimize for the body joints and ignore the expressive parts, i.e. the hands and face, we use SMPL-X as its corresponding pose prior, VPoser (Pavlakos et al., 2019), offers a well-trained decoder/generator $\mathcal{G}$ over the manifold $\mathcal{M}$ of plausible poses. To obtain the 2D keypoint constraints, $\mathbf{k}_{det}$ and the confidence score $w_{i}$ for the $i^{th}$ joint, we use OpenPose (Cao et al., 2019), an established 2D keypoint estimator model that predicts human joint positions in the 2D image space. We solve the optimization problem using the limited memory BFGS (L-BFGS) optimizer (Wright et al., 1999), with a strong Wolfe line search strategy to determine suitable step sizes during optimization. Optimization iterations are performed using a budget of $30$ iterations, striking a balance between convergence speed and computation resources. All experiments are implemented with a custom PyTorch-based (Paszke et al., 2017) framework (moai: PyTorch Model Development Kit, 2021) using the same PyTorch version, specifically 1.12. We set the weight for the data term and prior term to $1.0$ and $10.76$ respectively and optimize only for a single stage, using a temporal window of length $|\mathcal{T}|=10$.
4.2. Sliding Window Optimization
Our formulation allows for an efficient sliding window optimization implementation. Instead of optimizing Eq. (2) over the two latent keyframes, we can only optimize for the next keyframe, i.e. $(\mathbf{z}^{T},\mathbf{T}^{T})$, while keeping the first keyframe, i.e. $(\mathbf{z}^{0},\mathbf{T}^{0})$, fixed. We initialize the fitting process with a single-frame fit to acquire the first frame ($t=0$) solution for the first temporal window $\mathcal{T}^{1}$ and then solve for the first window’s next latent keyframe $t=T^{1}$. The optimization process then slides to the next temporal window $\mathcal{T}^{2}$ using the latent keyframe $t=T^{1}$ as the first keyframe of the second temporal window $\mathcal{T}^{2}$, keeping it fixed, while solving for the next latent keyframe $t=T^{2}$. This process repeats until all temporal segments are solved, effectively splitting an $F$ frames sequence into $N=F/T$ temporal windows $\mathcal{T}^{i},i\in[0,N]$ of length $|\mathcal{T}^{i}|=T+1$ represented as $N+1$ latent keyframes $(\mathbf{z}^{i},\mathbf{T}^{i})$. For the first frame we solve Eq. (1) with the same process and weights using $2$ stages of optimization. This initial single frame fit is initialized using an HMR (Kanazawa et al., 2018) prediction, and also initializes the shape parameters $\boldsymbol{\beta}$. The latter are kept constant for the entire sequence. This process is consistent for all methods that we will compare our method to, with any method relying on regressor estimates using HMR, and the same shape parameters used for all methods.
4.3. Experimental Setup
4.3.1. Datasets
We use two standard multi-view datasets for assessing the performance of our approach. They offer a high level of poses variance, ranging from simple performances to challenging motions.
Human3.6M (Ionescu et al., 2014) is a large-scale dataset for 3D human pose estimation, including $3.6M$ video frames from four synchronized cameras and 3D body joint annotations acquired from an optical MoCap system marker-based. It includes $11$ human subjects (five females and six males), and according to previous works [(Arnab et al., 2019)], S1, S5, S6, S7, and S8 are used for training and S9, and S11 for testing.
MPI-INF-3DHP (Mehta et al., 2017) is a dataset for 3D human pose estimation and is obtained through the multi-camera marker-less MoCap system. Since its test data includes single-view images, only train data composed of multi-view (i.e. $14$) images are used in our experiments. Following prior works, one subject is used for testing (S8) out of the total eight captured subjects. Likewise, the views numbered $0$, $2$, $7$, and $8$ are used for our experiments.
4.3.2. Metrics
Performance is assessed over a range of error and accuracy metrics. The typically reported MPJPE evaluates joint position estimation error, accompanied by a joint level RMSE that accentuates higher errors. Since we capture the full articulation of the subject using a template mesh, we also report the angular error (MAE) of the kinematic chain’s rotations. Finally, the accuracy of the estimates is reported with distance thresholded success (PCK) using two different thresholds set at $3$ and $7$cm.
4.3.3. Methods
As a baseline method we use standard multi-view fitting that considers each frame separately. We adapt a multi-staged fitting approach (Huang et al., 2017) ($MuVS$), using a data-driven prior, VPoser (Pavlakos et al., 2019), instead of optimizing directly over the angular domain and relying on a Gaussian Mixture Model (GMM) for pose plausibility. In addition, we initialize the solution with a data-driven regressor estimate from HMR (Kanazawa et al., 2018).
Then a variety of bundle solving methods are compared with our approach. We start from a multi-staged multi-view MoCap method (Huang et al., 2017) (DCT) that solves for a bundle of frames simultaneously at the last stage using a low frequency basis prior for enforcing smoothly varying joint positions. Another approach exploits the entire temporal context (Arnab et al., 2019) (ETC) and solves for all frames’ parameters simultaneously using a joint smoothness objective, starting from HMR predictions. ETC is adapted from the monocular case to our experimental multi-view setting. Then, we use the bundle solving approach of DMMR (Huang et al., 2021) using a temporal VPoser motion prior. While DMMR solves for the camera poses as well, in our experiments the camera poses are known and fixed, ensuring a proper and isolated bundle solving comparison. DMMR achieves motion smoothness by constraining the nearby optimized latent codes to be close. Similarly, SLAHMR (Ye et al., 2023) also solves for the camera’s motion, but this part is skipped in our experiments, instead using the known extrinsic camera calibration parameters, similar to DMMR. SLAHMR uses a multi-stage fitting approach relying on both VPoser, as well as HuMoR, an autoregressive model for solving over a temporal window and enforcing smooth motion with a joint smoothness objective. Its constraints are straightforwardly adapted from single view to multi-view for our experiments.
4.4. Discussion
4.4.1. Performance and robustness to outliers
Tables 1 and 2 present the quantitative results on the Human3.6M and MPI-INF-3DHP datasets respectively. Our BundleMoCap approach outperforms all other methods that optimize and solve over multiple stages and use motion smoothness constraints. Notably, BundleMoCap relies on a pose manifold and its local smoothness to reconstruct local smooth motion via manifold interpolation, compared to the autoregressive HuMoR temporal prior (Rempe et al., 2021; Ye et al., 2023) (SLAHMR) or the recurrent VPoser-t motion prior (Huang et al., 2021) that reconstruct motion segments (DMMR). Further, all other bundle solving methods (Huang et al., 2017, 2021; Arnab et al., 2019; Ye et al., 2023) solve multiple frames in a bundle, whereas BundleMoCap solves for a single latent keyframe, while reconstructing the intermediate frames. Crucially, this property offers higher robustness to outliers, as it is not possible to interpolate to a spurious pose that would manifest due to erroneous/conflicting multi-view keypoint estimates. Such cases are illustrated in Figure 3 in a sequence that contains segments suffering from outlier keypoint estimates. These result in temporally inconsistent results for all other methods, compared to the BundleMoCap results that remain temporally coherent.
MPJPE$\downarrow$ | RMSE$\downarrow$ | MAE$\downarrow$ | PCK3$\uparrow$ | PCK7$\uparrow$ | |
---|---|---|---|---|---|
${MuVS}$ | 43.83 $mm$ | 48.33 $mm$ | 4.56${}^{\circ}$ | 40.58% | 93.75% |
DCT (Huang et al., 2017) | 42.34 $mm$ | 48.13 $mm$ | 4.18${}^{\circ}$ | 40.84% | 94.97% |
DMMR (Huang et al., 2021) | 41.3 $mm$ | 47.66 $mm$ | 4.76${}^{\circ}$ | 39.52% | 92.11% |
SLAHMR (Ye et al., 2023) | 40.8 $mm$ | 42.66 $mm$ | 4.06${}^{\circ}$ | 40.86% | 94.97% |
ETC (Arnab et al., 2019) | 37.51 $mm$ | 41.53 $mm$ | 4.05${}^{\circ}$ | 40.30% | 94.50% |
BundleMoCap (Ours) | 36.48 $mm$ | 40.34 $mm$ | 3.96${}^{\circ}$ | 41.16% | 95.71% |
MPJPE$\downarrow$ | RMSE$\downarrow$ | MAE$\downarrow$ | PCK3$\uparrow$ | PCK7$\uparrow$ | |
---|---|---|---|---|---|
${MuVS}$ | 64.99 $mm$ | 76.12 $mm$ | 6.28${}^{\circ}$ | 28.20% | 73.75% |
DCT (Huang et al., 2017) | 62.43 $mm$ | 68.13 $mm$ | 6.18${}^{\circ}$ | 35.84% | 83.77% |
DMMR (Huang et al., 2021) | 57.51 $mm$ | 67.66 $mm$ | 6.06${}^{\circ}$ | 37.52% | 81.11% |
SLAHMR (Ye et al., 2023) | 61.8 $mm$ | 62.55 $mm$ | 5.74${}^{\circ}$ | 40.86% | 83.97% |
ETC (Arnab et al., 2019) | 59.51 $mm$ | 61.32 $mm$ | 5.64${}^{\circ}$ | 39.30% | 84.50% |
BundleMoCap (Ours) | 56.41 $mm$ | 59.12 $mm$ | 5.43${}^{\circ}$ | 44.51% | 85.71% |
4.4.2. Ensuring smooth motions
Apart from the temporal coherence, BundleMoCap delivers smooth motion captures, minimizing jitter, despite noisy keypoint estimates. This is achieved without using any temporal smoothness objective, compared to the other methods using joint smoothness (Huang et al., 2017; Arnab et al., 2019; Ye et al., 2023), or latent code smoothness (Huang et al., 2021). To illustrate this point, we extract the knee flexion, an angle between three joints, namely the hip, knee and ankle. Any jitter in these joints estimates will result in noisy angle extraction. As evident in Figure 4 the angle extracted from the results of our method exhibits smoother motion compared to both a segment solver (Huang et al., 2017) as well as an entire sequence solver (Arnab et al., 2019). Additional results showcasing the smooth MoCap that BundleMoCap offers can be found in the supplementary video, either in standalone qualitative results or compared to the other approaches (using the same color coding as Figure 3).
4.4.3. Runtime efficiency
Finally, BundleMoCap is a highly efficient method for solving sparse multi-view sequences compared to other approaches whose complexity does not scale efficiently with the sequence size. BundleMoCap solves a single/first frame in isolation and then solves for frame bundles of length $10$ using a single latent code. In contrast, other approaches either require per-frame initialization before solving the entire sequence (Arnab et al., 2019), single-frame fits over the entire sequence before solving bundles of frames (Huang et al., 2017; Ye et al., 2023), or solve bundles of frames whilst optimizing each frame’s latent code across multiple ($4$) stages (Huang et al., 2021). BundleMoCap solves the entire sequence in a single-shot, using a single sliding window optimization stage. Indicatively, the average optimization runtimes for each method can be outlined as follows: BundleMoCap takes around 1 hour for processing a 4-view sequence of 2000 frames, while DMMR involves a longer process with 4 stages spanning almost 1 day (24 hours). ETC requires an initialization pass over the entire sequence, with the optimization process taking approximately 3 hours. For DCT, the initialization from ${MuVS}$ consumes around 40 minutes, and the subsequent optimization requires an additional 9 hours. Lastly, SLAHMR itself takes an average of 6 hours for optimization. It is essential to consider that these times are approximate and may differ based on the specific hardware and software configurations employed during the experiments. However, in our implementation the same software version and hardware were used for consistency. Even though these timings do not take into account the keypoint detector and surely all implementations can be improved, the above analysis aims to showcase the complexity differences between methods requiring multiple passes over the videos, either for initialization or as multiple optimization stages, in addition to the gains of solving less or more parameters/frames. Figure 5 provides a comprehensive overview of the performance of each method, alongside its efficiency which is illustrated by each point’s size that represents its relative runtime.
4.4.4. Error Accumulation
Using a sliding window approach comes with the risk of errors accumulating from one temporal window to the next. BundleMoCap is robust to such drifting due to its bundle solving nature. Solving using constraints for the single keyframe only and then reconstructing the temporal window would suffer, when outliers manifest at that specific keyframe. Instead, BundleMoCap is constrained by the entire temporal window, which reduces the chances of errors accumulating across the entire sequence. This is experimentally verified in Table 2 where the sequences solved span minutes. In such long sequences drifting would hurt performance, but instead, BundleMoCap achieves good quantitative results. Still, BundleMoCap crucially relies on the assumption that small paths on the manifold can reconstruct small motions. It remains to be investigated if the expressivity of the manifold is hindered when bundle solving larger temporal windows and how that relates to the accumulation of errors.
5. Conclusion
In this work we have presented a novel method for solving MoCap in sparse multi-view videos. Exploiting manifold interpolation, we solve for a bundle of frames reconstructed via two latent keyframes. The resulting method is efficient, produces smooth motions and exhibits robustness to outlier observations. However, it assumes that linear interpolation steps on the manifold correspond to linear pose space displacements. While this may hold for some temporal window lengths, it remains to be investigated if this assumption holds for smaller/larger ones. Similarly, BundleMoCap relies on high quality local manifold transitions, a trait that can be improved using different generative models. Still, the diversity of generations models’ learned distributions may be limited by their training data. Closing, while spherical interpolation proved to be sufficient, more involved interpolation schemes may allow for extending the temporal window, or boost solving accuracy. Overall, we believe that there is opportunity for further investigation in this line of bundle MoCap solving.
References
- (1)
- Arnab et al. (2019) Anurag Arnab, Carl Doersch, and Andrew Zisserman. 2019. Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3395–3404.
- Bartol et al. (2022) Kristijan Bartol, David Bojanić, Tomislav Petković, and Tomislav Pribanić. 2022. Generalizable human pose triangulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11028–11037.
- Bhatnagar et al. (2020) Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. 2020. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. Advances in Neural Information Processing Systems 33 (2020), 12909–12922.
- Bogo et al. (2016) Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European conference on computer vision. Springer, 561–578.
- Cao et al. (2019) Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
- Cheng et al. (2022) Wei Cheng, Su Xu, Jingtan Piao, Chen Qian, Wayne Wu, Kwan-Yee Lin, and Hongsheng Li. 2022. Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis. arXiv preprint arXiv:2204.11798 (2022).
- Davydov et al. (2022) Andrey Davydov, Anastasia Remizova, Victor Constantin, Sina Honari, Mathieu Salzmann, and Pascal Fua. 2022. Adversarial parametric pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10997–11005.
- Gong et al. (2023) Xuan Gong, Liangchen Song, Meng Zheng, Benjamin Planche, Terrence Chen, Junsong Yuan, David Doermann, and Ziyan Wu. 2023. Progressive Multi-View Human Mesh Recovery with Self-Supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 676–684.
- Huang et al. (2021) Buzhen Huang, Yuan Shu, Tianshu Zhang, and Yangang Wang. 2021. Dynamic multi-person mesh recovery from uncalibrated multi-view cameras. In 2021 International Conference on 3D Vision (3DV). IEEE, 710–720.
- Huang et al. (2017) Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV). IEEE, 421–430.
- Huang et al. (2022) Yinghao Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. 2022. InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction. In DAGM German Conference on Pattern Recognition. Springer, 281–299.
- Ingwersen et al. (2023) Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Hannemose, and Anders Bjorholm Dahl. 2023. SportsPose-A Dynamic 3D sports pose dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5218–5227.
- Ionescu et al. (2014) Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (jul 2014), 1325–1339.
- Iskakov et al. (2019) Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision. 7718–7727.
- Jiang et al. (2022) Xiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, and Si Liu. 2022. Multi-view Human Body Mesh Translator. arXiv preprint arXiv:2210.01886 (2022).
- Jin and Liu (2023) Pengle Jin and Xinguo Liu. 2023. Robust human motion estimation using bidirectional motion prior model and spatiotemporal progressive motion optimization. Computers & Graphics (2023).
- Kanazawa et al. (2018) Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7122–7131.
- Kanazawa et al. (2019) Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5614–5623.
- Kingma and Welling (2015) Dieederik P. Kingma and Max Welling. 2015. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR).
- Li et al. (2021) Runze Li, Srikrishna Karanam, Ren Li, Terrence Chen, Bir Bhanu, and Ziyan Wu. 2021. Learning local recurrent models for human mesh recovery. In 2021 International Conference on 3D Vision (3DV). IEEE, 555–564.
- Loper et al. (2014) Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: motion and shape capture from sparse markers. ACM Trans. Graph. 33, 6 (2014), 220–1.
- Mahmood et al. (2019) Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proc. IEEE/CVF international conference on computer vision (CVPR). 5442–5451.
- Mehta et al. (2017) Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE. https://doi.org/10.1109/3dv.2017.00064
- moai: PyTorch Model Development Kit (2021) moai: PyTorch Model Development Kit 2021. moai: Accelerating modern data-driven workflows. https://github.com/ai-in-motion/moai.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
- Pavlakos et al. (2019) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10975–10985.
- Peng et al. (2018) Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. 2018. Sfv: Reinforcement learning of physical skills from videos. ACM Transactions On Graphics (TOG) 37, 6 (2018), 1–14.
- Rempe et al. (2021) Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision. 11488–11499.
- Ruggero Ronchi and Perona (2017) Matteo Ruggero Ronchi and Pietro Perona. 2017. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE international conference on computer vision. 369–378.
- Saini et al. (2023) Nitin Saini, Chun-Hao P Huang, Michael J Black, and Aamir Ahmad. 2023. SmartMocap: Joint Estimation of Human and Camera Motion Using Uncalibrated RGB Cameras. IEEE Robotics and Automation Letters (2023).
- Shen et al. (2023) Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. 2023. Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8887–8896.
- Tang et al. (2023) Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 2023. 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4790–4799.
- Tian et al. (2023) Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. 2023. Recovering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- Tiwari et al. (2022) Garvita Tiwari, Dimitrije Antić, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision. Springer, 572–589.
- Wei et al. (2022) Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. 2022. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13211–13220.
- Wright et al. (1999) Stephen Wright, Jorge Nocedal, et al. 1999. Numerical optimization. Springer Science 35, 67-68 (1999), 7.
- Ye et al. (2022) Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. 2022. Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection. In Proc. European Conference on Computer Vision (ECCV). Springer, 142–159.
- Ye et al. (2023) Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. 2023. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21222–21232.
- Zanfir et al. (2018) Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2018. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2148–2157.
- Zeng et al. (2022) Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. 2022. Smoothnet: A plug-and-play network for refining human poses in videos. In European Conference on Computer Vision. Springer, 625–642.
- Zhang et al. (2023) Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. 2023. NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8834–8845.
- Zhang et al. (2021) Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys, and Siyu Tang. 2021. Learning motion priors for 4d human body capture in 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11343–11353.
- Zhang et al. (2018) Xiuming Zhang, Tali Dekel, Tianfan Xue, Andrew Owens, Qiurui He, Jiajun Wu, Stefanie Mueller, and William T Freeman. 2018. Mosculp: Interactive visualization of shape and time. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 275–285.
- Zhao et al. (2022) Fuqiang Zhao, Wei Yang, Jiakai Zhang, Pei Lin, Yingliang Zhang, Jingyi Yu, and Lan Xu. 2022. Humannerf: Efficiently generated human radiance field from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7743–7753.