Optimized CUDA Implementation to Improve the Performance of Bundle Adjustment Algorithm on GPUs

Kommera, Pranay R. and Muknahallipatna, Suresh S. and McInroy, John E. (2024) Optimized CUDA Implementation to Improve the Performance of Bundle Adjustment Algorithm on GPUs. Journal of Software Engineering and Applications, 17 (04). pp. 172-201. ISSN 1945-3116

[thumbnail of jsea2024174_29303236.pdf] Text
jsea2024174_29303236.pdf - Published Version

Download (3MB)

Abstract

The 3D reconstruction pipeline uses the Bundle Adjustment algorithm to refine the camera and point parameters. The Bundle Adjustment algorithm is a compute-intensive algorithm, and many researchers have improved its performance by implementing the algorithm on GPUs. In the previous research work, “Improving Accuracy and Computational Burden of Bundle Adjustment Algorithm using GPUs,” the authors demonstrated first the Bundle Adjustment algorithmic performance improvement by reducing the mean square error using an additional radial distorting parameter and explicitly computed analytical derivatives and reducing the computational burden of the Bundle Adjustment algorithm using GPUs. The naïve implementation of the CUDA code, a speedup of 10× for the largest dataset of 13,678 cameras, 4,455,747 points, and 28,975,571 projections was achieved. In this paper, we present the optimization of the Bundle Adjustment algorithm CUDA code on GPUs to achieve higher speedup. We propose a new data memory layout for the parameters in the Bundle Adjustment algorithm, resulting in contiguous memory access. We demonstrate that it improves the memory throughput on the GPUs, thereby improving the overall performance. We also demonstrate an increase in the computational throughput of the algorithm by optimizing the CUDA kernels to utilize the GPU resources effectively. A comparative performance study of explicitly computing an algorithm parameter versus using the Jacobians instead is presented. In the previous work, the Bundle Adjustment algorithm failed to converge for certain datasets due to several block matrices of the cameras in the augmented normal equation, resulting in rank-deficient matrices. In this work, we identify the cameras that cause rank-deficient matrices and preprocess the datasets to ensure the convergence of the BA algorithm. Our optimized CUDA implementation achieves convergence of the Bundle Adjustment algorithm in around 22 seconds for the largest dataset compared to 654 seconds for the sequential implementation, resulting in a speedup of 30×. Our optimized CUDA implementation presented in this paper has achieved a 3× speedup for the largest dataset compared to the previous naïve CUDA implementation.

Item Type: Article
Subjects: Pustakas > Multidisciplinary
Depositing User: Unnamed user with email support@pustakas.com
Date Deposited: 07 May 2024 11:04
Last Modified: 07 May 2024 11:04
URI: http://archive.pcbmb.org/id/eprint/1997

Actions (login required)

View Item
View Item