Providing high-quality video conferencing experience over the best-effort Internet and wireless networks is challenging, because 2D videos are bulky. In this paper, we exploit the common structure of conferencing videos for an ultra-low-bitrate video conferencing system. In particular, we design, implement, optimize, and evaluate a video conferencing system, which: (i) extracts facial landmarks, (ii) transmits the selected facial landmarks and 2D images, and (iii) warps the untransmitted 2D images at the receiver. Several optimization techniques are adopted for minimizing the running time and maximizing the video quality, e.g., the image and warping frames are optimally determined based on network conditions and video content. The experiment results from real conferencing videos reveal that our proposed system: (i) outperforms the stateof-the-art x265 by up to 11.05 dB in PSNR (Peak Signal-to-Noise Ratio), (ii) adapts to different video content and network conditions, and (iii) runs in real-time at about 12 frame-per-second.