This paper presents an architectural technique to efficiently implement multi-stage additions through operand segmentation. Carry bypass is leveraged to break the dependency between the two segmented adders, reducing the delay of the critical path. This allows for power-and area-efficient hardware implementation due to the increased timing margin for architectural transformations at the cost of one extra clock cycle. Compared to existing segmented-adders, the proposed architecture has the least hardware overhead with near execution time. An accumulator and a 16-tap FIR filter are used to demonstrate the delay, power, and area improvements of the proposed technique. The synthesis results show that the delay is improved by up to 42% and 28.1%. Given the same timing constraint, the adder area is reduced by 27.4% and 12.4%.