Rate distortion optimization helps decide the best coding mode and partition to improve coding efficiency, but suffers from serious data dependency and complexity that hinders an efficient hardware encoder implementation. Thus, this paper presents a hardware-friendly fast rate-distortion method and its design. For rate estimation, we propose a context group adaptive entropy based method for more precise estimation and parallel computation that is applied to both intra and inter predictions instead of intra prediction only as in previous approaches. For distortion estimation, we use the transform domain instead of spatial domain computation to save inverse discrete transform computation and image reconstruction, and reduce cost further by adopting fixed zero sub-blocks in the high frequency part for 32×32 and 16×16 blocks. The simulation results shows 1.77% BD-rate loss in average. The proposed hardware design adopts the interleaved Luma/Chroma coding schedule to improve hardware utilization. The final implementation with TSMC 40nm CMOS process can achieve real time 4K×2K@30fps encoding with 57.95K gate count while operating under 400MHz clock frequency.