Load instructions usually have long execution latency in a deep processor pipeline, and have significant impact on overall performance. Therefore, how to hide the load latency becomes a serious problem in processor design. The latency of memory load can be separated into two parts: cache-miss latency and load-to-use latency. Previous work which tried to hide the load latency in a deep processor pipeline has some limitations. In this paper, we propose a hardware-based method, called early load, to hide the load-to-use latency with little hardware overhead. Early load scheme allows load instructions to load data from the cache system before it enters the execution stage. In the meantime, a detection method makes sure the correctness of the early operation before the load instruction enters the execution stage. Our experimental results showed that our approach can achieve 11.64% performance improvement in Dhrystone benchmark and 4.97% in average for MiBench benchmark suite.