On-chip shared cache is effective to alleviate the memory bottleneck in modern many-core systems, such as GPGPUs. However, when scheduling numerous concurrent threads on a GPGPU, a cache capacity agnostic scheduling scheme could lead to severe cache contention among threads and thus significant performance degradation. Moreover, the diverse working sets in irregular applications make the cache contention issue an even more serious problem. As a result, taking cache capacity into account has become a critical scheduling issue of GPGPUs. This paper formulates a Cache Capacity Aware Thread Scheduling Problem to capture the impact of cache capacity as well as different architectural considerations. With a proof to be NP-hard, this paper has proposed two algorithms to perform the cache capacity aware thread scheduling. The simulation results on Nvidia's Fermi configuration have shown that the proposed scheduling scheme can effectively avoid cache contention, and achieve an average of 44.7% cache miss reduction and 28.5% runtime enhancement. The paper also shows the runtime can be enhanced up to 62.5% for more complex applications.