With the aim at accurate action video retrieval, we firstly present an approach that can infer the implicit skeleton structure for a query action, an RGB video, and then propose to expand this query with the inferred skeleton for improving the performance of retrieval. It is inspired by the observation that skeleton structures can compactly and effectively represent human actions, and are helpful in bridging the semantic gap in action retrieval. The focal point is hence on action skeleton estimation in RGB videos. Specifically, an iterative training procedure is developed to select relevant training data for inferring the skeleton of an input action, since corrupt training data not only degrades performance but also complicates the learning process. Through the iterations, relevant training data are gradually revealed, while more accurate skeletons are inferred with the refined training set. The proposed approach is evaluated on ChaLearn 2013. Significant performance gains in action retrieval are achieved with the aid of the inferred skeletons.