This paper addresses the problem of temporal action detection from untrimmed videos. Considering that actions can be recognized by the occurrence of objects and the corresponding moving information in the video, a hierarchical model is proposed which consists of two object detection networks to do temporal action detection. The first network is used to detect objects in each frame, and the second one is for temporal action detection. We also proposed a method which converts the object detection results of the first network into a new type of frame so that it can be fed to the second network. The generated frame has six channels with spatiotemporal information beneficial to action detection. Quantitative results on THUMOS14 dataset demonstrate the superior of the proposed model with satisfactory performance gains over state-of-the-art action detection methods.