A scalar (single-issue) processor executes one instruction at a time and its functional units (ALU, multiplier, and shifter, etc) are never concurrently exercised. Modern processors issue multiple instructions simultaneously (i.e. superscalar or VLIW) to improve their functional unit utilization but the cost is considerably high. In this paper, an alternative is described to activate multiple functional units concurrently by issuing a composite instruction on cascaded functional units. Besides, an automatic generator for application-specific composite functional units is presented. In our simulation with popular DSP applications, 35% increase on the operations per cycle can be simply obtained with identical functional units. Moreover, our proposed approach saves up to 16.5% and 31.6% power on scalar and VLIW respectively for comparable performance.