Superscalar processors are often pipelined as well, but that's a different technology that allows more than one instruction at once in each execution unit, rather than using multiple execution units at once.
Loading more than one instruction in each clock cycle
The simplest processors are scalar processors. On a scalar processor, instructions usually work with one or two data items at once. On a vector processor, instructions usually work with many data items at once. A superscalar processor is a mix of a scalar process and a vector processor: each instruction processes one data item, but more than one instruction runs at once, so many data items are handled at once by the processor.
In a superscalar processor, it's very important to have an accurate instruction dispatcher, so that the execution units are always busy with work that probably will be needed. If the instruction dispatcher isn't accurate, the processor will have to throw away some work and might not be any faster than a scalar processor. In 2008, all normal CPUs were superscalar, and could have up to 4 ALUs, 2 FPUs, and 2 SIMD units.
Limitations
The efficiency of superscalar design is limited by three things:
How long the dispatcher takes to check dependencies and how long register renaming takes
The branch instruction processing
Existing programs have different levels of parallelism. In some cases, instructions are not dependent on each other and can be executed at the same time. In other cases, they are interdependent: one instruction affects another. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other results, but the instructions a = b + c; b = e + f will probably have to be run in a specific order because a depends on b.
Although there might not be any interdependent instructions in the list, a superscalar processor still has to check for them, because there's no way to be sure there aren't any unless it checks, and if a dependency is missed the results will be wrong. No matter how fast the processor is, this limits how many instructions can be run at the same time. Checking for dependencies gets harder and harder, even while improvements in hardwaremanufacturing allow for more execution units in each CPU core.
Alternatives
Simultaneous multithreading: often written as "SMT", this is a technique for improving the total speed of superscalar processors. SMT allows many independent threads of execution, to make better use of the resources available inside a modern superscalar processor.
Multi-core processors: a multi-core CPU has many processors that each have their own instruction lists, rather than just many execution units.
Pipelined processors: a pipelined CPU supports multiple instructions at different stages of execution inside each execution unit.
All of these techniques can be (and often are) combined in a single CPU, so it is possible to have a multicore CPU is where each core is an independent processor with many parallel superscalar pipelines. Some multicore processors also include vector capability.
Sorin Cotofana, Stamatis Vassiliadis, "On the Design Complexity of the Issue Logic of Superscalar Machines", EUROMICRO 1998: 10277-10284
Steven McGeady, "The i960CA SuperScalar Implementation of the 80960 Architecture", IEEE 1990, pp. 232–240
Steven McGeady, et al., "Performance Enhancements in the Superscalar i960MM Embedded Microprocessor," ACM Proceedings of the 1991 Conference on Computer Architecture (Compcon), 1991, pp. 4–7