??? 06/22/10 17:31 Read: times |
#176837 - Most concepts already exists in the wild Responding to: ???'s previous message |
You are now describing the piplelined and superscalar implementation of modern high-end processors.
The processors in PC:s, game consoles or more expensive mobile phones can do multiple instructions/clock cycle even if they have only a single core. The Pentium processor was the first x86 processor that was superscalar. It could issue one floating point and two integer instructions in the same clock cycle. By increasing the length of the pipleline, the processor can detect dependencies between instructions and then reorder the instructions to allow multiple instructions to be started at the same time. The next step up from this is the hyperthreading that got introduced in some of the faster P4 processors. A single processor core has double set of registers (including program counters) allowing the processor to instantly allocate the computation units (ALU, multiplier, ...) to the second program when instructions from the first program stalls. The increase in pipeline length needed for instruction reordering and efficient superscalar performance increases the cost of a stall - for example when the memory can't keep up and supply the data in time because the data wasn't already in the cache. The hyperthreading reduced the loss from such stalls since two applications needed to stall at the same time for the core to need to idle. A new i7 processor can have four cores - each with hyperthreading and each with superscalar performance where multiple instructions are issued each clock cycle. And beside these things, most newer processors also have SIMD instructions for multimedia or signal processing. But (yes, there is always a but) the increases in pipeline lengths to get the superscalar performance, and the use of data and code caches means that the timing will be more and more random. A traditional microcontroller is fully deterministic. If interrupts are not disabled, then you can compute the number of clock cycles from an event until the ISR has been activated. With a high-end processor, you may have three levels of cache that needs to be filled to get the initial instruction. If accessing virtual memory, you may get exceptions where you basically reach another level of ISR to map in/out the required memory. This can happen for the instruction bytes and/or for the data bytes. Depending on what other instructions the processor has been busy with previously, it will take an unknown number of cycles from an instruction enters the pipeline until the required execution unit (such as an ALU) is ready to process the instruction. A processor with a 17-step pipeline will obviously have way longer time from fetching an instruction until executing it compared to a processor with a 2-step pipeline. This is a big reason why most microcontrollers are still running at very sedate speeds even if they are using 0.13u process technology. The lower-end ARM chips are very nice. The high-end ARM chips have walked a long way towards current PC-class technologies, which makes them excellent for running Linux systems, but not as good at hard real-time for fulfilling us or sub-us requirements. Another concept that does exist is VLIW - very large instruction word. You make the processor load multiple concurrent instructions with each read, and processes these instructions in lock-step. Running the instructions in lock-step is what differs from a traditional superscalar processor where the instructions are completely independant and the processor itself takes care of any reordering to get max speed out of the core. With VLIW, the compiler must figure out which instructions that may be processed at the same time and merge these sub-instructions into a complete instruction word. Yet another thing that does have similarities with your ideas is the virtual design of new processors. They may have multiple cores that shares a pool of execution units and the individual cores then grabs any free address decoder, multiplier or whatever it may need. By pooling the resources, you can squeeze just a bit more performance out of the chip by reducing the stall time - this is a follow-up to the hyperthreading. Having virtual processors means that you could basically have a server that allocates 30% CPU resources to a specific application - not by just adjusting the number of time slices but by adjusting the number of core elements. You get a very fine-grained tool for handling CPU-time quota, making sure that your multimedia stream is guaranteed to have 1.2 billion multiplies and 3 billion adds/second. There are also companies that develops processing solutions based on graphcis cards, where the modern, programmable, pipelines of the graphics cards are used to dynamically form computation networks. There exists supercomputers based on graphics cards, but I don't think any manufacturer have any processor product that would be suitable for control applications. Most pipelined solutions are about pure power, and not about quick responses. |