??? 05/15/09 05:46 Read: times |
#165360 - It's not that difficult ... Responding to: ???'s previous message |
Per Westermark said:
The only problem with a pipeline is the startup time after an unexpected jump.
You don't have to worry about any reset conditions. Normal speculative execution just means that you can throw away results at the last step of the pipeline. Form the programs view, an instruction hasn't happened unless the last step in the pipeline gets accepted. With a short pipeline, you need to be able to see the startup time either. After a reset, it may take a couple of cycles extra until the first results gets produced. In the same way, an interrupt will require these cycles until activated, but these cycles would normally be so much shorter (the pipeline allows a higher clock frequency, since you limit the sequential state changes required for a step in the pipeline) that the total number of nanoseconds for the interrupt response will not be affected. The ALU etc of the 8051 are so tiny that it is very easy for the pipeline to compute both sides of a conditional branch, and throw away the wrong alternative. In the end, a pipeline need not affect any predictability. You could still count your cycles for the individual instructions. It isn't until you start with instruction reordering or concurrent (accepted, instead of speculative) instructions that you will lose the ability to compute exact timing. It's still predictable, since computer programs can and will "know" exactly what the state of the pipeline is, in the event that the core is pipelined. I disagree with the notion that the 805x ALU is small, as compared with the remainder of the logic. I also disagree with the notion that there has to be a lot of propagation delay when reaching into code space, internal data space, SFR space, external data space, etc. The ALU, from where I sit, is pretty large, wide, not in data path but in function, and has a substantial pair of data selectors (muxes) at its input, and a substantial data distributor at its output. A mux does the shifts and rotates, a 16-bit adder-subtractor does the address and data arithmetic, though it's 16-bits wide only to support the DPTR and PC/Address-Bus operations. Viewing many bytes of the instruction stream without a pipeline gives the processor information it can't make any use of. The ALU, address busses, etc. will not be any faster just because you have knowledge about following instructions. Yes, you're right ... the logic depth, which can be fairly well equalized, using short, wide paths rather than narrow long ones, will provide the rate-determining step. However, if a 3-byte instruction, e.g. MOV A,#HHHH takes just as long as a single-byte instruction, MOV B,A, or a two-byte instruction, MOV A,VNAME, things will go quite a bit faster even though the individual cycles are longer. Think about discrete logic. How much can you manage to do in your discrete logic with just one clock transition? Each gate will have a delay, and your information may in some situations have to ripple through the logic gates and flip-flops. Using a two-phase clock, you would still have quite interesting times to get data from the code space, decode, retrieve input data, compute and store back the result within one low-to-high and one high-to-low clock transition. I've not built an 805x core ... yet ... though I've done considerable preliminary work on it. I've built other cores, and have found that one can build nearly any MCU core with a simple two-phase clock, e.g. the sort which was used on 6801 or 6502, etc, doing data arithmetic on one phase and address arithmetic on the other. The two internal data spaces, SFR space, code space, and external data space, are all segments of an otherwise contiguous memory space. Since address arithmetic doesn't affect the data, and since data arithmetic doesn't affect the addresses, the address arithmetic cycle can be used to access the composite memory space. Consequently, the data arithmetic result can be transferred to memory during the address arithmetic cycle, and the address arithmetic result can be transferred to the appropriate resource (not data or code memory, but possibly to stack or registers) during the data arithmetic cycle. Because the ALU is wide but shallow, it can easily be used for both sets of arithmetic, thereby eliminating the need for long clearable and presettable up/down counters, which require quite large concatenated gates. I could go on, but I imagine people's eyes are already glazing over. RE |