??? 05/15/09 15:56 Read: times |
#165384 - It is a matter of how you choose to view things Responding to: ???'s previous message |
Per Westermark said:
Richard Erlacher said:
It's still predictable, since computer programs can and will "know" exactly what the state of the pipeline is, in the event that the core is pipelined. Never contested. Just noted that a pipeline has a startup time. Richard said:
I disagree with the notion that the 805x ALU is small, as compared with the remainder of the logic. Please stop extrapolating. I did write "The ALU etc of the 8051 are so tiny that it is very easy for the pipeline to compute both sides of a conditional branch, and throw away the wrong alternative." Don't know where you thought you saw "compared with the remainder of the logic".The ALU of a 8051 is small, if you compare it with the ALU of more recent processors. If bigger processors can manage to compute both sides of a conditional branch, then it must obviously be possible to do for a 8051 chip too. The reason? To run a pipeline at fixed speed without worrying about any pipe stall if the chip mispredicts the branch decision. It is also quite common to separate address calculations from the general ALU, so having two ALU does not mean having two identical copies of what the original 8051 had. The user is interested in the hehaviour, not what building block a specific transistor is located in. I didn't intend to imply that a 1500-gate ALU, such as I suggested the 805x core might use, should be compared with the 100K-gate ALU common in modern processors. It's not the ALU's job to control the sequence of operations in the core. It certainly doesn't control the pipeline or branch prediction. My view is that data or address information flows through the ALU twice during each machine cycle on its way from and to source and destination. That information does that whether it is altered by the ALU or not. It can be incremented, decremented, ANDed, ORed, shifted, rotated, added, etc. If one uses a Wallace-tree or other multiplier, it can do that in a cycle, too. The classic ALU, which apparently used Booth's algorithm to perform the multiplication, would have to take longer. Richard said:
Yes, you're right ... the logic depth, which can be fairly well equalized, using short, wide paths rather than narrow long ones, will provide the rate-determining step. However, if a 3-byte instruction, e.g. MOV A,#HHHH takes just as long as a single-byte instruction, MOV B,A, or a two-byte instruction, MOV A,VNAME, things will go quite a bit faster even though the individual cycles are longer. Correct - it is always good to have a memory interface that can load the full instruction in one read. But your example requires 24 bits. Your previous post talked about 48 bits. Without a pipeline - what would you do with the information about the following instruction? The 48-bit "view" of program memory allows the execution of two 3-byte instructions, e.g. MOV DPTR, #HHHH and LJMP AAAA (which is really a "load PC") at a time. It also allows selective out-of-order OR concurrent execution of 3 2-byte instructions, or up to 6 single byte instructions, selectively, of course. Instructions would be removed from the instruction stream as they're executed, and replace by subsequent code-space content. Not all of code space is instructions, so this wouldn't always help. Much of the time, however, it would yield additional performance without increasing the requirement for speed, hence, reducing the speed-power product. Richard said:
I've not built an 805x core ... yet ... though I've done considerable preliminary work on it. I've built other cores, and have found that one can build nearly any MCU core with a simple two-phase clock, e.g. the sort which was used on 6801 or 6502, etc, [...] We are not talking about the use of a two-phase clock here. We are talking about managing the instruction with just two clock transitions. A single input clock can easily be manipulated into a two-phase clock thereby producing a non-overlapping clock pair with a ~40% duty cycle on each phase of the input clock. That provides a convenient system for clocking the separate address and data arithmetic operations. I was talking about one-clockers managing with just two phase changes without pipelining and without internal clock doublers.
I know that the 6800 does not do 1MIPS at 1MHz. I haven't looked at the 6502, even if I know that it is using both phases of the clock and has asynchronous logic. Is it your claim the 6502 does 1 MIPS for 1MHz input clock? Well, the 6502 had a one-clock deep pipeline, for which it paid the price in interrupt response and branching. I don't know so much about the Motorola CPU. That was not my point, though. I simply meant that the two CPU's both used a two-phase clock generated from a single input clock of the same frequency, with two non-overlapping phases during which latched operations occurred. I say latched rather than registered, because they used latches because of their relative size (2 gates rather than ~6 for a DFF). But was these two your proof of two-phase one-clockers?
I implied no proof of any sort. The guy who pays for the CPU is concerned with the cost ... silicon by the pound, if you please, and while the end-user doesn't care whether it's 5000 gates or 5 million (unless he has to pay for them) it is nonetheless a significant factor. Single-cycle branch prediction requires a much deeper view into program memory than concurrent or out-of-order execution of instructions. That's why I've chosen to ignore it for now. I didn't intend to frame this part of the discussion as an argument. I have my own views about how an MCU core, even an 805x-type MCU core, should be architected. From what I've seen, most people don't follow that model. I prefer a short, wide path from source to destination, with a short, wide ALU section that does most of the heavy lifting while data simply flows through it. The ALU is large for a small core. The steering logic is large, for a small core, but paths through it can be set in advance, so its depth isn't as critical. Since the single clock has two phases, those phases can be separated, with an edge-detector between them to activate any gates or gated/clocked latches/registers. Experience has taught me that this approach can be fruitful with single-clock-cycle cores. Pipelining doesn't help those types of cores ... much ... though one stage can be useful. With deep logic, and with multi-clock-per-cycle architectures, pipelining can improve performance significantly, since, while it increases latency, it typically reduces propagation delays per stage, which, at least in theory, improves performance. RE |