e past, present and future of the humble CPU
Light a candle and bake a cake, then pop down to the shop to pick up a hilarious card – your CPU has just turned 30! While its best years arent behind it quite yet, it could do with cheering up.
In 1978, Intel released its first 16-bit microprocessor, the 8086. Although it was the cheaper, cut down 8-bit version – the 8088 – that made it into the IBM PC and quite literally changed the world as we know it, todays Core 2 and Phenom chips are designed to run code based on whats still called the x86 instruction set. In fact, they still share some important common core characteristics with the venerable 8086.
Quite why it should have been the x86 family is a different story for another time. Intels chips were far from the most advanced, cleverest or cheapest available at the end of the 1970s, and had some fairly serious design bugs, which had to be replaced by IBM free of charge some years later. In the annals of our times, though, that will be deemed irrelevant: this was the general purpose processor that drove the desktop revolution.
Curiously, one of its competitors – the Zilog Z80 which powered Sinclairs home computer of (almost) the same name – is actually still manufactured and used today. The 8086, however, has been consigned to history.
Why do we bring these curious factoids up? Because later this month also sees the launch of Intels seventh generation of x86 CPUs, the Core i7 (Nehalem). Intel is touting it as the biggest architectural change in the companys history; and for once were actually prepared to believe it.
Core i7: Your essential guide to Intels new processor
The success of x86 is, of course, backwards compatibility. Somewhere in the Core i7s infinitely more complex design are the same 116 instructions that the 8086 could execute, albeit substantially enhanced with later additions, and the same is true of the AMD Phenom. These are the basic arithmetic and logic commands – like ADD, MUL, OR and XOR – along with a few more specific instructions for which bit of data belongs in which block of memory or system register.
In reality, of course, the things couldnt be more different today if they tried. The 8086 ran at 4MHz, had a total transistor count of less than 30,000 and was packaged in a 40-pin dual in-line chip: physically, it was one of those long black things with the legs sticking out from the sides like an evil metal spider. The Core i7, by contrast, is a two-, four- or eight-core beast, with up to 1.4 billion transistors in its largest variety.
At launch, it will be clocked at well over the 3GHz mark. It has 1567 pin outs, and comes in the flat FCLGA (flip chip land grid array) packaging that will be familiar from the Core 2 line. That means that balls of solder meet the circuit board head on, and end in simple pads which are then laid on to of pins in the motherboard socket.
Weve come a long way, clearly. The CPUs of the seventies look like single-celled organisms in primordial processor sludge by comparison to the staggering complexity of todays chips. It takes teams of hundreds of people several years to design a new CPU, and its unlikely that any individual could completely navigate the finished silicon topography by hand.
Inside the shell
We can, however, do our bit to improve general understanding by looking at certain core principles of CPU design. Technically speaking, a CPU is any processor that can execute programmable code, but for the purposes of our sanity, well stick to a discussion of modern day x86 chips here.
Instructions are called from a memory store to the registers. These are then interpreted, processed and a result is written back. This result can be output to, say, a graphics card or hard drive, or called back into the CPU for processing again. However intricate a processor is, these basic four steps are a good way to understand how they work and why they are designed the way they are.
That cycle can be sped up, of course, by increasing the clockspeed of the CPU and the number of cycles it performs per second. Intel learnt the hard way that the key to building a really fast processor isnt just about raw gigahertz. If a single cycle requires a certain amount of electricity to be performed, more cycles per second means increasing the power consumed and – importantly – the heat produced.
The theoretically scalable netburst architecture of the Pentium 4 came a cropper when it hit an unexpected top speed barrier beyond which trying to cool the chip was impracticable for most. Which tells us that processor designers may be very clever, but they cant foresee everything.
In the same way that graphics technology has moved to unified shading in order to make more efficient use of the processing power available, todays design goals are to keep all the various parts of the CPU working on useful information. Note the inclusion of the word useful there.
Fetch, decode, etc.
The simplest form of CPU takes one piece of data, works out what to do with it, does it and then outputs the result. The inherent problem is that it can only work on one piece of data at a time, and while thats being passed through to the part of the execution engine thats designed to perform the requested operation, the rest of the CPU is sitting idle.
The solution to this is to introduce some form of parallelism to the pipeline. To start with, this might have been simply to have the fetch part of the CPU grabbing a new piece of data while the decode bit is working on another. Thats been developed somewhat, mind you, and the last iteration of Pentium 4 had a whopping 31 stages to its pipeline.
The problem with long pipelines, however, is that they arent always terribly efficient because theyre not always full of useful information. On its journey through the pipeline, a piece of data may return an error or will become reliant on other information being drawn from the registers – if it isnt there, the result will have to be written out while the new piece of data is fetched and the rest of the pipeline will stand idle.
The key workaround for this in todays CPUs is to build logical areas that are dedicated to branch prediction – in other words, guessing what bits of data are going to be needed next and getting them ready for insertion into the pipe. Of course, branch predictors arent infallible, and if the wrong information is called then youre back to having large amounts of wasted die area. A large part of processor design is finding a happy balance between length of pipeline and CPU cycles lost to such stalling.
There are other ways to speed up data throughput too. Your CPUs inbox is always overflowing with work to be done, but it will rifle through the pages to take the best job next, not necessarily the first one it was given. The order in which instructions are executed is decided by a scheduler, which independently assesses the most efficient way to do them.
That might mean looking ahead in the currently running thread and pulling out commands that arent dependent on the current operation – known as out of order processing – or, in the case of a processor core capable of working on more than one thread at once, starting to work through an entirely different instruction loop that just happens not to need the same parts of the pipeline as the currently running one. To speed things up further, Core i7 can execute up to four instructions per cycle.
Incidentally, its also interesting to note that a CPUs instruction set – the programming language into which all commands are eventually decoded and compiled – isnt completely hard-wired into the design. Theres a software layer that handles most of the interpretation known as the microcode – a form of non-upgradable firmware stored in an on-board ROM, which works as a mini- operating system.
Its a useful tool for chip builders: because the microcode isnt finalised until the chip goes into production – and can be rewritten for a new manufacturing run – any problems or improvements that are discovered after the silicon has been laid out can be changed in the software stack. This is, of course, easier than going back to the drawing board and laying out another million or two transistors.
The execution engine is also broken down further into dedicated areas for tasks like integer operations, float point calculation and SSE instructions. The latter is an acronym of an acronym – the Streaming SIMD Engine where SIMD stands for Single Instruction Multiple Data.
Its an on-board vector processor capable of performing the same transformation on several pieces of information at once. Its included on Intel and AMD chips for speeding up things like video processing, where the same command must be performed on, say, all the pixels on a screen simultaneously. Theres also, of course, the one important part of a CPU that we havent talked about yet: the memory.
Closest to the actual instruction pipeline are the registers: there are 32 of these on a 64-bit chip, and each can either store a general piece of information or has a specific task or overlapping tasks. In order to help out those prefetch engines we mentioned earlier, though, there are two levels of fast cache memory to store the data which might be needed for the current process, or that has been written out but may be called again.
The cache memory is much faster than system memory and prevents the whole system bottlenecking while the RAM is slowly scanned for instructions and data. For multi-core chips, where two or more processors are packaged onto the same die, theres often a third cache area that is structured to allow the different cores to swap information quickly.
With Core i7, however, Intel has finally introduced a technology it calls QuickPath Interconnect, or QPI. Broadly analogous to Hypertransport, it allows the CPU to talk directly to components like memory without going via the northbridge on the motherboard, which otherwise acts rather like the router in your home network as a central hub for data transportation and can get quite congested. This should prevent Core i7 bottlenecking despite its enormous demand for incoming information.
There have been many other improvements to CPUs since the humble 8086. One, for example, is that power management systems are now built onto the die. These serve several purposes – shutting the chip down to protect it from damage when it gets too hot, say, or turning off areas that arent being used to conserve electricity.
The latter is particularly useful for extending the battery life on notebooks, but in these times of rising energy costs is also handy for datacentres, where a 50W saving per chip over a thousand servers can add up to a serious amount of money per year. Especially when it means you can turn the air-conditioning down a notch or two as well.
Perhaps the biggest ongoing technological advances are in how these complex designs actually get turned into transistors on a silicon die.
Most of us will be familiar with the mind-roastingly tiny figures that are quoted by CPU manufacturers for their manufacturing processes – 45nm, 60nm and so on. These refer to the basic size of components on a chip, and are unimaginably small. Reducing them further has several advantages: performance-wise, the same chip on a smaller process can run cooler and faster; but more importantly – because they take up less physical space – more can be squeezed onto a single silicon wafer. So theyre cheaper too.
The basic manufacturing process hasnt changed much in 30 years: take a large disc of purified silicon, and use a photolithographic process to build layers of extra materials onto it which connect together to create data pathways and logic gates. The materials and tools, however, are constantly being refined to increase the accuracy needed to achieve these tiny dimensions, and reduce the effects of leakage. Put simply, this is when electrons begin hopping out over the boundaries between interconnects that theyre not supposed to be able to mount, and becomes more of a problem the smaller the manufacturing process becomes.
At the moment, Intel leads the way with its 45nm process, which is made possible thanks to a hafnium-derived material used for the transistor gates. Intel states that the next generation of Core i7s will be produced on an even smaller 32nm process.
Moore to come
CPU design and manufacture isnt showing any signs of slowing. The infamous Moores Law – a prediction by Intel founder Gordon Moore that the number of transistors that can be placed on a circuit will double every two years – may not be based on any scientific assessment of the manufacturing capabilities of the future, but it has remained peculiarly true for the last forty-three years.
Indeed, it could be that were on the cusp of a far bigger architectural change than even Core i7 augers. AMD and Intel are keen to move more functions onto the CPU, starting with a basic graphics processor, with the end goal of creating simple, power efficient system-on-a-chip that will, essentially, put a desktop PC on your fingernail.
NVIDIA and the ex-ATI part of AMD, meanwhile, seem to recognise that the next big jump in real-time graphics engines is a little further off than previously supposed, and their hugely parallel GPUs are capable of performing important tasks like medical imaging and financial reporting better than an entire farm of CPU servers.
Perhaps more likely to yield results faster, though, are the hardware hooks for virtualisation which are being built into CPU cores, allowing several operating systems to run at once without a performance penalty. Many speculate that cloud computing – starting an instanced desktop from a web-based grid server is the way forward, turning all our computing into one big Gmail-type application.
Quite where these developments – and the hundreds of others that are going on simultaneously – will lead, though, is anyones guess. But before gazing too far into the future, bear this in mind: theres another, even bigger and more significant birthday than the 8086 this year. In September 1958, Texas Instruments welcomed the very first microprocessor – just a single transistor on a germanium strip – off its production line and set the ball rolling for the information age.
Did anyone, fifty years ago, predict World of Warcraft or even Microsoft Word? Happy birthday, computers!
Entry filed under: Technology News.