The Expanded Infodump About the CPU

Instruction Sets

Early computers had nothing resembling instruction sets. To program a computer, one had to painstakingly wire the inputs to the proper execution units (adders, logic gates, etc.). Of course this grew into a test of sanity once job requests were becoming more frequent and the calculations went beyond making artillery tables. Once some form of memory became available, it was possible to have the computer "rewire" itself through the use of instruction sets. In the beginning this was something like instruction 0000 was moving some data, 0001 was adding data, etc. The "rewiring" part came from an instruction translator of sorts, which could be thought as a bunch of logic gates to determine where the inputs and outputs should go.

There was just one problem. While computation speeds were increasing quite nicely, memory speeds and capacity were not. As higher level programming languages were developing and taking off, newer processors were built to support these directly to make the execution code as compact as possible. This happened until about the mid 1980s when computer scientists began to figure out that performing simpler tasks in sequence could be done much quicker, often one instruction per clock cycle. Processors were built upon this new design paradigm and given the name Reduced Instruction Set Computing (RISC). Reduced in this case means the time it takes to complete an instruction is reduced. Retroactively, processors made before then were given the name Complex Instruction Set Computing. RISC proved to be such an efficient method of execution that many manufacturers by the late 1990s had or were making RISC processors. x86 is the only, if not one of the only, CISC ISAs still in widespread use. However, modern x86 processors are designed such that the instructions are decoded into micro-ops and performed in a RISC-like manner.

Later improvements were mostly aimed at trying to improve instruction level parallelism, or trying to do more instructions at once. The major stumbling block with this is that the program that's being run has to be highly predictable.

Note that instruction sets are primarily the interface between software and hardware. It is the "language" that the two use to speak to each other. How well the CPU actually performs depends on the implementation, not the instruction set itself. For example, while ARM is RISC based, for the longest time none of its implementations could beat an x86 one, despite x86 being the "slower" CISC instruction set.

Instruction Set Types

Complex Instruction Set Computer (CISC)

Examples: Intel x86, AMD x86-64

A problem in the early days of computing was that memory was expensive in terms of capacity and how fast the CPU could access it. In addition, there was the rise of early high level programming languages like FORTRAN and COBOL. To help better translate what these programming languages were doing and to save on the number of instructions needed to process, CPU designs gravitated towards performing multiple lower-level operations in one instruction. That is, a single instruction could access memory, process data, and save the result in one go. By encoding as many possibilities as possible for the operation, this helped increase code density.

However, as time went on, trying to decode all the possibilities for a single operation grew to be such a problem that in a lot of systems, it was faster to just perform the operations individually than with the one instruction that would've done them all. This led to a different way of designing CPUs in the 1980s.

The name for this paradigm was retroactively applied after the introduction of RISC.

Reduced Instruction Set Computer (RISC)

Examples: ARM, IBM PowerPC, AMD GCN

Developed in the 70s, the principle of RISC is to reduce the time it takes to complete each instruction. It was found even on a CISC machine, that doing a string of simpler instructions was faster than doing one complex instruction that handles everything those simpler instructions were doing. For example, a CISC CPU can have an operation like "add the numbers in memory locations A and B" using the same "ADD" mnemonic. But "ADD" can also mean "add the numbers in register A and memory location B" or "add the value A to memory location B". The CISC CPU has to spend extra cycles decoding the actual intent of the instruction.

Soon after processors were designed with the aim of reducing the amount of time it takes to complete an instruction, with goal being one instruction per clock. To do this, RISC architectures simplify the instruction execution process by:

Making uniform instruction sizes. While using fixed instruction sizes for the entire ISA is often the goal, some ISAs may allow for variable length instructions, but there are mechanisms in place to limit the variability. For instance, ARM has 16, 32, and 64-bit instructions, but it can only execute them in a specific CPU mode. Another example is MIPS introduced 16, 32, and 48-bit instruction sizes, but different opcodes represent the same operation of different instruction sizes (e.g., a 16-bit ADD instruction or a 48-bit ADD instruction). This contrasts with say x86 where the same operation maps to the same opcode, but it may have multiple versions of it of different sizes, with some instructions being up to 15-bytes in length.
Reducing memory access types that an instruction can do. Operations that modify data can only do so on registers or with an immediate value (a value that's part of the instruction). Contrast to x86 which can do a combination of registers, immediate values, or memory locations.
Supporting only integers and floating point data types and not special types like Binary Coded Decimal or Strings
Having a lot of general-purpose registers rather than a smaller set of register with more special-purpose registers. Originally x86 had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations. Though most special-purpose registers can be read from/written to by software without restriction, so they gradually became special-purpose in name only (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

By the 2000s, RISC had largely taken over. While x86 remains in widespread use, most implementations decode and translate it into something more RISC-like, with the execution core acting more like a RISC processor.

Very Long Instruction Word (VLIW)

Examples: Intel IA-64, Transmeta Crusoe, AMD Terascale, MCST Elbrus

Conceptualized in the late 1980s, VLIW was designed to take advantage of multiple/parallel computation units. The instruction word itself contains a number of instructions, similar to packing several RISC instructions together. This may sound similar to CISC, but each sub-instruction is still a simple independent operation while CISC tries to do many things at once as an independent instruction. A major issue is that the programs have to be highly predictable and often times the programs are recompiled and sorted so they can run in a parallel fashion ahead of time, something called static scheduling.

While it was tried to varying degrees of success around 2000 to the early 2010s, it's rare to encounter a native VLIW CPU. For GPUs, this model made sense, since the workloads they're expected to run are highly parallel in nature anyway. Though these designs soon fell into becoming SIMD instead, where the processors themselves handle a single instruction, but there's dozens of them to handle different pieces of data the single instruction is operating on.

No Instruction Set Computing (NISC)

Examples: NVIDIA's GPUs since the GeForce 600

An evolution of VLIW, the idea of NISC is to instead of compiling software into instructions and let the processor figure out how issue those instructions on its resources, the compiler can figure out where the data being fed into the processor goes and the instructions are only telling the processor how this data flow should work. That is, if you think of a processor's execution units (add, shifting, etc) as modules, you only need to say "the input of this module reads from here and the output of this module goes to there." While technically needing instructions, these only direct the data flow, whereas traditional CPU instructions direct control flow.

In a real life analogy, if you're in a cafeteria with various stations and someone staffing them to hand out food. Instead of saying "get some turkey, get some green beans, and get some mashed potatoes", which requires a person to recognize what and where they are, you could say "go to stations 1, 3, and 6." You get the same result, but the latter is simpler to resolve. This has the obvious downside that NISC-based software only works for an exact implementation of hardware.

While it's experimental in CPUs, it's seen adoption in GPUs in NVIDIA's GPUs since the GeForce 600 series. Like VLIW, this kind of instruction set requires a highly predictable program to run on it for maximum effectiveness. Graphics processing is one of those highly-predictable programs.

There's also a similarly sounding "Zero instruction set computer", but this describes a computer that runs entirely off of a neural network.

Multitasking

Early computers could only do one thing at a time and often it would do that one thing until it was done. This could be something like loading instructions and data, executing them, and saving the data. Loading and storing took a lot of time and the computer wasn't doing anything really useful either. IBM had the clever idea of separating those three tasks. The loading and storing steps were done by two, cheaper machines whose only job was to do those things. Then the actual processor could work on the job at hand. Even better, with multiple load/store machines, one could queue up a string of jobs to feed the main computer.

Once computers got powerful enough and memory plentiful enough, the idea of context switching came about. Context switching is when the processor save its state (usually the important registers) into main memory and switches to another task by loading the state for that task into itself. Computers around this time were still big honking mainframes with dumb terminals connected to it. With context switching, if one terminal had a job for the computer, it would do it until some block came up (usually waiting for data or peripheral to be free). When that block came up, the mainframe switches to run another user's task that was ready to be executed, repeating if another block came up. These early implementations exposed a number of problems quickly:

Someone could hog the mainframe with an extremely long task or take it down with a poorly written program that has say an infinite loop.
Someone could be using a resource, like a printer or storage unit, for extended periods of time, blocking others who are waiting on it
Two users want a resource the other user has, but they won't give it up until the other user gives up theirs first.

To combat this, several scheduling methods were created:

First in, first out (FIFO): The earliest implementation. Basically jobs are done in the order received.
Shortest Job First: All tasks are sorted by completion time. The job with the shortest time to complete is scheduled.
Round Robin: The processor gives a slice of time to a task, once the time is up, it moves to another task. When time slices are short enough, it appears the computer is doing many things at once to the user.
Priority scheduling: All tasks are given priority. When a task is done, the schedule does the next highest priority task. A lower priority task can be pre-empted by a higher priority task between jobs.

Which led to two primary types of multitasking:

Cooperative: While the computer can have multiple programs ready for execution in memory, the program currently running has complete control of the processor. The program has to give up the processor for another one. This means that if the program crashes, it will bring down the rest of the system.
Preemptive: Typically combines round robin and priority scheduling. This allows multiple programs to run seamlessly together, giving the impression they're running at once.

Improvements

Most of the improvements listed here were designed to prevent one thing: execution bubbles. These are stalls in executing because something caused it to block, such as waiting for a previous operation to complete even though other things can continue.

Pipelining

Pipelining in a processor is a way of emulating an assembly line in order to increase throughput with the goal being one instruction per clock cycle. Every CPU has pipelining, but what differs is how many stages the instruction goes through before being completed. The most basic pipeline is instruction fetch -> execute, though some reference materials may add a third stage for writing back the results. The classic RISC pipeline uses 5 stages (fetch -> decode -> execute -> memory access -> write back). Most modern CPUs have at least 8 stages.

To copy from The Other Wiki on how this works:

Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine, install the hood, and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line can have only one of the three steps done at once. After the car has its engine installed, it moves on to having its hood installed, leaving the engine installation facilities available for the next car. The first car then moves on to wheel installation, the second car to hood installation, and a third car begins to have its engine installed. If engine installation takes 20 minutes, hood installation takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only one car can be assembled at once would take 105 minutes. On the other hand, using the assembly line, the total time to complete all three is 75 minutes. At this point, additional cars will come off the assembly line at 20 minute increments.

The main downside to pipelining, especially when a CPU has lengthy number of stages, is branching, such as an if-statement, stalls the pipeline until the branch is resolved. Branch prediction, discussed later, was developed to avoid this, but if the prediction is wrong, then it incurs a penalty because the previous steps have to be thrown away. Taking the car factory example, if the factory can only work on one type of product at a time and it wants to make another, it has to wait for all the previous stages of the assembly line to complete before it can rework itself to produce the new product.

The number of pipeline stages also determines, of all things, how fast the CPU should be clocked. Aside from the physical limitations on why CPUs can't be clocked really fast, there's a theoretical limit based on how long each stage of the pipeline takes to complete. Ideally every stage should take the same amount of time. If for example, if we have a 3-stage pipeline where two stages take one clock cycle while the third takes two cycles, that slower pipeline stage starts creating a two-cycle gap where nothing can be done, so those extra clock cycles are wasted.

Cache

Caching is a response to RAM progressing in performance slower than the CPU. Caching stores recently used instructions and data so that the CPU can refer to that instead of RAM. Cache is often very fast as well. However, cache memory is expensive and adding more doesn't always improve performance. To given an example of how cache memory works:

Consider a library. Popular books are constantly being checked out. To save time so that the librarian doesn't have to process the book, store it back, and have the next person who wants to borrow it find the shelf it's in, recently returned books are put in a cart next to the checkout area. A person who wants to borrow the book can then quickly find it there and get out sooner. If the book is no longer requested as much, it eventually gets put back. If the recently returned cart is too large, then the person or librarian may spend more time looking through it than it would take for the person to go find the book in its normal location.

A semi-related note, if one looks into the specs of the CPU, they'll likely find that the CPU has two types of L1 cache: one for instructions and one for data. This might seem odd since everything above this in the Memory Hierarchy allows for both instructions and data to exist in it. This separation mostly has to do with two schools of thought with regards to how CPUs access memory. The first is the von Neumann architecture, which is where data and instructions can live in the same memory pool. The second is the Harvard architecture, which separates data and instructions in their own memory pools.

The main benefit to the von Neumann architecture is cost since the system only needs one pool of memory and memory bus. Harvard architecture's main benefit is performance, as the CPU can fetch both instructions and data independently and doesn't have to figure out which one is which. Modern CPUs use the so-called modified Harvard architecture, where the execution core itself is Harvard, but everything else is von Neumann. In addition, modern memory controllers can also emulate the Harvard architecture using von Neumann architectures by marking sections of memory space as "no execute," making it a data-only space.

Branch Prediction and Speculative Execution

In order to help keep the pipelining flowing, logic was added in the instruction decoding and scheduling parts of the CPU to execute branches ahead of time. Branch prediction aims to guess which branch will be taken and queues up those instructions. The guessing is based on several methods, such as simply always taking one branch or another to having a counter that keeps track of how many times a branch took one path or the other. With branch prediction came speculative execution, which goes ahead with executing instructions ahead of time, even if the CPU doesn't actually know if those instructions should be executed.

However, there were issues with speculative execution that in 2017, resulted in the discovery of one of the biggest security vulnerabilities. Dubbed Meltdown (for Intel only systems) and Spectre (for a general class of this problem), the gist of these problems is that CPUs up until that point didn't clean up cache from speculative execution. While this isn't a problem per se, it was found that attackers could exploit this if something sensitive was in the cache. For example, if a program was verifying the login for an account, the CPU still has to load data to check the credentials. This may be loaded in ahead of time even if it never gets used and since it may remain in cache, an attacker can grab this data.

There are some ways to avoid branching. The primary method if branching on a logical level can't be avoided is something called predication. The basic idea is a compare sets a flag, and depending on the flag, some code gets executed while others are skipped. While this helps keep the pipeline moving because the CPU no longer has make jumps, it can waste execution time if the code paths are long enough. Predication's use is usually limited to simpler things, like the conditional variable assignment.

Out of Order execution (OOE)

A type of instruction reordering where some instructions queued for execution in the future can cut in line if they're not dependent on results from an earlier instruction. This is to allow those instructions to go ahead and run to avoid execution stalls. OOE's main issue is it requires complex hardware to ensure that the output retains the logical ordering, often eating into die space and power consumption. Historically it was kept out of processors meant for small electronics, but has crept its way back in because performance boost it offers now outweighs its drawbacks.

Note this is different from speculative execution. Out of order execution re-orders the instruction queue rather than queue up instructions that CPU is not sure if it'll run.

Superscalar execution and simultaneous multithreading (SMT)

In order to increase the instruction processing throughput, parts of the CPU get duplicated or expanded to handle more stuff at once. This could be something like the instruction decoder accepting multiple instructions at the same time or the execution unit having multiple ALUs so multiple arithmetic/logic instructions can get executed at once. On a simple level, this simply duplicates the entire execution pipeline. Strictly speaking though, this isn't an implementation of a Multi-Core Processor, as the instructions must come from the same unit of execution of a program (typically a thread).

However, some processor manufacturers found a way to simulate a multicore processor via a technique called simultaneous multithreading (SMT). This allows one program execution unit to take up as many resources it needs, then if there are any left over, the CPU can schedule another program thread's execution on the remainder if it can take it. You can think of this like two children sharing a box of crayons or technicians sharing a toolbox rather than everyone having their own. One famous implementation of this is Intel's HyperThreading as they were the first company to implement this feature for consumer computers, though most others simply call it SMT.

Microcode, Micro-instructions, and Micro-operations

These three terms are related, but the gist of is that this solves a problem when the CPU doesn't behave as intended, either due to a poor design or a complex implementation gone wrong.

Many parts of the processor are essentially hardware "programs." Because of this, bugs in hardware have showed up that prevented processors from running correctly. These bugs either result from invalid instructions (an instruction value that doesn't map to anything specific, resulting in undefined behavior) or running a series of instructions that result in painting the CPU in a corner, like disabling interrupts then do a 'sleep until interrupted' instruction. Infamously these are called "Halt and catch fire" routines. Urban legend has it that if a processor ran into this condition, it would execute a single instruction over and over to the point where the electricity going through the processor is concentrated on the part servicing the instruction, causing it to heat up to the point of catching fire.

In simpler, traditional CPUs, the control unit, which handles how instructions are executed and how data is directed, used to act directly on the instructions from the ISA. However, as ISAs get more complicated, along with innovative ways to execute instructions get created, these hardware bugs become more costly to fix because it requires another revision of the hardware. Also by the time the bug is discovered and characterized, it's likely thousands and thousands of units were sold and so you're left with people who have the issue. Often the company must have a free-replacement program in order to not be sued. Software can mitigate the problem by not running the offending instructions or working around them, but as long as the hardware bug exists, there's always the potential of the computer crashing or worse. This is where microcode designs come in, but since this only affects how instructions are handled before being sent off to the execution unit, any bugs past decoding can still present problems.

This is where the micro-code/instruction/operation comes in. Rather than handle the ISA directly, the control unit and execution unit work on its own unique sort of ISA typically called micro-instructions. These map to micro-operations that the execution unit handles. The ISA is translated via microcode, which in its simplest implementation, is a ROM containing a mapping of which ISA instructions map to which micro-instructions. The main benefit of this is that if there's an issue with how the ISA is translated via microcode, it can be patched using an update. Though in most modern CPUs that employ this, it's stored in SRAM inside the CPU, so any microcode updates will have to be applied again on boot, mostly as a means to make sure the CPU isn't made worse by a bad microcode update or if the user wants to decide whether or not the benefit is worth it (as was the case with Spectre/Meltdown updates).

Hybrid design

An approach that started with mobile devices, hybrid designs pair fewer, larger high performance CPU cores with more, smaller high efficiency cores. The idea is that the high performance cores (typically called P-cores) can handle tasks where faster performance is preferable or where low response times are needed. The high efficiency cores (typically called E-cores) can handle background, time-insensitive tasks, help boost the performance of the P-cores. Some hybrid designs may include three or more performance layers, but usually only two are used.

Companies started considering this as they realized for certain applications, only a few tasks need high-performance or low-latency. Especially in mobile devices where the only time where the device needs high performance is to service requests with a hard time limit (such as talking to a cell phone tower) or with games (at least those with more flair than Candy Crush Saga). Another point is that mobile devices are power limited, and while a number of "full-sized" cores can be placed on the die, the power limitations prevent them from reaching full performance anyway. Since processor design has to slide between performance or efficiency and with higher performing cores gulping up more power, it didn't make sense to service background tasks on these performant cores. So the idea of throwing tasks on power efficient cores, where people wouldn't really notice a loss in performance but, in theory, improve battery life. It has since made its way to desktop and laptop computers, such as Intel's Alder Lake. AMD's Strix Point and Strix Halo.

The trick with this design is to make sure the E-cores aren't so performance deficient that they spend more energy overall doing the task. That is, if an E-core is only 50% as performant as a P-core, it needs to consume well below 50% of the power, otherwise it's as good if not worse than running it on a P-core. In addition, you need to be able to stuff more E-cores in the same die space as a P-core to make the value proposition better. In Intel's case, they were able to fit four E-cores in the same space as a P-core, with the E-core performing about 60% as good as a P-core while consuming a quarter of the power. This meant that four E-cores equate to one P-core in power consumption and area, but can offer about twice the performance in a multi-threaded task. The various manufacturers of ARM based processors designed have settled on a Prime, Performance, and Efficiency setup, typically with one Prime CPU and three or four of the other two. AMD have made "compact" versions of the Zen 4 and Zen 5 cores, which they insist are not efficiency cores, despite being designed for efficiency.

One major problem on the software end is that typically the E-cores not only have raw performance differences between the other cores, but they may also not implement all of the features of the P-core. This causes an issue with scheduling tasks at times when you want to have an application start slipping onto the E-cores. For instance, Intel's E-cores don't AVX instructions. This is something that AMD deliberately avoided with the compact version of their cores, which basically looks like a lower clocked version of the non-compact version.

A note about "Process Node"

The term "Process Node" generally refers to how small the smallest feature of an integrated circuit (IC) is, and everything else is built around that. So for instance, a "10nm process" is meant to depict the idea that the smallest feature in a 10nm built circuit is 10nm. However, this hasn't held true since 1994 and the measurements are a marketing term to avoid having too-technical of a way to describe the way the IC was built. For instance, would you rather have "45nm process" or "High-k dielectric process"?

In addition, not every manufacturing firm produces the same result and their techniques cause various differences despite being marketed with the same process node. For instance, AMD's Zen+ based processors are built from a 12nm process, but the chips themselves are the same size as the previous generation Zen processors which were built from a 14nm process. The key difference was AMD wanted more of a buffer between active components to help with heat transfer. This also led to confusion with where Intel was versus their immediate competitors of TSMC and Samsung. For instance, the Intel 10 process suggests its worse than TSMC or Samsung's 7nm process, but Intel's been able to achieve a transistor density similar to their competition's 7nm process.

Simply put, it's best to interpret the number as nothing more than a generational number, rather than anything of technical value.

Central Processing Unit

2 Following

The Expanded Infodump About the CPU

Instruction Sets

Instruction Set Types

Multitasking

Improvements

Pipelining

Cache

Branch Prediction and Speculative Execution

Out of Order execution (OOE)

Superscalar execution and simultaneous multithreading (SMT)

Microcode, Micro-instructions, and Micro-operations

Hybrid design

A note about "Process Node"

Previous

Index

Next

Analysis / Central Processing Unit 2 Following

The Expanded Infodump About the CPU

Instruction Sets

Instruction Set Types

Multitasking

Improvements

Pipelining

Cache

Branch Prediction and Speculative Execution

Out of Order execution (OOE)

Superscalar execution and simultaneous multithreading (SMT)

Microcode, Micro-instructions, and Micro-operations

Hybrid design

A note about "Process Node"

Previous

Index

Next

Central Processing Unit

2 Following