Central processing unit (also CPU) - an electronic unit or an integrated circuit that executes machine instructions (program code), the main part of computer hardware or programmable logic controller. Sometimes referred to as a microprocessor or simply a processor.
Initially, the term central processing unit described a specialized class of logical machines designed to execute complex computer programs. Due to the rather exact correspondence of this purpose to the functions of the computer processors that existed at that time, it was naturally transferred to the computers themselves. The beginning of the use of the term and its abbreviation in relation to computer systems was laid in the 1960s. The device, architecture and implementation of processors have changed many times since then, but their main executable functions have remained the same as before.
The main characteristics of the CPU are: clock speed, performance, power consumption, the norms of the lithographic process used in production (for microprocessors), and architecture.
Early CPUs were designed as unique building blocks for unique and even one-of-a-kind computer systems. Later, from the expensive method of developing processors designed to execute one single or several highly specialized programs, computer manufacturers switched to serial production of typical classes of multi-purpose processor devices. The trend towards standardization of computer components began in the era of the rapid development of semiconductors, mainframes and minicomputers, and with the advent of integrated circuits, it has become even more popular. The creation of microcircuits allowed further increasing the complexity of the CPU while reducing their physical size. The standardization and miniaturization of processors have led to a deep penetration of digital devices based on them into everyday life. Modern processors can be found not only in high-tech devices such as computers, but also in cars, calculators, mobile phones, and even children's toys. Most often they are represented by microcontrollers, where, in addition to the computing device, additional components are located on the chip (program and data memory, interfaces, input-output ports, timers, etc.). Modern computing capabilities of the microcontroller are comparable to personal computer processors of thirty years ago, and more often even significantly exceed their performance.
The history of the development of the production of processors is fully consistent with the history of the development of technology for the production of other electronic components and circuits.
The first stage, which affected the period from the 1940s to the late 1950s, was the creation of processors using electromechanical relays, ferrite cores (memory devices) and vacuum tubes. They were installed in special slots on modules assembled into racks. A large number of such racks, connected by conductors, represented a processor in total. Distinctive features were low reliability, low speed and high heat dissipation.
The second stage, from the mid-1950s to the mid-1960s, was the introduction of transistors. Transistors were already mounted on boards close to modern in appearance, installed in racks. As before, the average processor consisted of several such racks. Increased performance, improved reliability, reduced power consumption.
The third stage, which came in the mid-1960s, was the use of microchips. Initially, microcircuits of a low degree of integration were used, containing simple transistor and resistor assemblies, then, as the technology developed, microcircuits that implement individual elements of digital circuitry began to be used (first elementary keys and logic elements, then more complex elements - elementary registers, counters, adders) , later microcircuits appeared containing functional blocks of the processor - a microprogram device, an arithmetic-logical unit, registers, devices for working with data and command buses.
The fourth stage, in the early 1970s, was the creation, thanks to a breakthrough in technology, LSI and VLSI (large and extra-large integrated circuits, respectively), a microprocessor - a microcircuit, on the crystal of which all the main elements and blocks of the processor were physically located. Intel in 1971 created the world's first 4-bit microprocessor 4004, designed for use in calculators. Gradually, almost all processors began to be produced in the microprocessor format. For a long time, the only exceptions were small-scale processors hardware-optimized for solving special problems (for example, supercomputers or processors for solving a number of military tasks) or processors that had special requirements for reliability, speed, or protection from electromagnetic pulses and ionizing radiation. Gradually, with the reduction in cost and the spread of modern technologies, these processors are also beginning to be manufactured in the microprocessor format.
Now the words "microprocessor" and "processor" have practically become synonymous, but then it was not so, because conventional (large) and microprocessor computers coexisted peacefully for at least 10-15 years, and only in the early 1980s microprocessors have supplanted their older counterparts. Nevertheless, the central processing units of some supercomputers even today are complex complexes built on the basis of microcircuits of a large and ultra-large degree of integration.
The transition to microprocessors then allowed the creation of personal computers, which penetrated almost every home.
The first publicly available microprocessor was the 4-bit Intel 4004, introduced on November 15, 1971 by Intel Corporation. It contained 2300 transistors, ran at a clock frequency of 92.6 kHz and cost $300.
Then it was replaced by the 8-bit Intel 8080 and 16-bit 8086, which laid the foundation for the architecture of all modern desktop processors. Due to the prevalence of 8-bit memory modules, the cheap 8088 was released, a simplified version of the 8086 with an 8-bit data bus.
This was followed by its modification, 80186.
The 80286 processor introduced a protected mode with 24-bit addressing, which allowed the use of up to 16 MB of memory.
The Intel 80386 processor appeared in 1985 and introduced an improved protected mode, 32-bit addressing that allowed up to 4 GB of RAM and support for a virtual memory mechanism. This line of processors is built on a register computing model.
In parallel, microprocessors are developing, based on the stack computing model.
Over the years, microprocessors have developed many different architectures. Many of them (in supplemented and improved form) are still used today. For example, Intel x86, which developed first into 32-bit IA-32, and later into 64-bit x86-64 (which Intel calls EM64T). x86 architecture processors were originally used only in IBM personal computers (IBM PCs), but are now increasingly used in all areas of the computer industry, from supercomputers to embedded solutions. You can also list architectures such as Alpha, POWER, SPARC, PA-RISC, MIPS (RISC architectures) and IA-64 (EPIC architecture).
In modern computers, processors are made in the form of a compact module (about 5 × 5 × 0.3 cm in size) that is inserted into a ZIF socket (AMD) or on a
Most modern personal computer processors are generally based on some version of the cyclic serial processing process described by John von Neumann.
In July 1946, Burks, Goldstein, and von Neumann wrote a famous monograph entitled "A Preliminary Consideration of the Logical Structure of an Electronic Computing Device", which described in detail the device and technical characteristics of the future computer, which later became known as the "von Neumann architecture". This work developed the ideas outlined by von Neumann in May 1945 in a manuscript entitled "The First Draft of a Report on the EDVAC".
A distinctive feature of the von Neumann architecture is that instructions and data are stored in the same memory.
Different architectures and different commands may require additional steps. For example, arithmetic instructions may require additional memory accesses during which operands are read and results are written.
Run cycle steps:
The processor sets the number stored in the program counter register to the address bus and issues a read command to the memory.
The exposed number is the memory address; memory, having received the address and the read command, exposes the contents stored at this address to the data bus and reports readiness.
The processor receives a number from the data bus, interprets it as a command (machine instruction) from its instruction set, and executes it.
If the last instruction is not a jump instruction, the processor increments by one (assuming each instruction length is one) the number stored in the instruction counter; as a result, the address of the next instruction is formed there.
This cycle is executed invariably, and it is he who is called the process (hence the name of the device).
During a process, the processor reads a sequence of instructions contained in memory and executes them. Such a sequence of commands is called a program and represents the algorithm of the processor. The order of reading commands changes if the processor reads a jump command, then the address of the next command may be different. Another example of a process change would be when a stop command is received, or when it switches to interrupt service.
The commands of the central processor are the lowest level of computer control, so the execution of each command is inevitable and unconditional. No check is made on the admissibility of the actions performed, in particular, the possible loss of valuable data is not checked. In order for the computer to perform only legal actions, the commands must be properly organized into the desired program.
The speed of transition from one stage of the cycle to another is determined by the clock generator. The clock generator generates pulses that serve as a rhythm for the central processor. The frequency of the clock pulses is called the clock frequency.
Pipeline architecture (eng. pipelining) was introduced into the central processor in order to increase performance. Usually, to execute each instruction, it is required to perform a number of operations of the same type, for example: fetching an instruction from RAM, decrypting an instruction, addressing an operand to RAM, fetching an operand from RAM, executing an instruction, writing a result to RAM. Each of these operations is associated with one stage of the conveyor. For example, a MIPS-I microprocessor pipeline contains four stages:
- receiving and decoding instructions,
- addressing and fetching an operand from RAM,
- performing arithmetic operations,
- save the result of the operation.
After the release of the k-th stage of the pipeline, it immediately starts working on the next command. If we assume that each stage of the pipeline spends a unit of time for its work, then the execution of an instruction on a pipeline with a length of n stages will take n units of time, but in the most optimistic case, the result of executing each following instruction will be obtained every unit of time.
Indeed, in the absence of a pipeline, the execution of an instruction will take n units of time (since the execution of an instruction still requires fetching, decryption, etc.), and m instructions will require n ⋅ m units of time; when using a pipeline (in the most optimistic case), it takes only n+m units of time to execute m instructions.
Factors that reduce the efficiency of the conveyor:
1. A simple pipeline when some stages are not used (for example, addressing and fetching an operand from RAM is not needed if the instruction works with registers).
2. Waiting: if the next command uses the result of the previous one, then the last one cannot start executing before the execution of the first one (this is overcome by using out-of-order execution of commands).
3. Clearing the pipeline when a branch instruction hits it (this problem can be smoothed out using branch prediction).
Some modern processors have more than 30 stages in the pipeline, which improves the performance of the processor, but, however, leads to an increase in idle time (for example, in the event of an error in conditional branch prediction). There is no consensus on the optimal pipeline length: different programs may have significantly different requirements.
The ability to execute multiple machine instructions in one processor cycle by increasing the number of execution units. The emergence of this technology has led to a significant increase in performance, at the same time, there is a certain limit to the growth of the number of executive devices, above which the performance practically stops growing, and the executive devices are idle. A partial solution to this problem is, for example, Hyper-threading technology.
Complex instruction set computer - calculations with a complex set of commands. A processor architecture based on a sophisticated instruction set. Typical representatives of CISC are microprocessors of the x86 family (although for many years these processors have been CISC only by an external instruction system: at the beginning of the execution process, complex instructions are broken down into simpler micro-operations (MOS) executed by the RISC core).
Reduced instruction set computer - calculations with a simplified set of instructions (in the literature, the word reduced is often mistakenly translated as "reduced"). The architecture of processors, built on the basis of a simplified instruction set, is characterized by the presence of fixed-length instructions, a large number of registers, register-to-register operations, and the absence of indirect addressing. The concept of RISC was developed by John Cock of IBM Research, the name was coined by David Patterson.
The simplification of the instruction set is intended to reduce the pipeline, which avoids delays in the operations of conditional and unconditional jumps. A homogeneous set of registers simplifies the work of the compiler when optimizing the executable program code. In addition, RISC processors are characterized by lower power consumption and heat dissipation.
Among the first implementations of this architecture were MIPS, PowerPC, SPARC, Alpha, PA-RISC processors. ARM processors are widely used in mobile devices.
Minimum instruction set computer - calculations with a minimum set of commands. Further development of the ideas of the team of Chuck Moore, who believes that the principle of simplicity, which was originally for RISC processors, has faded into the background too quickly. In the heat of the race for maximum performance, RISC has caught up and overtaken many CISC processors in terms of complexity. The MISC architecture is based on a stack computing model with a limited number of instructions (approximately 20–30 instructions).
Very long instruction word - an extra long instruction word. The architecture of processors with explicitly expressed parallelism of calculations incorporated into the processor instruction set. They are the basis for the EPIC architecture. The key difference from superscalar CISC processors is that for them, a part of the processor (scheduler) is involved in loading the execution devices, which takes a fairly short time, while the compiler is responsible for loading the computing devices for the VLIW processor, which takes much more time. (the quality of the download and, accordingly, the performance should theoretically be higher).
For example, Intel Itanium, Transmeta Crusoe, Efficeon and Elbrus.
Contain several processor cores in one package (on one or more chips).
Processors designed to run a single copy of an operating system on multiple cores are a highly integrated implementation of multiprocessing.
The first multi-core microprocessor was IBM's POWER4, which appeared in 2001 and had two cores.
In October 2004, Sun Microsystems released the UltraSPARC IV dual-core processor, which consisted of two modified UltraSPARC III cores. In early 2005, the dual-core UltraSPARC IV+ was created.
On May 9, 2005, AMD introduced the first dual-core, single-chip processor for consumer PCs, the Athlon 64 X2 with the Manchester core. Shipments of the new processors officially began on June 1, 2005.
On November 14, 2005, Sun released the eight-core UltraSPARC T1, with each core running 4 threads.
On January 5, 2006, Intel introduced the first dual-core processor on a single Core Duo chip for a mobile platform.
In November 2006, the first quad-core Intel Core 2 Quad processor based on the Kentsfield core was released, which is an assembly of two Conroe crystals in one package. A descendant of this processor was the Intel Core 2 Quad on the Yorkfield core (45 nm), which is architecturally similar to Kentsfield, but has a larger cache and operating frequencies.
In October 2007, eight-core UltraSPARC T2s went on sale, each core running 8 threads.
On September 10, 2007, real (in the form of a single chip) quad-core processors for AMD Opteron servers were released for sale, which had the code name AMD Opteron Barcelona during development. November 19, 2007 went on sale quad-core processor for home computers AMD Phenom. These processors implement the new K8L (K10) microarchitecture.
AMD went its own way, manufacturing quad-core processors on a single die (unlike Intel, whose first quad-core processors are actually gluing together two dual-core dies). Despite all the progressiveness of this approach, the first "quad-core" of the company, called AMD Phenom X4, was not very successful. Its lagging behind contemporary competitor processors ranged from 5 to 30 percent or more, depending on the model and specific tasks.
By the 1st-2nd quarter of 2009, both companies updated their lines of quad-core processors. Intel introduced the Core i7 family, which consists of three models running at different frequencies. The main highlights of this processor is the use of a three-channel memory controller (DDR3 type) and eight-core emulation technology (useful for some specific tasks). In addition, thanks to the general optimization of the architecture, it was possible to significantly improve the performance of the processor in many types of tasks. The weak side of the platform that uses the Core i7 is its excessive cost, since the installation of this processor requires an expensive motherboard based on the Intel X58 chipset and a three-channel DDR3 memory kit, which is also currently very expensive.
AMD, in turn, introduced a line of Phenom II X4 processors. During its development, the company took into account its mistakes: the cache size was increased (compared to the first generation Phenom), processors began to be manufactured using the 45-nm process technology (this, accordingly, allowed to reduce heat dissipation and significantly increase operating frequencies). In general, AMD Phenom II X4 is on a par with Intel's previous generation processors (Yorkfield core) in terms of performance and lags far behind the Intel Core i7. With the release of the 6-core processor AMD Phenom II X6 Black Thuban 1090T, the situation has changed slightly in favor of AMD.
As of 2013, processors with two, three, four and six cores, as well as two-, three- and four-module AMD processors of the Bulldozer generation are widely available (the number of logical cores is 2 times more than the number of modules). In the server segment, 8-core Xeon and Nehalem processors (Intel) and 12-core Opterons (AMD) are also available.
Caching is the use of additional high-speed memory (the so-called cache - English cache, from French cacher - “hide”) to store copies of blocks of information from the main (RAM) memory, the probability of accessing which is high in the near future.
There are caches of the 1st, 2nd and 3rd levels (denoted by L1, L2 and L3 - from Level 1, Level 2 and Level 3). The 1st level cache has the lowest latency (access time), but a small size, in addition, 1st level caches are often made multiported. So, AMD K8 processors were able to perform both 64-bit write and read, or two 64-bit reads per clock, AMD K8L can perform two 128-bit reads or writes in any combination. Intel Core 2 processors can do 128-bit writes and reads per clock. A L2 cache usually has significantly higher access latency, but it can be made much larger. Level 3 cache is the largest and is quite slow, but it is still much faster than RAM.
The von Neumann architecture has the disadvantage of being sequential. No matter how huge the data array needs to be processed, each of its byte will have to go through the central processor, even if the same operation is required on all the bytes. This effect is called the von Neumann bottleneck.
To overcome this shortcoming, processor architectures, which are called parallel, have been proposed and are being proposed. Parallel processors are used in supercomputers.
Possible options for parallel architecture are (according to Flynn's classification):
SISD - one command stream, one data stream;
SIMD - one instruction stream, many data streams;
MISD - many command streams, one data stream;
MIMD - many command streams, many data streams.
For digital signal processing, especially with limited processing time, specialized high-performance signal microprocessors (digital signal processor, DSP) with a parallel architecture are used.
Initially, the developers are given a technical task, on the basis of which a decision is made about what the architecture of the future processor will be, its internal structure, manufacturing technology. Various groups are tasked with developing the corresponding functional blocks of the processor, ensuring their interaction, and electromagnetic compatibility. Due to the fact that the processor is actually a digital machine that fully complies with the principles of Boolean algebra, a virtual model of the future processor is built using specialized software running on another computer. It tests the processor, executes elementary commands, significant amounts of code, works out the interaction of various blocks of the device, optimizes it, and looks for errors that are inevitable in a project of this level.
After that, from digital basic matrix crystals and microcircuits containing elementary functional blocks of digital electronics, a physical model of the processor is built, on which the electrical and temporal characteristics of the processor are checked, the processor architecture is tested, the correction of errors found continues, and electromagnetic compatibility issues are clarified (for example, with almost ordinary at a clock frequency of 1 GHz, 7 mm lengths of conductor already work as transmitting or receiving antennas).
Then begins the stage of joint work of circuit engineers and process engineers who, using specialized software, convert the electrical circuit containing the processor architecture into a chip topology. Modern automatic design systems make it possible, in the general case, to directly obtain a package of stencils for creating masks from an electrical circuit. At this stage, technologists are trying to implement the technical solutions laid down by circuit engineers, taking into account the available technology. This stage is one of the longest and most difficult to develop and rarely requires compromises on the part of circuit designers to abandon some architectural decisions. A number of manufacturers of custom microcircuits (foundry) offer developers (design center or fabless) a compromise solution, in which at the stage of designing the processor, the libraries of elements and blocks (Standard cell) presented by them, standardized in accordance with the available technology, are used. This introduces a number of restrictions on architectural solutions, but the stage of technological adjustment actually comes down to playing Lego. In general, custom microprocessors are faster than processors based on existing libraries.
8 inch multi-chip silicon wafer
Main article: Technological process in the electronics industry
The next, after the design phase, is the creation of a microprocessor chip prototype. In the manufacture of modern ultra-large integrated circuits, the lithography method is used. At the same time, layers of conductors, insulators and semiconductors are alternately applied to the substrate of the future microprocessor (a thin circle of single-crystal silicon or sapphire) through special masks containing slots. The corresponding substances are evaporated in vacuum and deposited through the holes of the mask on the processor chip. Sometimes etching is used, when an aggressive liquid corrodes areas of the crystal that are not protected by a mask. At the same time, about a hundred processor chips are formed on the substrate. The result is a complex multilayer structure containing hundreds of thousands to billions of transistors. Depending on the connection, the transistor works in the microcircuit as a transistor, resistor, diode or capacitor. Creating these elements on a chip separately, in the general case, is unprofitable. After the end of the lithography procedure, the substrate is sawn into elementary crystals. To the pads formed on them (made of gold), thin gold conductors are soldered, which are adapters to the contact pads of the microcircuit case. Further, in the general case, the heat sink of the crystal and the chip cover are attached.
Then the stage of testing the processor prototype begins, when its compliance with the specified characteristics is checked, and the remaining undetected errors are searched for. Only after that the microprocessor is put into production. But even during production, there is a constant optimization of the processor associated with the improvement of technology, new design solutions, and error detection.
Simultaneously with the development of universal microprocessors, sets of peripheral computer circuits are being developed that will be used with the microprocessor and on the basis of which motherboards are created. The development of a microprocessor set (chipset, English chipset) is a task no less difficult than the creation of the actual microprocessor chip.
In the last few years, there has been a tendency to transfer part of the chipset components (memory controller, PCI Express bus controller) into the processor.
The power consumption of the processor is closely related to the manufacturing technology of the processor.
The first x86 architecture processors consumed a very small (by modern standards) amount of power, which is a fraction of a watt. An increase in the number of transistors and an increase in the clock frequency of processors led to a significant increase in this parameter. The most productive models consume 130 or more watts. The power consumption factor, which was insignificant at first, now has a serious impact on the evolution of processors:
- improvement of production technology to reduce consumption, search for new materials to reduce leakage currents, lowering the supply voltage of the processor core;
- the appearance of sockets (sockets for processors) with a large number of contacts (more than 1000), most of which are designed to power the processor. So, processors for the popular LGA775 socket have 464 main power contacts (about 60% of the total);
- changing the layout of processors. The processor crystal has moved from the inside to the outside for better heat dissipation to the cooling system radiator;
- installation of temperature sensors in the crystal and an overheating protection system that reduces the frequency of the processor or even stops it if the temperature rises unacceptably;
- the appearance in the latest processors of intelligent systems that dynamically change the supply voltage, the frequency of individual blocks and processor cores, and disable unused blocks and cores;
- the emergence of energy-saving modes for "falling asleep" processor at low load.
Another CPU parameter is the maximum allowable temperature of a semiconductor crystal (TJMax) or the surface of the processor, at which normal operation is possible. Many consumer processors operate at surface (chip) temperatures no higher than 85 °C. The temperature of the processor depends on its workload and on the quality of the heat sink. If the temperature exceeds the maximum allowed by the manufacturer, there is no guarantee that the processor will function normally. In such cases, errors in the operation of programs or a computer freeze may occur. In some cases, irreversible changes within the processor itself are possible. Many modern processors can detect overheating and limit their own performance in this case.
Passive heatsinks and active coolers are used to remove heat from microprocessors. For better contact with the heatsink, thermal paste is applied to the surface of the processor.
To measure the temperature of the microprocessor, usually inside the microprocessor, a microprocessor temperature sensor is installed in the center of the microprocessor cover. In Intel microprocessors, the temperature sensor is a thermal diode or a transistor with a closed collector and base as a thermal diode, in AMD microprocessors it is a thermistor.
The most popular processors today produce:
- for personal computers, laptops and servers - Intel and AMD;
- for supercomputers - Intel and IBM;
- for graphics accelerators and high performance computing - NVIDIA and AMD
- for mobile phones and tablets[9] - Apple, Samsung, HiSilicon and Qualcomm.
Most processors for personal computers, laptops and servers are Intel-compatible in terms of instructions. Most of the processors currently used in mobile devices are ARM-compatible, that is, they have a set of instructions and programming interfaces developed by ARM Limited.
Intel processors: 8086, 80286, i386, i486, Pentium, Pentium II, Pentium III, Celeron (simplified Pentium), Pentium 4, Core 2 Duo, Core 2 Quad, Core i3, Core i5, Core i7, Core i9, Xeon (series of processors for servers), Itanium, Atom (series of processors for embedded technology), etc.
AMD has in its line of processors x86 architecture (analogues 80386 and 80486, K6 family and K7 family - Athlon, Duron, Sempron) and x86-64 (Athlon 64, Athlon 64 X2, Phenom, Opteron, etc.). IBM processors (POWER6, POWER7, Xenon, PowerPC) are used in supercomputers, 7th generation video set-top boxes, embedded technology; previously used in Apple computers.
Market shares of sales of processors for personal computers, laptops and servers by years:
Loongson Family (Godson)
- ShenWei Family (SW)
- YinHeFeiTeng Family (FeiTeng)
- NEC VR (MIPS, 64 bit)
- Hitachi SH (RISC)
A common misconception among consumers is that higher clocked processors always perform better than lower clocked processors. In fact, performance comparisons based on clock speed comparisons are only valid for processors with the same architecture and microarchitecture.