The journey to becoming a developer

My future is created by what I do today, not tomorrow.

Computer Science/Crash Course

Advanced CPU Designs: Crash Course Computer Science #9

Millie 2021. 10. 11. 08:16

Last episode

컴퓨터는 1초에 1개를 계산할 수 있던 기계장치에서, kilohertz나 megahertz의 속도로 동작하는 CPU에 이르기까지 오랜 시간 동안 발전해 왔다.

현대 컴퓨터는 거의 다 gigahertz의 속도로 동작하고 있다. 즉, 매 초마다 수십억 개의 명령어가 실행되는 것인데 실로 엄청난 계산량이다.

 

Today's episode

electronic computing의 초기에는 일반적으로 칩 내부의 트랜지스터의 switching time을 개선하여 프로세서를 더 빠르게 만들었다. (트랜지스터는 지난 에피소드에서 배웠던 logic gate, ALU와 기타 항목을 구성하는 요소이다.)

트랜지스터를 더 빠르고 효율적으로 만들게 되면서, processor designer들은 성능을 높이기 위한 다양한 기술을 개발하였고, 그로 인해 간단한 명령을 빠르게 실행할 수 있을 뿐만 아니라 정교한 연산을 수행할 수 있게 되었다.


지난 에피소드에서는 CPU가 두 개의 숫자를 나누기할 수 있는 작은 프로그램을 만들었다. 빼기를 연속으로 여러 번 해서 나누기를 구현한 것이다. 예를 들어 16 / 4는 -4를 4번 하는 식이다. 결과가 0이나 음수라면 끝났다는 것을 알 수 있다.

하지만 이런 방법은 많은 clock cycle을 소비하는 것이고 효율적이지도 않다.

그래서 오늘날 대부분의 컴퓨터 프로세서는 ALU가 하드웨어에서 수행할 수 있는 명령어 중 하나로 나누기를 가지고 있다.

물론 이러한 부가적인 회로들은 ALU를 더 크고 복잡한 설계로 만들게 된다. 대신 CPU의 능력은 더 좋아진다. 이런 식으로 컴퓨터 역사에서는 복잡성과 속도 사이의 tradeoff가 많았다.

예를 들어, 현대 컴퓨터 프로세서는 special circuit을 가지고 있는데, 이것들은 graphics operations, decoding compressed video, encrypting files 와 같은 일을 처리한다. 이런 모든 기능들은 표준 operation으로 한다면 어마어마한 clock cycle이 필요하다.

다양한 일들을 하기 위해서 명령어들은 계속 확장이 된다. 이런 식으로 명령어 집합은 점점 더 커지는 경향이 있다. 단 이때 과거 opcode와는 backwards compatibility(호환성)을 유지한다.

최초의 진정으로 통합된 CPU인 Intel 4004는 단 46개의 명령어를 가지고 있었다. 이것만으로도 충분히 기능적인 컴퓨터를 만들 수 있었다. 

그러나 현대 컴퓨터 프로세서에는 똑똑하면서도 복잡한 내부 회로를 활용하는 수천 개의 명령어가 있다.

그런데 빠른 clock의 속도와 좋은 명령어의 집합은 CPU로 데이터를 가져오거나 내보내는 것 역시 충분히 빠른지에 대한 문제를 야기시킨다. 마치 강력한 증기 기관차가 있지만, 그만큼 빠르게 석탄을 삽질해서 넣어줄 방법이 있는가와 같은 문제이다. 

이 경우, 병목 지점은 RAM이다.

RAM은 CPU의 밖에 있는 memory module이다. 이 말은 데이터는 bus라고 하는 데이터 선들을 통해서 RAM으로 전달되거나, RAM으로 가져와야 한다는 것이다.

 

This bus might only be a few centimeters long, and remember those electrical signals are traveling near the speed of light, but when you are operating at gigahertz speeds – that’s billionths of a second – even this small delay starts to become problematic.

It also takes time for RAM itself to look up the address, retrieve the data, and configure itself for output.

So a “load from RAM” instruction might take dozens of clock cycles to complete, and during this time the processor is just sitting there idly waiting for the data.

 

Cache

One solution is to put a little piece of RAM right on the CPU -- called a cache.

There isn’t a lot of space on a processor’s chip, so most caches are just kilobytes or maybe megabytes in size, where RAM is usually gigabytes.

Having a cache speeds things up in a clever way.

CPU가 RAM에서 메모리 위치를 요청할 때 RAM은 단일 값뿐만 아니라 전체 데이터 블록을 전송할 수 있다.

This takes only a little bit more time than transmitting a single value, but it allows this data block to be saved into the cache.

This tends to be really useful because computer data is often arranged and processed sequentially.

데이터 블록을 전송 가능한 모습 

For example, let say the processor is totalling up daily sales for a restaurant. It starts by fetching the first transaction from RAM at memory location 100. The RAM, instead of sending back just that one value, sends a block of data, from memory location 100 through 200, which are then all copied into the cache.

Now, when the processor requests the next transaction to add to its running total, the value at address 101, the cache will say “Oh, I’ve already got that value right here, so I can give it to you right away!” And there’s no need to go all the way to RAM.

Because the cache is so close to the processor, it can typically provide the data in a single clock cycle -- no waiting required.

This speeds things up tremendously over having to go back and forth to RAM every single time.

Cache hit & Cache miss

When data requested in RAM is already stored in the cache like this it’s called a cache hit, and if the data requested isn’t in the cache, so you have to go to RAM, it’s a called a cache miss.

The cache can also be used like a scratch space, storing intermediate values when performing a longer, or more complicated calculation.


Continuing our restaurant example, let’s say the processor has finished totalling up all of the sales for the day, and wants to store the result in memory address 150.

Like before, instead of going back all the way to RAM to save that value, it can be stored in cached copy, which is faster to save to, and also faster to access later if more calculations are needed.

But this introduces an interesting problem -- the cache’s copy of the data is now different to the real version stored in RAM.

This mismatch has to be recorded, so that at some point everything can get synced up.

For this purpose, the cache has a special flag for each block of memory it stores, called the dirty bit.

Dirty bit

Most often this synchronization happens when the cache is full, but a new block of memory is being requested by the processor.

Before the cache erases the old block to free up space, it checks its dirty bit, and if it’s dirty, the old block of data is written back to RAM before loading in the new block.

 

Instruction Pipelining

Another trick to boost CPU performance is called instruction pipelining.

In episode 7, our example processor performed the fetch-decode-execute cycle sequentially and in a continuous loop: Fetch-decode-execute, fetch-decode-execute, fetch-decode-execute, and so on.

This meant our design required three clock cycles to execute one instruction.

But each of these stages uses a different part of the CPU, meaning there is an opportunity to parallelize!

While one instruction is getting executed, the next instruction could be getting decoded, and the instruction beyond that fetched from memory.

All of these separate processes can overlap so that all parts of the CPU are active at any given time. In this pipelined design, an instruction is executed every single clock cycle which triples the throughput.

But just like with caching this can lead to some tricky problems. A big hazard is a dependency in the instructions.

For example, you might fetch something that the currently executing instruction is just about to modify, which means you’ll end up with the old value in the pipeline.

To compensate for this, pipelined processors have to look ahead for data dependencies, and if necessary, stall their pipelines to avoid problems.

out-of-order execution : 비순차적 실행

High end processors, like those found in laptops and smartphones, go one step further and can dynamically reorder instructions with dependencies in order to minimize stalls and keep the pipeline moving, which is called out-of-order execution.

As you might imagine, the circuits that figure this all out are incredibly complicated.

Nonetheless, pipelining is tremendously effective and almost all processors implement it today.

Another big hazard are conditional jump instructions -- we talked about one example, a JUMP NEGATIVE, last episode.

These instructions can change the execution flow of a program depending on a value.

A simple pipelined processor will perform a long stall when it sees a jump instruction, waiting for the value to be finalized.

Only once the jump outcome is known, does the processor start refilling its pipeline.

But, this can produce long delays, so high-end processors have some tricks to deal with this problem too.


Imagine an upcoming jump instruction as a fork in a road - a branch.

Advanced CPUs guess which way they are going to go, and start filling their pipeline with instructions based off that guess – a technique called speculative execution.

When the jump instruction is finally resolved, if the CPU guessed correctly, then the pipeline is already full of the correct instructions and it can motor along without delay.

Pipeline flush

However, if the CPU guessed wrong, it has to discard all its speculative results and perform a pipeline flush - sort of like when you miss a turn and have to do a u-turn to get back on route, and stop your GPS’s insistent shouting.

Branch prediction

To minimize the effects of these flushes, CPU manufacturers have developed sophisticated ways to guess which way branches will go, called branch prediction.

Instead of being a 50/50 guess, today’s processors can often guess with over 90% accuracy!

Superscalar

In an ideal case, pipelining lets you complete one instruction every single clock cycle, but then superscalar processors came along which can execute more than one instruction per clock cycle.

During the execute phase even in a pipelined design, whole areas of the processor might be totally idle.

For example, while executing an instruction that fetches a value from memory, the ALU is just going to be sitting there, not doing a thing.

So why not fetch-and-decode several instructions at once, and whenever possible, execute instructions that require different parts of the CPU all at the same time!?

But we can take this one step further and add duplicate circuitry for popular instructions.

For example, many processors will have four, eight or more identical ALUs, so they can execute many mathematical instructions all in parallel!

Multi-core processors

Ok, the techniques we’ve discussed so far primarily optimize the execution throughput of a single stream of instructions, but another way to increase performance is to run several streams of instructions at once with multi-core processors.

You might have heard of dual core or quad core processors. This means there are multiple independent processing units inside of a single CPU chip.

In many ways, this is very much like having multiple separate CPUs, but because they’re tightly integrated, they can share some resources, like cache, allowing the cores to work together on shared computations.

But, when more cores just isn’t enough, you can build computers with multiple independent CPUs!

High end computers, like the servers streaming this video from YouTube’s data center, often need the extra horsepower to keep it silky smooth for the hundreds of people watching simultaneously.

Super Computer

Two- and four-processor configuration are the most common right now, but every now and again even that much processing power isn’t enough. So we humans get extra ambitious and build ourselves a supercomputer!

If you’re looking to do some really monster calculations – like simulating the formation of the universe - you’ll need some pretty serious compute power.

A few extra processors in a desktop computer just isn’t going to cut it. You’re going to need a lot of processors.

When this video was made(2017), the world’s fastest computer was located in The National Supercomputing Center in Wuxi, China. The Sunway TaihuLight contains a brain-melting 40,960 CPUs, each with 256 cores!

Thats over ten million cores in total... and each one of those cores runs at 1.45 gigahertz. In total, this machine can process 93 Quadrillion -- that’s 93 million-billions -- floating point math operations per second, knows as FLOPS.


So long story short, not only have computer processors gotten a lot faster over the years, but also a lot more sophisticated, employing all sorts of clever tricks to squeeze out more and more computation per clock cycle.

Our job is to wield that incredible processing power to do cool and useful things.

That’s the essence of programming, which we’ll start discussing next episode.

 


저번에는 CPU가 어떤 식으로 작동하는지 아주 간단한 수준의 CPU와 명령어들을 통해 알아봤다면, 이번에는 실제 CPU가 어떤 식으로 작동하는지에 대한 아이디어를 알아갈 수 있었다.

이번에도 한국어 자막의 해설이 잘못된 부분이 있어서, 뒷부분은 영어로 옮겨 왔다.