Micro-threads for multi-core and many-cores processors is a mechanism to hide memory latency similar to multi-threading architectures. However, it is done in software for multi-core processors such as the Cell Broadband Engine to dynamically hide latencies that occur due to memory latency or I/O operations.
Introduction
Micro-threading is a software-based threading framework that creates small threads inside multi-core or many-core processors. Each core may have two or more tiny threads that utilize its idle time. It is like hyper-threading invented by Intel or the general multi-threading architecture in modern micro-processors. It enables the existence of more than one thread running on the same core without performing expensive context switching to system's main memory, even if this core does not have multi-threading hardware logic. Micro-threads mainly hide memory latency inside each core by over lapping computations with memory requests. The main difference between micro-threads and current threading models is that micro-threads context switching overhead is very small. For example, the overhead micro-threads implementation on Cell Broadband Engine is 160 nano seconds; meanwhile, the overhead of context switching of the whole core's (SPE) thread is around 2000 micro-seconds. This low overhead is due to three main factors. First, micro-threads are very small. Each micro-thread runs one or two simple but critical functions. Second, micro-threads context include only the register file of the core currently the micro-thread is executing on. Third, micro-threads are context switched to core's dedicated cache, which makes this process very fast and efficient.
Background
As microprocessors are becoming faster, mainly because of the cores being added every few months, memory latency gap is becoming wider. Memory latency was few cycles in 1980 and it is reaching nowadays almost 1000 cycles. If the micro-processor has enough cores and hopefully they are not sending requests to the main memory at the same time, there will be partial aggregate hiding of memory latency. Some cores might be executing while others are waiting for memory response. This is not the best situation for multi-core processors. High performance computing experts are striving to keep all cores busy all the time. So, if each core is kept busy all the time, a complete utilization of the whole micro-processor is possible. Creating software based threads won't solve the problem for one obvious reason. Context switching threads to main memory is much expensive operation when compared to memory latency. For example, in Cell Broadband Engine context switching any of the core's thread takes 2000 micro-seconds in best cases. Some software techniques like double or multi-buffering may solve the memory latency problem. However, they can be used in regular algorithms, where the program knows where is the next data chunk to retrieve from memory; in this case it sends request to memory while it is processing previously request data. However, this technique won't work if it the program does not know the next data chunk to retrieve from memory. In other words, it won't work in combinatorial algorithms, such as tree spanning or random list ranking. In addition, multi-buffering assumes that memory latency is constant and can be hidden by statically. However, reality shows that memory latency changes from application to another. It depends on the overall load on microprocessor's shared resources, such as the rate of memory requests shared cores interconnections.
Current implementation
Currently micro-threading is implemented on the Cell Broadband Engine.[1] Three to fivefold performance improvement could be achieved. Currently it is proven for regular and combinatorial algorithms. Some other efforts are trying to prove its viability for scientific algorithms.
Micro-threads provide a very good solution to hide memory latency best based on the run-time utilization of the microprocessor. For example, if the memory latency is very high compared to processing and context switching time, more micro-threads can be added; this happens when large data chunks are requested from memory or there are many memory hot-spots. If this ration is small, less micro-threads might be introduced at run-time. This depends on factors related to the implemented application and system's run-time factors.
Critique
Although micro-threads provide a promising model to hide memory latency for multi and many-core processors, it has some important critiques that need to be addressed:
- It requires special hardware support. Each core should have its own local interrupt facility to efficiently schedule micro-threads. However, if non-preemptive scheduling policy is followed, the built in interrupting facility is not required.
- It works best when each core has its own local cache that is managed manually by the programmer.
- Adding more micro-threads per core increases dramatically load on microprocessor's shared resources. More memory and synchronization requests will likely create congestions on shared resources. However, this problem can be mitigated by the run-time system's monitoring to microprocessor's critical measures, such as memory latency, and accordingly slow down overall execution by either reducing micro-threads or modifying scheduling policy.
References