Software pipelining

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops. Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser.[1] Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique, modulo variable expansion is widely used in practice.[2] Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper.[3] This paper has a good set of references on the topic.

Example

Consider the following loop:

for i = 1 to bignumber
  A(i)
  B(i)
  C(i)
end

In this example, let A(i), B(i), C(i) be instructions, each operating on data i, that are dependent on each other. In other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations execute in the following sequence:

A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...

Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 9 cycles to complete: 3 clock cycles for A(1), 3 clock cycles for B(1), and 3 clock cycles for C(1).

Now consider the following sequence of instructions with software pipelining:

A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...

It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment that bignumber is divisible by 3):

for i = 1 to (bignumber - 2) step 3
  A(i)
  A(i+1)
  A(i+2)
  B(i)
  B(i+1)
  B(i+2)
  C(i)
  C(i+1)
  C(i+2)
end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the total number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem, but note that software pipelining prevents the use of Duff's device.[citation needed]

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency. For example, the following code:

for i = 1 to bignumber
  A(i) ; 3 cycle latency
  B(i) ; 3
  C(i) ; 12(perhaps a floating point operation)
  D(i) ; 3
  E(i) ; 3
  F(i) ; 3
end

would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache performance, see code bloat). Even worse, the prologue (code before the loop for handling the case of bignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization ineffectual.

By contrast, here is the software pipelining for our example (the prologue and epilogue will be explained later):

prologue
for i = 1 to (bignumber - 6)
  A(i+6)
  B(i+5)
  C(i+4)
  D(i+2) ; note that we skip i+3
  E(i+1)
  F(i)
end
epilogue

Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)
Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2)
Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3)
Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4)
Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5)
Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6)
Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoids the bottleneck at instruction C. Note that there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7) are used for other instructions instead of being wasted.

The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our example above:

; loop prologue (arranged on lines for clarity)
A(1)
A(2), B(1)
A(3), B(2), C(1)
A(4), B(3), C(2) ; cannot start D(1) yet
A(5), B(4), C(3), D(1)
A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:

; loop epilogue (arranged on lines for clarity)
B(bignumber), C(bignumber-1), D(bignumber-3), E(bignumber-4), F(bignumber-5)
C(bignumber), D(bignumber-2), E(bignumber-3), F(bignumber-4)
D(bignumber-1), E(bignumber-2), F(bignumber-3)
D(bignumber), E(bignumber-1), F(bignumber-2)
E(bignumber), F(bignumber-1)
F(bignumber)

Difficulties of implementation

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The epilogue would also be 18 instructions. In other words, the prologue and epilogue together are 6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.

In fact, Monica Lam presents an elegant solution to this problem in her thesis, A Systolic Array Optimizing Compiler (1989) (ISBN 0-89838-300-5). She calls it modulo variable expansion. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose that A(i) and B(i) can be issued in parallel and that the latency of the former is 2 cycles. The pipelined body could then be:

A(i+2); B(i)

Register allocation of this loop body runs into the problem that the result of A(i+2) must stay live for two iterations. Using the same register for the result of A(i+2) and the input of B(i) will result in incorrect results.

However, if we replicate the scheduled loop body, the problem is solved:

 A(i+2); B(i)
 A(i+3); B(i+1)

Now a separate register can be allocated to the results of A(i+2) and A(i+3). To be more concrete:

 r1 = A(i+2); B(i) = r1
 r2 = A(i+3); B(i+1) = r2
 i = i + 2 // Just to be clear

On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body, r1 holds the value of A(i+2) from the previous replicated loop iteration. Since i has been incremented by 2 in the meantime, this is actually the value of A(i) in this replicated loop iteration.

Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

IA-64 implementation

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:

  • A "rotating" register bank; instructions can refer to a register number that is redirected to a different register each iteration of the loop (eventually looping back around to the beginning). This makes the extra instructions[specify] inserted in the previous example unnecessary.
  • Predicates (used to "predicate" instructions; see Branch predication) that take their value from special looping instructions. These predicates turn on or off certain instructions in the loop, making a separate prologue and epilogue unnecessary.

References

  1. ^ B.R. Rau and C.D. Glaeser, "Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing", In Proceedings of the Fourteenth Annual Workshop on Microprogramming (MICRO-14), December 1981, pages 183-198
  2. ^ M. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).
  3. ^ J. Ruttenberg, G.R. Gao, A. Stoutchinin, and W. Lichtenstein, "Software pipelining showdown: optimal vs. heuristic methods in a production compiler", In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996, pages 1-11. Also published as ACM SIGPLAN Notices 31(5).

Read other articles:

Konsonan letup dwibibir nirsuarapNomor IPA101Pengkodean karakterEntitas (desimal)pUnikode (heks)U+0070X-SAMPApKirshenbaumpBrailleSampel suaranoicon sumber · bantuan Konsonan letup dwibibir nirsuara adalah jenis dari suara konsonan dwibibir yang digunakan dalam berbagai bahasa. Simbol IPAnya adalah ⟨p⟩. Dalam bahasa Indonesia huruf [p] terdapat dalam kata seperti panas dan pasir. Huruf [p] terdapat di hampir semua bahasa di dunia. Beberapa bahasa, seperti bahasa-baha...

 

 

Kantor Kejaksaan Agung最高檢察署Zuìgāo Jiǎnchá Shǔ (Mandarin)Chui-kô Kiám-chhat Su (Hakka)Divisi Investigasi Khusus Kantor Kejaksaan AgungInformasi lembagaDibentuk16 November 1928Wilayah hukumTaiwan (Republik Tiongkok)Kantor pusatZhongzheng, TaipeiPejabat eksekutifChiang Hui-min, Jaksa AgungLembaga indukKementerian KehakimanSitus webwww.tps.moj.gov.tw Kantor Kejaksaan Agung (Hanzi: 最高檢察署; Pinyin: Zuìgāo Fǎyuàn Jiǎnchá Shǔ; Pe̍h-ōe-jī: Chòe-ko Kiám...

 

 

SMA Negeri 1 MaturInformasiJenisNegeriAkreditasiBJumlah kelasX,XI,XIIJurusan atau peminatanIPA dan IPSKurikulumKurikulum Tingkat Satuan PendidikanAlamatLokasiJl. Matur-Palembayan 29 E Matur, Sumatera Barat, IndonesiaSitus webwww.sman1matur.comMoto SMA Negeri 1 Matur adalah salah satu sekolah menengah atas yang ada di Kabupaten Agam, Bukittinggi, Sumatera Barat. SMA ini adalah salah satu SMA unggul yang ada Indonesia dan mendapat peringkat akreditasi A dari BAN-DIKDASMEN serta ditunj...

Railway line in Glasgow, Scotland, UK This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Maryhill Line – news · newspapers · books · scholar · JSTOR (May 2014) (Learn how and when to remove this message) Maryhill LineA Class 156 at Anniesland, the outer terminus of the Maryhill LineOverviewOwnerNetwork RailLoca...

 

 

此條目需要补充更多来源。 (2021年7月4日)请协助補充多方面可靠来源以改善这篇条目,无法查证的内容可能會因為异议提出而被移除。致使用者:请搜索一下条目的标题(来源搜索:美国众议院 — 网页、新闻、书籍、学术、图像),以检查网络上是否存在该主题的更多可靠来源(判定指引)。 美國眾議院 United States House of Representatives第118届美国国会众议院徽章 众议院旗...

 

 

Untuk bendera Republik Dominika, lihat Bendera Republik Dominika. Bendera Dominika Pemakaian Bendera sipil dan negara Perbandingan 1:2 Dipakai 1990 Rancangan Bidang hijau dengan salib bergaris tiga berwarna kuning, putih dan hitam, dan lingkaran merah di tengah yang terdapat Burung Beo Sisserou dilingkari bintang hijau. Perancang Alwin Bully Varian bendera Bendera Dominika Pemakaian Standar presidensial Rancangan Bidang hijau dengan lambang Dominika di tengahnya. Bendera Dominika diadopsi pad...

Country in Central Africa (1971–1997) This article is about the Democratic Republic of the Congo from 1965 to 1997. For the present-day country, see Democratic Republic of the Congo. For other uses, see Zaire (disambiguation). Republic of ZaireRépublique du Zaïre (French)Repubilika ya Zaïre (Kituba)Republíki ya Zaïre (Lingala)Jamhuri ya Zaïre (Swahili)Ditunga dia Zaïre (Luba-Lulua)1971–1997 Flag Emblem Motto: Paix — Justice — Travail[1]...

 

 

Elda Emma Anderson Elda Emma Anderson, (lahir di Green Lake, Wisconsin, Amerika Serikat, 5 Oktober 1899 – meninggal di Oak Ridge, Tennessee, 17 April 1961) adalah seorang fisikawati dan berperan penting dalam pengembangan di bidang fisika kesehatan. Kemampuan intelektuanya telah ada sejak kecil. Setelah lulus dari Ripon College ia mendapatkan gelar master dibidang Fisika dari Universitas Wisconsin pada tahun 1924. Pada tahun 1941, dia juga mendapatkan gelar doktornya di universitas yang sam...

 

 

  هذه المقالة عن المملكة الأردنية الهاشمية. لمعانٍ أخرى، طالع الأردن (توضيح). الأردن المملكة الأردنية الهاشمية الأردنعلم المملكة الأردنية الهاشمية الأردنشعار المملكة الأردنية الهاشمية الشعار الوطنيالله، الوطن، الملك  النشيد: السلام الملكي الأردنيالسلام الملكي ا...

Historical population CensusPop.Note%± 18606,482—18709,65849.0%188040,440318.7%189088,243118.2%1900122,93139.3%1910204,35466.2%1920334,16263.5%1930435,57330.3%1940499,26114.6%1950749,58750.1%19601,302,16173.7%19701,745,90034.1%19802,718,21555.7%19903,665,22834.8%20005,130,63240.0%20106,392,01724.6%20207,151,50211.9%Sources: 1910–2020[1]Note that early censusesmay not includeNative Americans in Arizona As of the 2020 United States census, Arizona had a population of 7,151,502. ...

 

 

International basketball competition EuroBasket 1983 Women19th FIBA European Women'sBasketball ChampionshipTournament detailsHost countryHungaryDates11–18 SeptemberTeams12Venue(s) (in Budapest, Miskolc, Zalaegerszeg host cities)Final positionsChampions Soviet Union (17th title)Official websiteOfficial website (archive)← 1981 1985 → The 1983 European Women Basketball Championship, commonly called EuroBasket Women 1983, was the 19th regional championship held b...

 

 

American artist and illustrator Stahl at work in Florida, 1966 Ben Stahl painting Ben-Hur movie scenes for MGM. Benjamin Albert Stahl (September 7, 1910 – October 19, 1987) was an American artist, illustrator and author. He showed precocious talent, winning a scholarship to the Art Institute of Chicago at age twelve. His artwork appeared in the International Watercolor Show at the Art Institute when he was sixteen. He later taught at the Art Institute, as well as at the American Ac...

University softball team This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Notre Dame Fighting Irish softball – news · newspapers · books · scholar · JSTOR (April 2017) (Learn how and when to remove this message) Notre Dame Fighting IrishFounded1996–97UniversityUniversity of Notre DameAthletic directorPete B...

 

 

HU-331 Names Preferred IUPAC name 3-Hydroxy-2-[(1R,6R)-3-methyl-6-(prop-1-en-2-yl)cyclohex-2-en-1-yl]-5-pentylcyclohexa-2,5-diene-1,4-dione Identifiers CAS Number 137252-25-6 Y 3D model (JSmol) Interactive image ChemSpider 9568213 Y PubChem CID 126249 UNII I7Q196L4AU Y CompTox Dashboard (EPA) DTXSID10929758 InChI InChI=1S/C21H28O3/c1-5-6-7-8-15-12-18(22)19(21(24)20(15)23)17-11-14(4)9-10-16(17)13(2)3/h11-12,16-17,24H,2,5-10H2,1,3-4H3/t16-,17+/m0/s1 YKey: WDXXEUARVHTWQ...

 

 

Plant considered undesirable in a particular place or situation This article is about plants specifically called weeds. For the psychoactive plant commonly called weed, see Cannabis (drug). For other uses, see Weed (disambiguation). Weeds growing in the cracks of a concrete staircase (Epilobium roseum, Chelidonium majus, Oxalis corniculata, Plantago major) A weed is a plant considered undesirable in a particular situation, growing where it conflicts with human preferences, needs, or goals. ...

Road in Asia Asian Highway 42Route informationLength3,754 km (2,333 mi)Major junctionsNorth endLanzhou, ChinaMajor intersectionsBakhari, Mehsi, Bihar, IndiaSouth endBarhi, India LocationCountriesChina, Nepal, India Highway system Asian Highway Network ← AH41→ AH43 Asian Highway 42 (AH42) is a route of the Asian Highway Network, running 3,754 kilometres (2,333 mi) from AH5 in Lanzhou, China[1] to AH1 in Barhi, India.[2] It passes through the co...

 

 

Arrigo Jacchia (Lugo, 12 dicembre 1891[1] – Roma, 25 dicembre 1963) è stato un giornalista italiano. Indice 1 Biografia 2 Jacchia e Arnold Zweig 3 Opere 4 Note 5 Altri progetti Biografia Discendente della numerosa comunità ebraica di Lugo, si trasferisce giovanissimo a Roma, dove nel 1912, a soli 21 anni, è già redattore del quotidiano Il Messaggero. Successivamente è corrispondente parlamentare per La Stampa di Torino e Il Secolo di Milano. Le leggi razziali interrompono la su...

 

 

Stasiun Bakkai (抜海駅code: ja is deprecated , Bakkai-eki) adalah sebuah stasiun kereta api di Jalur Utama Sōya di Wakkanai, Hokkaido, Jepang, dioperasikan oleh JR Hokkaido. Stasiun Bakkai抜海駅Stasiun BakkaiLokasiKutonebetsu, Bakkai-mura, Wakkanai-shi, HokkaidoJepangKoordinat45°19′0.7″N 141°38′40.4″E / 45.316861°N 141.644556°E / 45.316861; 141.644556Operator JR HokkaidoJalur■ Jalur Utama SōyaLetak245.0 km dari AsahikawaJumlah peron2 jalur peronJu...

本條目存在以下問題,請協助改善本條目或在討論頁針對議題發表看法。 此条目也许具备关注度,但需要可靠的来源来加以彰显。(2022年8月26日)请协助補充可靠来源以改善这篇条目。 此條目没有列出任何参考或来源。 (2022年8月26日)維基百科所有的內容都應該可供查證。请协助補充可靠来源以改善这篇条目。无法查证的內容可能會因為異議提出而被移除。 由零開始郭富城...

 

 

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: 1981 in video games – news · newspapers · books · scholar · JSTOR (March 2015) (Learn how and when to remove this message) Overview of the events of 1981 in video games List of years in video games … 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 198...