Sum-addressed decoder

In CPU design, the use of a sum-addressed decoder (SAD) or sum-addressed memory (SAM) decoder is a method of reducing the latency of the CPU cache access and address calculation (base + offset). This is achieved by fusing the address generation sum operation with the decode operation in the cache SRAM.

Overview

The L1 data cache should usually be in the most critical CPU resource, because few things improve instructions per cycle (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and pipelining the data cache makes IPC worse. One way of reducing the latency of the L1 data cache access is by fusing the address generation sum operation with the decode operation in the cache SRAM.

The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused add/decode described here.

The most profitable recurrence to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. Assuming that load results are bypassed with the same priority as integer results, then it's possible to summarize this recurrence as a load followed by another load—as if the program was following a linked list.

The rest of this page assumes an instruction set architecture (ISA) with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most RISC ISAs fit this description. In ISAs such as the Intel x86, three or four inputs are summed to generate the virtual address. Multiple-input additions can be reduced to a two-input addition with carry save adders, and the remaining problem is as described below. The critical recurrence, then, is an adder, a decoder, the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering muxes, and the bypass muxes.

For this example, a direct-mapped 16 KB data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr[13:3]. The sum-addressed SRAM idea applies equally well to set associative caches.

Sum-addressed cache: collapse the adder and decoder

The SRAM decoder for this example has an 11-bit input, Addr[13:3], and 2048 outputs, the decoded word lines. One word line is driven high in response to each unique Addr[13:3] value.

In the simplest form of decoder, each of the 2048 lines is logically an AND gate. The 11 bits (call them A[13:3] and their complements (call them B[13:3]) are driven up the decoder. For each line, 11 bits or complements are fed into an 11-input AND gate. For instance, 1026 decimal is equal to 10000000010 binary. The function for line 1026 would be:

wordline[1026] = A[13] & B[12] & B[11] & B[10] & B[9] & B[8] & B[7] & B[6] & B[5] & A[4] & B[3]

Both the carry chain of the adder and the decoder combine information from the entire width of the index portion of the address. Combining information across the entire width twice is redundant. A sum-addressed SRAM combines the information just once by implementing the adder and decoder together in one structure.

Recall that the SRAM is indexed with the result of an add. Call the summands R (for register) and O (for the offset to that register). The sum-addressed decoder is going to decode R+O. For each decoder line, call the line number L.

Suppose that our decoder drove both R and O over each decoder line, and each decoder line implemented:

wordline[L] = (R+O)==L
(R+O)==L <=> R+O-L==0
         <=> R+O+~L+1==0
         <=> R+O+~L==-1==11..1.

A set of full adders can be used to reduce R+O+~L to S+C (this is carry save addition). S+C==11..1 <=> S==~C. There will be no carries in the final add. Note that since C is a row of carries, it's shifted up one bit, so that R[13:3]+O[13:3]+~L[13:3] == {0,S[13:3]} + {C[14:4],0}

With this formulation, each row in the decoder is a set of full adders which reduce the base register, the offset, and the row number to a carry-save format, and a comparator. Most of this hardware will be proven redundant below, but for now it's simpler to think of it all existing in each row.

Ignoring the LSBs: late select on carry

The formulation above checks the entire result of an add. However, in a CPU cache decoder, the entire result of the add is a byte address, and the cache is usually indexed with a larger address, in our example, that of an 8-byte block. It is preferable to ignore a few of the LSBs of the address. However, the LSBs of the two summands can't be ignored because they may produce a carry-out which would change the doubleword addressed.

If R[13:3] and O[13:3] are added to get some index I[13:3], then the actual address Addr[13:3] is equal to either I[13:3], or I[13:3] + 1, depending on whether R[2:0]+O[2:0] generates a carry-out. Both I and I+1 can be fetched if there are two banks of SRAM, one with even addresses and one with odd. The even bank holds addresses 000xxx, 010xxx, 100xxx, 110xxx, etc., and the odd bank holds addresses 001xxx, 011xxx, 101xxx, 111xxx, etc. The carry-out from R[2:0]+O[2:0] can then be used to select the even or odd doubleword fetched later.

Note that fetching from two half-size banks of SRAM will dissipate more power than fetching from one full-size bank, as it causes more switching in the sense amps and data steering logic.

Match generation

I[13:3] even bank
fetches line
odd bank
fetches line
100 100 101
101 110 101
110 110 111

Referring to the adjacent diagram, the even bank will fetch line 110 when I[13:3]==101 or I[13:3]==110. The odd bank will fetch line 101 when I[13:3]==100 or I[13:3]==101.

In general, the odd SRAM bank should fetch line Lo==2N+1 when either I[13:3]==2N or I[13:3]==2N+1. The two conditions can be written as:

I[13:3] = Lo-1 =>  R[13:3] + O[13:3] + ~Lo+1 = 11..11
               =>  R[13:3] + O[13:3] + ~Lo   = 11..10
I[13:3] = Lo   =>  R[13:3] + O[13:3] + ~Lo   = 11..11

Ignore the last digit of the compare: (S+C)[13:4]==11..1

Similarly, the even SRAM bank fetches line Le==2N when either I[13:3]==2N or I[13:3]==2N-1. The conditions are written as follows, and once again ignore the last digit of the compare.

I[13:3] = Le-1 =>  R[13:3] + O[13:3] + ~Le = 11..10
I[13:3] = Le   =>  R[13:3] + O[13:3] + ~Le = 11..11

Gate-level implementation

    R13 ... R6  R5  R4  R3
    O13 ... O6  O5  O4  O3
    L13 ... L6  L5  L4  L3
--------------------------
    S13 ... S6  S5  S4  S3
C14 C13 ... C6  C5  C4

Before collapsing redundancy between rows, review:

Each row of each decoder for each of two banks implements a set of full adders which reduce the three numbers to be added (R[13:3], O[13:3], and L) to two numbers (S[14:4] and C[13:3]). The LSB (==S[3]) is discarded. Carry out (==C[14]) is also discarded. The row matches if S[13:4] == ~C[13:4], which is &( xor(S[13:4], C[13:4])).

It is possible to partially specialize the full adders to 2-input AND, OR, XOR, and XNOR because the L input is constant. The resulting expressions are common to all lines of the decoder and can be collected at the bottom.

S0;i   = S(Ri, Oi, 0) = Ri xor Oi
S1;i   = S(Ri, Oi, 1) = Ri xnor Oi
C0;i+1 = C(Ri, Oi, 0) = Ri and Oi
C1;i+1 = C(Ri, Oi, 1) = Ri or Oi.

At each digit position, there are only two possible Si, two possible Ci, and four possible XORs between them:

Li=0 and Li-1=0: X0;0;i = S0;i xor C0;i = Ri xor Oi xor (Ri-1 and Oi-1)
Li=0 and Li-1=1: X0;1;i = S0;i xor C1;i = Ri xor Oi xor (Ri-1 or Oi-1)
Li=1 and Li-1=0: X1;0;i = S1;i xor C0;i = Ri xnor Oi xor (Ri-1 and Oi-1) = !X0;0;i
Li=1 and Li-1=1: X1;1;i = S1;i xor C1;i = Ri xnor Oi xor (Ri-1 or Oi-1)  = !X0;1;i

One possible decoder for the example might calculate these four expressions for each of the bits 4..13, and drive all 40 wires up the decoder. Each line of the decoder would select one of the four wires for each bit, and consist of a 10-input AND.

What has been saved?

A simpler data cache path would have an adder followed by a traditional decoder. For our example cache subsystem, the critical path would be a 14-bit adder, producing true and complement values, followed by an 11-bit AND gate for each row of the decoder.

In the sum-addressed design, the final AND gate in the decoder remains, although 10 bits wide instead of 11. The adder has been replaced by a four input logical expression at each bit. The latency savings comes from the speed difference between the adder and that four input expression, a savings of perhaps three simple CMOS gates.

If the reader feels that this was an inordinate amount of brain-twisting work for a three gate improvement in a multi-cycle critical path, then the reader has a better appreciation for the level to which modern CPUs are optimized.

Further optimizations: predecode

Many decoder designs avoid high-fan-in AND gates in the decode line itself by employing a predecode stage. For instance, an 11-bit decoder might be predecoded into three groups of 4, 4, and 3 bits each. Each 3-bit group would drive 8 wires up the main decode array, each 4-bit group would drive 16 wires. The decoder line then becomes a 3-input AND gate. This reorganization can save significant implementation area and some power.

This same reorganization can be applied to the sum-addressed decoder. Each bit in the non-predecoded formulation above can be viewed as a local two-bit add. With predecoding, each predecode group is a local three, four, or even five-bit add, with the predecode groups overlapping by one bit.

Predecoding generally increases the number of wires traversing the decoder, and sum-addressed decoders generally have about twice as many wires as the equivalent simple decoder. These wires can be the limiting factor on the amount of feasible predecoding.

References

  • Paul Demone has an explanation of sum-addressed caches in a realworldtech article.
  • Heald et al.[1] have a paper in ISSCC 1998 that explains what may be the original sum-addressed cache in the Ultrasparc III.
  • Sum-addressed memory is described in

United States patent 5,754,819, May 19, 1998, Low-latency memory indexing method and structure. Inventors: Lynch; William L. (Palo Alto, CA), Lauterbach; Gary R. (Los Altos, CA); Assignee: Sun Microsystems, Inc. (Mountain View, CA), Filed: July 28, 1994

  • At least one of the inventors named on a patent related to carry-free address decoding credits the following publication:

Evaluation of A + B = K Conditions without Carry Propagation (1992) Jordi Cortadella, Jose M. Llaberia IEEE Transactions on Computers, [1] [2]

  • The following patent extends this work, to use redundant form arithmetic throughout the processor, and so avoid carry propagation overhead even in ALU operations, or when an ALU operation is bypassed into a memory address:

United States Patent 5,619,664, Processor with architecture for improved pipelining of arithmetic instructions by forwarding redundant intermediate data forms, awarded April 18, 1997, Inventor: Glew; Andrew F. (Hillsboro, OR); Assignee: Intel Corporation (Santa Clara, CA), Appl. No.: 08/402,322, Filed: March 10, 1995

  1. ^ Heald, R.; et al. (1998). "64 kB Sum-Addressed-Memory Cache with 1.6 ns Cycle and 2.6 ns Latency". ISSCC Digest of Technical Papers. pp. 350–351. doi:10.1109/ISSCC.1998.672519.

Read other articles:

П-50Т, ОФАБ-100-120, ОФАБ-250-270 - МАКС-2009 OFAB -100-120 merupakan bom kecil yang dapat dibawa pada Sukhoi Su-17, Sukhoi Su-25, MiG-29, Su-27, Sukhoi Su-30 dan berbagai pesawat lainnya.[1][2][3] Bom ini dirancang untuk menyerang material lapis baja ringan dan fasilitas industri militer, serta tenaga kerja. Ia dijatuhkan dari ketinggian 500 hingga 15.000 m dengan kecepatan 500 hingga 1.150 km/jam. Bom pesawat ini efektif terhadap personel di medan terbu...

 

Katedral Ortodoks Koptik St. Markus di Alexandria, Egypt. Katolikos, dalam bentuk jamaknya, Katolikoi, adalah sebuah gelar yang diberikan kepada kepala gereja-gereja tertentu di beberapa tradisi Kekristenan Timur.[1] Gelar tersebut menunjukkan status otosefalus dan dalam beberapa kasus diberikan kepada kepala Gereja otonom yang ditunjuk, di mana pemegang gelar ini mungkin juga memiliki gelar lainnya seperti Patriark.[1] Dalam kasus lainnya seorang katolikos mengepalai suatu Ge...

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (أغسطس 2019) عبدالله بن صنيتان آل سعود عبدالله بن سعود بن عبدالله بن إبراهيم آل سعود معلومات شخصية الميلاد 1290الرياض الوفاة 1325الطرفية سبب الوفاة قتل في معركة الطرفية الإق...

American college football season 2018 Boston College Eagles footballFirst Responder Bowl, no contest vs. Boise StateConferenceAtlantic Coast ConferenceDivisionAtlantic DivisionRecord7–5 (4–4 ACC)Head coachSteve Addazio (6th season)Offensive coordinatorScot Loeffler (3rd season)Co-defensive coordinatorJim Reid (4th season)Co-defensive coordinatorAnthony Campanile (1st season)CaptainJon Baker, Will HarrisHome stadiumAlumni StadiumUniformSeasons← 20172019&...

 

Weiler-Simmerberg Balai kota Weiler-Simmerberg Lambang kebesaranLetak Weiler-Simmerberg di Lindau NegaraJermanNegara bagianBayernWilayahSchwabenKreisLindauPemerintahan • MayorKarl-Heinz RudolphLuas • Total31,30 km2 (1,210 sq mi)Ketinggian tertinggi900 m (3,000 ft)Ketinggian terendah630 m (2,070 ft)Populasi (2013-12-31)[1] • Total6.115 • Kepadatan2,0/km2 (5,1/sq mi)Zona waktuWET/WMPET...

 

† Человек прямоходящий Научная классификация Домен:ЭукариотыЦарство:ЖивотныеПодцарство:ЭуметазоиБез ранга:Двусторонне-симметричныеБез ранга:ВторичноротыеТип:ХордовыеПодтип:ПозвоночныеИнфратип:ЧелюстноротыеНадкласс:ЧетвероногиеКлада:АмниотыКлада:Синапсиды�...

Bangladeshi historian, writer and academic (born 1939) For other people with similar names, see Sirajul Islam (disambiguation). This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: Sirajul Islam – news · newspapers...

 

Enrico I d'Inghilterradetto Beauclerc (il Chierico)Enrico I Beauclerc in una miniatura della Historia Anglorum di Matthew Paris, 1250 - 1259Re d'InghilterraStemma In carica3 agosto 1100 –1º dicembre 1135 Incoronazione5 agosto 1100 PredecessoreGuglielmo II SuccessoreStefano (de facto)Matilde (de iure) Duca di NormandiaIn carica1106 –1º dicembre 1135 PredecessoreRoberto II SuccessoreStefano di Blois NascitaSelby, 1068 MorteLyons-la-Forêt, 1º dicembre 1135 SepolturaAbbazi...

 

Portion regarding requirements for ratification For the closing endorsement following Article Seven, see Signing of the United States Constitution This article is part of a series on theConstitutionof the United States Preamble and Articles Preamble I II III IV V VI VII Amendments to the Constitution I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI XVII XVIII XIX XX XXI XXII XXIII XXIV XXV XXVI XXVII Unratified Amendments: Congressional Apportionment Titles of Nobility Corwin Child Labor...

1983 American film directed by Robert Ellis Miller This article is about the 1983 film. For the opera by Marc Blitzstein, see Reuben, Reuben (opera). For the old song, see Reuben and Rachel. Reuben, ReubenReuben, Reuben film posterDirected byRobert Ellis MillerWritten byPeter De Vries (novel)Julius J. EpsteinHerman Shumlin (Spofford play)Produced byJulius J. EpsteinWalter ShensonStarringTom ContiKelly McGillisCinematographyPeter SteinEdited bySkip LuskMusic byBilly GoldenbergProductioncompani...

 

Voce principale: Nuova Cosenza Calcio. Associazione Sportiva CosenzaStagione 1947-1948Sport calcio Squadra Cosenza Allenatore Atilio Demaría Presidente Adolfo Quintieri Serie B10º posto nel girone C. Retrocesso in Serie C. Maggiori presenzeCampionato: Crola, Del Frati, Manni (34) Miglior marcatoreCampionato: Loschi (12) 1946-1947 1948-1949 Si invita a seguire il modello di voce Questa voce raccoglie le informazioni riguardanti l'Associazione Sportiva Cosenza nelle competizioni ufficia...

 

此条目序言章节没有充分总结全文内容要点。 (2019年3月21日)请考虑扩充序言,清晰概述条目所有重點。请在条目的讨论页讨论此问题。 哈萨克斯坦總統哈薩克總統旗現任Қасым-Жомарт Кемелұлы Тоқаев卡瑟姆若马尔特·托卡耶夫自2019年3月20日在任任期7年首任努尔苏丹·纳扎尔巴耶夫设立1990年4月24日(哈薩克蘇維埃社會主義共和國總統) 哈萨克斯坦 哈萨克斯坦政府...

European sports league This article is about the Austrian league. For other leagues, see List of ice hockey leagues. Erste Bank Eishockey Liga redirects here. Not to be confused with Erste Liga (ice hockey). ICE Hockey LeagueCurrent season, competition or edition: 2023–24 ICE Hockey League seasonSportIce hockeyFounded1923; 101 years ago (1923)CEOKarl Safron[1]No. of teams13CountryAustria (8 teams)Italy (3 teams)Hungary (1 team)Slovenia...

 

Welsh footballer Joe Jacobson Jacobson playing for Accrington Stanley in 2011Personal informationFull name Joseph Mark Jacobson[1]Date of birth (1986-11-17) 17 November 1986 (age 37)[2]Place of birth Cardiff, WalesHeight 5 ft 11 in (1.80 m)[2]Position(s) DefenderTeam informationCurrent team Wycombe WanderersNumber 3Youth career000?–2005 Cardiff CitySenior career*Years Team Apps (Gls)2005–2007 Cardiff City 1 (0)2006 → Accrington Stanley (loan )...

 

British royal house Windsors redirects here. For other uses, see Windsor. House of WindsorBadge of the House of Windsor, featuring the Round Tower of Windsor CastleParent houseHouse of Saxe-Coburg and Gotha[a][b]CountryUnited Kingdom and other Commonwealth realmsFounded17 July 1917; 106 years ago (1917-07-17)FounderGeorge VCurrent headCharles IIIMembersList^ a: The children and male-line descendants of Queen Elizabeth II and Prince Philip genealogic...

Queen of Egypt from 51 to 30 BC For other uses, see Cleopatra (disambiguation). CleopatraThe Berlin Cleopatra, a Roman sculpture of Cleopatra wearing a royal diadem, mid-1st century BC, now in the Altes Museum, Germany[1][2][3][note 1]PharaohQueen of the Ptolemaic KingdomReign51–30 BC (21 years)[4]Coregency See list Ptolemy XII (until 51 BC)Ptolemy XIII (51–47 BC)Ptolemy XIV (47–44 BC)Ptolemy XV (44–30 BC) PredecessorPtolemy XII AuletesSucc...

 

Multi-purpose stadium in Sydney, New South Wales, Australia Telstra Stadium redirects here. For the stadium in Melbourne previously known as the Telstra Dome, see Docklands Stadium. ANZ Stadium redirects here. For the stadium in Brisbane previously known as ANZ Stadium, see Queensland Sport and Athletics Centre. For the stadium in Fiji, see ANZ National Stadium. For the sports arena in Paris, see Accor Arena. Stadium AustraliaOlympic Stadium Homebush Stadium Sydney Olympic StadiumThe stadium ...

 

Daniel R. Hokanson Daniel R. Hokanson (lahir 27 Juni 1963) adalah seorang jenderal bintang empat dalam Angkatan Darat Amerika Serikat yang sekarang menjabat sebagai kepala Biro Garda Nasional ke-29. Sebelum itu, ia menjabat sebagai direktur Garda Nasional Angkatan Darat ke-21. Pranala luar Wikimedia Commons memiliki media mengenai Daniel R. Hokanson. Biography, Daniel R. Hokanson at National Guard Bureau General Officer Management Office Biography, Daniel R. Hokanson at United States Northern...

Extinct order of trilobites ProetidaTemporal range: Early Ordovician – End Permian PreꞒ Ꞓ O S D C P T J K Pg N Aulacopleura konincki, Kosov u Berouna, Czech Republic Scientific classification Domain: Eukaryota Kingdom: Animalia Phylum: Arthropoda Class: †Trilobita Order: †ProetidaFortey & Owens, 1975 Superfamilies See text Proetida is an order of trilobite that lived from the Ordovician to the Permian. It was the last surviving order of trilobite, dying out in the Permian-Triass...

 

1997 film by Robert Young and Fred Schepisi This article is about the comedy film. For the episode of the television series Zoboomafoo, see List of Zoboomafoo episodes § Season 1 (1999–2000). Fierce CreaturesTheatrical release posterDirected by Robert Young Fred Schepisi Written by John Cleese Iain Johnstone Produced by Michael Shamberg John Cleese Starring John Cleese Jamie Lee Curtis Kevin Kline Michael Palin Ronnie Corbett Carey Lowell Robert Lindsay Cinematography Ian Baker Adrian...