Share to: share facebook share twitter share wa share telegram print page

Fermi (microarchitecture)

Nvidia Fermi
NVIDIA GeForce GTX 590 of the GeForce 500-line of graphics-cards, was the final major iteration featuring the Fermi microarchitecture (GF110-351-A1).
Release dateApril 2010
Manufactured byTSMC
Designed byNvidia
Fabrication process40 nm and 28 nm[citation needed]
History
PredecessorTesla
SuccessorKepler
Support status
Unsupported
Photo of Enrico Fermi, eponym of architecture

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm[citation needed]. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.

Fermi was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs.

In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, and in Nvidia Tesla computing modules.

The architecture is named after Enrico Fermi, an Italian physicist.

Overview

NVIDIA GeForce GTX 480 of the GeForce 400-line of graphics-cards; the first iteration to feature the Fermi micro-architecture (GF100-375-A3).
Fig. 1. NVIDIA Fermi architecture
Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches.
Die shot of the GF100 GPU found inside GeForce GTX 470 cards

Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors and a schematic is sketched in Fig. 1.

  • Streaming Multiprocessor (SM): composed of 32 CUDA cores (see Streaming Multiprocessor and CUDA core sections).
  • GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section).
  • Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8 GB/s).
  • DRAM: supported up to 6 GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section).
  • Clock frequency: 1.5 GHz (not released by NVIDIA, but estimated by Insight 64).
  • Peak performance: 1.5 TFlops.
  • Global memory clock: 2 GHz.
  • DRAM bandwidth: 192 GB/s.
  • H.264 FHD decode support.
  • H.265 FHD decode support (GT 730 only).[1]

Streaming multiprocessor

Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64 KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).

Load/Store Units

Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to cache or DRAM.

Special Functions Units (SFUs)

Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

CUDA core

Integer Arithmetic Logic Unit (ALU)

Supports full 32-bit precision for all instructions, consistent with standard programming language requirements.[which?] It is also optimized to efficiently support 64-bit in workstation and server models, but artificially crippled for consumer versions.

Floating Point Unit (FPU)

Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.[2]

Fused multiply-add

Fused multiply-add (FMA) perform multiplication and addition (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

Warp scheduling

The Fermi architecture uses a two-level, distributed thread scheduler.

Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 1. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies.

64-bit floating point operations require both the first two execution columns, so run at half the speed of 32-bit operations.

Dual Warp Scheduler

At the SM level, each warp scheduler distributes warps of 32 threads to its execution units. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double precision instructions do not support dual dispatch with any other operation.[citation needed]

Performance

The theoretical single-precision processing power of a Fermi GPU in GFLOPS is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × shader clock speed (in GHz). Note that the previous generation Tesla could dual-issue MAD+MUL to CUDA cores and SFUs in parallel, but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized.[3] Therefore, it is not possible to leverage the SFUs to reach more than 2 operations per CUDA core per cycle.

The theoretical double-precision processing power of a Fermi GPU is 1/2 of the single precision performance on GF100/110. However, in practice this double-precision power is only available on professional Quadro and Tesla cards, while consumer GeForce cards are capped to 1/8.[4]

Memory

L1 cache per SM and unified L2 cache that services all operations (load, store and texture).

Registers

Each SM has 32K of 32-bit registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.

L1+Shared Memory

On-chip memory that can be used either to cache data for individual threads (register spilling/L1 cache) and/or to share data among several threads (shared memory). This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic. Shared memory is accessible by the threads in the same thread block. It provides low-latency access (10-20 cycles) and very high bandwidth (1,600 GB/s) to moderate amounts of data (such as intermediate results in a series of calculations, one row or column of data for matrix operations, a line of video, etc.). David Patterson says that this Shared Memory uses idea of local scratchpad[5]

Local Memory

Local memory is meant as a memory location used to hold "spilled" registers. Register spilling occurs when a thread block requires more register storage than is available on an SM. Local memory is used only for some automatic variables (which are declared in the device code without any of the __device__, __shared__, or __constant__ qualifiers). Generally, an automatic variable resides in a register except for the following: (1) Arrays that the compiler cannot determine are indexed with constant quantities; (2) Large structures or arrays that would consume too much register space; Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.

L2 Cache

768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests. The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global memory

Global memory (VRAM) is accessible by all threads directly as well as the host system over the PCIe bus. It has a high latency of 400-800 cycles.[citation needed]

Video decompression/compression

See Nvidia NVDEC (formerly called NVCUVID) as well as Nvidia PureVideo.

The Nvidia NVENC technology was not available yet, but introduced in the successor, Kepler.

Fermi chips

  • GF100
  • GF104
  • GF106
  • GF108
  • GF110
  • GF114
  • GF116
  • GF117
  • GF119

See also

References

  1. ^ "NVIDIA GPU Decoder Device Information".
  2. ^ "NVIDIA's Next Generation CUDA Compute Architecture: Fermi" (PDF). 2009. Retrieved December 7, 2015.
  3. ^ Glaskowsky, Peter N. (September 2009). "NVIDIA's Fermi: The First Complete GPU Computing Architecture" (PDF). p. 22. Retrieved December 6, 2015. A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM
  4. ^ Smith, Ryan (March 26, 2010). "NVIDIA's GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?". AnandTech. p. 6. Retrieved December 6, 2015. the GTX 400 series' FP64 performance is capped at 1/8th (12.5%) of its FP32 performance, as opposed to what the hardware natively can do of 1/2 (50%) FP32
  5. ^ Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved October 3, 2013.

General

Read other articles:

مناورات فرخ النسر أو (بالإنجليزية: Foal Eagle Maneuvers)‏ هي مناورات عسكرية مشتركة تقام في شبه الجزيرة الكورية بين كل من الولايات المتحدة الأمريكية وكوريا الجنوبية.[1] تحت رعاية قيادة القوات المشتركة للبلدين. وهي واحدة من أكبر المناورات العسكرية ذات الطبيعة الدفاعية التي تجرى سن

Artikel ini tidak memiliki referensi atau sumber tepercaya sehingga isinya tidak bisa dipastikan. Tolong bantu perbaiki artikel ini dengan menambahkan referensi yang layak. Tulisan tanpa sumber dapat dipertanyakan dan dihapus sewaktu-waktu.Cari sumber: Daftar kata serapan dari bahasa Inggris dalam bahasa Indonesia – berita · surat kabar · buku · cendekiawan · JSTOR Daftar ini belum tentu lengkap. Anda dapat membantu Wikipedia dengan mengembangkannya. Beri…

French bibliographer For the bibliographer, historian and editor, see Pierre Gustave Brunet. A plaque on Rue Gît-le-Cœur in the 6th arrondissement of Paris (next to Place St. Michel) marks the spot where Brunet composed his famous work. Jacques Charles Brunet (2 November 1780 – 14 November 1867) was a French bibliographer. Biography He was born in Paris, the son of a bookseller. He began his bibliographical career by the preparation of several auction catalogues, notable examples being that …

New Zealand footballer Rebecca Smith Personal informationFull name Rebecca Katie SmithDate of birth (1981-06-17) 17 June 1981 (age 42)Place of birth Los Angeles, California, United States[1]Height 1.75 m (5 ft 9 in)Position(s) DefenderYouth career Palos Verdes Breakers South Bay Gunners Fram-CQ Soccer Club1995–1999 Chadwick School1999–2003 Duke Blue DevilsSenior career*Years Team Apps (Gls)2003–2004 Ajax America Women 2004 1. FFC Frankfurt 2005 FSV Frankfurt 200…

British Empire Games 1930 Deelnemende teams 11 Deelnemende atleten 400 Evenementen 59 in 6 sporten Openingsceremonie 16 augustus, 1930 Sluitingsceremonie 23 augustus, 1930 De eerste British Empire Games, een evenement dat tegenwoordig onder de naam Gemenebestspelen bekend is, werden gehouden van 16 tot en met 23 augustus 1930 in Hamilton, Canada. Er namen elf teams deel aan deze spelen. De openingsceremonie en een groot gedeelte van de wedstrijden vond plaats in het hiervoor gebouwde ‘Civicsta…

KLa

KLaAlbum studio karya KLa ProjectDirilis31 Desember 1989DirekamMaret 1988 – November 1989GenrePop, New waveDurasi45:12LabelTeam RecordsAquarius Musikindo (rilisan ulang setelah tahun 1992)ProduserJan DjuhanaKronologi KLa Project KLa(1989) Kedua(1990) Singel dalam album KLa Tentang KitaDirilis: November 1988 Kedua1990 KLa adalah album pertama KLa Project yang dirilis tahun 1989 oleh Team Records. Album ini meraih nominasi BASF Awards 1990 untuk kategori Pop Rock. Sukses KLa Project lewat si…

エンゲルベルト・フンパーディンクEngelbert Humperdinck 2009年6月14日のラスベガスでのコンサートにて基本情報出生名 Arnold George Dorsey生誕 (1936-05-02) 1936年5月2日(87歳) イギリス領インド帝国、マドラス(現:チェンナイ)ジャンル イージー・リスニング担当楽器 ボーカル、ピアノ活動期間 1956年–現在 エンゲルベルト・フンパーディンク(Engelbert Humperdinck, 1936年5月2日 - )は…

Суперкубок Туреччини з футболу 2016Турнір Суперкубок Туреччини з футболу «Бешікташ» «Галатасарай» 1 (0) 1 (3) Дата 13 серпня 2016Стадіон Конья Бююкшехір, КоньяАрбітр Мете КалкаванГлядачі 33 700← 2015 2017 → Суперкубок Туреччини з футболу 2016 — 43-й розіграш турніру. Матч відбувся …

ارماند فيجبنوم معلومات شخصية الميلاد 6 أبريل 1920مدينة نيو يورك الوفاة 13 نوفمبر 2014 (92 سنة)   بيتسفيلد  الجنسية  الولايات المتحدة عضو في الأكاديمية الوطنية للهندسة  الحياة العملية المدرسة الأم معهد ماساتشوستس للتقنية مشرف الدكتوراه دوغلاس فنسنت براون  [لغات أخرى…

Cet article est une ébauche concernant une personnalité allemande. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Karl HeinzenBiographieNaissance 22 février 1809GrevenbroichDécès 12 novembre 1880 (à 71 ans)BostonNom de naissance Karl Peter HeinzenNationalité prussienneFormation Université rhénane Frédéric-Guillaume de BonnActivités Poète, journalistemodifier - modifier le code - modifier Wikidata …

Geographical features of England Geography of EnglandSatellite image including EnglandContinentEuropeRegionBritish IslesArea • Total132,930 km2 (51,320 sq mi)Coastline3,200 km (2,000 mi)Borders Scotland96 km (60 mi) Wales 257 km (160 mi) Highest pointScafell Pike978 m (3,209 ft)Lowest pointHolme Fen, −2.75 m (−9 ft)Longest riverRiver Severn (shared with Wales)354 km (220 mi) Longest river entirely within E…

Os LimianosDatos generalesNombre Associação Desportiva «Os Limianos»Apodo(s) LimianosFundación 1953Presidente José Carlos MatosEntrenador Carlos CunhaInstalacionesEstadio Campo do Cruzeiro[1]​Capacidad 2,500Ubicación Ponte de Lima, Portugal Titular Alternativo Última temporadaLiga Primera División de Viana do Castelo(2022-23) 1º Copa Copa de Portugal(2013-14) Primera Ronda Página web oficial[editar datos en Wikidata] El AD Os Limianos es un equipo de fútbol de Portuga…

Telecommunications in Kenya include radio, television, fixed and mobile telephones, and the Internet. Radio and television Radio stations: state-owned radio broadcaster operates 2 national radio channels and provides regional and local radio services in multiple languages; a large number of private radio stations, including provincial stations broadcasting in local languages; transmissions of several international broadcasters are available (2007);[1] 24 AM, 8 FM, and 6 shortwave (2001).…

Soccer clubTexas UnitedFull nameTexas United Football ClubNickname(s)Texans, Los TejanosFounded2017; 6 years ago (2017)StadiumJohn Clark StadiumPlano, TexasCapacity14,224OwnerNeltex Sports Group, LLCPresidentVacantHead CoachDave JacobsLeagueUSL League Two20231st, Mid South DivisionPlayoffs: Conference SemifinalsWebsiteClub website Home colors Away colors Texas United is an American soccer club that currently competes in USL League Two,[1][2] the fourth tier of t…

Spanish sports TV channel This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Real Madrid TV – news · newspapers · books · scholar · JSTOR (March 2023) (Learn how and when to remove this template message) Television channel Real Madrid TVLogo used from 1999 to 2005CountrySpainBroadcast areaMadridWorldwideNetworkReal Madrid Club de F…

See also: List of newspapers in New Jersey American daily broadsheet newspaper New Jersey HeraldThe August 19, 2008 front page of the New Jersey Herald.TypeDaily newspaperFormatBroadsheetOwner(s)GannettPublisherKeith FlinnEditorBruce TomlinsonFounded1829LanguageAmerican EnglishHeadquarters2 Spring Street, Newton, New JerseyCountryUnited StatesCirculation11,220 Daily17,430 Sunday (as of 2008)[1]ISSN0893-3677OCLC number12198584 Websitewww.njherald.comMedia of the United StatesList of n…

Hrim-2 (bahasa Ukraina: «Оперативно-тактичний ракетний комплекс« Грім » ), juga dikenal sebagai Grіm-2 (kadang-kadang disebut sebagai Grom), adalah calon rudal balistik jarak pendek Ukraina yang dikembangkan oleh Yuzhnoye Design Office dan A.M. Makarov Southern Machine-Building Plant. Senjata ini menggabungkan fitur kompleks rudal taktis dan beberapa peluncur roket. Kisaran rudal untuk versi ekspor adalah 280 kilometer, dibatasi oleh MTCR. Nam…

Milton do Ó Informações pessoais Nome completo Milton Rogério Harassen do Ó Data de nascimento 24 de fevereiro de 1979 (44 anos) Local de nascimento Ponta Grossa, Paraná, Brasil Altura 1,84 m Pé destro Informações profissionais Clube atual aposentado Posição ex-zagueiro Clubes de juventude Paraná Clubes profissionais1 Anos Clubes Jogos e gol(o)s 1999–20002000200120012002–200320042004200520052005–20072007–2010 ParanáAtlético ParanaenseMarselhaGoiásSamsunsporMadure…

1994 film by Fred Schepisi I.Q.Theatrical release posterDirected byFred SchepisiScreenplay byAndy BreckmanMichael J. LeesonStory byAndy BreckmanProduced byFred SchepisiCarol BaumNeil A. Machlis (co-producer)Starring Tim Robbins Meg Ryan Walter Matthau Charles Durning CinematographyIan BakerEdited byJill BilcockMusic byJerry GoldsmithDistributed byParamount PicturesRelease dateDecember 25, 1994Running time96 minutesCountryUnited StatesLanguagesEnglishGermanBudget$25 millionBox office$47 million&#…

تسرد هذه المقالة التسلسل الزمني لتاريخ المثليين والمثليات ومزدوجي التوجه الجنسي والمتحولين جنسياً (اختصارا: LGBT) في القرن 20. التسلسل الزمني عقد 1900 1903 - في مدينة نيويورك في 21 شباط / فبراير 1903، تم تسجيل أول مداهمة قامت بها شرطة نيويورك أول على حمام المثليين، وهو «أريستون هوتيل با…

Kembali kehalaman sebelumnya

Lokasi Pengunjung: 18.119.118.111