Thursday, April 9, 2015

Assessing Trends in Performance per Watt for Signal Processing Applications

Now that my latest paper has been accepted, Assessing Trends in Performance per Watt for Signal Processing Applications, I can actually talk about the what was important about it. This paper is about power, so here are the important parts:
  • Processing GMAC/W calculations are approaching a Powerwall
  • Memory Power halves every 29 years, which creates a fixed power requirement
  • Non-data flow: Cache is king
The summary is that data-flow processing will have stagnant power improvements from scaling, memory transfers are of fixed power, and cache is really not helping as much as you think when it comes to relative performance per Watt.

The GMAC, everybody does it.

The Multiply ACcumulate (MAC) is a mathematical calculation that computers spend a good deal of their time doing. It's used heavily in natural signal processing, such as audio, and video cards have GPUs that designed to quickly do these calculations. There are two players in the marketing of the semiconductor space who have "Law's" that are not actual laws, but more of trends that have been noticed: Moore and Koomey[1]. Moore said that more transistors will fit on a die, and Koomey did a survey that showed that power consumption is still scaling. The assumption is that as transistors have scaled, their relative behavior has scaled, which is not true. Another item that comes into to play on the marketing side is how you measure power. Most processors offload some power requirements to RAM that has sense amps just burning current to have an effective input capacitance, which makes the CPU power lower at the cost of increased RAM power. If you really want to get a good feel for the power requirements of a system the Giga-MAC per Watt (GMAC/W) is a great calculation.

The the graph above, I present Koomey's system trend and then the trend specific to the GMAC/W . Koomey included systems until 2006, but did not focus on specifically on data flow applications. If you are trying to do things like audio, video games, watching the sky, processing speech, RADAR, and just about any natural system, you are burdened by how much data you can move through the system, and how to get the heat out and data and power in. I am confident that for digital approaches there exists a Powerwall at between 10 to 30 GMAC/W. ie: the end of useful scaling for power constrained systems.

When I first started looking at this with Dr. Marr, it was because of a program called DARPA ACT. The question was if the specs were actually achievable with classical digital approaches, and I happen to play in the space of non-classical digital and analog processing. The idea of ACT is to build a phased array system where the amount of bandwidth and number of beams that can be formed scales per unit Watt. This would allow powerful phased array technology to proliferate in many types of devices, not just expensive military radars. Also, the cloud computing revolution, in part, depends on Koomey's law continuing; if we are to cloud compute with mobile devices, we must be able to transmit, receive, and process wireless data for ever smaller energy levels assuming fixed or slowly growing energy density in batteries, which so far has been the case.

Conclusion: DARPA ACT will not hit their power target with classical digital approaches.
I have not yet seen the final reports, but I hope the teams prove me wrong. I doubt it.

Memory Power is stagnant?

Memory transfer power halves every 29 years, which makes memory power a fixed commodity calculation for data-flow systems. If you have data that is constantly changing, you are burdened by the speed and power of the data bus.

When it comes to memory, it really depends on architecture. As far as improvements, it's life in the slow lane.

What about measured performance in a non-dataflow system?

In a non-dataflow system, cache is king. If you cannot get data in faster, you need to keep it local and handy. I looked into what SPECInt2006 had to say about this using the Stanford CPUDB. If data transfer power is constrained by IO, you need have a lot of cache to minimize non-data instructions coming into the system, but what about power? Using the data in the CPUDB, I plotted SPECInt2006 basemean score per Watt, and then normalized it for processor cache.
I believe that these improvements in SPECint96 score are based upon increases in processor cache because the graph shows almost no trend when the cache is normalized across the processors. As data flow applications are not affected by substantial increase in cache sizes, I believe that there is little improvement between years due to scaling regarding score per Watt.
However, it does make a good argument for simpler code. When you consider that how large programs have gotten relative to cache, the Computer Science community should consider trimming the fat a bit.

Appendix

Of course, the data for the first figure, and the references.

yearnode(nm) processor GMAC/W ref





1980 3000 IBM PC-XT 0.000001 [19]





1980 3000 IBM PC 0.000002 [19]





1982 3000 commodore 64 0.000000 [20]





1984 3500 MACINTOSH 128K 0.000005 [2122]





1985 1500 IBM PC-AT 0.000002 [23]





1988 2000 MC68030 0.000017 [24]





1991 1000 INTEL 486 0.000065 [25]





1991 1000 IBM PS/2E 0.000086 [25]





1995 500 Pentium Pro 200 0.000182 [2627]





1992 750 DEC Alpha 21064 EV4 0.002632 [28]





1996 350 DEC Alpha 21264 EV6 0.001989 [2829]





1997 200 Early PowerPC 0.024056 [30]





1998 250 Pentium III 600B 0.003043 [31]





2001 180 PPC 7410 0.014720 [32]





2003 130 PPC 7447 0.017355 [33]





2003 130 Pentium 4 ee HT 3.2 0.006080 [34]





2003 130 TMS-320C6412-500 0.235215 [3536]





2003 90 TMS-320C6455-1000 0.210843 [37]





2004 90 POWER5 0.133000 [3839]





2006 65 Athlon X2 BE2300 0.029556 [40]





2006 65 Core 2 Extreme X6800 0.027375 [41]





2006 65 Core 2 QX6700 0.028711 [42]





2007 65 Pentium Dual-Core E2140 0.017231 [43]





2007 90 Monarch 0.746592 [4445]





2007 65 IBM Cell 0.248889 [46]





2007 90 PPC 8641D 0.027364 [47]





2007 90 TMS-320C6424-700 0.341988 [48]





2007 65 POWER6 0.063000 [3949]





2008 45 Core 2 T9400 0.050660 [5051]





2008 45 Atom 230 0.140000 [52]





2008 65 TMS-320C6474-850 0.153879 [53]





2008 65 TMS-320C6474-1000 0.166667 [53]





2008 65 TMS-320C6474-1200 0.193846 [53]





2008 40 Tesla C2050 0.757647 [5455]





2008 40 Tesla C2075 0.838698 [54]





2008 40 Tesla M2050 0.801422 [54]





2008 40 Tesla M2070 0.801422 [54]





2008 40 Tesla M2090 1.035378 [5456]





2008 45 Core i7-965 ee 0.034462 [57]





2009 65 TMS-320C6748 1.360703 [58]





2008 45 Atom N270 0.224000 [59]





2009 90 Tile64 0.712727 [60]





2009 40 ARM Cortex-A 1.085000 [61]





2010 NA Qualcomm Snapdragon 0.234500 [62]





2010 NA Tegra 3 0.101500 [62]





2010 45 POWER7 0.112000 [3963]





2010 40 Virtex 6 4.736630 [64]





2010 32 Core i7 980 ee 0.107682 [65]





2010 40 Fermi GTX480 0.941472 [54]





2011 40 Fermi GTX580 0.566977 [5466]





2011 40 SPARC T4 0.069953 [67]





2012 40 Virtex 7 7.269231 [64]





2012 28 Fermi GTX680 2.773465 [5468]





2012 32 Atom D2550 0.130200 [69]





2013 22 POWER8 0.168000 [3970]





2013 22 SPARC T5 0.168000 [71]





2013 22 Core i7-4960X 0.116308 [72]





2013 40 TMS320C6678-1250 0.823529 [73]





2013 28 R9 290x 3.942400 [74]





2013 28 Fermi GTX780 3.528000 [75]





2013 40 TILE-Gx72 0.806400 [76]





2014 28 Titan Black 3.584448 [77]





2014 22 E3-1284LV3 0.190638 [4]





2015 14 Kintex Ultrascale 4.806519 [64]





References

[1]    Jonathan G Koomey, Christian Belady, Michael Patterson, Anthony Santos, and Klaus-Dieter Lange, “Assessing trends over time in performance, costs, and energy use for servers,” Lawrence Berkeley National Laboratory, Stanford University, Microsoft Corporation, and Intel Corporation, Tech. Rep, 2009.

[2]    Ariel Bleicher, “5G service on your 4G phone?,” in IEEE Spectrum. IEEE, 2014.

[3]    Chris Auth, “22-nm fully-depleted tri-gate cmos transistors,” in Custom Integrated Circuits Conference (CICC), 2012 IEEE. IEEE, 2012, pp. 1–6.

[4]    S Narasimha, P Chang, et al., “22nm high-performance soi technology featuring dual-embedded stressors, epi-plate high-k deep-trench embedded dram and self-aligned via 15lm beol,” in Electron Devices Meeting (IEDM), 2012 IEEE International. IEEE, 2012, pp. 3–3.

[5]    C Auth, C Allen, et al., “A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density mim capacitors,” in VLSI Technology (VLSIT), 2012 Symposium on. IEEE, 2012, pp. 131–132.

[6]    Christopher S Wallace, “A suggestion for a fast multiplier,” Electronic Computers, IEEE Transactions on, , no. 1, pp. 14–17, 1964.

[7]    Kaizad Mistry, C Allen, et al., “A 45nm logic technology with high-k+ metal gate transistors, strained silicon, 9 cu interconnect layers, 193nm dry patterning, and 100% pb-free packaging,” in Electron Devices Meeting, 2007. IEDM 2007. IEEE International. IEEE, 2007, pp. 247–250.

[8]    Peng Bai, C Auth, et al., “A 65nm logic technology featuring 35nm gate lengths, enhanced channel strain, 8 cu interconnect layers, low-k ild and 0.57 μm2sram cell,” in Electron Devices Meeting, 2004. IEDM Technical Digest. IEEE International. IEEE, 2004, pp. 657–660.

[9]    K Cheng, A Khakifirooz, et al., “Extremely thin soi (etsoi) CMOS with record low variability for low power system-on-chip applications,” in Electron Devices Meeting (IEDM), 2009 IEEE International. IEEE, 2009, pp. 1–4.

[10]    Himanshu Kaul, Mark Anders, et al., “A 1.45 ghz 52-to-162gflops/w variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 182–184.

[11]    Steven Hsu, Amit Agarwal, Mark Anders, Sanu Mathew, Himanshu Kaul, Farhana Sheikh, and Ram Krishnamurthy, “A 280mv-to-1.1 v 256b reconfigurable simd vector permutation engine with 2-dimensional shuffle in 22nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 178–180.

[12]    Jeffery A Davis, Vivek K De, and James D Meindl, “A stochastic wire-length distribution for gigascale integration (GSI)–part ii: Applications to clock frequency, power dissipation, and chip size estimation,” IEEE Transactions on Electron Devices, vol. 45, no. 3, pp. 590–597, 1998.

[13]    T.H. Ning, “A perspective on the theory of MOSFET scaling and its impact,” Solid-State Circuits Society Newsletter, IEEE, vol. 12, no. 1, pp. 27–30, Winter 2007.

[14]    TA Fjeldly and M. Shur, “Threshold voltage modeling and the subthreshold regime of operation of short-channel MOSFETs,” IEEE Transactions on Electron Devices, vol. 40, no. 1, pp. 137–145, 1993.

[15]    Mark Horowitz, “Computing’s energy problem:(and what we can do about it),” 2014, International Solid-State Circuits Conference (ISSCC).

[16]    Ron Kalla, Balaram Sinharoy, and Joel Tendler, “Simultaneous multi-threading implementation in POWER5,” in Conference Record of Hot Chips, 2003, vol. 15.

[17]    Andrew Danowitz, Kyle Kelley, James Mao, John P Stevenson, and Mark Horowitz, “CPUDB: recording microprocessor history,” Communications of the ACM, vol. 55, no. 4, pp. 55–63, 2012.

[18]    John L Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.

[19]    Intel, 8088 8-bit HMOS microprocessor, 8 1990.

[20]    Commodore Semiconductor Group, 6510 microprocessor, 1984.

[21]    Motorola, 68000 programmer’s reference manual, 1 1990, Rev. 1.0.

[22]    Motorola, M68000 8-/16-/32-BIT microprocessors user’s manual, Motorola Inc., 1991.

[23]    Intel Corp., Iapx 286 Hardware Reference Manual, Number 1983/210760-001. Intel Books, Berkeley, CA, USA, 1984.

[24]    Motorola, MC68030 enhanced 32-bit microprocessor user’s manual, Prentice-Hall, 1990.

[25]    Intel Corp., i486 Microprocessor Programmer’s Reference Manual, Osborne/McGraw-Hill, Berkeley, CA, USA, 1990.

[26]    “Intel Pentium Pro processors,” http://www.intel.com/support/processors/{P}entiumpro/sb/cs-_011161.htm, Accessed: 2014-05-06.

[27]    “Intel Pentium Pro 200,” http://ark.intel.com/products/49952/Intel-_{P}entium-_Pro-_Processor-_200-_MHz-_256K-_Cache-_66- _MHz-_FSB, Accessed: 2014-05-06.

[28]    Richard E Kessler, Edward J McLellan, and David A Webb, “The alpha 21264 microprocessor architecture,” in Computer Design: VLSI in Computers and Processors, 1998. ICCD’98. Proceedings. International Conference on. IEEE, 1998, pp. 90–95.

[29]    Compaq Computer Corporation, Alpha 21264 Microprocessor DataSheet, 2 1999, Rev. 1.0.

[30]    IBM Microelectronics Division, PowerPC 740 and PowerPC 750 Microprocessor Datasheet, 6 2002, Rev. 2.0.

[31]    “Intel®; Pentium®; III processor 600 MHz, 512k cache, 133 MHz FSB,” http://ark.intel.com/products/27546/Intel-_{P}entium-_{III}- _Processor-_600-_MHz-_512K-_Cache-_133-_MHz-_FSB, Accessed: 2014-05-06.

[32]    Freescale Semiconductor, “MPC7410 RISC Microprocessor Hardware Specifications,” 2005.

[33]    Freescale Semiconductor, “MPC7447 RISC Microprocessor Hardware Specifications,” 2005.

[34]    “Pentium®; 4 processor extreme edition supporting HT technology 3.20 ghz, 2m cache, 800 MHz FSB,” http://ark.intel.com/products/27489/ {P}entium-_4-_Processor-_Extreme-_Edition-_supporting-_HT-_Technology-_3_20-_GHz-_2M-_Cache-_800-_MHz-_FSB, Accessed: 2014-05-06.

[35]    Texas Instruments, “TMS320C6412 power consumption summary,” http://www.ti.com/lit/an/spra967e/spra967e.pdf, 2005.

[36]    Texas Instruments, “TMS320C6412 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6412.pdf, 2005, SPRS219G.

[37]    Texas Instruments, “TMS320C6455 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6455.pdf, 2012, SPRS276M.

[38]    Ron Kalla, Balaram Sinharoy, and Joel M Tendler, “IBM POWER5 chip: A dual-core multithreaded processor,” Micro, IEEE, vol. 24, no. 2, pp. 40–47, 2004.

[39]    IBM, “IBM POWER systems, hardware deep dive,” http://www-_05.ibm.com/cz/events/febannouncement2012/pdf/power_ architecture.pdf, 2013.

[40]    “AMD Athlon BE-2300,” http://products.amd.com/pages/desktopcpudetail.aspx?id=394, Accessed: 2014-05-06.

[41]    “Intel®; Core 2 x6800 (4m cache, 2.93 ghz, 1066 MHz FSB),” http://ark.intel.com/products/27258/Intel-_Core2-_Extreme- _Processor-_X6800-_(4M-_Cache-_2_93-_GHz-_1066-_MHz-_FSB), Accessed: 2014-05-06.

[42]    “Intel®; Core 2 qx6700 (8m cache, 2.66 ghz, 1066 MHz FSB),” http://ark.intel.com/products/28028/Intel-_Core2-_Extreme- _Processor-_QX6700-_8M-_Cache-_2_66-_GHz-_1066-_MHz-_FSB, Accessed: 2014-05-06.

[43]    “Intel®; Pentium®; processor e2140 (1m cache, 1.60 ghz, 800 MHz FSB),” http://ark.intel.com/products/29738/Intel-_{P}entium- _Processor-_E2140-_(1M-_Cache-_1_60-_GHz-_800-_MHz-_FSB), Accessed: 2014-05-06.

[44]    John Granacki and Mike Vahey, “Monarch: A MOrphable Networked micro-ARCHitecture,” Tech. Rep., USC/Information Sciences Institute and Raytheon, 2002.

[45]    Michael Vahey et al., “MONARCH: a first generation polymorphic computing processor,” in 10th High Performance Embedded Computing Workshop, 2006.

[46]    TR Maeurer and D Shippy, “Introduction to the cell multiprocessor,” IBM journal of Research and Development, vol. 49, no. 4, pp. 589–604, 2005.

[47]    Freescale Semiconductor, MPC8641 and MPC8641D Integrated Host Processor Hardware Specifications, 7 2009, MPC8641DEC.

[48]    Texas Instruments, “TMS320C6424 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6424.pdf, 2009, SPRS347D.

[49]    Hung Q Le, William J Starke, J Stephen Fields, Francis P O’Connell, Dung Q Nguyen, Bruce J Ronchetti, Wolfram M Sauer, Eric M Schwarz, and Michael T Vaden, “IBM POWER6 microarchitecture,” IBM Journal of Research and Development, vol. 51, no. 6, pp. 639–662, 2007.

[50]    Ronak Singhal, “Inside intel core microarchitecture (nehalem),” in A Symposium on High Performance Chips, 2008, vol. 20.

[51]    “Intel®; Core 2 Duo t9400 (6m cache, 2.53 ghz, 1066 MHz FSB),” http://ark.intel.com/products/35562/Intel-_Core2-_Duo- _Processor-_T9400-_(6M-_Cache-_2_53-_GHz-_1066-_MHz-_FSB), Accessed: 2014-05-06.

[52]    “Intel®; atom processor 230 (512k cache, 1.60 ghz, 533 MHz FSB),” http://ark.intel.com/products/35635/Intel-_Atom-_Processor- _230-_(512K-_Cache-_1_60-_GHz-_533-_MHz-_FSB), Accessed: 2014-05-06.

[53]    Texas Instruments, “TMS320C6474 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6474.pdf, 2011, SPRS552H.

[54]    C. Woolley, “CUDA overview,” http://www.cc.gatech.edu/~vetter/keeneland/tutorial-_2011-_04-_14/02-_cuda-_overview.pdf, 2011.

[55]    “Tesla C2050 and Tesla C2070 computing processor board,” http://www.nvidia.com/docs/IO/43395/Tesla_C2050_Board_Specification. pdf, 2010, BD-04983-001-v02.

[56]    “Tesla M2090 dual-slot computing processor module,” http://www.nvidia.com/docs/IO/43395/Tesla-_M2090-_Board-_Specification. pdf, 2010, BD-05766-001-v02.

[57]    “Intel®; core i7-965 processor extreme edition,” http://ark.intel.com/products/37149/Intel-_Core-_i7-_965-_Processor-_Extreme- _Edition-_8M-_Cache-_3_20-_GHz-_6_40-_GTs-_Intel-_QPI, Accessed: 2014-05-06.

[58]    Texas Instruments, “TMS320C6448 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6748.pdf, 2013, SPRS590E.

[59]    “Intel®; atom processor n270 (512k cache, 1.60 ghz, 533 MHz FSB),” http://ark.intel.com/products/36331/Intel-_Atom-_Processor- _N270-_512K-_Cache-_1_60-_GHz-_533-_MHz-_FSB, Accessed: 2014-05-06.

[60]    Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, et al., “Tile64-processor: A 64-core SoC with mesh interconnect,” in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. IEEE, 2008, pp. 88–598.

[61]    ARM, “Cortex-a5,” http://www.arm.com/products/processors/cortex-_a/cortex-_a5.php, 2009, DDI 0433A.

[62]    L. Lewins, “Performance of low power ARM processors,” in Raytheon Information Systems and Computing (ISAC) Symposium. Raytheon, 2012.

[63]    DF Wendel, J Barth, DM Dreps, S Islam, J Pille, and JA Tierno, “IBM POWER7 processor circuit design,” IBM Journal of Research and Development, vol. 55, no. 3, pp. 1–1, 2011.

[64]    Xilinx, “Xilinx power estimator (xpe),” http://www.xilinx.com/products/design_tools/logic_design/xpe.htm.

[65]    “Intel®; core i7-980x processor extreme edition,” http://ark.intel.com/products/47932/Intel-_Core-_i7-_980X-_Processor-_Extreme- _Edition-_(12M-_Cache-_3_33-_GHz-_6_40-_GTs-_Intel-_QPI), Accessed: 2014-05-06.

[66]    “GeForce GTX 580,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_580/specifications, Accessed: 2014-05-06.

[67]    Manish Shah et al., “SPARC t4: A dynamically threaded server-on-a-chip,” IEEE Micro, vol. 32, no. 2, pp. 0008–19, 2012.

[68]    “GeForce GTX 680,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_680/specifications, Accessed: 2014-05-06.

[69]    “Intel®; atom processor d2550 (1m cache, 1.86 ghz),” http://ark.intel.com/products/65470/Intel-_Atom-_Processor-_D2550-_1M- _Cache-_1_86-_GHz, Accessed: 2014-05-06.

[70]    Eric J Fluhr et al., “5.1 POWER8 tm: A 12-core server-class processor in 22nm soi with 7.6 tb/s off-chip bandwidth,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International. IEEE, 2014, pp. 96–97.

[71]    John Feehrer, Sumti Jairath, Paul Loewenstein, Ram Sivaramakrishnan, David Smentek, Sebastian Turullols, and Ali Vahidsafa, “The oracle SPARC T5 16-core processor scales to eight sockets,” Micro, IEEE, vol. 33, no. 2, pp. 48–57, 2013.

[72]    “Intel®; core i7-4960x processor extreme edition (15m cache, up to 4.00 ghz),” http://ark.intel.com/products/77779/Intel-_Core-_i7- _4960X-_Processor-_Extreme-_Edition-_15M-_Cache-_up-_to-_4_00-_GHz, Accessed: 2014-05-06.

[73]    Texas Instruments, “TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor,” http://www.ti.com.cn/cn/lit/ds/symlink/ tms320c6678.pdf, 2013, SPRS691D.

[74]    “Radeon R9,” http://www.amd.com/en-_us/products/graphics/desktop/r9, Accessed: 2014-05-06.

[75]    “GeForce GTX 780,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_780/specifications, Accessed: 2014-05-06.

[76]    M. Mattina, “Architecture and performance of the Tilera TILE-Gx8072 manycore processor,” http://www.hoti.org/hoti21/slides/Mattina. pdf, 2013, Tilera.

[77]    “GeForce GTX 700 titan black,” http://www.nvidia.com/gtx-_700-_graphics-_cards/gtx-_titan-_black/, Accessed: 2014-05-06.

No comments:

Post a Comment