Otherwise forgotten.: April 2015

Now that my latest paper has been accepted, Assessing Trends in Performance per Watt for Signal Processing Applications, I can actually talk about the what was important about it. This paper is about power, so here are the important parts:

Processing GMAC/W calculations are approaching a Powerwall
Memory Power halves every 29 years, which creates a fixed power requirement
Non-data flow: Cache is king

The summary is that data-flow processing will have stagnant power improvements from scaling, memory transfers are of fixed power, and cache is really not helping as much as you think when it comes to relative performance per Watt.

The GMAC, everybody does it.

The Multiply ACcumulate (MAC) is a mathematical calculation that computers spend a good deal of their time doing. It's used heavily in natural signal processing, such as audio, and video cards have GPUs that designed to quickly do these calculations. There are two players in the marketing of the semiconductor space who have "Law's" that are not actual laws, but more of trends that have been noticed: Moore and Koomey[1]. Moore said that more transistors will fit on a die, and Koomey did a survey that showed that power consumption is still scaling. The assumption is that as transistors have scaled, their relative behavior has scaled, which is not true. Another item that comes into to play on the marketing side is how you measure power. Most processors offload some power requirements to RAM that has sense amps just burning current to have an effective input capacitance, which makes the CPU power lower at the cost of increased RAM power. If you really want to get a good feel for the power requirements of a system the Giga-MAC per Watt (GMAC/W) is a great calculation.

The the graph above, I present Koomey's system trend and then the trend specific to the GMAC/W . Koomey included systems until 2006, but did not focus on specifically on data flow applications. If you are trying to do things like audio, video games, watching the sky, processing speech, RADAR, and just about any natural system, you are burdened by how much data you can move through the system, and how to get the heat out and data and power in. I am confident that for digital approaches there exists a Powerwall at between 10 to 30 GMAC/W. ie: the end of useful scaling for power constrained systems.

When I first started looking at this with Dr. Marr, it was because of a program called DARPA ACT. The question was if the specs were actually achievable with classical digital approaches, and I happen to play in the space of non-classical digital and analog processing. The idea of ACT is to build a phased array system where the amount of bandwidth and number of beams that can be formed scales per unit Watt. This would allow powerful phased array technology to proliferate in many types of devices, not just expensive military radars. Also, the cloud computing revolution, in part, depends on Koomey's law continuing; if we are to cloud compute with mobile devices, we must be able to transmit, receive, and process wireless data for ever smaller energy levels assuming fixed or slowly growing energy density in batteries, which so far has been the case.

Conclusion: DARPA ACT will not hit their power target with classical digital approaches.
I have not yet seen the final reports, but I hope the teams prove me wrong. I doubt it.

Memory Power is stagnant?

Memory transfer power halves every 29 years, which makes memory power a fixed commodity calculation for data-flow systems. If you have data that is constantly changing, you are burdened by the speed and power of the data bus.

When it comes to memory, it really depends on architecture. As far as improvements, it's life in the slow lane.

What about measured performance in a non-dataflow system?

In a non-dataflow system, cache is king. If you cannot get data in faster, you need to keep it local and handy. I looked into what SPECInt2006 had to say about this using the Stanford CPUDB. If data transfer power is constrained by IO, you need have a lot of cache to minimize non-data instructions coming into the system, but what about power? Using the data in the CPUDB, I plotted SPECInt2006 basemean score per Watt, and then normalized it for processor cache.

I believe that these improvements in SPECint96 score are based upon increases in processor cache because the graph shows almost no trend when the cache is normalized across the processors. As data flow applications are not affected by substantial increase in cache sizes, I believe that there is little improvement between years due to scaling regarding score per Watt.
However, it does make a good argument for simpler code. When you consider that how large programs have gotten relative to cache, the Computer Science community should consider trimming the fat a bit.

Appendix

Of course, the data for the first figure, and the references.

year	node(nm)	processor	GMAC/W	ref

1980	3000	IBM PC-XT	0.000001	[19]

1980	3000	IBM PC	0.000002	[19]

1982	3000	commodore 64	0.000000	[20]

1984	3500	MACINTOSH 128K	0.000005	[21, 22]

1985	1500	IBM PC-AT	0.000002	[23]

1988	2000	MC68030	0.000017	[24]

1991	1000	INTEL 486	0.000065	[25]

1991	1000	IBM PS/2E	0.000086	[25]

1995	500	Pentium Pro 200	0.000182	[26, 27]

1992	750	DEC Alpha 21064 EV4	0.002632	[28]

1996	350	DEC Alpha 21264 EV6	0.001989	[28, 29]

1997	200	Early PowerPC	0.024056	[30]

1998	250	Pentium III 600B	0.003043	[31]

2001	180	PPC 7410	0.014720	[32]

2003	130	PPC 7447	0.017355	[33]

2003	130	Pentium 4 ee HT 3.2	0.006080	[34]

2003	130	TMS-320C6412-500	0.235215	[35, 36]

2003	90	TMS-320C6455-1000	0.210843	[37]

2004	90	POWER5	0.133000	[38, 39]

2006	65	Athlon X2 BE2300	0.029556	[40]

2006	65	Core 2 Extreme X6800	0.027375	[41]

2006	65	Core 2 QX6700	0.028711	[42]

2007	65	Pentium Dual-Core E2140	0.017231	[43]

2007	90	Monarch	0.746592	[44, 45]

2007	65	IBM Cell	0.248889	[46]

2007	90	PPC 8641D	0.027364	[47]

2007	90	TMS-320C6424-700	0.341988	[48]

2007	65	POWER6	0.063000	[39, 49]

2008	45	Core 2 T9400	0.050660	[50, 51]

2008	45	Atom 230	0.140000	[52]

2008	65	TMS-320C6474-850	0.153879	[53]

2008	65	TMS-320C6474-1000	0.166667	[53]

2008	65	TMS-320C6474-1200	0.193846	[53]

2008	40	Tesla C2050	0.757647	[54, 55]

2008	40	Tesla C2075	0.838698	[54]

2008	40	Tesla M2050	0.801422	[54]

2008	40	Tesla M2070	0.801422	[54]

2008	40	Tesla M2090	1.035378	[54, 56]

2008	45	Core i7-965 ee	0.034462	[57]

2009	65	TMS-320C6748	1.360703	[58]

2008	45	Atom N270	0.224000	[59]

2009	90	Tile64	0.712727	[60]

2009	40	ARM Cortex-A	1.085000	[61]

2010	NA	Qualcomm Snapdragon	0.234500	[62]

2010	NA	Tegra 3	0.101500	[62]

2010	45	POWER7	0.112000	[39, 63]

2010	40	Virtex 6	4.736630	[64]

2010	32	Core i7 980 ee	0.107682	[65]

2010	40	Fermi GTX480	0.941472	[54]

2011	40	Fermi GTX580	0.566977	[54, 66]

2011	40	SPARC T4	0.069953	[67]

2012	40	Virtex 7	7.269231	[64]

2012	28	Fermi GTX680	2.773465	[54, 68]

2012	32	Atom D2550	0.130200	[69]

2013	22	POWER8	0.168000	[39, 70]

2013	22	SPARC T5	0.168000	[71]

2013	22	Core i7-4960X	0.116308	[72]

2013	40	TMS320C6678-1250	0.823529	[73]

2013	28	R9 290x	3.942400	[74]

2013	28	Fermi GTX780	3.528000	[75]

2013	40	TILE-Gx72	0.806400	[76]

2014	28	Titan Black	3.584448	[77]

2014	22	E3-1284LV3	0.190638	[4]

2015	14	Kintex Ultrascale	4.806519	[64]

References

[1] Jonathan G Koomey, Christian Belady, Michael Patterson, Anthony Santos, and Klaus-Dieter Lange, “Assessing trends over time in performance, costs, and energy use for servers,” Lawrence Berkeley National Laboratory, Stanford University, Microsoft Corporation, and Intel Corporation, Tech. Rep, 2009.

[2] Ariel Bleicher, “5G service on your 4G phone?,” in IEEE Spectrum. IEEE, 2014.

[3] Chris Auth, “22-nm fully-depleted tri-gate cmos transistors,” in Custom Integrated Circuits Conference (CICC), 2012 IEEE. IEEE, 2012, pp. 1–6.

[4] S Narasimha, P Chang, et al., “22nm high-performance soi technology featuring dual-embedded stressors, epi-plate high-k deep-trench embedded dram and self-aligned via 15lm beol,” in Electron Devices Meeting (IEDM), 2012 IEEE International. IEEE, 2012, pp. 3–3.

[5] C Auth, C Allen, et al., “A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density mim capacitors,” in VLSI Technology (VLSIT), 2012 Symposium on. IEEE, 2012, pp. 131–132.

[6] Christopher S Wallace, “A suggestion for a fast multiplier,” Electronic Computers, IEEE Transactions on, , no. 1, pp. 14–17, 1964.

[7] Kaizad Mistry, C Allen, et al., “A 45nm logic technology with high-k+ metal gate transistors, strained silicon, 9 cu interconnect layers, 193nm dry patterning, and 100% pb-free packaging,” in Electron Devices Meeting, 2007. IEDM 2007. IEEE International. IEEE, 2007, pp. 247–250.

[8] Peng Bai, C Auth, et al., “A 65nm logic technology featuring 35nm gate lengths, enhanced channel strain, 8 cu interconnect layers, low-k ild and 0.57 μm²sram cell,” in Electron Devices Meeting, 2004. IEDM Technical Digest. IEEE International. IEEE, 2004, pp. 657–660.

[9] K Cheng, A Khakifirooz, et al., “Extremely thin soi (etsoi) CMOS with record low variability for low power system-on-chip applications,” in Electron Devices Meeting (IEDM), 2009 IEEE International. IEEE, 2009, pp. 1–4.

[10] Himanshu Kaul, Mark Anders, et al., “A 1.45 ghz 52-to-162gflops/w variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 182–184.

[11] Steven Hsu, Amit Agarwal, Mark Anders, Sanu Mathew, Himanshu Kaul, Farhana Sheikh, and Ram Krishnamurthy, “A 280mv-to-1.1 v 256b reconfigurable simd vector permutation engine with 2-dimensional shuffle in 22nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 178–180.

[12] Jeffery A Davis, Vivek K De, and James D Meindl, “A stochastic wire-length distribution for gigascale integration (GSI)–part ii: Applications to clock frequency, power dissipation, and chip size estimation,” IEEE Transactions on Electron Devices, vol. 45, no. 3, pp. 590–597, 1998.

[13] T.H. Ning, “A perspective on the theory of MOSFET scaling and its impact,” Solid-State Circuits Society Newsletter, IEEE, vol. 12, no. 1, pp. 27–30, Winter 2007.

[14] TA Fjeldly and M. Shur, “Threshold voltage modeling and the subthreshold regime of operation of short-channel MOSFETs,” IEEE Transactions on Electron Devices, vol. 40, no. 1, pp. 137–145, 1993.

[15] Mark Horowitz, “Computing’s energy problem:(and what we can do about it),” 2014, International Solid-State Circuits Conference (ISSCC).

[16] Ron Kalla, Balaram Sinharoy, and Joel Tendler, “Simultaneous multi-threading implementation in POWER5,” in Conference Record of Hot Chips, 2003, vol. 15.

[17] Andrew Danowitz, Kyle Kelley, James Mao, John P Stevenson, and Mark Horowitz, “CPUDB: recording microprocessor history,” Communications of the ACM, vol. 55, no. 4, pp. 55–63, 2012.

[18] John L Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.

[19] Intel, 8088 8-bit HMOS microprocessor, 8 1990.

[20] Commodore Semiconductor Group, 6510 microprocessor, 1984.

[21] Motorola, 68000 programmer’s reference manual, 1 1990, Rev. 1.0.

[22] Motorola, M68000 8-/16-/32-BIT microprocessors user’s manual, Motorola Inc., 1991.

[23] Intel Corp., Iapx 286 Hardware Reference Manual, Number 1983/210760-001. Intel Books, Berkeley, CA, USA, 1984.

[24] Motorola, MC68030 enhanced 32-bit microprocessor user’s manual, Prentice-Hall, 1990.

[25] Intel Corp., i486 Microprocessor Programmer’s Reference Manual, Osborne/McGraw-Hill, Berkeley, CA, USA, 1990.

[26] “Intel Pentium Pro processors,” http://www.intel.com/support/processors/{P}entiumpro/sb/cs-_011161.htm, Accessed: 2014-05-06.

[27] “Intel Pentium Pro 200,” http://ark.intel.com/products/49952/Intel-_{P}entium-_Pro-_Processor-_200-_MHz-_256K-_Cache-_66- _MHz-_FSB, Accessed: 2014-05-06.

[28] Richard E Kessler, Edward J McLellan, and David A Webb, “The alpha 21264 microprocessor architecture,” in Computer Design: VLSI in Computers and Processors, 1998. ICCD’98. Proceedings. International Conference on. IEEE, 1998, pp. 90–95.

[29] Compaq Computer Corporation, Alpha 21264 Microprocessor DataSheet, 2 1999, Rev. 1.0.

[30] IBM Microelectronics Division, PowerPC 740 and PowerPC 750 Microprocessor Datasheet, 6 2002, Rev. 2.0.

[31] “Intel®; Pentium®; III processor 600 MHz, 512k cache, 133 MHz FSB,” http://ark.intel.com/products/27546/Intel-_{P}entium-_{III}- _Processor-_600-_MHz-_512K-_Cache-_133-_MHz-_FSB, Accessed: 2014-05-06.

[32] Freescale Semiconductor, “MPC7410 RISC Microprocessor Hardware Specifications,” 2005.

[33] Freescale Semiconductor, “MPC7447 RISC Microprocessor Hardware Specifications,” 2005.

[34] “Pentium®; 4 processor extreme edition supporting HT technology 3.20 ghz, 2m cache, 800 MHz FSB,” http://ark.intel.com/products/27489/ {P}entium-_4-_Processor-_Extreme-_Edition-_supporting-_HT-_Technology-_3_20-_GHz-_2M-_Cache-_800-_MHz-_FSB, Accessed: 2014-05-06.

[35] Texas Instruments, “TMS320C6412 power consumption summary,” http://www.ti.com/lit/an/spra967e/spra967e.pdf, 2005.

[36] Texas Instruments, “TMS320C6412 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6412.pdf, 2005, SPRS219G.

[37] Texas Instruments, “TMS320C6455 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6455.pdf, 2012, SPRS276M.

[38] Ron Kalla, Balaram Sinharoy, and Joel M Tendler, “IBM POWER5 chip: A dual-core multithreaded processor,” Micro, IEEE, vol. 24, no. 2, pp. 40–47, 2004.

[39] IBM, “IBM POWER systems, hardware deep dive,” http://www-_05.ibm.com/cz/events/febannouncement2012/pdf/power_ architecture.pdf, 2013.

[40] “AMD Athlon BE-2300,” http://products.amd.com/pages/desktopcpudetail.aspx?id=394, Accessed: 2014-05-06.

[41] “Intel®; Core 2 x6800 (4m cache, 2.93 ghz, 1066 MHz FSB),” http://ark.intel.com/products/27258/Intel-_Core2-_Extreme- _Processor-_X6800-_(4M-_Cache-_2_93-_GHz-_1066-_MHz-_FSB), Accessed: 2014-05-06.

[42] “Intel®; Core 2 qx6700 (8m cache, 2.66 ghz, 1066 MHz FSB),” http://ark.intel.com/products/28028/Intel-_Core2-_Extreme- _Processor-_QX6700-_8M-_Cache-_2_66-_GHz-_1066-_MHz-_FSB, Accessed: 2014-05-06.

[43] “Intel®; Pentium®; processor e2140 (1m cache, 1.60 ghz, 800 MHz FSB),” http://ark.intel.com/products/29738/Intel-_{P}entium- _Processor-_E2140-_(1M-_Cache-_1_60-_GHz-_800-_MHz-_FSB), Accessed: 2014-05-06.

[44] John Granacki and Mike Vahey, “Monarch: A MOrphable Networked micro-ARCHitecture,” Tech. Rep., USC/Information Sciences Institute and Raytheon, 2002.

[45] Michael Vahey et al., “MONARCH: a first generation polymorphic computing processor,” in 10th High Performance Embedded Computing Workshop, 2006.

[46] TR Maeurer and D Shippy, “Introduction to the cell multiprocessor,” IBM journal of Research and Development, vol. 49, no. 4, pp. 589–604, 2005.

[47] Freescale Semiconductor, MPC8641 and MPC8641D Integrated Host Processor Hardware Specifications, 7 2009, MPC8641DEC.

[48] Texas Instruments, “TMS320C6424 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6424.pdf, 2009, SPRS347D.

[49] Hung Q Le, William J Starke, J Stephen Fields, Francis P O’Connell, Dung Q Nguyen, Bruce J Ronchetti, Wolfram M Sauer, Eric M Schwarz, and Michael T Vaden, “IBM POWER6 microarchitecture,” IBM Journal of Research and Development, vol. 51, no. 6, pp. 639–662, 2007.

[50] Ronak Singhal, “Inside intel core microarchitecture (nehalem),” in A Symposium on High Performance Chips, 2008, vol. 20.

[51] “Intel®; Core 2 Duo t9400 (6m cache, 2.53 ghz, 1066 MHz FSB),” http://ark.intel.com/products/35562/Intel-_Core2-_Duo- _Processor-_T9400-_(6M-_Cache-_2_53-_GHz-_1066-_MHz-_FSB), Accessed: 2014-05-06.

[52] “Intel®; atom processor 230 (512k cache, 1.60 ghz, 533 MHz FSB),” http://ark.intel.com/products/35635/Intel-_Atom-_Processor- _230-_(512K-_Cache-_1_60-_GHz-_533-_MHz-_FSB), Accessed: 2014-05-06.

[53] Texas Instruments, “TMS320C6474 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6474.pdf, 2011, SPRS552H.

[54] C. Woolley, “CUDA overview,” http://www.cc.gatech.edu/~vetter/keeneland/tutorial-_2011-_04-_14/02-_cuda-_overview.pdf, 2011.

[55] “Tesla C2050 and Tesla C2070 computing processor board,” http://www.nvidia.com/docs/IO/43395/Tesla_C2050_Board_Specification. pdf, 2010, BD-04983-001-v02.

[56] “Tesla M2090 dual-slot computing processor module,” http://www.nvidia.com/docs/IO/43395/Tesla-_M2090-_Board-_Specification. pdf, 2010, BD-05766-001-v02.

[57] “Intel®; core i7-965 processor extreme edition,” http://ark.intel.com/products/37149/Intel-_Core-_i7-_965-_Processor-_Extreme- _Edition-_8M-_Cache-_3_20-_GHz-_6_40-_GTs-_Intel-_QPI, Accessed: 2014-05-06.

[58] Texas Instruments, “TMS320C6448 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6748.pdf, 2013, SPRS590E.

[59] “Intel®; atom processor n270 (512k cache, 1.60 ghz, 533 MHz FSB),” http://ark.intel.com/products/36331/Intel-_Atom-_Processor- _N270-_512K-_Cache-_1_60-_GHz-_533-_MHz-_FSB, Accessed: 2014-05-06.

[60] Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, et al., “Tile64-processor: A 64-core SoC with mesh interconnect,” in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. IEEE, 2008, pp. 88–598.

[61] ARM, “Cortex-a5,” http://www.arm.com/products/processors/cortex-_a/cortex-_a5.php, 2009, DDI 0433A.

[62] L. Lewins, “Performance of low power ARM processors,” in Raytheon Information Systems and Computing (ISAC) Symposium. Raytheon, 2012.

[63] DF Wendel, J Barth, DM Dreps, S Islam, J Pille, and JA Tierno, “IBM POWER7 processor circuit design,” IBM Journal of Research and Development, vol. 55, no. 3, pp. 1–1, 2011.

[64] Xilinx, “Xilinx power estimator (xpe),” http://www.xilinx.com/products/design_tools/logic_design/xpe.htm.

[65] “Intel®; core i7-980x processor extreme edition,” http://ark.intel.com/products/47932/Intel-_Core-_i7-_980X-_Processor-_Extreme- _Edition-_(12M-_Cache-_3_33-_GHz-_6_40-_GTs-_Intel-_QPI), Accessed: 2014-05-06.

[66] “GeForce GTX 580,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_580/specifications, Accessed: 2014-05-06.

[67] Manish Shah et al., “SPARC t4: A dynamically threaded server-on-a-chip,” IEEE Micro, vol. 32, no. 2, pp. 0008–19, 2012.

[68] “GeForce GTX 680,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_680/specifications, Accessed: 2014-05-06.

[69] “Intel®; atom processor d2550 (1m cache, 1.86 ghz),” http://ark.intel.com/products/65470/Intel-_Atom-_Processor-_D2550-_1M- _Cache-_1_86-_GHz, Accessed: 2014-05-06.

[70] Eric J Fluhr et al., “5.1 POWER8 tm: A 12-core server-class processor in 22nm soi with 7.6 tb/s off-chip bandwidth,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International. IEEE, 2014, pp. 96–97.

[71] John Feehrer, Sumti Jairath, Paul Loewenstein, Ram Sivaramakrishnan, David Smentek, Sebastian Turullols, and Ali Vahidsafa, “The oracle SPARC T5 16-core processor scales to eight sockets,” Micro, IEEE, vol. 33, no. 2, pp. 48–57, 2013.

[72] “Intel®; core i7-4960x processor extreme edition (15m cache, up to 4.00 ghz),” http://ark.intel.com/products/77779/Intel-_Core-_i7- _4960X-_Processor-_Extreme-_Edition-_15M-_Cache-_up-_to-_4_00-_GHz, Accessed: 2014-05-06.

[73] Texas Instruments, “TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor,” http://www.ti.com.cn/cn/lit/ds/symlink/ tms320c6678.pdf, 2013, SPRS691D.

[74] “Radeon R9,” http://www.amd.com/en-_us/products/graphics/desktop/r9, Accessed: 2014-05-06.

[75] “GeForce GTX 780,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_780/specifications, Accessed: 2014-05-06.

[76] M. Mattina, “Architecture and performance of the Tilera TILE-Gx8072 manycore processor,” http://www.hoti.org/hoti21/slides/Mattina. pdf, 2013, Tilera.

[77] “GeForce GTX 700 titan black,” http://www.nvidia.com/gtx-_700-_graphics-_cards/gtx-_titan-_black/, Accessed: 2014-05-06.

Otherwise forgotten.

Thursday, April 30, 2015

CMRF8SF: Resistor device mismatch, the saga continues, and the fix.

CMRF8SF: bondpad bad device

Tuesday, April 28, 2015

CMRF8SF: resistor device mismatch

Friday, April 24, 2015

CMRF8SF, wrong metals but not really, so recompile the techfile

CMRF8SF, it's always that last metal: 4-1-3

Saturday, April 11, 2015

Cadence and Virtuoso, versions get me.

A new calendar system: Ivy Mike

Thursday, April 9, 2015

Assessing Trends in Performance per Watt for Signal Processing Applications

The GMAC, everybody does it.

Memory Power is stagnant?

What about measured performance in a non-dataflow system?

Appendix