Otherwise forgotten.: 2015

Monday, December 21, 2015

Simon Cipher for 32/64.

The complete iteration for the Simon Cipher where K is the key expansion and C is the encryption block. Using the key block and test block specified by the Simon and Speck specification:
key value: 1918111009080100
test block: 65656877
result: c69be9bb

k[00] 1918 1110 0908 0100 
c[00] 6565 6877 
k[01] 71c3 1918 1110 0908 
c[01] bca2 6565 
k[02] b649 71c3 1918 1110 
c[02] bee3 bca2 
k[03] 56d4 b649 71c3 1918 
c[03] 37ba bee3 
k[04] e070 56d4 b649 71c3 
c[04] 5327 37ba 
k[05] f15a e070 56d4 b649 
c[05] 2ca6 5327 
k[06] c535 f15a e070 56d4 
c[06] 57fa 2ca6 
k[07] dd94 c535 f15a e070 
c[07] 8fcf 57fa 
k[08] 4010 dd94 c535 f15a 
c[08] 873b 8fcf 
k[09] 250a 4010 dd94 c535 
c[09] 687c 873b 
k[10] 6f66 250a 4010 dd94 
c[10] b397 687c 
k[11] e96b 6f66 250a 4010 
c[11] 7c95 b397 
k[12] 4bd8 e96b 6f66 250a 
c[12] 90fa 7c95 
k[13] 0fe5 4bd8 e96b 6f66 
c[13] 3ae5 90fa 
k[14] 7c47 0fe5 4bd8 e96b 
c[14] 7102 3ae5 
k[15] e0ef 7c47 0fe5 4bd8 
c[15] 1587 7102 
k[16] 3e21 e0ef 7c47 0fe5 
c[16] 6fc2 1587 
k[17] 065b 3e21 e0ef 7c47 
c[17] 676f 6fc2 
k[18] 438c 065b 3e21 e0ef 
c[18] c07e 676f 
k[19] f26a 438c 065b 3e21 
c[19] 86bb c07e 
k[20] b5c0 f26a 438c 065b 
c[20] edb7 86bb 
k[21] 8609 b5c0 f26a 438c 
c[21] a552 edb7 
k[22] 9f8e 8609 b5c0 f26a 
c[22] 79d4 a552 
k[23] d8bf 9f8e 8609 b5c0 
c[23] 6041 79d4 
k[24] 09ac d8bf 9f8e 8609 
c[24] 0d11 6041 
k[25] e812 09ac d8bf 9f8e 
c[25] c20c 0d11 
k[26] 2710 e812 09ac d8bf 
c[26] 9eac c20c 
k[27] 2caa 2710 e812 09ac 
c[27] 4c19 9eac 
k[28] 8d14 2caa 2710 e812 
c[28] bf65 4c19 
k[29] fa04 8d14 2caa 2710 
c[29] 3d16 bf65 
k[30] 32f2 fa04 8d14 2caa 
c[30] 7e01 3d16 
k[31] 7db9 32f2 fa04 8d14 
c[31] e9bb 7e01 
k[32] 4d83 7db9 32f2 fa04 
c[32] c69b e9bb

The test vector that was provided by the NSA seems to be explicitly aligned to hit the critical bits in the key expansion. Pretty nice piece of mathematics.

Saturday, December 19, 2015

LaTeX: Center rotated text vertically.

There seems to be 100s of ways to do this. I only had 2 columns to tweak, and this worked the best:

\raisebox{2.5\normalbaselineskip}[0pt][0pt]{\rotatebox[origin=c]{90}{Rotated Text}}

The 2.5 needs to be adjusted for the height of the row.

Sunday, December 13, 2015

removing commas from a line of text

who knows why it took me four times to get it correct:

echo "1,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0" | sed 's/,//g'
which resulted in:
11111010001001010110000111001101111101000100101011000011100110

Tuesday, December 1, 2015

Finding files with BASH and excluding some from the list.

I wanted to copy some files from the current directory to another based on file type, and file name. Bash 4.0 has some great things, such as globs:

shopt -s globstar nullglob extglob

I do not have Bash 4.0; I have Bash 3.2. So how does one list all files of a type, and then exclude some. The fastest way I found was to not use any external tools, but use case statements:

for f in *; do
  case "$f" in
     *.c |*.h | *.data| *.sh | Makefile) 
     # it's a valid filetype
          case "$f" in
          exportcode.sh | r.sh | project.report.sh)
          #exclude list is above
             ;;
          *)
             echo $f  # output the file name
             ;;
          esac   
         ;;
         *)
  # it's not
  ;;
   esac
done

The above code finds every file in the directory that is a .c .h .data .sh or Makefile and then excludes the file exportcode.sh, r.sh and project.report.sh. The resulting files are output via the "echo $f".

Thursday, November 19, 2015

Making a C-array from a sequence of binary.

I've been recently working with the SIMON cipher. There are many nice things about it from a hardware perspective. I came across the need to extract some bits from the paper and put them into a "C" format. The paper gives Z as:

z = [11111010001001010110000111001101111101000100101011000011100110,
10001110111110010011000010110101000111011111001001100001011010,
10101111011100000011010010011000101000010001111110010110110011,
11011011101011000110010111100000010010001010011100110100001111,
11010001111001101011011000100000010111000011001010010011101111]

I took the above and pasted it into a file called "z.txt".
I called the following bash script cformatu.sh.

#!/bin/sh
FILE_NAME=$1 
STR_LENGTH=$(cat $FILE_NAME | tr -d " z=[],\r\n" | wc -c | tr -d '[[:space:]]')
echo "/* bit count is: $STR_LENGTH, generated by cformatu.sh */"
STR_VAL=$(cat $FILE_NAME | tr -d " z=[],\r\n" | sed 's/\(.\{1\}\)/\1,/g' | sed 's/,$//' )
echo "static const u8 z[$STR_LENGTH]={$STR_VAL};"

Using the command "./cformatu.sh z.txt", you will receive the following series to stdout:

/* bit count is: 310, generated by cformatu.sh */
static const u8 z[310]={1,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,1,0,0,0,1,1,1,0,1,1,1,1,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,0,1,0,1,0,0,0,1,1,1,0,1,1,1,1,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,0,1,0,1,0,1,0,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,1,0,1,1,0,1,1,0,0,1,1,1,1,0,1,1,0,1,1,1,0,1,0,1,1,0,0,0,1,1,0,0,1,0,1,1,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,0,0,1,1,0,1,0,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,0,0,1,1,0,1,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,1,0,0,1,1,1,0,1,1,1,1};

The above is so much nicer than actually risking an error when typing it all out.

Wednesday, October 21, 2015

Round to a power of 2.

Sometimes, you need to just round up to a power of 2. 17 will result in 32, for example.
I settled on:

unsigned int util_roundpower2(unsigned int u_power)
{
  u_power = u_power-1;
  u_power |= u_power >> 1;
  u_power |= u_power >> 2;
  u_power |= u_power >> 4;
  u_power |= u_power >> 8;
  u_power |= u_power >> 16;
  u_power = u_power+1;
  return(u_power);
}

Sunday, August 23, 2015

When in doubt, ask a mathematician.

As I have said in my previous posts, NOAA probably isn't part of a conspiracy, but they sure try to make you think they are. I have some major issues with the "Pairwise Homogenization Algorithm" as it creates residuals, which I do not believe are representative of actual changes. I'm going to table this until I can spend time with my math buddy.

Friday, August 21, 2015

NOAA gives you data that you cannot use.

NOAA releases data, and example tools to process things, but they do not release data that you can use. Check this note from the "Pairwise Homogenization Algorithm" software readme:

# A New Dataset must have at least a Station List and a Data Files. At a minimum, the
# Station List defines the Station ID/Latitude/Longitude. The Data Files (one for
# each station/element - see data/benchmark/world1/monthly/raw) defines all 
# of the Monthly Temperatures in Annual Records. All files are in the GHCNMv3 format.

Now, the data that is released is in the form of years, so you cannot use the software they supply to recreate their data.
I have the Pairwise Homogenization Algorithm setup in MATLAB, and I can make it do whatever I want to the data, including make it give me totally erroneous data.

I actually emailed NOAA, and they didn't get back to me because I'm asking questions that they already have decided everyone in "the know" should know. It's always nice to get an official answer.

--mail--
To whom it may concern,

I am looking for sources for the weights for the Pairwise Homogenization Algorithm that is used in the NOAA data reported in ushcn.tavg.latest.FLs.52j.tar.gz.   I have MATLAB scripts setup to analyze the data, but I’ve been having trouble.  I believe that I cannot reproduce the data because I am missing the monthly entries, and the exported data is yearly.   Any guidance or references would be appreciated. 

Also, there are about 20% fewer weather stations in 2015 than in 1990.  Do you have a reference for this as well?  I'm just curious why there are so many fewer stations.
--end mail--

After reading a bunch of papers, I've decided that things, such as infilling, are pretty much magic that should be disregarded when you can actually have a meaningful dataset.
Here are all of the reasons that you might need to tune a dataset:
Changes in type of equipment.
Changes in region of the thermometer (You get an urban heat island, for example).
A station has consistently given bad data.
A station has been moved.

After reading through the papers that I could find, I believe that what NOAA has done is make a self-correlating time series. You can use the Pairwise Homogenization Algorithm to create a series that creates the same anomalies in time that you are trying to avoid. What I really want to know is why NOAA does not release a useful, monthly, dataset. This is the sort of thing that makes me shake my head at the soft sciences. They might be right, or wrong, but they are definitely sloppy by the standards of engineering.

Where does this leave me? I'm going to look at the actual data, remove anything that is not complete, and then plot it. Even if there are changes due to urbanization, equipment changes, and bad data: it is real data. If there is heating due to urbanization, that's just part of the heat, and that should show an increased trend in heating over years.

The new temperature.sh file is in SVN as revision 6.

Tuesday, August 18, 2015

So, where did the weather stations go and does it matter?

In an attempt to determine if it actually matters what happened to the weather stations, I looked at the data for the last year (2015) and tried to match up the station locations over previous years. The stations in 2015 are largely in the previous year's stations. I plotted the ratio of available stations per year when compared to 2015.

What this means is that the weather stations in 2015 are largely available in previous years, which means that they can be used to make a real comparison. I would still love to know why so many stations were removed between 1990 and 2015.

I also found a free SVN site to post my scripts: https://subversion.assembla.com/svn/noaa This post can be recreated by checking out revision 5 and running "runme.m" via MATLAB or OCTAVE after getting the data via temperature.sh

Monday, August 17, 2015

NOAA's data. 20% fewer data points?

I'm still plugging away at NOAAs data. I have not yet come to a point that I'm happy with the data yet, but that is mostly because it's a terrible export. It is obviously from a database, but the format is indefensible. My respect for data in the soft sciences has waned considerably.

I have come across something that is odd, but not necessarily wrong. The number of weather stations trends up, and then down. I'm not sure what to think about this yet, but there are about 20% fewer stations today than in 1990.

Please excuse the huge graphs, but it's clearer this way. Why would one remove stations if the data is valid?

Thursday, August 6, 2015

NOAA's data, get it with this script.

NOAA's weather data, as I previously mentioned, is a bit difficult. It seems to have no coherent format, and it is obviously a dump from a database. In order to try to make it just a bit better, I used a bash script to download, and format the data:

temperature.sh

The script downloads the raw files to: ./temperature/downloads, and then tries to create CSV files in ./temperature/data/extracted

I've just started looking at the data, but wow... soft sciences.
The most jarring thing is that you will have data for a weather station given for the year. I'm sure that the internal database has monthly information because the year is given with an offset of missing months. So, the number 823b means that the temperature was 8.23 degrees C with 2 months missing. The letter denotes how many months are missing.

I will revisit this data when I feel that I can handle drudging through it.

I heard back from NOAA

I had a typo:

tavg = (tmax + tmin)/2
Claude

Thanks Claude! I found that I transcribed the average equation incorrectly. It happens to the best of us. Now on to look at the data. It seems that some weather stations are always below freezing. I guess you get that in Alaska.

Wednesday, August 5, 2015

NOAA's data is difficult.

When it comes to things like climate change, I'm pretty much a pragmatists. Following the black body model, we are removing energy integrators (ie: green space) and replacing it with concrete, combined with releasing energy that was stored, logic dictates that the entropy of the system will increase, and thereby the temperature will increase because the earth is not expanding much.

I started looking at weather data from NOAA when another physics friend said "wouldn't it be nice if we could repair missing data like is done with the weather?". In my field, data is what it is, and missing data is missing data. If you publish something and you extrapolate a line, you make sure that you mention how it was calculated. Usually, if we find data missing, we just take the data again. What I decided to do seemed simple:
1) Take the data from NOAA that is modified
2) Take the data from NOAA that is raw
3) Look at the difference and see if I could back extract the variables for the "Pairwise Homogenization Algorithm"

...and then you look at NOAA's data and you start to believe that there is a conspiracy. Even if the data is quality, the released datasets are terribly formatted, or seems to be just incorrect. Here's an example:

USH00011084 1897   734  3  1292  3  1972  3  1786  3  2084  3  2761  3  2753  3  2547  3  2406  3  1878  3 -9999    -9999   
USH00011084 1900 -9999    -9999     1337a 3  1936  3  2378  3  2589  3  2770  3  2872  3  2700  3  2320  3  1486 3  1100  3
USH00011084 1926 -9999     1245     1251a    1781     2240     2654     2712     2763c    2770     2110     1256a    1421   
USH00011084 1927  1209     1821     1651     2183     2467     2707     2730     2594a    2579     2081     1907      871f 3
USH00011084 1928   800b    1135     1614     1711     2218     2596     2829     2817    -9999    -9999    -9999    -9999

Here's an excerpt from the tavg dataset, where one would expect the average to be a positive number:

USH00017157 1940  -287      502     1059     1502     1822     2368     2447     2616     2161     1695     1035      897

In the line above, -287 represents a -2.87 degree average at the USH00017157 weather station in 1940. How can something defined as (tmax-tmin)/2 be negative? To NOAA's credit, the have good documentation for this formatting disaster in readme.txt.

Before I do anything else, I'm going to ask NOAA about this data.

Saturday, August 1, 2015

VREF at 1.25v?

The MAX669 datasheet says that it has a bandgap reference that puts out 1.25V. I cannot image how that number was actually reached, when I'd expect the bandgap output to be 1.262V. I'm sure it's a great part, but I'm reluctant to use it just because I believe a part of the datasheet to be incorrect. Most datasheets are just marketing data, but there's really not much room for fudging a bandgap reference output voltage.

Thursday, April 30, 2015

CMRF8SF: Resistor device mismatch, the saga continues, and the fix.

Previously, I mentioned that I was having issues with resistor extraction mismatch. I had some success that ended suddenly, and it forced me to carefully look at the kit. You need to preprocess your netlists in the schematic. The kit uses the file "cdl_processor.pl" to convert the extraction from the schematic.

The LVS release notes:
Hierachical LVS has requirements for CDL netlist inputs which require changes in the standard output of CDL. Because of these requirements, a CDL processing program is included within the IBM design kit.

In your schematic:
IBM_PDK->Netlist->CDL
IBM_PDK->Netlist->CDL Processor for LVS
This creates a file called (schematic).netlist.lvs.
In Calibre LVS (nmLVS) under Inputs, selected the "Netlist" tab and uncheck "export from schematic viewer" and specify the spice file as (schematic).netlist.lvs

CMRF8SF: bondpad bad device

It took me 3 weeks to completely port a FPGA from cmos14soi to cmrf8sf, so let's say that I generally know what I'm doing in the world of VLSI layout. I have always found that the padframes take the longest of any component to DRC/LVS clean, and this is the single most under-estimated component. Part of this is my fault as I optimize my padframe and pads for different performance because I'm stuck with a few packages. If you are getting strange "bad device" errors in your bondpad cell, make sure that the M1 at the bondpad edge is connected by a contiguous ring of "BFMOAT IND". My corner cells causes a whole padframe to not LVS. 2 days to figure that out.

Tuesday, April 28, 2015

CMRF8SF: resistor device mismatch

I use a single resistor in my padframe for my floating-gate tunneling voltage.
It did not extract. Even if I had a cell with only the resistor, it would fail Calibre LVS. After a bunch of trial and error, I realized that it must be something that is just crazy. I tried to LVS with ic6.1.4 and it worked! It seems that resistor LVS works with every version of Cadence but 6.1.5.

Friday, April 24, 2015

CMRF8SF, wrong metals but not really, so recompile the techfile

If you dig long enough, you eventually find what you need in the documentation. When I do IC layout, I do most of my work on the lower metals, M1-M3, and then I glue them together with higher metals. I realized that I did not have M3-M4 in the via menu, which makes me very nervous because it made me wonder if I had the wrong settings on my PDK for CMRF8SF.

The answer resided in the cdslib release notes, and I failed to see it when I installed the kit months ago.
"If a design library is created and attached to cmrf8sf as the reference library, note that the cmrf8sf library has been compiled for 3-2 in LM and DM (4-2 in OL, 5-1 in AM) designs. If your location requires a different metallization, the library may be recompiled using the ASCII technology files by the library administrator."

Here's the kicker, you need access to the file, "Design Kit and Technology Training CMOS8RF", to get an explanation; however, I happen to know how to do this.

From the Cadence CIW, select Tools->Technology File Manager->Load. The "Load Technology File" window will appear.
2. Specify the following: ASCII Technology File: IBM_PDK/cmrf8sf/relDM/cdslib/cmrf8sf/techfile*.asc (the dialog defaults to *.tf, so select all file types)
Classes: Choose Select All. This will load all the classes or objects in the techfile.
Mode: Select Replace.
This will make a new tech.db file in the directory, and then you should be ready to go.

CMRF8SF, it's always that last metal: 4-1-3

130nm 8RF DM is 4thin-1thick-3rf (M1-M4,MQ,LY,E1,MA)
No one ever spells out the metal order in the design manuals.
My SKILL scrips had an error, which is now fixed, but you'd think that it'd be easier to find metal stacks.

Saturday, April 11, 2015

Cadence and Virtuoso, versions get me.

If you are accustomed to use icadv12, and then you use a kit that needs ic6, make sure that you realize that they aren't compatible. IC12 didn't throw any warnings on the 8sf kit, but it generated some serious nastiness. I am so used to using IC12, I just didn't think to consider that the older design kit would need an older tool set.

Anyone want to give a few million to make a quality competitor for schematic and layout?

A new calendar system: Ivy Mike

I have a real issue with dating items. The biggest issue is the non-uniform formats. I got an email today that said they needed something by 08/07/2015. What does that mean? August? July? This is why I personally use the format of 08 JUL 2015.
But 2015, that is pretty arbitrary too. I could use the Islamic Calendar or the Japanese Calendar; however, I think a better way is to use something that will be relevant in 1000 years: Ivy Mike Calendar
That makes this year to be "63 IM", and the date of this post, in a usable format, 11 APR 63IM.

Thursday, April 9, 2015

Assessing Trends in Performance per Watt for Signal Processing Applications

Now that my latest paper has been accepted, Assessing Trends in Performance per Watt for Signal Processing Applications, I can actually talk about the what was important about it. This paper is about power, so here are the important parts:

Processing GMAC/W calculations are approaching a Powerwall
Memory Power halves every 29 years, which creates a fixed power requirement
Non-data flow: Cache is king

The summary is that data-flow processing will have stagnant power improvements from scaling, memory transfers are of fixed power, and cache is really not helping as much as you think when it comes to relative performance per Watt.

The GMAC, everybody does it.

The Multiply ACcumulate (MAC) is a mathematical calculation that computers spend a good deal of their time doing. It's used heavily in natural signal processing, such as audio, and video cards have GPUs that designed to quickly do these calculations. There are two players in the marketing of the semiconductor space who have "Law's" that are not actual laws, but more of trends that have been noticed: Moore and Koomey[1]. Moore said that more transistors will fit on a die, and Koomey did a survey that showed that power consumption is still scaling. The assumption is that as transistors have scaled, their relative behavior has scaled, which is not true. Another item that comes into to play on the marketing side is how you measure power. Most processors offload some power requirements to RAM that has sense amps just burning current to have an effective input capacitance, which makes the CPU power lower at the cost of increased RAM power. If you really want to get a good feel for the power requirements of a system the Giga-MAC per Watt (GMAC/W) is a great calculation.

The the graph above, I present Koomey's system trend and then the trend specific to the GMAC/W . Koomey included systems until 2006, but did not focus on specifically on data flow applications. If you are trying to do things like audio, video games, watching the sky, processing speech, RADAR, and just about any natural system, you are burdened by how much data you can move through the system, and how to get the heat out and data and power in. I am confident that for digital approaches there exists a Powerwall at between 10 to 30 GMAC/W. ie: the end of useful scaling for power constrained systems.

When I first started looking at this with Dr. Marr, it was because of a program called DARPA ACT. The question was if the specs were actually achievable with classical digital approaches, and I happen to play in the space of non-classical digital and analog processing. The idea of ACT is to build a phased array system where the amount of bandwidth and number of beams that can be formed scales per unit Watt. This would allow powerful phased array technology to proliferate in many types of devices, not just expensive military radars. Also, the cloud computing revolution, in part, depends on Koomey's law continuing; if we are to cloud compute with mobile devices, we must be able to transmit, receive, and process wireless data for ever smaller energy levels assuming fixed or slowly growing energy density in batteries, which so far has been the case.

Conclusion: DARPA ACT will not hit their power target with classical digital approaches.
I have not yet seen the final reports, but I hope the teams prove me wrong. I doubt it.

Memory Power is stagnant?

Memory transfer power halves every 29 years, which makes memory power a fixed commodity calculation for data-flow systems. If you have data that is constantly changing, you are burdened by the speed and power of the data bus.

When it comes to memory, it really depends on architecture. As far as improvements, it's life in the slow lane.

What about measured performance in a non-dataflow system?

In a non-dataflow system, cache is king. If you cannot get data in faster, you need to keep it local and handy. I looked into what SPECInt2006 had to say about this using the Stanford CPUDB. If data transfer power is constrained by IO, you need have a lot of cache to minimize non-data instructions coming into the system, but what about power? Using the data in the CPUDB, I plotted SPECInt2006 basemean score per Watt, and then normalized it for processor cache.

I believe that these improvements in SPECint96 score are based upon increases in processor cache because the graph shows almost no trend when the cache is normalized across the processors. As data flow applications are not affected by substantial increase in cache sizes, I believe that there is little improvement between years due to scaling regarding score per Watt.
However, it does make a good argument for simpler code. When you consider that how large programs have gotten relative to cache, the Computer Science community should consider trimming the fat a bit.

Appendix

Of course, the data for the first figure, and the references.

year	node(nm)	processor	GMAC/W	ref

1980	3000	IBM PC-XT	0.000001	[19]

1980	3000	IBM PC	0.000002	[19]

1982	3000	commodore 64	0.000000	[20]

1984	3500	MACINTOSH 128K	0.000005	[21, 22]

1985	1500	IBM PC-AT	0.000002	[23]

1988	2000	MC68030	0.000017	[24]

1991	1000	INTEL 486	0.000065	[25]

1991	1000	IBM PS/2E	0.000086	[25]

1995	500	Pentium Pro 200	0.000182	[26, 27]

1992	750	DEC Alpha 21064 EV4	0.002632	[28]

1996	350	DEC Alpha 21264 EV6	0.001989	[28, 29]

1997	200	Early PowerPC	0.024056	[30]

1998	250	Pentium III 600B	0.003043	[31]

2001	180	PPC 7410	0.014720	[32]

2003	130	PPC 7447	0.017355	[33]

2003	130	Pentium 4 ee HT 3.2	0.006080	[34]

2003	130	TMS-320C6412-500	0.235215	[35, 36]

2003	90	TMS-320C6455-1000	0.210843	[37]

2004	90	POWER5	0.133000	[38, 39]

2006	65	Athlon X2 BE2300	0.029556	[40]

2006	65	Core 2 Extreme X6800	0.027375	[41]

2006	65	Core 2 QX6700	0.028711	[42]

2007	65	Pentium Dual-Core E2140	0.017231	[43]

2007	90	Monarch	0.746592	[44, 45]

2007	65	IBM Cell	0.248889	[46]

2007	90	PPC 8641D	0.027364	[47]

2007	90	TMS-320C6424-700	0.341988	[48]

2007	65	POWER6	0.063000	[39, 49]

2008	45	Core 2 T9400	0.050660	[50, 51]

2008	45	Atom 230	0.140000	[52]

2008	65	TMS-320C6474-850	0.153879	[53]

2008	65	TMS-320C6474-1000	0.166667	[53]

2008	65	TMS-320C6474-1200	0.193846	[53]

2008	40	Tesla C2050	0.757647	[54, 55]

2008	40	Tesla C2075	0.838698	[54]

2008	40	Tesla M2050	0.801422	[54]

2008	40	Tesla M2070	0.801422	[54]

2008	40	Tesla M2090	1.035378	[54, 56]

2008	45	Core i7-965 ee	0.034462	[57]

2009	65	TMS-320C6748	1.360703	[58]

2008	45	Atom N270	0.224000	[59]

2009	90	Tile64	0.712727	[60]

2009	40	ARM Cortex-A	1.085000	[61]

2010	NA	Qualcomm Snapdragon	0.234500	[62]

2010	NA	Tegra 3	0.101500	[62]

2010	45	POWER7	0.112000	[39, 63]

2010	40	Virtex 6	4.736630	[64]

2010	32	Core i7 980 ee	0.107682	[65]

2010	40	Fermi GTX480	0.941472	[54]

2011	40	Fermi GTX580	0.566977	[54, 66]

2011	40	SPARC T4	0.069953	[67]

2012	40	Virtex 7	7.269231	[64]

2012	28	Fermi GTX680	2.773465	[54, 68]

2012	32	Atom D2550	0.130200	[69]

2013	22	POWER8	0.168000	[39, 70]

2013	22	SPARC T5	0.168000	[71]

2013	22	Core i7-4960X	0.116308	[72]

2013	40	TMS320C6678-1250	0.823529	[73]

2013	28	R9 290x	3.942400	[74]

2013	28	Fermi GTX780	3.528000	[75]

2013	40	TILE-Gx72	0.806400	[76]

2014	28	Titan Black	3.584448	[77]

2014	22	E3-1284LV3	0.190638	[4]

2015	14	Kintex Ultrascale	4.806519	[64]

References

[1] Jonathan G Koomey, Christian Belady, Michael Patterson, Anthony Santos, and Klaus-Dieter Lange, “Assessing trends over time in performance, costs, and energy use for servers,” Lawrence Berkeley National Laboratory, Stanford University, Microsoft Corporation, and Intel Corporation, Tech. Rep, 2009.

[2] Ariel Bleicher, “5G service on your 4G phone?,” in IEEE Spectrum. IEEE, 2014.

[3] Chris Auth, “22-nm fully-depleted tri-gate cmos transistors,” in Custom Integrated Circuits Conference (CICC), 2012 IEEE. IEEE, 2012, pp. 1–6.

[4] S Narasimha, P Chang, et al., “22nm high-performance soi technology featuring dual-embedded stressors, epi-plate high-k deep-trench embedded dram and self-aligned via 15lm beol,” in Electron Devices Meeting (IEDM), 2012 IEEE International. IEEE, 2012, pp. 3–3.

[5] C Auth, C Allen, et al., “A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density mim capacitors,” in VLSI Technology (VLSIT), 2012 Symposium on. IEEE, 2012, pp. 131–132.

[6] Christopher S Wallace, “A suggestion for a fast multiplier,” Electronic Computers, IEEE Transactions on, , no. 1, pp. 14–17, 1964.

[7] Kaizad Mistry, C Allen, et al., “A 45nm logic technology with high-k+ metal gate transistors, strained silicon, 9 cu interconnect layers, 193nm dry patterning, and 100% pb-free packaging,” in Electron Devices Meeting, 2007. IEDM 2007. IEEE International. IEEE, 2007, pp. 247–250.

[8] Peng Bai, C Auth, et al., “A 65nm logic technology featuring 35nm gate lengths, enhanced channel strain, 8 cu interconnect layers, low-k ild and 0.57 μm²sram cell,” in Electron Devices Meeting, 2004. IEDM Technical Digest. IEEE International. IEEE, 2004, pp. 657–660.

[9] K Cheng, A Khakifirooz, et al., “Extremely thin soi (etsoi) CMOS with record low variability for low power system-on-chip applications,” in Electron Devices Meeting (IEDM), 2009 IEEE International. IEEE, 2009, pp. 1–4.

[10] Himanshu Kaul, Mark Anders, et al., “A 1.45 ghz 52-to-162gflops/w variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 182–184.

[11] Steven Hsu, Amit Agarwal, Mark Anders, Sanu Mathew, Himanshu Kaul, Farhana Sheikh, and Ram Krishnamurthy, “A 280mv-to-1.1 v 256b reconfigurable simd vector permutation engine with 2-dimensional shuffle in 22nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 178–180.

[12] Jeffery A Davis, Vivek K De, and James D Meindl, “A stochastic wire-length distribution for gigascale integration (GSI)–part ii: Applications to clock frequency, power dissipation, and chip size estimation,” IEEE Transactions on Electron Devices, vol. 45, no. 3, pp. 590–597, 1998.

[13] T.H. Ning, “A perspective on the theory of MOSFET scaling and its impact,” Solid-State Circuits Society Newsletter, IEEE, vol. 12, no. 1, pp. 27–30, Winter 2007.

[14] TA Fjeldly and M. Shur, “Threshold voltage modeling and the subthreshold regime of operation of short-channel MOSFETs,” IEEE Transactions on Electron Devices, vol. 40, no. 1, pp. 137–145, 1993.

[15] Mark Horowitz, “Computing’s energy problem:(and what we can do about it),” 2014, International Solid-State Circuits Conference (ISSCC).

[16] Ron Kalla, Balaram Sinharoy, and Joel Tendler, “Simultaneous multi-threading implementation in POWER5,” in Conference Record of Hot Chips, 2003, vol. 15.

[17] Andrew Danowitz, Kyle Kelley, James Mao, John P Stevenson, and Mark Horowitz, “CPUDB: recording microprocessor history,” Communications of the ACM, vol. 55, no. 4, pp. 55–63, 2012.

[18] John L Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.

[19] Intel, 8088 8-bit HMOS microprocessor, 8 1990.

[20] Commodore Semiconductor Group, 6510 microprocessor, 1984.

[21] Motorola, 68000 programmer’s reference manual, 1 1990, Rev. 1.0.

[22] Motorola, M68000 8-/16-/32-BIT microprocessors user’s manual, Motorola Inc., 1991.

[23] Intel Corp., Iapx 286 Hardware Reference Manual, Number 1983/210760-001. Intel Books, Berkeley, CA, USA, 1984.

[24] Motorola, MC68030 enhanced 32-bit microprocessor user’s manual, Prentice-Hall, 1990.

[25] Intel Corp., i486 Microprocessor Programmer’s Reference Manual, Osborne/McGraw-Hill, Berkeley, CA, USA, 1990.

[26] “Intel Pentium Pro processors,” http://www.intel.com/support/processors/{P}entiumpro/sb/cs-_011161.htm, Accessed: 2014-05-06.

[27] “Intel Pentium Pro 200,” http://ark.intel.com/products/49952/Intel-_{P}entium-_Pro-_Processor-_200-_MHz-_256K-_Cache-_66- _MHz-_FSB, Accessed: 2014-05-06.

[28] Richard E Kessler, Edward J McLellan, and David A Webb, “The alpha 21264 microprocessor architecture,” in Computer Design: VLSI in Computers and Processors, 1998. ICCD’98. Proceedings. International Conference on. IEEE, 1998, pp. 90–95.

[29] Compaq Computer Corporation, Alpha 21264 Microprocessor DataSheet, 2 1999, Rev. 1.0.

[30] IBM Microelectronics Division, PowerPC 740 and PowerPC 750 Microprocessor Datasheet, 6 2002, Rev. 2.0.

[31] “Intel®; Pentium®; III processor 600 MHz, 512k cache, 133 MHz FSB,” http://ark.intel.com/products/27546/Intel-_{P}entium-_{III}- _Processor-_600-_MHz-_512K-_Cache-_133-_MHz-_FSB, Accessed: 2014-05-06.

[32] Freescale Semiconductor, “MPC7410 RISC Microprocessor Hardware Specifications,” 2005.

[33] Freescale Semiconductor, “MPC7447 RISC Microprocessor Hardware Specifications,” 2005.

[34] “Pentium®; 4 processor extreme edition supporting HT technology 3.20 ghz, 2m cache, 800 MHz FSB,” http://ark.intel.com/products/27489/ {P}entium-_4-_Processor-_Extreme-_Edition-_supporting-_HT-_Technology-_3_20-_GHz-_2M-_Cache-_800-_MHz-_FSB, Accessed: 2014-05-06.

[35] Texas Instruments, “TMS320C6412 power consumption summary,” http://www.ti.com/lit/an/spra967e/spra967e.pdf, 2005.

[36] Texas Instruments, “TMS320C6412 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6412.pdf, 2005, SPRS219G.

[37] Texas Instruments, “TMS320C6455 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6455.pdf, 2012, SPRS276M.

[38] Ron Kalla, Balaram Sinharoy, and Joel M Tendler, “IBM POWER5 chip: A dual-core multithreaded processor,” Micro, IEEE, vol. 24, no. 2, pp. 40–47, 2004.

[39] IBM, “IBM POWER systems, hardware deep dive,” http://www-_05.ibm.com/cz/events/febannouncement2012/pdf/power_ architecture.pdf, 2013.

[40] “AMD Athlon BE-2300,” http://products.amd.com/pages/desktopcpudetail.aspx?id=394, Accessed: 2014-05-06.

[41] “Intel®; Core 2 x6800 (4m cache, 2.93 ghz, 1066 MHz FSB),” http://ark.intel.com/products/27258/Intel-_Core2-_Extreme- _Processor-_X6800-_(4M-_Cache-_2_93-_GHz-_1066-_MHz-_FSB), Accessed: 2014-05-06.

[42] “Intel®; Core 2 qx6700 (8m cache, 2.66 ghz, 1066 MHz FSB),” http://ark.intel.com/products/28028/Intel-_Core2-_Extreme- _Processor-_QX6700-_8M-_Cache-_2_66-_GHz-_1066-_MHz-_FSB, Accessed: 2014-05-06.

[43] “Intel®; Pentium®; processor e2140 (1m cache, 1.60 ghz, 800 MHz FSB),” http://ark.intel.com/products/29738/Intel-_{P}entium- _Processor-_E2140-_(1M-_Cache-_1_60-_GHz-_800-_MHz-_FSB), Accessed: 2014-05-06.

[44] John Granacki and Mike Vahey, “Monarch: A MOrphable Networked micro-ARCHitecture,” Tech. Rep., USC/Information Sciences Institute and Raytheon, 2002.

[45] Michael Vahey et al., “MONARCH: a first generation polymorphic computing processor,” in 10th High Performance Embedded Computing Workshop, 2006.

[46] TR Maeurer and D Shippy, “Introduction to the cell multiprocessor,” IBM journal of Research and Development, vol. 49, no. 4, pp. 589–604, 2005.

[47] Freescale Semiconductor, MPC8641 and MPC8641D Integrated Host Processor Hardware Specifications, 7 2009, MPC8641DEC.

[48] Texas Instruments, “TMS320C6424 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6424.pdf, 2009, SPRS347D.

[49] Hung Q Le, William J Starke, J Stephen Fields, Francis P O’Connell, Dung Q Nguyen, Bruce J Ronchetti, Wolfram M Sauer, Eric M Schwarz, and Michael T Vaden, “IBM POWER6 microarchitecture,” IBM Journal of Research and Development, vol. 51, no. 6, pp. 639–662, 2007.

[50] Ronak Singhal, “Inside intel core microarchitecture (nehalem),” in A Symposium on High Performance Chips, 2008, vol. 20.

[51] “Intel®; Core 2 Duo t9400 (6m cache, 2.53 ghz, 1066 MHz FSB),” http://ark.intel.com/products/35562/Intel-_Core2-_Duo- _Processor-_T9400-_(6M-_Cache-_2_53-_GHz-_1066-_MHz-_FSB), Accessed: 2014-05-06.

[52] “Intel®; atom processor 230 (512k cache, 1.60 ghz, 533 MHz FSB),” http://ark.intel.com/products/35635/Intel-_Atom-_Processor- _230-_(512K-_Cache-_1_60-_GHz-_533-_MHz-_FSB), Accessed: 2014-05-06.

[53] Texas Instruments, “TMS320C6474 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6474.pdf, 2011, SPRS552H.

[54] C. Woolley, “CUDA overview,” http://www.cc.gatech.edu/~vetter/keeneland/tutorial-_2011-_04-_14/02-_cuda-_overview.pdf, 2011.

[55] “Tesla C2050 and Tesla C2070 computing processor board,” http://www.nvidia.com/docs/IO/43395/Tesla_C2050_Board_Specification. pdf, 2010, BD-04983-001-v02.

[56] “Tesla M2090 dual-slot computing processor module,” http://www.nvidia.com/docs/IO/43395/Tesla-_M2090-_Board-_Specification. pdf, 2010, BD-05766-001-v02.

[57] “Intel®; core i7-965 processor extreme edition,” http://ark.intel.com/products/37149/Intel-_Core-_i7-_965-_Processor-_Extreme- _Edition-_8M-_Cache-_3_20-_GHz-_6_40-_GTs-_Intel-_QPI, Accessed: 2014-05-06.

[58] Texas Instruments, “TMS320C6448 fixed-point digital signal processor,” http://www.ti.com/lit/ds/symlink/tms320c6748.pdf, 2013, SPRS590E.

[59] “Intel®; atom processor n270 (512k cache, 1.60 ghz, 533 MHz FSB),” http://ark.intel.com/products/36331/Intel-_Atom-_Processor- _N270-_512K-_Cache-_1_60-_GHz-_533-_MHz-_FSB, Accessed: 2014-05-06.

[60] Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, et al., “Tile64-processor: A 64-core SoC with mesh interconnect,” in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. IEEE, 2008, pp. 88–598.

[61] ARM, “Cortex-a5,” http://www.arm.com/products/processors/cortex-_a/cortex-_a5.php, 2009, DDI 0433A.

[62] L. Lewins, “Performance of low power ARM processors,” in Raytheon Information Systems and Computing (ISAC) Symposium. Raytheon, 2012.

[63] DF Wendel, J Barth, DM Dreps, S Islam, J Pille, and JA Tierno, “IBM POWER7 processor circuit design,” IBM Journal of Research and Development, vol. 55, no. 3, pp. 1–1, 2011.

[64] Xilinx, “Xilinx power estimator (xpe),” http://www.xilinx.com/products/design_tools/logic_design/xpe.htm.

[65] “Intel®; core i7-980x processor extreme edition,” http://ark.intel.com/products/47932/Intel-_Core-_i7-_980X-_Processor-_Extreme- _Edition-_(12M-_Cache-_3_33-_GHz-_6_40-_GTs-_Intel-_QPI), Accessed: 2014-05-06.

[66] “GeForce GTX 580,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_580/specifications, Accessed: 2014-05-06.

[67] Manish Shah et al., “SPARC t4: A dynamically threaded server-on-a-chip,” IEEE Micro, vol. 32, no. 2, pp. 0008–19, 2012.

[68] “GeForce GTX 680,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_680/specifications, Accessed: 2014-05-06.

[69] “Intel®; atom processor d2550 (1m cache, 1.86 ghz),” http://ark.intel.com/products/65470/Intel-_Atom-_Processor-_D2550-_1M- _Cache-_1_86-_GHz, Accessed: 2014-05-06.

[70] Eric J Fluhr et al., “5.1 POWER8 tm: A 12-core server-class processor in 22nm soi with 7.6 tb/s off-chip bandwidth,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International. IEEE, 2014, pp. 96–97.

[71] John Feehrer, Sumti Jairath, Paul Loewenstein, Ram Sivaramakrishnan, David Smentek, Sebastian Turullols, and Ali Vahidsafa, “The oracle SPARC T5 16-core processor scales to eight sockets,” Micro, IEEE, vol. 33, no. 2, pp. 48–57, 2013.

[72] “Intel®; core i7-4960x processor extreme edition (15m cache, up to 4.00 ghz),” http://ark.intel.com/products/77779/Intel-_Core-_i7- _4960X-_Processor-_Extreme-_Edition-_15M-_Cache-_up-_to-_4_00-_GHz, Accessed: 2014-05-06.

[73] Texas Instruments, “TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor,” http://www.ti.com.cn/cn/lit/ds/symlink/ tms320c6678.pdf, 2013, SPRS691D.

[74] “Radeon R9,” http://www.amd.com/en-_us/products/graphics/desktop/r9, Accessed: 2014-05-06.

[75] “GeForce GTX 780,” http://www.geforce.com/hardware/desktop-_gpus/geforce-_gtx-_780/specifications, Accessed: 2014-05-06.

[76] M. Mattina, “Architecture and performance of the Tilera TILE-Gx8072 manycore processor,” http://www.hoti.org/hoti21/slides/Mattina. pdf, 2013, Tilera.

[77] “GeForce GTX 700 titan black,” http://www.nvidia.com/gtx-_700-_graphics-_cards/gtx-_titan-_black/, Accessed: 2014-05-06.