Last week I attended the latest in the NYC Google Speaker event series: “Luiz Barroso, Google Distinguished Engineer, will talk about “Watts, Faults, and Other Fascinating Dirty Words Computer Architects Can No Longer Afford to Ignore”. Luiz was talking about things way beyond my skill or experience but I still got some great insights into designing hardware and infrastructure. That’s mainly why I go to things like this - it’s new information, just expands what I know a little about. So here’s some notes from the talk; there was a lot more to it than I noted.
. power and energy usage have not been very sexy when it comes to designing architecture and that has caught up with people. It is now the centre of plenty of attention.
. In the 90’s there two big research areas, the MHZ race and the DSM (Distributed Shared Memory) race. The first for accelerated single thread performance and second to improve the efficiency of shared memory
. Moore’s law is fundamentally about transistors. The issue is becoming power; they are energy wasteful and temperature control is difficult. Power costs are increasing and look like being more expensive than the hardware. It may tend towards the Mobile model, where you get a energy contract and then the hardware thrown in for free.
. they are focusing on reducing conversion losses and improving power conversion. On PCs the power supply consumes much of the energy, with 55-70% efficiency. .
. Multi-core processes help reduce energy use. You need to design software differently to take advantage of it, building efficient concurrent programs.
. Google has been monitoring diskfailure. Common wisdom is that failure rate is <1% and temperature is a big factor. So we looked at 100k+ drives over 5 yrs. Failure rates were ~8% after 2 years, all way larger than manufacturers rates and temperature did not appear affect the rate. Trying to find a predictive algorithm has had little success; more than half the disk failures happened with no indicative errors and the arrival of errors did not indicate time to failure. The models are good for predicting population wide trends, ie predicting how many failures you will have and how many replacement disks you need. And also for telling you that temperature does not matter that much.
. Looking at power requirements, the average data centre costs $10-22/watt used, whereas US average energy costs $0.8/watt/year. It costs more to build a data centre than to power it for 10 years. YOu have to optimise energy usage to be close to capacity, thinking about power provisioning, how many machines can be used, the unused watts cost.
. Studying power usage, we found the data centre never hit peak capacity, even if a rack on its own could have. A PC uses about 60W at rest, 120 at full usage; a human uses around 60W at rest and 1200W at high usage. We are far more efficient - machines have a factor of 2 between idle and peak, humans a factor of 20. To improeve energy efficiency for data centres, we should focus on reducing the usage of idle power.
. So by reducing the idle power, with no change in peak, you can get a 40-50% savings. You can reduce the peak power requirements for the data centre as a whole by reducing the machine idle consumption.