Jowon: 2007

Dec 13, 2007

Parallel Processing in a Program

Dec 6, 2007

Internet Data Center

Download our Solutions Brief for Internet Data Centers

As video-based services and multimedia content expand, the Web is delivering a host of advanced communications, collaboration, sales, and production applications for businesses, as well as new services for consumers. For Internet service providers, this represents a large opportunity, provided their data centers are up for the challenge.

Woven Delivers 10 GE for Internet Data Centers
Woven Systems™ eliminates scalability barriers and minimizes latency with its 10 GE EFX 1000 Ethernet Fabric Switch, which is designed to meet the widest range of application requirements. Internet service providers can now scale their existing infrastructures using Ethernet and achieve performance that previously only complex Layer 3 networks could deliver. Woven’s high port, low-latency switch uses an innovative method to build massively scalable multi-path 10 Gigabit Ethernet (GE) fabrics that simplify Internet data center infrastructures and revolutionize service performance. Plus, Woven’s unique vSCALE™ technology continuously monitors congestion and automatically re-balances traffic to less congested paths without dropping or re-ordering packets, further enhancing performance and throughput.

Resolving Ethernet’s Scalability Challenge
As web-based services proliferate, so do the number of server connections on network switches. This strains network infrastructure and complicates power and cooling requirements. Driven by video applications, the requirement for more Internet bandwidth and 10 GE connections are increasing exponentially. However, scaling beyond the limits imposed by Ethernet Spanning Tree requires Layer 3 routing, which adds a new level of complexity to Internet service providers’ networks.
Woven’s Ethernet fabric switch solutions eliminate Ethernet's traditional scalability limitations, allowing operators to build massively scalable networks. These switches can be interconnected in a multi-path, meshed topology to create 10 GE fabrics that scale to more than 4000 fully non-blocking 10 GE edge ports. As a result, this resilient network structure is not limited by the bandwidth capacity of a single switch, enabling Internet service providers to deploy new services without having to re-engineer their networks.

Lower Total Cost of Ownership
As customer and application demands increase, servers will increasingly require 10 GE connectivity. Woven’s 10 GE fabric solutions deliver unsurpassed performance, scalability and manageability using familiar, standards-compliant Ethernet. Woven solutions do not require specialized expertise, re-building existing applications, or complex re-routing and re-configuration. With the highest number of 10 GE ports per chassis at one-third the rack space and one-fifth the power consumption of existing switching solutions, Woven delivers the lowest total cost of ownership.

Deploy New Services Fast
Woven can reduce service deployment schedules by weeks or months, because one Ethernet fabric greatly simplifies launching new applications and services. Applications developed on Ethernet can run on a Woven fabric without having to be re-built. In fact, applications can be built on smaller, less costly systems and moved to a large infrastructure with confidence.

High Performance Computing

Download our Solutions Brief for HPC

Download the Sandia Benchmark Tech Note

Woven Systems™ makes it easier to build High Performance Computing (HPC) cluster systems with its 10 Gigabit Ethernet (GE) fabric switching architecture. Optimized for low latency and delivering massive scalability, the Woven solution allows multiple Ethernet switches to interconnect to form a fabric that scales to over 4000 non-blocking10 GE ports. HPC system designers can now achieve ultra-high performance without compromising compatibilty.

Improved Performance
Woven integrates your storage and computing resources on a single, scalable, high-performance fabric to eliminate the difficulties imposed by proprietary interconnects. For instance, heavy traffic loads create congestion and poor performance in underlying transport networks. If the network node fails, applications running on the fabric could fail because the fabric topology mapping changes, requiring remapping to the subnet manager. Network re-tuning to accommodate a new application is usually a manual, time-consuming process and in many cases too difficult to achieve.
Woven, on the other hand, with its single high-performance 10 GE fabric connecting storage and computing, standard TCP/IP applications and services can run at maximum performance under load. Plus, Woven’s unique vSCALE technology continuously monitors congestion and automatically re-balances traffic to less congested paths without dropping or re-ordering packets.

Balanced Systems
Woven also makes it easier to balance storage, computing, and interconnect resources for HPC applications. If a particular application requires higher network bandwidth, the Woven fabric can be partitioned to guarantee resources for that application. No other interconnect fabric offers the flexible assignment and fine control of network resources for guaranteeing application performance.

Faster System Deployment
Although HPC system designers increasingly use commodity-based hardware to reduce costs and deploy systems faster, the underlying interconnect networks continue to introduce unnecessary complexity, which slows system deployment. Woven reduces new deployment schedules by weeks or months, because using a single Ethernet fabric greatly simplifies building new clusters. No additional IT resources are needed to deploy new clusters and re-distribute computing problems from other systems. Applications can be built on smaller, less costly systems, validated, and moved to a larger system with confidence.

Dec 5, 2007

개인시장에도 업로드 속도의 중요성 날로 증가…네트워크 투자전략의 변화 필요

telephonyonline.com, 2007/11/19

지금까지 통신사업자들은 브로드밴드 서비스의 이용자 환경을 향상시키기 위한 방안으로 다운로드 속도 향상에만 초점을 맞춰온 것이 사실이다. 대부분의 인터넷 트래픽도 네트워크로부터 유저로 흘러가는 것뿐이었다.

반면 상향속도(upstream), 즉 유저로부터 네트워크로 업로드되는 데이터들에 대해서는 전혀 고려되지 않았으며, P2P 애플리케이션이나 불법 업로드 등은 음영 지대로만 여겨졌다. P2P 트래픽은 콘텐츠를 불법 유통시키며 대역폭을 많이 소비한다는 것이 일반적인 인식이었던 만큼, 이를 지원하기 위해 네트워크 투자를 할 필요성을 전혀 느끼지 못한 것이다.

하지만 점차 합법적인 업로드 트래픽이 증가하고 있다. 다음은 업로드 채널 수요가 증가할 것임을 보여주는 사례들이다.

-시장조사기관 Yankee group의 조사에 따르면, 향후 6개월간 소비자들이 구입하고자 하는 제품 목록 중 2,3위에 디지털카메라가 올라 있다. 이는 다수의 소비자들이 자신의 사진을 온라인 공간에 저장하길 원한다는 것을 보여 준다.
-동영상 공유 사이트 YouTube에는 하루 평균 65,000건의 목록이 게재된다.
-미국 뉴스채널 CNN은 웹사이트 내 I-Report 섹션에 업로드되고 있는 아마추어 작가들이 촬영한 영상화면을 정기적으로 내보내고 있다.
-미국과 캐나다에서 근무시간 중 최소 몇 시간은 재택근무를 하는 사람들이 급증하고 있다. 이는 점차 많은 회사들이 사원들의 만족감 향상을 위해 재택근무가 큰 효과가 있음을 인지하기 시작했기 때문이다.

이러한 수요에 맞춰 미국 2위 통신사업자인 Verizon은 최근 자사 브로드밴드 네트워크 FiOS의 업로드 속도를 다운로드 속도와 동일한 20Mbps로 향상시켰다. 다운로드 및 업로드 속도가 동일한 대칭형 서비스는 미국의 컨슈머 시장에서는 최초로 선보이는 것으로, Connecticut, New Jersey, New York 등지에서 제공된다. 업로드 속도 향상 외에도 Verizon은 1GB 용량의 무료 온라인스토리지 서비스를 제공하고 있으며, 50GB까지 확장이 가능하다.

그렇다면 20Mbps의 업로드 속도로 유저들이 무엇을 할까. 거의 이용되지 않는다면 대역폭 향상을 위한 이러한 노력들은 무용화되는 것이다. 하지만 업로드 속도 향상은 당장 전세계적으로 이용되고 있는 P2P 사이트 BitTorrent 유저들에게만 혜택을 주는 것에 그치는 것이 아니라 앞으로도 이러한 노력을 지속해 나가는 것이 더욱 중요하다.

풀 타임으로 재택근무를 할 경우 출퇴근할 필요도 없으며, 의상에 신경쓰지 않아도되고 스케줄도 마음대로 조정이 가능하지만, 인터넷 접속은 자유롭지 못하다. 물론 현재 제공 중인 가정용 브로드밴드 서비스를 이용할 수 있으나 이들의 업로드 속도는 1~3Mbps에 불과하며 이마저도 공유하고 있는 실정이다.

만약 사무실 환경과 동일한 접속 서비스가 가정에도 제공될 경우 재택근무자들에게 가져다 줄 가치를 생각해 보자. 법인시장에서는 증가하고 있는 가상화 작업과 더불어 데이터트래픽 패턴이 변화하고 있다. 대규모의 사무실이라고 해서 더 이상 거대한 양의 데이터 베이스를 보유하는 것이 아니다. 기업들이 보유하고 있는 데이터를 분산시킬 수 있는 것처럼 전세계로 사원들을 분산시킬 수 있는 것이다.

이제 더 이상 업로드를 필요악으로 여기며 무시하지 말아야 할 때다. 합법적인 업로드가 대세가 되고 있으며, 점차 유저들이 기꺼이 지불하고자 하는 가치를 지니게 될 것이다.

Nov 29, 2007

요즘 글로벌 기업의 공통고민은 '전력효율'

조선일보 071129
구글, 재생에너지 분야에 수억 달러 집중 투자
HP는 태양광 에너지로 수십만 달러 절약 계획
MS도 냉각효율 좋은 시베리아에 IDC 설립고려

글로벌 기업들이 전력효율 극대화를 위해 잇달아 새로운 전략을 제시하고 있어 관심을 끌고 있다. 구글과 HP는 친환경 에너지를 통한 전력 소요비용을 줄이기로 했으며, MS는 데이터센터 냉각 효율을 극대화하기 위해 동토(凍土) 러시아 시베리아 동부 지역에 시설을 마련키로 했다.

◆구글, 재생에너지 기술에 수억 달러 투자
세계 최대 인터넷 검색 엔진 美 구글(Google)은 27일(현지시각) 태양광, 풍력, 지열 등 재생 가능한 에너지 생산에 향후 몇년간 수억달러를 투자할 것이라 밝혀 그 배경에 관심이 쏠리고 있다.

래리 페이지(Larry Page) 구글 공동 창업자는 이날 현지 언론에 배포한 보도자료에서 "석탄보다 싼 재생 에너지 프로젝트(Renewable Energy Cheaper Than Coal,
http://www.google.com/renewable-energy), 일명 'REC'를 위해 수억 달러를 투자할 것"이라며 "첨단 태양열 에너지, 풍력, 그리고 첨단 지열발전 등 대체 에너지원 개발에 초점을 맞출 것"이라고 밝혔다.

이 프로젝트는 하이브리드 및 전기 자동차 발전과 구글 데이타 센터의 에너지 효율성 극대화 등을 목표로 전반적인 계획이 지난 봄에 이미 소개된 바 있다.

이번게 공개된 프로젝트는 지구 온난화 등 환경 파괴를 가져오는 화력발전소 전력보다 싼값에 친환경 재생에너지를 개발하겠다는 것이 핵심 취지다. 구글은 내년까지 관련 전문가 20~30여명을 채용할 계획을 세우고 있다.

그는 "1기가와트의 재생 가능한 에너지원을 생산하는 것이 목표"라며 "석탄보다 경제적인 이런 에너지원 개발이 몇년 안에 이뤄질 수 있을 것으로 낙관한다"고 지적했다. 1기가와트는 샌프란시스코 도시 전체에 공급할 수 있는 엄청난 양이다.

그는 "세계 전력 생산의 40%에 달하는 석유가 배럴당 100달러에 육박했다"며 "재생 가능한 에너지원 생산이 경제성을 갖기 때문에 이제 개발할 시점이 됐다"고 강조했다.

그는 "태양광이나 풍력 등으로 싼값에 에너지를 생산하는 것은 빠르면 수 년 안에 가능하며, 특히 태양광 전기 가격은 예상보다 25~50%가량 더 낮아질 것"이라고 설명했다.

구글은 이번 프로젝트를 위해 태양광 에너지를 개발하는 이솔라(eSolar Inc.),와 풍력발전 전문업체인 마카니파워(Makani Power Inc.) 등 친환경 에너지 개발업체 두 곳과 손을 잡았다.

레리 브릴런트(Dr. Larry Brilliant) 구글재단(http://Google.org) 이사는 자료에서 "더 저렴한 기술로 좀 더 빨리 다가가기 위해 뭔가 하고 싶었다"며 "최악의 환경 변화를 피하기 위한 초저가, 친환경, 재활용 에너지을 구현하는 것에는 보통 투자가 이뤄지지 않는다"고 말했다.

구글에 따르면 지금까지 ▲미국 및 전세계에서 데이터센터에 전력을 공급하거나 식히는 최점단 기술 개발, ▲전기 생산을 위해 1.6메가와트 기업용 태양광 패널을 마운틴뷰에 설치, ▲1000만달러 규모의 'IT재충전(RechargeIT, http://www.google.org/recharge)'이라는 전기 자동차 개발 프로젝트, ▲기후 보호 컴퓨팅(Climate Savers Computing Initiative, http://www.climatesaverscomputing.org) 산업을 형성하기 위한 업계 리더로서 협력 등을 주요 과제로 추진해 왔다.

◆'전기먹는 하마' 구글 IDC의 에너지 효율 높일 목적인 듯
구글은 저가의 상용 PC 서버로 세계 최대 규모의 클러스터를 구축하고 있는 것으로 잘 알려져 있다. 업계에서는 2006년 당시 구글이 운영하는 서버의 추정치만 45만대(뉴욕타임즈 보도)가 넘을 것으로 추산했고, 현재는 전세계 수백만대 규모라는 주장도 있다. 특히 구글 플랫폼은 저가 하드웨어의 취약한 성능과 안정성을 지능적인 시스템 소프트웨어 설계를 통해 극복해 관심을 받았다. 업계에서는 이를 통해 1/3 이상 비용을 줄이면서도 확장성과 안정성, 개발 편의성을 구현했다고 평가하고 있다. 이번 전략 역시 기하급수적으로 늘어나는 서버에 맞춰 IDC 전력 비용을 줄이겠다는 의지를 엿볼 수 있다.

래리 페이지는 이날 언론에 배포한 자료에서 "기술은 석탄보다 더 저렴한 전력을 공급할 수 있는 산업이 성숙할 때 계속 진화할 수 있다"며 "비용 효율적이자 환경 친화적인(cost-competitive and green) 다른 기술을 개발하는데 더 관심을 기울이고 있다"고 말했다. 기술을 통해 비용 효율적인 사업으로 이끌어 가겠다는 목표를 강조한 것이다.

이어 래리 페이지는 구글 공식 블로그에서도 "이렇게 되면 데이터센터 등 초대형, 에너지 집약형 시설을 지을 때 이익을 얻을 수 있다"며 "값싼 재생 전기를 대량으로 생산해내는데 우리의 창의성과 혁신 능력을 활용할 것"이라고 말했다. 사실상 재생에너지 개발을 통해 현재 운영 중이거나 운영할 예정인 데이터센터의 전력 소비비용을 낮추겠다는 계획을 간접적으로 드러낸 부분이다.

세르게이 브린(Sergey Brin) 구글 공동창업자는 영국 가디언지와의 인터뷰에서도 이같은 의미를 재확인해주고 있다. "왜 구글이 이미 포화상태인 재생에너지 시장에 뛰어드는가"라는 질문에 그는 "1 킬로와트에 10센트에 팔고 있는 청정 에너지 기술 개발에 매력을 느끼고 있다"며 "물론 기존에도 이만큼 저렴한 곳도 있다"고 말했다.

그는 "문제는 우리가 가격을 끌어 내려 경쟁력 있는 수준이 될 때까지는 전체 문제가 해결되지 않는다는 점"이라며 "비즈니스를 하는 사람이라면 10센트라는 목표에 도달하면 돈을 벌 수 있지만, 4센트에 도달하려면 좀 더 중요한 기술적 장애물이 있을 것"이라고 말했다.

구글이 데이터센터 전력 효율에 신경을 쓰는 것은 비단 이뿐만이 아니다. 구글은 지난해 인텔 개발자 포럼(Intel Developer Forum)에서 PC 제조업계에 "전력 효율이 더 강화된 PC 전원공급장치(PSU, Power Supply Unit) 규격을 마련해 달라"고 PC 하드웨어 업계에 요청하기도 했다. 구글이 하드웨어 설계에 대해 직접 언급한 것은 이례적이다.

당시 구글 데이터센터 설계 담당자 2명은 인텔 개발자 포럼에서 공개한 백서를 통해 “PC 업계가 1981년 IBM-PC 이후 계속된 ‘과거 전력’ 때문에 비효율적인 전원 공급장치를 사용하고 있다”며 "최근 PC들은 대부분 필요한 전원 규격을 주기판 레귤레이터에서 변환하기 때문에 전원 공급장치 표준 규격이 다중전압(multivoltage)일 필요가 없다"고 지적했다. 구글 관계자는 이를 위해 "다중전압 규격의 전원 공급장치를 12볼트 출력으로 단일화하고, 필요한 전압 규격은 모두 PC 주기판에서 변환하자"고 제안했다.

◆HP도 '무공해 에너지'에 투자…비용 크게 줄어
이 밖에도 HP(http://www.hp.com/environment)도 27일(현지시각) 보도자료에서 "미국 샌디에이고 제조공장에 1메가와트 급 태양광 전기 생산 시스템을 설치하고, 내년에는 아일랜드에서 80기가와트 급 풍력 에너지를 구입할 계획"이라고 밝혔다. 이를 통해 80만 달러 이상 절약할 수 있다는 주장이다.

이를 위해 HP는 해당 태양광 시설을 15년 동안 운영-보수하는 계약을 썬파워(SunPower)사와 맺었다. 주요 시설은 7개 빌딩에 5000여개 패널로 구성되어 있다. HP는 향후 15년 동안 75만 달러 이상 절약할 수 있을 것으로 예상하고 있다. 이는 1년 간 이산화탄소 배출량 100만 파운드를 줄일 수 있는 규모다.

또 HP는 내년부터 아일랜드에서 풍력 발전 업체 에어트리시티(Airtricity)로부터 80기가와트급 재생 에너지를 공급받아, 4만 달러 이상 절약한다는 방침이다. 이는 HP가 아일렌드에 사용하는 에너지의 90%에 이르는 수준이다.

HP 역시 "이산화탄소를 줄일 수 있는 수단"이라는 설명과 함께 "비용을 크게 줄일 수 있다"는 점을 강조하고 있다.

◆MS도 시베리아에 초대형 '데이터센터'…냉각 효율 고려
MS 역시 마찬가지 고민을 하고 있다. 러시아 현지 언론들은 비르거 스텐(Birger Sten) MS 러시아 CEO가 지난주 말 발표한 자료를 인용, "MS가 1만대 서버를 갖춘 데이터센터를 시베리아 '이르쿠츠크(Irkutsk)' 지역에 세울 예정"이라고 보도했다.

언론들은 "아일랜드에 있는 MS 데이터 센터의 구축 비용이 5억 달러 정도"라는 말을 소개하며 비슷한 수준이 될 것이라고 전했다. 주요 회선은 트랜드텔레콤(Transtelecom) 광케이블로 연결될 것이라는 구체적인 정보도 나왔다.

그러나 다른 언론들은 "MS가 러시아에 데이터센터를 설치할 것을 고려하고 있지만, 설치 지역인 앙가르스크(Angarsk)와 이르쿠츠크(Irkutsk) 두 지역이 물망에 올랐을 뿐 확정된 것은 아니다"고 지적했다. 그러나 두 지역 모두 동부 시베리아 지역인 것은 공통된 특징이다. MS 러시아 CEO는 "연간 50메가와트 수준의 안정적인 전력을 공받을 수 있는 지역"이라고 말했지만 현지 업계에서는 여러가지 의견이 엇갈리고 있다.

'IT는 중요하지 않다' 등을 쓴 IT 전략 전문가 니콜라스 카(Nicholas G. Carr) 교수는 자신의 블로그에서 "MS는 구글과 경쟁하기 위해 많은 데이터센터를 전 세계에 구축하고 있다"며 "올해에만 캘리포니아 남부 및 북부, 오클라호마, 아이오와, 네덜란드 등에 새 시설을 마련했다"고 설명했다.

그는 "구글과 MS는 데이터센터의 전력 소비량을 줄이기 위해 창의적인 방법을 고민하고 있다"며 "서버를 식히기 위한 냉각플랜트의 물은 오리건(Oregon) 콜럼비아 강(Columbia River) 더댈러스(The Dalles) 급류를 활용할 예정이고, MS는 수많은 서버 랙을 쉽게 식히기 위해 시카고의 쌀쌀한 날씨를 활용했다"고 지적했다. 시베리아를 언급하는 것도 비슷한 목적일 것이라는 해석이다.

실제로 러시아 시베리아 지역은 한겨울에는 영하 50도까지 떨어지며, 한여름에도 좀체 영상으로 올라가지 않는 곳으로 잘 알려져 있다.

마이크 마노스(Mike Manos) MS 데이터센터 수석 책임자는 C넷과 인터뷰에서 "일반적으로 초대형 데이터센터는 냉각을 위해 많은 물을 필요로 한다"며 "시카고 윈디 시티(Windy City) 데이터센터가 차가운 겨울 날씨를 활용한 것처럼 시베리아도 마찬가지 이점이 있을 것"이라고 말했다.

입력 : 2007.11.29 09:16 / 수정 : 2007.11.29 09:19

Nov 28, 2007

구글의 베일에 가린 자체 제작 10기가 이더넷 스위치가 시장에 몰고 올 영향

KISTI 『글로벌동향브리핑(GTB)』 2007-11-27

나이퀴스트 캐피탈(Nyquist Capital)의 최근 발표에 따르면 구글은 자신들이 직접 제작한 10기가비트 이더넷 스위치를 자신들의 대규모 데이터 센터에 사용하고 있으며, 이 제품이 관련 벤처 회사인 아스트라타(Astrata)를 통해 출시될 경우 시장에 큰 영향을 미칠 것이라고 한다.
나이퀴스트 캐피탈의 앤드류 슈미트(Andrew Schmitt)는 초고속 네트워크를 위한 광 장비 시장에서 구글로 인해 시장의 왜곡이 발생하고 있다고 말한다. "많은 회사들이 현재 10기가비트 이더넷을 위한 컴포넌트들을 생산하고 있지만 마땅한 수요처를 찾지 못하고 있습니다. 이러한 문제는 물론 광네트워크 시장 전체의 침체 때문이기도 하지요. 하지만 다양한 장비 제조 업체들과의 면담을 종합해 보면 우리는 네트워크 관련 거대 업체인 구글이 자신들만의 10기가비트 이더넷을 위한 스위치를 제작하였으며, 이를 자신들의 데이터 센터 서버들을 연결하는 데 사용하고 있다는 결론을 내리게 되었습니다"라고 앤드류 스미트가 말했다.
또한 슈미트는 이 구글의 스위치가 브로드컴(Broadcom)의 스위치 프로세서를 사용하고 있다고 주장한다. 그는 구글이 표준을 따르지 않는 자신들만의 데이터 센터용 단거리 저비용 광데이터 포맷을 사용하고 있을 것이며, 이러한 포맷은 구글의 시장 선도적인 입지를 고려해 볼 때 향후 다른 업체의 유사한 데이터 센터 구축에도 사용될 가능성이 높다고 전했다.
이러한 주장은 최근 데이터 센터 관련 신제품을 출시한 신생기업 아스트라타가 구글과 유사한 형태의 시스템 구성을 띄고 있다는 데서 더 확신을 얻게 된다. 아스트라타의 창업자인 앤드 베크톨쉐임(Andy Bechtolsheim)은 구글의 CEO인 에릭 슈미트(Eric Schmidt)와 동시에 선(Sun) 社에서 근무하였으며, 구글의 초창기 투자자이기도 하다.
구글이 자신들의 방대한 데이터 센터를 운영하기 위해 최적화한 자신들의 스위치 제품 및 광 네트워크 시스템을 어떤 식으로든 시장에 출시할 경우 그 영향은 아주 막대할 것이다. IT 분야의 새로운 골리앗으로 등장한 `구글`, 이제 그들이 가지고 있는 기술과 새로운 시도들은 시장의 패러다임을 바꿀 막강한 힘을 지니게 되었다. 다만 그들이 독자적으로 준비하는 새로운 시도들이 기존 시장을 고사시키거나 독점적인 이익을 취하는 방편이 아닌 블루 오션을 창조하고 이익을 나눠갖는 공생적인 발전을 계속하기를 기대해 본다.

http://www.networkworld.com/

Nov 9, 2007

구글 Android, 휴대폰이 개방적 인터넷 디바이스로 변모하는 계기…OpenSocial과 연계도

CNet News, 2007/11/07

Google의 OpenSocial이 사람들의 입에 채 오르내리기도 전에 연이어 발표된 것이 휴대폰 완전개방형 플랫폼 Android와 이를 추진할 34개 기업들의 연합체 Open Handset Alliance(OHA)이다.

OHA는 Google이 주도하는 휴대폰 완전개방형 플랫폼을 위한 연합체로서, 이통사와 단말업체들마다 각기 다른 휴대폰 플랫폼을 하나의 무료 개방형 플랫폼으로 통일하는 것이 목적이다. 이러한 목적 하에 OHA가 발표한 첫 번째 플랫폼이 바로 Android이며, open source Apache v2 라이센스로 제공되며 커뮤니티에 소스코드를 공개하지 않아도 된다.

현재 OHA에 참가한 34개 기업은 Aplix, Ascender, Audience, Broadcom, China Mobile, eBay, Esmertec, Google, HTC, Intel, KDDI, Living Image, LG, Marvell, Motorola, NMS Communications, Noser, NTTDocomo, Nuance, Nvidia, PacketVideo, Qualcomm, Samsung, SiRF, SkyPop, SONiVOX, Sprint Nextel, Synaptics, TAT - The Astonishing Tribe, Telecom Italia, Telefonica, Texas Instruments, T-Mobile, Wind River이다.

이 중에서 일본 도코모와 KDDI와 참여하고 있는 것이 주목할 만하다. 도코모의 경우, Android와 경쟁구도라 할 수 있는 Nokia의 Symbian에 이미 폭넓게 투자를 하고 있는 상황이므로 조금은 예상 밖이다. 현재 도코모는 단말업체들에게 Symbian 기반의 도코모 공동 플랫폼’ 개발을 지원하고 있으므로 OHA에 참가는 했으나 초반에는 Android 지원 단말을 별로 내놓지 않을 것이라는 것이 업계의 예상이다. KDDI도 공통플랫폼 ‘KCP’ 등이 있으므로 전면적으로 Android를 채택하지는 않을 전망이다. 따라서 양 사 모두 올해와 내년 경까지는 Android에 대해 소극적으로 나설 것으로 전망된다.

그러나 글로벌 관점에서 보면 얘기는 달라진다. Motorola와 LG 등의 단말업체, Qualcomm과 Nvidia 등의 칩제조업체, WindRiver와 Nuance 등의 소프트웨어 애플리케이션 벤더 등이 참가하고 있어, 이르면 내년부터 이미 보급된 수백 만대의 iPhone에 대항해 Android 기반 수천 만대의 단말이 격돌하게 될 가능성을 눈여겨 볼 만하다.

물론 iPhone은 섹시하고 매력적인 단말로 소비자들의 관심을 사로잡았으나, 비즈니스 측면에서 살펴보면 Google보다 밀리는 것은 확실하다. Google은 라이센스 프리의 Android를 통해 자사의 광고가 들어간 휴대폰을 전세계 유저들에게 공급하게 되기 때문이다. 이것이 바로 Google의 돋보이는 비즈니스 감각이다.

한편, Android의 경쟁 상대인 Symbian은 지금까지의 애플리케이션 개발 플랫폼이었던 java와 비교해 빠른 동작성과, native library에 접속 가능하다는 등의 긍정적인 측면이 있는 반면, 높은 기술적 장벽의 C++의 소양을 가진 엔지니어가 필요하므로 대응이 좀처럼 쉽지 않다는 부정적인 요소가 있다.

그러나 Android는 Linux 기반이므로 어느 정도로 빠른 속도를 구현할 수 있을지 알 수 없으나, 다수의 이통사와 단말업체들을 포섭하고 있으므로 native library에 쉽게 접근 가능할 것이란 점에서 기대를 가져 본다. 어쨌든 Symbian의 모회사인 Nokia가 모바일OS 시장의 절반 이상을 점유하고 있는 현재, 조만간 Nokia vs Google이라는 새로운 경쟁구도가 펼쳐질 것은 명약관화하다.

아직 소프트웨어 개발 키트(SDK; software development kit)도 나와 있지 않은 단계에서 속단하기는 힘드나, OpenSocial도 결국 Android와 연계해 나갈 것이다. 그렇게 된다면, PC의 IM과 휴대폰의 화상전화를 통해 끊김없는 원활한 화상채팅이 이루어지고 휴대폰의 통화와 메일 이력을 통해 SNS의 친구 목록이 수시로 바뀔 수 있다. 또한 실시간으로 혼잡한 정체상황 등을 파악할 수 있는 애플리케이션 등 다양한 애플리케이션이 나올 수 있다.

물론, 단말제조업체와 이통사마다 서로 다른 사양을 비즈니스모델로 하고 있기 때문에 Android에 참여하더라도 테스트사업자나 솔루션 벤더들에게는 녹록지 않은 과제가 남아 있다. 그러나 진정한 의미에서 휴대폰이 개방적인 인터넷 디바이스로 변모한다는 의미에서 OHA는 환영할 만 하다. 향후 어떠한 애플리케이션이 개발될 지 예상만으로도 업계 관계자들을 흥분시키고 있다. 바야흐로 PC와 모바일의 영역이 무너지고 있는 것이다.

Nov 7, 2007

구글의 모바일 플랫폼 ‘Android’ 구상, 휴대폰에서도 매쉬업 실현해 광고수익 확대가 목표

IT-media, 2007/11/06

Google이 33개 이동통신 관련업체들과 공동으로 11월5일 발표한 Android는 세계 최초로 완전 통합된 모바일 플랫폼으로 PC에서 일어났던 매쉬업이 휴대폰에서도 실현할 수 있게 하겠다고 포부를 밝혔다.

Android는 리눅스 기반의 OS와 UI, 응용프로그램으로 이루어진 모바일 소프트웨어의 집합체로 Apache Licence를 이용한 오픈소스로 제공되므로 참여하는 회사는 필요한 일부 기능만을 골라 사용할 수 있다. 스마트폰과 같은 고사양 단말은 물론 일반단말도 지원한다.

현재 Android를 Google과 함께 개발 중인 33개사 연합의 Open Handset Alliance(OHA) 는 Sprint와 T-Mobile, Qualcomm, Motorola 등 미국을 중심으로 한 이통사와 단말벤더 그리고 일본 도코모와 KDDI, 한국의 삼성전자와 LG전자 등이 참여하고 있다.

Android의 SDK는 향후 1주일 이내에 Windows, Macintosh, Linux용의 에뮬레이터가 첨부되어 공개되며, α 버전도 조만간 공개될 예정이다. Android를 탑재한 휴대폰은 2008년 중반 모습을 드러낼 것이다. 2∼3년간은 휴대폰 개발에 주력하겠지만, 향후 미디어플레이어나 카내비게이션, 셋톱박스(STB) 등 가전제품 개발도 생각하고 있다.

이번 프로젝트의 책임자인 Google의 앤디 루빈은 Android를 통해 온라인과 모바일의 연결고리가 되고 싶다고 말한다. 오픈플랫폼 온라인 네트워크는 급속하게 진화하고 있지만, 모바일에서는 이통사별•단말별 사양이 달라 서비스 개발을 위한 진화 속도가 느리다는 문제점을 안고 있다. 예를 들어 이통사별 모바일 인터넷 서비스의 사양도 상이하고 OS나 UI도 단말에 따라 다르다. 이 때문에 단말벤더나 애플리케이션 개발자의 자유도는 그만큼 낮아지고 있으며 개발비용은 증가했다.

일본의 KDDI와 소프트뱅크모바일의 경우, 각각 KCP+와 POP-i이라는 통합 플랫폼을 발표했지만 이러한 통합 플랫폼도 결국은 특정 이통사를 중심으로 한 수직통합으로 이통사를 초월한 서비스 개발로는 연결되지 않았다.

현재 플랫폼을 이통사가 독점하고 있지만 모두가 같은 플랫폼으로 제공받는다면 벤더의 선택사항도 늘어나고 개발 비용의 삭감도 꾀할 수 있다. 이 때문에 무상으로 제공되는 Android를 통해 단말개발 비용을 10% 정도 내리는 것이 가능하다.

매쉬업(Mash-up) 실현

온라인 세계에서 매쉬업 문화를 만들어낸 Google은 이번에는 모바일에서도 같은 일을 하고 싶어한다. Android를 사용하면 Google Maps와 타 사이트를 조합한 서비스 도 간단히 만들 수 있다. 각 기능은 모듈식으로 벤더들은 필요한 기능만을 골라서 자사 서비스로 만들 수 있다.

이를 통해 Google은 인터넷 이용자가 크게 증가할 것으로 내다보고 있다. Android를 탑재한 단말은 인터넷 접속이 쉽기 때문에 Google의 검색이나 광고에 손쉽게 접근 할 수 있다. 현재는 직장이나 집안에서 주로 인터넷을 이용했지만 앞으로는 달리는 차 안이나 야외에서도 자유롭게 휴대폰을 통해 인터넷을 이용할 수 있다.

한편, Google의 라이벌인 MS도 Windows Mobile 플랫폼 추진에 주력하고 있다. 하지만 Google이 결정적으로 다른 점은 독점이 아닌 모두와 함께 하는 개방적이고 민주적인 구조라는 점이다.

"Android 도입으로 휴대폰 제조원가 최소 10%는 인하될 것"…구글폰 개발의 핵심인물 Andy Rubin
Tech-On, 2007/11/06

구글이 자사의 개방형 모바일 OS 플랫폼 Android와 이와 관련한 범업계연합 Open Handset Alliance(OHA)에 관한 기자회견을 11월6일 개최했다. 이날 기자회견에는 구글폰 개발의 핵심인물인 Andy Rubin씨가 참석해 Android와 OHA의 개요를 설명했다. 발언 요지는 이하와 같다.

--Android를 제공하는 목적은 무엇인가?

이동통신 가입자수는 전세계에 30억명에 달하지만 무선 인터넷을 이용하는 사람은 극히 일부에 지나지 않는다. OS, 미들웨어, 유저인터페이스, 애플리케이션 등을 통합한 완전한 오픈소스(open source) 플랫폼을 제공하면 신속한 단말 개발과 비용절감이 가능하다. 이를 통해 보다 많은 유저가 모바일로 인터넷에 접속할 수 있기를 바라고 있다.

단말 제조업체는 Android를 채택함으로써 라이센스 비용부담을 줄일 수 있다. 제조원가를 최소한 10% 정도는 절감할 수 있을 것이다. 향후 이 플랫폼을 디지털 가전 등에도 적용할 예정이다. 또한 수직통합형 모바일 소프트웨어 플랫폼과는 다른 오픈소스 형태의 새로운 비즈니스가 탄생할 것으로 기대하고 있다.

--OHA의 역할은 무엇인가?

참가기업 34사는 반도체 제조업체, 단말 제조업체, 이통사, 소프트웨어 제조업체 등 다양하다. 특히 반도체 제조업체가 참여하고 있다는 사실은 주목할 만한 부분으로, 앞으로 단말 아키텍쳐의 표준화는 물론 특정 칩을 이용하는 드라이버 등도 오픈소스를 통해 플랫폼에 도입할 계획이다.

소프트웨어 제조업체는 플랫폼에 준거한 소프트웨어를 개발해 각각의 독자기술을 소프트웨어 스택(software stacks)에 통합할 수 있다. 말하자면 휴대전화의 매쉬업(mash-up)을 실현할 수 있게 되는 것이다. 이를 통해 이용자에게 다양한 선택지를 제공하는 것이 OHA의 목적이라고 할 수 있다.

--소프트웨어의 라이센스 형태는 어떻게 되나? 역시 GPL(GNU General Public Lisence) 인가?

Linux가 GPL을 채택하고 있기는 하지만 Linux는 커넬(kernel) 부분에만 사용하고 있다. GPL은 커넬 부분으로만 한정할 것이며, 그 이상에서 구동하는 소프트웨어는 Apache 라이센스를 채택하고 있다. 전체 코드 가운데 Linux가 차지하고 있는 비율은 약 2%에 지나지 않는다.

--Android는 Google의 수익에 어떠한 영향을 줄 것으로 보나?

오픈소스인 만큼 소프트웨어 제공으로 수익을 기대할 수는 없다. 모바일로도 지속적으로 구글의 서비스를 이용할 수 있는 환경이 정비되면 결과적으로 광고 수입 등이 증가할 것으로 본다.

--플랫폼은 하이엔드와 로엔드 단말 가운데 어디에 채택될 예정인가?

소프트웨어는 모듈 구조로 기능을 취사선택 할 수 있도록 되어 있다. 예를 들면 일부 모듈만을 이용해 저가격의 휴대전화를 개발할 수 있다. 향후에는 휴대전화뿐만 아니라, 디지털가전이나 셋탑박스, 미디어 플레이어, 카내비게이션 등에도 사용할 계획이다.

베일 벗은 구글폰, 핵심은 개방형 OS 플랫폼 ‘Android’… ARM9 이상의 단말이면 탑재가능

CNet News, 2007/11/06

Google이 HTC와 Motorola, 삼성전자 등 34개 휴대폰 업체들과 손을 잡고 개방형 모바일 소프트웨어 플랫폼 Android를 미국 시간으로 5일 공식 발표했다. Android 플랫폼 개발을 위한 OHA(Open Handset Alliance)에는 단말제조업체 HTC와 Motorola, 미국의 통신사업자 T-Mobile과 칩 제조업체 Qualcomm 등이 포함되며 이는 Google의 휴대폰 개발 커뮤니티를 위한 하나의 절차로 보여진다.

Google이 자사의 애플리케이션을 휴대폰에 탑재할 새로운 소프트웨어를 개발한다는 얘기는 끊임없이 제기되어 왔었다. 그러나 이는 단순히 소프트웨어 개발 차원을 넘어 다수의 기업과 제휴를 통해 휴대폰용 개방형 OS 플랫폼을 제공하는 것이 목적이다.

Google의 에릭 슈미트 CEO는 성명을 통해 "우리가 구상하는 비즈모델은 단순히 Google폰 그 자체가 아닌, 그보다 스케일이 큰 강력한 모바일 플랫폼이다. Google의 OS 플랫폼을 모든 휴대폰 단말에 탑재하는 것이 우리의 목표다"라고 강조했다.

Android 플랫폼은 OS와 미들웨어, 유저친화적 인터페이스, 애플리케이션으로 구성된다. Android가 탑재된 휴대전화는 2008년말경에 출시될 예정이다.

Android 소프트웨어는 휴대전화 단말업체와 이통사들이 혁신적인 애플리케이션 개발할 수 있는 개방적인 플랫폼을 제공한다. 이 새로운 소프트웨어는 Apple과 Microsoft, Nokia, Palm, Research in Motion 등 다른 기업이 제공하는 스마트폰 소프트웨어와 정면 충돌을 피할 수 없게 되었다. 그러나 이러한 모바일 OS와는 달리 Android는 특정 단말에만 한정되지 않고 Motorola, HTC, 삼성전자, LG전자 등 모든 휴대전화 단말업체가 제공하는 기종을 지원한다는 계획이다.
　
Google의 모바일플랫폼 담당 디렉터이자 2005년에 Google이 인수한 모바일 소프트웨어업체 Android의 공동창립자인 Andy Rubin씨에 의하면 200MHz급의 ARM 9 프로세서 이상의 휴대폰 단말이면 Android를 탑재할 수 있다고 밝혔다. Android 플랫폼은 소형과 대형 화면에서 모두 최적의 상태로 지원하며 키보드 외에 다른 입력 수단도 가능하다고 설명한다.

Rubin은 "Android는 특히 유저 experience가 매우 우수하다. 일주일 내로 소프트웨어 개발 킷을 공개할 예정이며, 이를 직접 보여줄 것이다"라고 언급했다. 그는 third party 개발자들이 USB와 메모리카드, 무선통신을 통해 서비스와 콘텐츠를 매우 간편하게 전송할 수 있는 호스티드(hosted) 서비스도 제공할 예정이라고 밝혔다. 시스템과 관련된 자세한 내용은 소프트웨어 개발 킷 발표시에 언급하겠다고 덧붙였다.
　
또한 Qualcomm의 CEO Paul Jacobs는 Android를 low-end 휴대단말보다도 스마트폰의 7225 칩셋에 탑재해 200달러 미만의 컨슈머 가격에 판매할 계획이라고 밝혔다.
　
개발자들은 이번 Alliance를 통해 휴대전화 단말업체와 이통사들이 유저 친화적인 서비스와 기기를 개발할 수 있도록 지원하며, 휴대폰에 인터넷 기능을 한층 더 강화시킬 수 있도록 하기 위함이다. 또한 개방적인 모바일 OS는 참가 기업들의 개발규모 확대에 긍정적인 영향을 미치며, 이는 보다 저렴한 비용으로 고도의 기능을 제공할 수 있는 가능성이 높기 때문이다. 즉, 그 어느 때보다도 유저친화적인 인터페이스와 다채로운 인터넷 서비스를 지원하는 매력적인 단말 제공이 가능해 진다.

HTC의 CEO Peter Chou는 성명을 통해 "Open Handset Alliance 참가와 2008년 후반의 Android 플랫폼 통합으로 우리의 제품 포트폴리오를 인터넷과 접속한 휴대전화 라는 새로운 카테고리로 확대할 수 있다. 이는 모바일 업계의 구조를 바꾸는 계기가 될 것이며, 휴대전화에 대한 유저의 기대를 새롭게 할 것이다"라고 언급했다.
　
Open Handset Alliance의 참가 기업은 다음 주에 액세스 소프트웨어 개발 킷을 발표할 예정이다.

Oct 29, 2007

The Future of Connected Mibile Computing

"It's what computers have become" neatly sums up Nokia's position on its recent flagship consumer handset, the N95. The handset, which integrates high-specification camera, video, satellite navigation and web browsing, is a prime example of the continued convergence of computing and communications technology. Ian Drew, vice president of segment marketing, ARM, considers the way that the industry ecosystem has developed to support the needs of the new connected mobile computing market.
Three consumer needs are currently at the forefront of handset manufacturing. First, despite buying products that offer more and more functionality, consumers do not want to have to recharge their batteries during the day. Demand for a full-day use model ?an always-on handset that will last all day on a single charge ?makes low power a critical need that is high on the design and manufacturing agenda.
In addition to long battery life, consumers want their portable devices to deliver sufficient performance for the applications they are running, including office applications for business and multimedia capabilities incorporating music, video and gaming. It is also essential to provide an excellent browser experience if users are to embrace mobile internet.
And third, both businesses and consumers want to be able to find, use, and communicate information as and when they need to, so devices need to be permanently connected. Some want the same level of access to information on the move as they have when using their desktop PCs. The web browser is key technology in meeting that need.
Demand for office, internet, multimedia, games and social networking applications is driving the smart phone market to volumes of around 150 million units per year and, potentially, further significant growth. According to global market research firm Ipsos Insight, the number of people accessing the internet from a mobile phone is growing faster than those using wireless access from a notebook PC. A huge installed user base puts the mobile phone in a strong position to become the dominant internet platform outside the home.
Comparing Market Growth Rates
The significant opportunities for mobile applications become even more apparent in the light of wired and wireless technology market trends. For example, use of SMS is still growing at a compound rate of 24 percent, and data at 41 percent. Shipments of internet-enabled phones outstrip media-enabled phones, but both have very high compound growth rates: 48 percent and 30 percent respectively (Figure 1).

Figure 1: Growth in SMS; Internet-enabled phones;
Media-enabled phones; Data usage
Sources: Gartner; Jon Peddie
The question facing manufacturers is how best to tap into these growth trends. Integration of several capabilities within a single mobile device has proven to be a winning formula for manufacturers in the past, and there is no reason to believe that this will change. For example, always-connected functionality will allow users to adopt mobile instant messaging as well as SMS.
There is also a potential upside for mobile operators: an opportunity to increase revenues is presented by handsets supporting new services, including email, access to music, video and other downloadable content, and location-based services like provision of maps. At the same time, operators must manage potential threats that disruptive technology brings, the obvious one being wireless voice over IP (VoIP). A phone that supports WiFi and includes a VoIP client offers a no-cost alternative to using the operator's network for voice.
Supporting Innovation
During the last 10 years, handset design has evolved from a uniform product that enables basic voice communication to a multitude of devices that support a broad range of communication options. Today's smart phone platform offers scope for a range of form factors and features for different target markets.
However, cellular handsets are not the only option for connected, converged portable products (Figure 2). Personal media players, satellite navigation devices and handheld games players benefit from enhanced connectivity. Sub-laptop mobile computers offer consumers a more portable alternative to the traditional laptop computer.

Figure 2: Convergence of connected mobile computing

Like all consumer markets, the market for connected mobile computing devices is highly competitive, so differentiation is critical to product success. Designing products for low power and cost is essential, as is providing a level of performance appropriate to the target application.
The structure of the mobile computing ecosystem is one of the factors that enable companies to achieve these design goals. Companies can license IP from multiple sources to create their own differentiated designs and have them fabricated by any of the major chip manufacturers or foundries. They can choose fabrication process technologies that suit their applications and manufacturers appropriate to their business models.
OEMs integrate the manufactured devices into the final product and add software to differentiate the application further. Again, a broad choice of software IP and development environments enables OEMs to optimize the software for their products. The connected mobile computing industry is vibrant, fast changing, competitive and characterized by innovation that continues to yield a rapidly expanding and diverse range of products. The structure of the mobile ecosystem supports innovation at every point in the development flow.
Nurturing Ecosystem Growth
An industry-wide willingness to collaborate has led to the growth of a diverse ecosystem of companies offering a broad range of software, with support for different standards, development tools and environments. This collaborative approach offers a number of benefits:
Choice and Flexibility
ARM provides processor IP to chip designers and manufacturers, who then have complete choice in how to build on that IP. By customizing commercial IP and adding their own proprietary IP to meet specific needs, each chip manufacturer is able to focus on differentiation, developing their devices for a broad range of mobile products that target a range of different markets.
Fostering Differentiation and Innovation
In this respect, ARM's business model enables the design community to differentiate and encourages innovation. It gives entrepreneurs the means to bring their new ideas to market and, reduces barriers to entry by removing the need to raise prohibitive amounts of capital. The fabless business model also offers a greater degree of flexibility for OEMs. The result is that, by taking advantage of ARM IP and the ability to choose manufacturers, more than 200 licensees are continuously innovating and creating highly differentiated products.
Low Risk
Another benefit is that having a broader supply base of silicon and software reduces the risk involved in this innovation. As a result, the number of ARM licensees has doubled from over 100 in 2002 to around 200 in 2006. The number of ARM processor-based product shipments increased to something approaching 2.5 billion in 2006.
Experienced Development Community
The success of efforts to build a development community is evident in the numbers: collectively tens of thousands of SoC engineers and hundreds of thousands of software engineers are creating millions of ARM processor-based connected mobile devices.
Low Cost
This development community is now spread across all regions of the world. Globalization of the electronics industry has introduced consumer electronics to vast new markets in regions such as China, India, South America and Eastern Europe. While these countries offer the prospect of huge numbers of consumers, individuals' disposable income is much lower than that of people in the established Western markets. According to telecoms research and consulting firm ARCchart, many in the industry see a sub-$100 3G handset as a key to unlock untapped developing world markets. Product price points and cost become even more critical if manufacturers are to compete profitably in these new high-growth markets.
ARM helps manufacturers minimize their costs by providing area-optimized IP architectures, by supporting choice in routes to manufacturing, by working with Synopsys and other EDA vendors to develop design flows that minimize power and support design for manufacturing, and by encouraging choice and flexibility within the mobile computing ecosystem.
Collaborating to Achieve Low Power
Low power is the critical technology driver for this market. ARM and Synopsys have collaborated on a project to deliver a low-power implementation solution for ARM's Intelligent Energy Manager (IEM) technology through the proven Synopsys Galaxy?design flow for dynamic voltage and frequency scaling.
ARM's IEM technology combines hardware and software features to control voltage and frequency depending on the dynamic needs of the application. The power management software and hardware controls the system to use as little power as possible for a given task, running the processor only as fast as necessary (commensurately using only the voltage required to operate at that frequency). This is adjusted on-the-fly, for each task or timeslice. This approach can reduce processor power needs by as much as 60 percent, which can equate to as much as 15 ?20 percent of the overall system power.
ARM's IEM technology works with the operating system and handset applications to adjust the required CPU performance level through a standard programmer's model. It does this by balancing processor workload and energy consumption, while maximizing system responsiveness to meet end-user performance expectations. This will not affect the user experience, except that the battery will last longer between charges. ARM and Synopsys have also built on their extensive low power collaborative research and silicon technology demonstrators to create the Low Power Methodology Manual (LPMM) for SoC Design, published by Springer.
The LPMM enables designers to adopt aggressive power management techniques and take advantage of the latest low power features in ARM IP and Synopsys tools; an enhanced ARM-Synopsys implementation Reference Methodology for the ARM1176JZF-S will incorporate the LPMM techniques for an automated power gating flow through the Galaxy design platform. One of the technology demonstrator SoCs described in the LPMM, the Synopsys-ARM Low Power Technology (SALT) demonstrator SoC showed more than 96 percent leakage power savings using the techniques described in the LPMM.
Enabling Manufacturing Choice
The link between design and manufacturing is increasingly important in achieving performance, power and area goals. ARM works closely with foundries and EDA vendors to optimize design reference flows for particular processes, and in fabricating reference designs to demonstrate best-in-class implementation.
Choice of manufacturing partner enables choice of process technology. Because silicon-on-insulator (SOI) enables higher performance and lower power than CMOS, it is the process of choice for all gaming platforms, satellite chips, ultra-low power circuits for watches and so on.
ARM has worked closely with foundry partner UMC to migrate 65nm CMOS to 65nm SOI (L65SOI). This is the first open foundry and IP offering for worldwide availability of 65nm SOI process technology. The solution comprises ARM's portfolio of standard cell library, I/O library and SRAM compiler, and UMC's manufacturing capability.
Summary
The connected mobile computing market needs rich wireless and wired communication to provide always-on and connected functionality. Combining high performance and low power to enable full-day use of office-class applications for business and multi-day standby, rich multimedia support including music, TV, video and gaming, and excellent browser support.
This market is characterized by innovation and competition. Product companies need support to implement their ideas quickly and efficiently. Above all, they need support to differentiate. The partnership approach advocated by ARM encourages a broad supply base that provides choice of hardware and software intellectual property, silicon, development tools and expertise that is fundamental in offering an industry platform for differentiation.

Oct 23, 2007

구글의 모바일 전략, 속내는 對이통사 협상력 높이기?…"주파수 획득 않고도 선택지는 많다"

MarketWatch, 2007/10/18

Google의 Larry Page 창립자와 Eric Schmidt CEO는 18일 열린 3분기 실적회의에서 Wall Street 애널리스트들이 핫이슈로 선정한 휴대전화 사업전략에 대해 어떠한 계획도 밝히지 않았다. 하지만 이통사들이 앞으로도 Google의 행보를 주시해야 할 것임을 분명히 했다.

Google은 2008년 1월로 예정된 FCC의 주파수 경매에 수십억 달러 규모로 입찰할 가능성을 시사한 바 있다. 또한 지난 몇 개월간 Google이 향후 확보할 주파수에서 활용 가능한 휴대전화용 소프트웨어를 개발하고 있다는 뉴스가 흘러 나오고 있다. 하지만 이에 대해 Google은 어떠한 명확한 입장도 밝히지 않고 있다.

Page 창립자는 "Google은 주파수, 인터넷 접속, 무선 분야 등 다양한 사업 옵션을 가지고 있다"며, "주파수를 구매하거나 모바일 인프라를 개발할 필요성은 없지만 이와 같은 분야가 기회로 작용할 수도 있다. Google이 현재 제공 중인 애플리케이션을 더 많은 사람들이 더욱 널리 이용할 수 있기를 바란다"고 말했다. 이에 덧붙여 Schmidt CEO는 "이미 일부 특정 분야에서는 Google이 모바일 애플리케이션 분야의 선두주자가 되고 있다"고 말했다.

이미 Google Maps와 같은 다양한 모바일 애플리케이션은 iPhone 등 인기 단말에서 이용이 가능하다. Page 창립자는 "지금까지 다수의 이통사들과 사업제휴를 맺어왔으며, 앞으로도 이를 유지해 나갈 것"이라고 밝혔다.

하지만 향후 Google이 모바일 분야로 사업 확장을 해 나갈 경우, 이통사로부터의 독립 가능성을 협상 무기로 삼아 이통사들에게 더 많은 권한을 요구하게 될 수도 있다.

Jacob Internet Fund의 Darren Chervitz 연구부장은 "Google이라면 이통사들에게 일방적으로 의존하고 싶지 않다. 우리는 대안(plan B)을 가지고 있다고 충분히 말할 수 있다. 주파수 구매 가능성은 이통사와의 보다 긴밀한 협상의 열쇠로 작용할 것"이라고 말했다. Google은 지난 지난 8월 말을 기준으로 Jacob Internet Fund 주식의 6.6%를 보유하며 2대 주주 자리에 있다.

Google은 FCC에 경매 대상 주파수 대역 중 일부를 개방용(openness provision) 으로 지정해 줄 것을 요청하며, 애플리케이션, 단말, 서비스, 네트워크 등 4가지 개방 조건을 내세운 바 있다. FCC는 이통사에 관계없이 모든 단말이나 애플리케이션 이용이 가능하다는 일부 조건은 허용했으나, 나머지는 수용하지 않았다. Page 창립자는 “FCC가 개방형 대역을 승인한 것에 대해 만족한다”고 밝혔다.

한편, Verizon은 개방형 주파수 경매안에 대해 완강한 반대 의사를 표시해 왔다. 하지만 업계에서는 향후 대형 이통사들이 Google의 제안에 순응해야 할 시기가 올 것으로 보고 있다. Chervitz 연구부장은 "Google이 인터넷 분야에서 가지고 있는 동일한 효율성을 모바일에서도 실현하기 위해 기존 이통시장의 비즈모델을 파괴할 수도 있을 것"으로 예상했다.

Sep 30, 2007

List of device bandwidths

List of device bandwidths (Wikipedia)

This is a list of device bandwidths: the channel capacity (or, more informally, bandwidth) of some computer devices employing methods of data transport is listed by bit/s, kilobit/s (kbit/s), megabit/s (Mbit/s), or gigabit/s (Gbit/s) as appropriate and also MB/s or megabytes per second. They are listed in order from lowest bandwidth to highest.

Whether to use bit/s or byte/s (B/s) is often a matter of convention. The most commonly cited measurement is bolded. In general, parallel interfaces are quoted in byte/s (B/s), serial in bit/s. On devices like modems, bytes may be more than 8 bits long because they may be individually padded out with additional start and stop bits; the figures below will reflect this. Where channels use line codes, such as Ethernet, Serial ATA and PCI Express, quoted speeds are for the decoded signal.

Many of these figures are theoretical maxima, and various real-world considerations will generally keep the actual effective throughput much lower. The actual throughput achievable on Ethernet networks, for example (especially when heavily loaded or when running over substandard media), is debatable. The figures are also simplex speeds, which may conflict with the duplex speeds vendors sometimes use in promotional materials.

All of the figures listed here are true metric quantities and not binary prefixes (1 kilobit, for example, is 1000 bits, not 1024 bits). Similarly, kB, MB, GB mean kilobytes, megabytes, gigabytes, not kibibytes, mebibytes, gibibytes.

List..

The Common System Interface: Intel's Future Interconnect

By: David Kanter (dkanter@realworldtech.com)
Updated: 08-28-2007

Introduction

In the competitive x86 microprocessor market, there are always swings and shifts based on the latest introductions from the two main protagonists: Intel and AMD. The next anticipated shift is coming in 2008-9 when Intel will finally replace their front side bus architecture. This report details Intel’s next generation system interconnect and the associated cache coherency protocol, likely deployment plans across the desktop, notebook and server market as well as the economic implications.

Intel’s front-side bus has a long history that dates back to 1995 with the release of the Pentium Pro (P6). The P6 was the first processor to offer cheap and effective multiprocessing support; up to four CPUs could be connected to a single shared bus with very little additional effort for an OEM. The importance of cheap and effective cannot be underestimated. Before the P6, multiprocessor systems used special chipsets and usually a proprietary variant of UNIX; consequently they were quite expensive. Initially, Intel’s P6 could not always match the performance of these high end systems from the likes of IBM, DEC or Sun, but the price was so much lower that the performance gap became a secondary consideration. The workstation and low-end server markets embraced the P6 precisely because the front-side bus enabled inexpensive multiprocessors.

Ironically, the P6 bus was the subject of considerable controversy at Intel. It was originally based on the bus used in the i960 project and the designers came under pressure from various corporate factions to re-use the bus from the original Pentium so that OEMs would not have to redesign and validate new motherboards, and so end users could easily upgrade. However, the Pentium bus was strictly in-order and could only have a single memory access in flight at once, making it entirely inadequate for an out-of-order microprocessor like the P6 that would have many simultaneous memory accesses. Ultimately a compromise was reached that preserved most of the original P6 bus design, and the split-transaction P6 bus is still being used in new products 10 years after the design was started. The next step for Intel’s front side bus was to shift to the P4 bus, which was electrically similar to the P6 bus and issued commands at roughly the same rate, but clocked the data bus four times faster to provide fairly impressive throughput.

While the inexpensive P4 bus is still in use for Intel’s x86 processors, the rest of the world moved on to newer point-to-point interconnects rather than shared buses. Compared to systems based on HP’s EV7 and more importantly AMD’s Opteron, Intel’s front-side bus shows it age; it simply does not scale as well. Intel’s own Xeon and Xeon MP chipsets illustrate the point quite well ? as both use two separate front-side bus segments in order to provide enough bandwidth to feed all the processors. Similarly, Intel designed all of their MPUs with relatively large caches to reduce the pressure on the front-side bus and memory systems, exemplified by the cases of the Xeon MP and Itanium 2, sporting 16MB and 24MB of L3 cache respectively. While some critics claim that Intel is pushing an archaic solution and patchwork fixes on the industry, the truth is that this is simply a replay of the issues surrounding the Pentium versus P6 bus debate writ large. The P4 bus is vastly simpler and less expensive than a higher performance, point-to-point interconnect, such as HyperTransport or CSI. After 10 years of shipping products, there is a massive amount of knowledge and infrastructure invested in the front-side bus architecture, both at Intel and at strategic partners. Tossing out the front-side bus will force everyone back to square one. Intel opted to defer this transition by increasing cache sizes, adding more bus segments and including snoop filters to create competitive products.

While Intel’s platform engineers devised more and more creative ways to improve multiprocessor performance using the front-side bus, a highly scalable next generation interconnect was being jointly designed by engineers from teams across Intel and some of the former Alpha designers acquired from Compaq. This new interconnect, known internally as the Common System Interface (CSI), is explicitly designed to accommodate integrated memory controllers and distributed shared memory. CSI will be used as the internal fabric for almost all future Intel systems starting with Tukwila, an Itanium processor and Nehalem, an enhanced derivative of the Core microarchitecture, slated for 2008. Not only will CSI be the cache coherent fabric between processors, but versions will be used to connect I/O chips and other devices in Intel based systems.

The design goals for CSI are rather intimidating. Roughly 90% of all servers sold use four or fewer sockets, and that is where Intel faces the greatest competition. For these systems, CSI must provide low latency and high bandwidth while keeping costs attractive. On the other end of the spectrum, high-end systems using Xeon MP and Itanium processors are intended for mission critical deployment and require extreme scalability and reliability for configurations as large as 2048 processors, but customers are willing to pay extra for those benefits. Many of the techniques that larger systems use to ensure reliability and scalability are more complicated than necessary for smaller servers (let alone notebooks or desktops), producing somewhat contradictory objectives. Consequently, it should be no surprise that CSI is not a single implementation, but rather a closely related family of implementations that will serve as the backbone of Intel’s architectures for the coming years.

Physical Layer

Unlike the front-side bus, CSI is a cleanly defined, layered network fabric used to communicate between various agents. These ‘agents’ may be microprocessors, coprocessors, FPGAs, chipsets, or generally any device with a CSI port. There are five distinct layers in the CSI stack, from lowest to highest: Physical, Link, Routing, Transport and Protocol [27]. Table 1 below describes the different layers and responsibilities of each layer.

Table 1 ? Common System Interface Layers

While all five layers are clearly defined, they are not all necessary. For example, the routing layer is optional in less complex systems, such as a desktop, where there are only two CSI agents (the MPU and chipset). Similarly, in situations where all CSI agents are directly connected, the transport layer is redundant, as end-to-end reliability is equivalent to link layer reliability.

CSI is defined as a variable width, point to point, packet-based interface implemented as two uni-directional links with low-voltage differential signaling. A full width CSI link is physically configured with 20 bit lanes in each direction; these bit lanes are divided into four quadrants of 5 bit lanes, as depicted in Figure 1 [25]. While most CSI links are full width, half width (2 quadrants) and quarter width (a single quadrant) options are also possible. Reduced width links will likely be used for connecting MPUs and chipset components. Additionally, some CSI ports can be bifurcated, so that they can connect to two different agents (for example, so that an I/O hub can directly connect to two different MPUs) [25]. The width of the link determines the physical unit of information transfer, or phit, which can be 5, 10 or 20 bits.

Figure 1 ? Anatomy of a CSI Link

In order to accommodate various link widths (and hence phit sizes) and bit orderings, each nibble of output is muxed on-chip before being transmitted across the physical transmission pins and the inverse is done on the receive side [25]. The nibble muxing eliminates trace length mismatches which reduces skew and improves performance. To support port bifurcation efficiently, the bit lanes are swizzled to avoid excessive wire crossings which would require additional layers in motherboards. Together these two techniques permit a CSI port to reverse the pins (i.e. send output for pin 0 to pin 19, etc.), which is needed when the processor sockets are mounted on both sides of a motherboard.

CSI is largely defined in a way that does not require a particular clocking mechanism for the physical layer. This is essential to balance current latency requirements, which tend to favor parallel interfaces, against future scalability, which requires truly serial technology. Clock encoding and clock and data recovery are prerequisites for optical interconnects, which will eventually be used to overcome the limitations of copper. By specifying CSI in an expansive fashion, the architects created a protocol stack that can naturally be extended from a parallel implementation over copper to optical communication.

Initial implementations appear to use clock forwarding, probably with one clock lane per quadrant to reduce skew and enable certain power saving techniques [16] [19] [27]. While some documents reference a single clock lane for the entire link, this seems unlikely as it would require much tighter skew margins between different data lanes. This would result in more restrictive board design rules and more expensive motherboards.

When a CSI link first boots up, it goes through a handshake based physical layer calibration and training process [14] [15]. Initially, the link is treated like a collection of independent serial lanes. Then the transmitter sends several specially designed phit patterns that will determine and communicate back the intra-lane skew and detect any lane failures. This information is used to train the receiver circuitry to compensate for skew between the different lanes that may arise due to different trace lengths, and process, temperature and voltage variation. Once the link has been trained, it then begins to operate as a parallel interface, and the circuitry used for training is shut down to save power. The link and any de-skewing circuitry will also be periodically recalibrated, based on a timing counter; according to one patent this counter triggers every 1-10ms [13]. When the retraining occurs, all higher level functionality including flow control and data transmission is temporarily halted. This skew compensation enables motherboard designs with less restrictive design rules for trace length matching, which are less expensive as a result.

It appears some variants of CSI can designate data lanes as alternate clocking lanes, in case of a clock failure [16]. In that situation, the transmitter and receiver would then disable the alternate clock lane and probably that lane’s whole quadrant. The link would then re-initialize at reduced width, using the alternate clock lane for clock forwarding, albeit with reduced data bandwidth. The advantage is that clock failures are no longer fatal, and gracefully degrade service in the same manner as a data lane failure, which can be handled through virtualization techniques in the link layer.

Initial CSI implementations in Intel’s 65nm and 45nm high performance CMOS processes target 4.8-6.4GT/s operation, thus providing 12-16GB/s of bandwidth in each direction and 24-32GB/s for each link [30] [33]. Compared to the parallel P4 bus, CSI uses vastly fewer pins running at much higher data rates, which not only simplifies board routing, but also makes more CPU pins available for power and ground.

Link Layer

The CSI link layer is concerned with reliably sending data between two directly connected ports, and virtualizing the physical layer. Protocol packets from the higher layers of CSI are transmitted as a series of 80 bit flow control units (flits) [25]. Depending on the width of a physical link, transmitting each flit takes either 4, 8 or 16 cycles. A single flit can contain up to 64 bits of data payload, the remaining 16 bits in the header are used for flow control, packet interleave, virtual networks and channels, error detection and other purposes [20] [22]. A higher level protocol packet can consist of as little as a single flit, for control messages, power management and the like, but could include a whole cache line ? which is currently 64B for x86 MPUs and 128B for IPF.

Flow control and error detection/correction are part of the CSI link layer, and operate between each transmitter and receiver pair. CSI uses a credit based flow control system to detect errors and avoid collisions or other quality-of-service issues [22] [34]. CSI links have a number of virtual channels, which can form different virtual networks [8]. These virtual channels are used to ensure deadlock free routing, and to group traffic according to various characteristics, such as transaction size or type, coherency, ordering rules and other information [23]. These particular details are intertwined with other aspects of CSI-based systems and are discussed later. To reduce the storage requirements for the different virtual channels, CSI uses two level adaptive buffering. Each virtual channel has a small dedicated buffer, and all channels share a larger buffer pool [8].

Under ordinary conditions, the transmitter will first acquire enough credits to send an entire packet ? as previously noted this could be anywhere from 1-18+ flits. The flits will be transmitted to the receiver, and also copied into a retry buffer. Every flit is protected by an 8 bit CRC (or 16 bits in some cases), which will alert the receiver to corruption or transmission errors. When the receiver gets the flits, it will compute the CRC to check that the data is correct. If everything is clear, the receiver will send an acknowledgement (and credits) with the next flit that goes from the receiver-side to the transmitter-side (remember, there are two uni-directional links). Then the transmitter will clear the flits out of the retry buffer window. If the CRC indicates an error, the receiver-side will send a link layer retry request to the transmitter-side. The transmitter then begins resending the contents of the retry buffer until the flits have been correctly received and acknowledged. Figure 2 below shows an example of several transactions occurring across a CSI link using the flow control counters.

Figure 2 ? Flow Control Example, [34]

While CSI’s flow control mechanisms will prevent serious contention, they do not necessarily guarantee low latency. To ensure that high priority control packets are not blocked by longer latency data packets, CSI incorporates packet interleaving in the link layer [21]. A bit in the flit header indicates whether the flit belongs to a normal or interleaved packet. For example, if a CSI agent is sending a 64B cache line (8+ flits) and it must send out a cache coherency snoop, rather than delaying the snoop by 8 or more cycles, it could interleave the snoop. This would significantly improve the latency of the snoop, while barely slowing down the data transmission. Similarly, this technique could be used to interleave multiple data streams so that they arrive in a synchronized fashion, or simply to reduce the variance in packet latency.

The link layer can also virtualize the underlying physical bit lanes. This is done by turning off some of the physical transmitter lanes, and assigning these bit lanes either a static logical value, or a value based on the remaining bits in each phit [24]. For example, a failed data lane could be removed, and replaced by one of the lanes which sends CRC, thus avoiding any data pollution or power consumption as a result. The link would then continue to function with reduced CRC protection, similar to the failover mechanisms for FB-DIMMs.

Once the physical layer has been calibrated and trained, as discussed previously, the link layer goes through an initialization process [31]. The link layer is configured to auto-negotiate and exchange various parameters, which are needed for operation [20]. Table 2 below is a list of some (but not necessarily all) of the parameters that each link will negotiate. The process starts with each side of the link assuming the default values, and then negotiating which values to actually use during normal operation. The link layer can also issue an in-band reset command, which stops the clock forwarding, and forces the link to recalibrate the physical layer and then re-initialize the link layer.

Table 2 ? CSI Link Layer Parameters and Values, [20]

Most of these parameters are fairly straight forward. The only one that has not been discussed is the agent profile. This field characterizes the role of the device and contains other link level information, which is used to optimize for specific roles. For example, a ”mobile” profile agent would likely have much more aggressive power saving features, than a desktop part. Similarly, a server agent might disable some prefetch techniques that are effectively for multi-media workloads but tend to reduce performance for more typical server applications. Additionally, the two CSI agents will communicate what ‘type’ each one belongs to. Some different types would include caching agents, memory agents, I/O agents, and other agents that are defined by the CSI specification.

Power Saving Techniques

Power and heat have become first order constraints in modern MPU design, and are equally important in the design of interconnects. Thus it should come as no surprise that CSI has a variety of power saving techniques, which tend to span both the physical and link layer.

The most obvious technique to reduce power is using reduced or low-power states, just as in a microprocessor. CSI incorporates at least two different power states which boost efficiency by offering various degrees of power reduction and wake-up penalties for each side of a link [17]. The intermediate power state, L0s, saves some power, but has a relatively short wake-up time to accommodate brief periods of inactivity [28]. There are several triggers for entering the L0s state; they can include software commands, an empty or near empty transaction queue (at the transmitter), a protocol message from an agent, etc. When the CSI link enters the L0s state, an analog wake-up detection circuit is activated, which monitors for any signals which would trigger an exit from L0s. Additionally, a wake-up can be caused by flits entering the transaction queue.

During normal operation, even if one half of the link is inactive, it will still have to send idle flits to the other side to maintain flow control and provide acknowledgement that flits are being received correctly. In the L0s state, the link stops flow control temporarily and can shut down some, but not all, of the circuitry associated with the physical layer. Circuits are powered down based on whether or not they can be woken up within a predetermined period of time. This wake-up timer is configurable and the value likely depends upon factors such as the target market (mobile, desktop or server) and power source (AC versus battery). For instance, the bit lanes can generally be held in an electrical idle so they do not consume any power. However, the clock recovery circuits (receiver side PLLs or DLLs) must be kept active and periodically recalibrated. This ensures that when the link is activated, no physical layer initialization is required, which keeps the wake up latency relatively low. Generally, increasing the timer would improve the power consumption in L0s, but could negatively impact performance. Intel’s patents indicate that the wake-up timer can be set as low as 20ns, or roughly 96-128 cycles [19].

For more dramatic power savings, CSI links can be put into a low power state. The L1 state is optimized specifically for the lowest possible power, without regard for the wake-up latency, as it is intended to be used for prolonged idle periods. The biggest difference between the L0s and L1 states is that in the latter, the DLLs or PLLs used for clock recovery and skew compensation are turned off. This means that the physical layer of the link must be retrained when it is turned back on, which is fairly expensive in terms of latency ? roughly 10us [17]. However, the benefit is that the link barely dissipates any power when in the L1 state. Figure 3 shows a state diagram for the links, including various resets and the L1 and L0s states.

Figure 3 ? CSI Initialization and Power Saving State Diagram, [19]

Another power saving trick for CSI addresses situations where the link is underutilized, but must remain operational. Intel’s engineers designed CSI so that the link width can be dynamically modulated [27]. This is not too difficult, since the physical link between two CSI agents can vary between 5, 10 and 20 bits wide and the link layer must be able to efficiently accommodate each configuration. The only additional work is designing a mechanism to switch between full, half and quarter-width and ensuring that the link will operate correctly during and after a width change.

Note that width modulation is separate for each unidirectional portion of a link, so one direction might be wider to provide more bandwidth, while the opposite direction is mostly inactive. When the link layer is auto-negotiating, each CSI agent will keep track of the configurations supported by the other side (i.e. full width, half-width, quarter-width). Once the link has been established and is operating, each transmitter will periodically check to see if there is an opportunity to save power, or if more bandwidth is required.

If the link bandwidth is not being used, then the transmitter will select a narrower link configuration that is mutually supported and notify the receiver. Then the transmitter will modulate to a new width, and place the inactivated quadrants into the L0s or L1 power saving states, and the receiver will follow suit. One interesting twist is that the unused quadrants can be in distinct power states. For example, a full-width link could modulate down to a half-width link, putting quadrants 0 and 1 into the L1 state, and then later modulate down to a quarter-width link, putting quadrant 2 into the L0s state. In this situation, the link could respond to an immediate need for bandwidth by activating quadrant 2 quickly, while still saving a substantial amount of power.

If more bandwidth is required, the process is slightly more complicated. First, the transmitter will wake up its own circuitry, and also send out a wake-up signal to the receiver. However, because the wake-up is not instantaneous, the transmitter will have to wait for a predetermined and configurable period of time. Once this period has passed, and the receiver is guaranteed to be awake, then the transmitter can finally modulate to a wider link, and start transmitting data at the higher bandwidth.

Most of the previously discussed power saving techniques are highly dynamic and difficult to predict. This means that engineers will naturally have to build in substantial guard-banding to guarantee correct operation. However, CSI also offers deterministic thermal throttling [26]. When a CSI agent reaches a thermal stress point, such as exceeding TDP for a period of time, or exceeding a specific temperature on-die, the overheating agent will send a thermal management request to other agents that it is connected to via CSI. The thermal management request typically includes a specified acknowledgement window and a sleep timer (these could be programmed into the BIOS, or dynamically set by the overheating agent). If the other agent responds affirmatively within the acknowledgement window, then both sides of the link will shut down for the specified sleep time. Using an acknowledgement window ensures that the other agent has the flexibility to finish in-flight transactions before de-activating the CSI link.

Coherency Leaps Forward at Intel

CSI is a switched fabric and a natural fit for cache coherent non-uniform memory architectures (ccNUMA). However, simply recycling Intel’s existing MESI protocol and grafting it onto a ccNUMA system is far from efficient. The MESI protocol complements Intel’s older bus-based architecture and elegantly enforces coherency. But in a ccNUMA system, the MESI protocol would send many redundant messages between different nodes, often with unnecessarily high latency. In particular, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. However, the requesting processor only needs a single copy of the data, so the system is wasting a bit of bandwidth.

Intel's solution to this issue is rather elegant. They adapted the standard MESI protocol to include an additional state, the Forwarding (F) state, and changed the role of the Shared (S) state. In the MESIF protocol, only a single instance of a cache line may be in the F state and that instance is the only one that may be duplicated [3]. Other caches may hold the data, but it will be in the shared state and cannot be copied. In other words, the cache line in the F state is used to respond to any read requests, while the S state cache lines are now silent. This makes the line in the F state a first amongst equals, when responding to snoop requests. By designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist.

When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.

Figure 4 demonstrates the advantages of MESIF over the traditional MESI protocol, reducing two responses to a single response (acknowledgements are not shown). Note that a peer node is simply a node in the system that contains a cache.

Figure 4 ? MESIF versus MESI Protocol

In general, MESIF is a significant step forward for Intel’s coherency protocol. However, there is at least one optimization which Intel did not pursue ? the Owner state that is used in the MOESI protocol (found in the AMD Opteron). The O state is used to share dirty cache lines (i.e. lines that have been written to, where memory has older or dirty data), without writing back to memory.

Specifically, if a dirty cache line is in the M (modified) state, then another processor can request a copy. The dirty cache line switches to the Owned state, and a duplicate copy is made in the S state. As a result, any cache line in the O state must be written back to memory before it can be evicted, and the S state no longer implies that the cache line is clean. In comparison, a system using MESIF or MESI would change the cache line to the F or S state, copy it to the requesting cache and write the data back to memory ? the O state avoids the write back, saving some bandwidth. It is unclear why Intel avoided using the O state in the newer coherency protocol for CSI ? perhaps the architects decided that the performance gain was too small to justify the additional complexity.

Table 3 summarizes the different protocols and states for the MESI, MOESI and MESIF cache coherency protocols.

Table 3 ? Overview of States in Snoop Protocols

A Two Hop Protocol for Low Latency

In a CSI system, each node is assigned a unique node ID (NID), which serves as an address on the network fabric. Each node also has a Peer Agent list, which enumerates the other nodes in the system that it must snoop when requesting data from memory (typically peers contain a cache, but could also be an I/O hub or device with DMA). Similarly, each transaction is assigned an identifier (TID) for tracking at each involved node. The TID, together with a destination and source ID form a globally unique transaction identifier [37]. The number of TIDs, and hence outstanding transactions is not unlimited, and will likely be one differentiating factor between Xeon DP, Xeon MP and Itanium systems. Table 4 describes the different fields that can be used in each CSI message, although some messages do not use all fields. For example, a snoop response from a processor that holds data in the shared state will not contain any data, just an acknowledgement.

Table 4 ? CSI Message Fields, [1]

CSI was designed as a natural extension of the existing front side bus protocol; although there are some changes, many of the commands can be easily traced to the commands on the front side bus. A set of commands is listed in the ‘250 patent.

In a three hop protocol, such as the one used by AMD’s Opteron, read requests are first sent to the home node (i.e. where the cache line is stored in memory). The home node then snoops all peer nodes (i.e. caching agents) in the system, and reads from memory. Lastly, all snoop responses from peer nodes and the data in memory are sent to the requesting processor. This transaction involves three point to point messages: requestor to home, home to peer and peer to requestor, and a read from memory before the data can be consumed.

Rather than implement a three hop cache coherency protocol, CSI was designed with a novel two hop protocol that achieves lower latency. In the protocol used by CSI, transactions go through three phases; however, data can be used after the second phase or hop. First, the requesting node sends out snoops to all peer nodes (i.e. caches) and the home node. Each peer node sends a snoop response to the requesting node. When the second phase has finished, the requesting node sends an acknowledgement to the home node, where the transaction is finally completed.

In the rare case of a conflict, the home node is notified and will step in and resolve transactions in the appropriate order to ensure correctness. This could force one or more processor in the system to roll back, replay or otherwise cancel the effects of a load instruction. However, the additional control circuitry is neither frequently used, nor is on any critical paths, so it can be tuned for low leakage power.

In the vast majority of transactions, the home node is a silent observer, and the requestor can use the new data as soon as it is received from the peer agent’s cache, which is the lowest possible latency. In particular, a two hop protocol does not have to wait to access memory in the home node, in contrast to three hop protocols. Figure 5 compares the critical paths between two hop and three hop protocols, when data is in a cache (note that not all snoops and responses are shown ? only the critical path).

Figure 5 ? Critical Path Latency for Two and Three Hop Protocols

This arrangement is somewhat unusual in that the requesting processor is conceptually pushing transactions into the system and home node. In three hop protocols, the home node acts as a gate keeper and can defer a transaction if the appropriate queues are full, while only stalling the requestor. In a CSI-based system, the home node receives messages after the transaction in is progress or has already occurred. If these incoming transactions were lost, the system would be unable to maintain coherency. Therefore to ensure correctness, CSI home nodes must have a relatively large pre-allocated buffer to support as many transactions as can be reasonably initiated.

Virtual Channels

One of the most difficult challenges in designing multiprocessor systems is guaranteeing forward progress, and avoiding strange network behavior that limits performance. Unfortunately, this problem is an inherent aspect of multiprocessor system design, and really impacts almost every decision made; there is no clean way to separate it from other concerns. Every coherent transaction across CSI has three phases: the snoop from the requesting CPU, the responses from the peer nodes, and the acknowledgement to the home node. As noted previously, CSI uses separate link layer virtual channels to improve performance and avoid livelocks or deadlocks. Each transaction phase has one or more associated virtual channels: snoop, response and home. These arrangements should come as no surprise, since the EV7 used similar techniques and designations. Additional channels discussed in patents include short message, data and data response channels [34].

One reason for providing different virtual channels is that the traffic characteristics of the three are quite distinct. Packets sent across the home channel are typically very small and must be received in order. In the most common case, a home packet would simply be an acknowledgement that a transaction can retire. The response channel sometimes includes larger packets, often containing actual cache lines (although these may go on the data response channel), and can be processed out of order to improve performance. The snoop channel is mostly smaller packets, and can also operate out of order. The optimizations for each channel are different and by separating each class of traffic, Intel architects can more carefully tune the system for high performance and lower power.

There are also priority relationships between the different classes of traffic. When the system is saturated, the home phase channels will be given the highest priority, which ensures that some transactions will retire, leaving the system and reducing traffic. The next highest priority is the response phase and associated channels, which provide data to processors so they can continue computation, and initiate the home phase. The lowest priority of traffic are the snoop phase channels, which are used to start new transactions and is the first to throttle back.

Dynamic Reconfiguration

One of the problems with the existing bus infrastructure is that the interface presented to software is not particularly clean or isolated. Specifically, components of Intel’s system architecture cannot be dynamically added or removed from the front-side bus; instead the bus and all attached components must be shut down, and then restarted after disabling or adding the component in question. For instance, to remove one faulty processor in a 16 socket server, an entire node (4 processors, one memory controller hub, the local memory and I/O) must be off-lined.

CSI supports both in-band (coordinated by a system component) and out-of-band (coordinated by a service processor) dynamic reconfiguration of system resources, also known as hot plug [39]. A system agent and the firmware work together to quiesce individual components and then modify the routing tables and system addressing decoders, so that the changes appear to be atomic to the operating system and software.

To add a system resource, such as a processor, first the firmware creates a physical and logical profile for the new processor in the rest of the system. Next, the firmware enables the CSI links between the new processor and the rest of the system. The firmware initializes the new processor’s CSI link and sends data about the system configuration to the new processor. The new processor initializes itself and begins self-testing, and will notify the firmware when it is complete. At that point, the firmware notifies the OS and the rest of the system to begin operating with the new processor in place.

System resources, such as a processor, memory or IO hub are removed though a complementary mechanism. These two techniques can also be combined to move resources seamlessly between different system partitions.

Multiprocessor Systems

When the P6 front side bus was first released, it caused a substantial shift in the computer industry by supporting up to four processors without any chipset modifications. As a result, Intel based systems using Linux or Windows penetrated and dominated the workstation and entry level server market, largely because the existing architectures were priced vastly higher.

However, Intel hesitated to extend itself beyond that point. This hesitancy was partially due to economic incentives to maintain the same infrastructure, but also the preferences of key OEMs such as IBM, HP and others, who provide value added in the form of larger multiprocessor systems. Balancing all the different priorities inside of Intel, and pleasing partners is nearly impossible and has handicapped Intel for the past several years. However, it is quite clear that any reservations at Intel disappeared around 2002-3, when CSI development started.

Intel's patents clearly anticipate two and four processor systems, as shown in Figure 6. Each processor in a dual socket system will require a single coherent full width CSI link, with one or two half-width links to connect to I/O bridges, making the system fully symmetric (half-width links are shown as dotted lines). Processors in four socket systems will be fully connected, and each processor could also connect directly to the I/O bridge. More likely, each processor, or pair of processors, could connect to a separate I/O bridge to provide higher I/O bandwidth in the four socket systems.

Figure 6 ? 2 and 4P CSI System Diagrams [2] [34]

Fully interconnected systems, such as those shown in Figure 6 enjoy several advantages over partially connected solutions. First of all, transactions occur at the speed of the slowest participant. Hence, a system where every caching agent (including the I/O bridge) is only one hop away ensures lower transaction latency. Secondly, by lowering transaction latency, the number of transactions in flight is reduced (since the average transaction life time is shorter). This means that the buffers for each caching agent can be smaller, faster and more power efficient. Lastly, operating systems and applications have trouble handling NUMA optimizations, so more symmetrical systems are ideal from a software perspective.

Interacting with I/O

Of course, ensuring optimal communication between multiple processors is just one part of system design. The I/O architecture for Intel’s platform is also important, and CSI brings along several important changes in that area as well [36].

As Figure 6 indicates, some CSI based systems contain multiple I/O hubs, which need to communicate with each other. Since the I/O hubs are not connected, Intel’s engineers devised an efficient method to forward I/O transactions (typically PCI-Express) through CSI. Because CSI was optimized for coherent traffic, it lacks many of the features which PCI-Express relies upon, such as I/O specific packet attributes. To solve this problem, PCI-E packets are tunneled through CSI, leaving much or all of the PCI-E header information intact.

Beyond Multiprocessors

In a forward looking decision by Intel, CSI is fairly agnostic with respect to system expansion. Systems can be expanded in a hierarchical manner, which is the path that IBM took for their older X3 chipset, where one agent in each local cell acts a proxy for the rest of the system. Certainly, the definition of CSI lends itself to hierarchical arrangements, since a “CSI node” is an abstraction and may in fact consist of multiple processors. For instance, in a 16 socket system, there might be four nodes, and each node might contain four sockets and resemble the top diagram in Figure 6. Early Intel patents seem to point to hierarchical expansion as being preferred, although later patents appear to be less restrictive [2] [4]. As an alternative to hierarchical expansion, larger systems can be built using a flat topology (the 2 dimensional torus used by the EV7 would be an example). However, a flat system must have a node ID for each processor, whereas a hierarchical system needs only enough node IDs for the processors in each ‘cell’. So, while a flat 32 socket system would require 32 distinct node IDs, a comparable system using 8 nodes of 4 sockets would only need 4 distinct node IDs.

Most MPU vendors have used node ID ranges to differentiate between versions of their processors. For instance, Intel and AMD both draw clear distinctions between 1, 2 and 4P server MPUs; each one with an increasing level of RAS and more node IDs and a substantial price increase. Furthermore, a flat system with 8+ processors in all likelihood needs snoop filters or directories for scalability. However, Intel’s x86 MPUs will probably not natively support directories or snoop filters; instead leaving that choice to OEMs. This flexibility for CSI systems means that OEMs with sufficient expertise can differentiate their products with custom node controllers for each local node in a hierarchical system.

Directory based coherency protocols are the most scalable option for system expansion. However, directories use a three hop coherency protocol that is quite different from CSI. In the first phase, the requestor sends a request to the home node, which contains the directory that lists which agents have a copy of the cache line. The home node would then snoop those agents, while sending no messages to uninvolved third parties. Lastly, all the agents receiving a snoop would send a response to the requestor. This presents several problems. The directory itself is difficult to implement, since every cache miss in the system generates both a read (the lookup) and a write to the directory (updating ownership). The latency is also higher than a snooping broadcast protocol, although the system bandwidth used is lower, hence providing better scalability. Snoop filters are a more natural extension of the CSI infrastructure suitable for mid-sized systems.

Snoop filters focus on a subset of the key data to reduce the number of snoop responses. The classic example of a snoop filter, such as Intel’s Blackford, Seaburg or Clarksboro chipsets, tracks remotely cached data. Snoop filters have an advantage because they preserve the low latency of the CSI protocol, while a directory would require changing to a three hop protocol. Not every element in the system must have a snoop filter either; CSI is flexible in that regard as well.

Remote Prefetch

Another interesting innovation that may show up in CSI is remote prefetch [9]. Hardware prefetching is nothing new to modern CPUs, it has been around since the 130nm Pentium III. Typically, hardware prefetch works by tracking cache line requests from the CPU and trying to detect a spatial or temporal pattern. For instance, loading a 128MB movie will result in roughly one million sequential requests (temporal) for 128B cache lines that are probably adjacent in memory (spatial). A prefetcher in the cache controller will figure out this pattern, and then start requesting cache accesses ahead of time to hide memory access latencies. However, general purpose systems rely on cache and memory controllers prefetching for the CPU and do not receive feedback from other system agents.

One of the patents relating to CSI is for a remote device initiating a prefetch into processor caches. The general idea is that in some situations, remote agents (an I/O device or coprocessor) might have more knowledge about where data is coming from, than the simple pattern detectors in the cache or memory controller. To take advantage of that, a remote agent sends a prefetch directive message to a cache prefetcher. This message could be as simple as indicating where the prefetch would come from (and therefore where to respond), but in all likelihood would include information such as data size, priority and addressing information. The prefetcher can then respond by initiating a prefetch or simply ignoring the directive altogether. In the former case, the prefetcher would give direct cache access to the remote agent, which then writes the data into the cache. Additionally, the prefetcher could request that the remote agent pre-process the data. For example, if the data is compressed, encoded or encrypted, the remote agent could transform the data to an immediately readable format, or route it over the interprocessor fabric to a decoder or other mechanism.

The most obvious application for remote prefetching is improving I/O performance when receiving data from a high speed Ethernet, FibreChannel or Infiniband interface (the network device would be the remote agent in that case). This would be especially helpful if the transfer unit is large, as is the case for storage protocols such as iSCSI or FibreChannel, since the prefetch would hide latency for most of the data. To see how remote prefetch could improve performance, Figure 7 shows an example using a network device.

Figure 7 ? Remote Prefetch for Network Traffic

On the left is a traditional system, which is accessing 4KB of data over a SAN. It receives a packet of data through the network interface, and then issues a write-invalidate snoop for a cache line to all caching agents in the system. A cache line in memory is allocated, and the data is stored through the I/O hub. This repeats until all 4KB of data has been written into the memory, at which point the I/O device issues an interrupt to the processor. Then, the processor requests the data from memory and snoops all the caching agents; lastly it reads the memory into the cache and begins to use the data.

In a system using remote prefetch, the network adapter begins receiving data and the packet headers indicate that the data payload is 4KB. The network adapter then sends a prefetch directive, through the I/O hub to the processor’s cache, which responds by granting direct cache access. The I/O hub will issue a write-invalidate snoop for each cache line written, but instead of storing to memory, the data is placed directly in the processor’s cache in the modified state. When all the data has been moved, the I/O hub sends an interrupt and the processor begins operating on the data already in the cache. Compared to the previously described method, remote prefetching demonstrates several advantages. First, it eliminates all of the snoop requests by the processor to read the data from memory to the cache. Second, it reduces the load on the main memory system (especially if the processors can stream results back to network adapter) and modestly decreases latency.

While the most obvious application of the patented technique is for network I/O, remote prefetch could work with any part of a system that does caching. For instance, in a multiprocessor system, remote prefetch between different processors, or even coprocessors or acceleration devices is quite feasible. It is unclear whether this feature will be made available to coprocessor vendors and other partners, but it would certainly be beneficial for Intel and a promising sign for a more open platform going forward.

Speculations on CSI Clients

While the technical details of CSI are well documented in various Intel patents, there is relatively little information on future desktop or mobile implementations. These next two sections make a transition from fairly solid technical details into the realm of educated, but ultimately speculative predictions.

CSI will modestly impact the desktop and mobile markets, but may not bring any fundamental changes. Certain Intel patents seem to imply that discrete memory controllers will continue to be used with some MPUs [9]. In all likelihood, Intel will offer several different product variations based on the target market. Some versions will use integrated memory controllers, some will offer an on-package northbridge and some will probably have no system integration at all.

Intel has a brisk chipset business on both the desktop and notebook side that keeps older fabs effectively utilized ? an essential element of Intel’s capital strategy. If Intel were to integrate the northbridge in all MPUs, it would force the company to find other products which can use older fabs, or shutter some of the facilities. Full integration also increases the pin count for each MPU, which increases the production costs. While an integrated memory controller increases performance by reducing latency, many products do not need the extra performance, nor is it always desirable from a marketing perspective.

For technical reasons, an integrated memory controller can also be problematic. Integrated graphics controllers share main memory to reduce cost. As a result, integrated graphics substantially benefits from sharing a die with the memory controller, as it does currently for Intel based systems. However, integrating graphics on the processor seems a little aggressive for a company that has yet to produce an on-die memory controller, and is a waste of cutting edge silicon ? most high performance systems simply do not use integrated graphics.

Intel’s desktop version of Nehalem is code-named Bloomfield and it seems clear that the high performance MPUs, which are targeted at gamers, will feature on-die memory controllers. The performance benefits of reducing memory latency will probably be used as product differentiation by Intel to encourage gamers to move to the Extreme Edition line and justify the higher prices. However, on-die or on-package graphics is unlikely given that most OEMs will use higher performance discrete solutions from NVIDIA or AMD. The width of the CSI connection between the MPU and the chipset may be another differentiating factor. While a half-width link will work for mid-range systems, high-end gaming systems will require more bandwidth. Modern high performance GPUs use a PCI-E x16 slots, which provides 4GB/s in each direction. Hence, it is quite conceivable that by 2009 a pair of high-end GPUs would require ~16GB/s in each direction. Given that gaming systems often stress graphics, network and disk, a full width CSI link may be required to provide enough appropriate performance.

Other desktop parts based on Bloomfield will focus on low cost, and greater integration. It is very likely that these MPUs will be connected via CSI to a second die containing a memory controller and integrated graphics, all packaged inside a single MCM. A CSI link (probably half-width) would connect the northbridge to the rest of the chipset. This solution would let Intel use older fabs to produce the northbridge, and would enable more manufacturing flexibility ? each component could be upgraded individually with fewer dependencies between the two. Intel will probably also produce a MPU with no integrated system features, which will let OEMs use chipsets from 3rd party vendors, such as NVIDIA, VIA and SiS.

Gilo, the mobile proliferation of Nehalem, will face many of the same issues as desktop processors, but also some that are unique to the notebook market. Mobile MPUs do not really need the lower latency; in many situations they sacrifice performance by only populating a single channel of memory, or operating at relatively slow transfer rates. An integrated memory controller would also require a separate voltage plane from the cores, hence systems would need an additional VRM on the motherboard. The clock distribution would also need to be designed so that the cores can vary frequency independently of the memory controller. Consequently, an on-die memory controller is unlikely because of the lack of benefits and additional complexity.

The implementations for Gilo will most likely resemble the mid-range and low-end desktop product configuration. The more integrated products will feature the northbridge and graphics in the same package as the MPU, connected by CSI. A more bare-bones MPU would also be offered for OEMs that prefer higher performance discrete graphics, or wish to use alternative chipsets.

While the system architecture for Intel’s desktop and mobile offerings will change a bit, the effects will probably be more subtle. The majority of Intel MPUs will still require external memory controllers, but they will be integrated on the MPU package itself. This will not fundamentally improve Intel’s performance relative to AMD’s desktop and mobile offerings. However, it will make Intel’s products more attractive to OEMs, since the greater integration will reduce the number of discrete components on system boards and lower the overall cost. In many ways the largest impact will be on the graphics vendors ? since it will make all their solutions (both integrated and discrete) more expensive relative to a single MCM from Intel.

Speculations on CSI Servers

In the server world, CSI will be introduced in tandem with an on-die memory controller. The impact of these two modifications will be quite substantial, as they address the few remaining shortcomings in Intel’s overall server architecture and substantially increase performance. This performance improvement come from two places: the integrated memory controller will lower memory latency, while the improved interconnects for 2-4 socket servers will increase bandwidth and decrease latency.

To Intel, the launch of a broad line of CSI based systems will represent one of the best opportunities to retake server market share from AMD. New systems will use the forthcoming Nehalem microarchitecture, which is a substantially enhanced derivative of the Core microarchitecture, and features simultaneous multithreading and several other enhancements. Historically speaking, new microarchitectures tend to win the performance crown and presage market share shifts. This happened with the Athlon, the Pentium 4, Athlon64/Opteron, and the Core 2 and it seems likely this trend will continue with Nehalem. The system level performance benefits from CSI and integrated memory controllers will also eliminate Intel’s two remaining glass jaws: the older front side bus architecture and higher memory latency.

The single-processor server market is likely where CSI will have the least impact. For these entry level servers, the shared front side bus is not a substantial problem, since there is little communication compared to larger systems. Hence, the technical innovations in CSI will have relatively little impact in this market. AMD also has a much smaller presence in this market, because their advantages (which are similar to the advantages of CSI) are less pronounced. Clearly, AMD will try to make inroads into this market; if the market responds positively to AMD’s solution that may hint at future reactions to CSI.

Currently in the two socket (DP) server market, Intel enjoys a substantial performance lead for commercial workloads, such as web serving or transaction processing. Unfortunately, Intel’s systems are somewhat handicapped because they require FB-DIMMs, which use an extra 5-6 watts per DIMM and cost somewhat more than registered DDR2. This disadvantage has certainly hindered Intel in the last year, especially with customers who require lots of memory or extremely low power systems. While Intel did regain some server market share, AMD’s Opteron is still the clear choice for almost all high performance computing, where the superior system architecture provides more memory and processor communication bandwidth. This advantage has been a boon for AMD, as the HPC market is the fastest growing segment within the overall server market.

Gainestown, the first CSI based Xeon DP, will arrive in the second half of 2008, likely before any of the desktop or mobile parts. In the dual socket market, CSI will certainly be welcome and improve Intel’s line up, featuring 2x or more the bandwidth of the previous generation, but the impact will not be as pronounced as for MP systems. Intel’s dual socket platforms are actually quite competitive because the product cycles are shorter, meaning more frequent upgrades and higher bandwidth. Intel’s current Blackford and Seaburg chipsets, with dual front side buses and snoop filters, offer reasonable bandwidth, although at the cost of slightly elevated power and thermal requirements. This too shall pass; it appears that dual socket systems will shift back to DDR3, eliminating the extra ~5W penalty for each FB-DIMM [12]. This will improve Intel’s product portfolio and put additional pressure on AMD, which is still benefitting from the FB-DIMM thermal issues. The DP server market is currently fairly close to ‘equilibrium’; AMD and Intel have split the market approximately along the traditional 80/20 lines. Consequently, the introduction of CSI systems will enhance Intel’s position, but will not spark massive shifts in market share.

The first Xeon MP to use CSI will debut in the second half of 2009, lagging behind its smaller system counterparts by an entire year. Out of all the x86 product families using CSI, Beckton will have the biggest impact, substantially improving Intel’s position in the four socket server market. Beckton will offer roughly 8-10x the bandwidth of its predecessor, dramatically improving performance. The changes in system architecture will also dramatically reduce latency, which is a key element of performance for most of the target workloads, such as transaction processing, virtualization and other mission critical applications. Since the CSI links are point-to-point, they eliminate one chip and one interconnect crossing, which will cut the latency between processors in half, or better. The integrated memory controller in Beckton will similarly reduce latency, since it also removes out an extra chip and interconnect crossing.

Intel’s platform shortcomings created a weakness that AMD exploited to gain significant market share. It is estimated that Intel currently holds as little as 50% of the market for MP servers, compared to roughly 75-80% of the overall market. When CSI-based MP platforms arrive in 2009, Intel will certainly try to bring their market share back in-line with the overall market. However, Beckton will be competing against AMD’s Sandtiger, a 45nm server product with 8-16 cores also slated for 2009. Given that little is known about the latter, it is difficult to predict the competitive landscape.

Itanium and CSI

CSI will also be used for Tukwila, a quad-core Itanium processor due in 2008. Creating a common infrastructure for Itanium and Xeon based systems has been a goal for Intel since 2003. Because, the economic and technical considerations for these two products are different, they will not be fully compatible. However, the vast majority of the two interconnects will be common between the product lines.

One goal of a common platform for Itanium and Xeon is to share (and therefore better amortize) research, development, design and validation costs, by re-using components across Intel's entire product portfolio. Xeon and Xeon MP products ship in the tens of millions each year, compared to perhaps a million for Itanium. If the same components can be used across all product lines, the non-recurring engineering costs for Itanium will be substantially reduced. Additionally, the inventory and supply chain management for both Intel and its partners will be simplified, since some chipset components will be interchangeable.

Just as importantly, CSI and an integrated memory controller will substantially boost the performance of the Itanium family. Montvale, which will be released at the end of 2007, uses a 667MHz bus that is 128 bits wide ? a total of 10.6GB/s of bandwidth. This pales in comparison to the 300GB/s that a single POWER6 processor can tap into. While bandwidth is only one factor that determines performance, a 30x difference is substantial by any measure. When Tukwila debuts in 2008, it will go a long way towards evening the playing field. Tukwila will offer 120-160GB/s between MPUs (5 CSI lanes at 4.8-6.4GT/s), and multiple integrated FB-DIMM controllers. The combination of doubling the core count, massively increasing bandwidth and reducing latency should prove compelling for Itanium customers and will likely cause a wave of upgrades and migrations similar to the one triggered by the release of Montecito in 2006.

Conclusion

The success of the Pentium Pro and its lineage captured the multi-billion dollar RISC workstation and low-end server market, but that success also created inertia around the bus interface. Politics within the company and with existing partners, OEMs and customers conspired to keep Intel content with the status quo. Unfortunately for Intel, AMD was not content to play second fiddle forever. The Opteron took a portion of the server market, largely by virtue of its superior system architecture and Intel’s simultaneous weakness with the Pentium 4 (Prescott) microarchitecture. While Intel might be prone to internal politics, when an external threat looms large, everything is thrown into high gear. The industry saw that with the RISC versus CISC debate, where Intel P6 engineers hung ads from the now friendly Apple in their cubes for competitive inspiration. The Core microarchitecture, Intel’s current flag bearer, was similarly the labor of a company under intense competitive pressure.

While Intel had multiple internal projects working on a next generation interconnect, the winning design for CSI was the result of collaboration between Intel veterans from Hillsboro, Santa Clara and other sites, as well as the architects who worked on DEC’s Alpha architecture. The EV7, the last Alpha stands out for having the best system interconnect of its time, and certainly influenced the overall direction for CSI. The CSI design team was given a set of difficult, but not impossible goals: design an interconnect family that would span the range of Intel’s performance oriented computational products, from the affordable Celeron to the high-end Xeon MP and Itanium. The results were delayed, largely due to the cancellation of Whitefield, a quad core x86 processor, and the rescheduling and evisceration of Tanglewood nee Tukwila. However, Tukwila and Nehalem will feature CSI when they debut in the next two years, and the world will be able to judge the outcome.

CSI will be a turning point for the industry. In the server world, CSI paired with an integrated memory controller, will erase or reverse Intel’s system architecture deficit to AMD. Intel’s microprocessors will need less cache because of the lower memory and remote access latency; the specs for Tukwila call for 6MB/core rather than the 12MB/core in Montecito. This in turn will free up more die area for additional cores, or more economical die sizes. These changes will put Intel on a more equal footing with AMD, which has had a leg up in system architecture with their integrated memory controller and HyperTransport. As a result, Intel will be in a good position to retake lost market share in the server world in 2008/9 when CSI based systems debut.

In some ways, CSI and integrated memory controllers are the last piece of the puzzle to get Intel’s servers back on track. The new Core microarchitecture has certainly proven to be a capable design, even when paired with the front side bus and a discrete memory controller. The multithreaded microarchitecture for Nehalem, coupled with an integrated memory controller and the CSI system fabric should be an even more impressive product. For Intel, 2008 will be a year to look forward to, thanks in no small part to the engineers who worked on CSI.

References

[1] Batson, B. et al. Messaging Protocol. US Patent Application 20050262250A1. November 24, 2005.
[2] Batson, B. et al. Cache Coherence Protocol. US Patent Application 20050240734A1. October 27, 2005.
[3] Hum, H. et al. Forward State for use in Cache Coherency in a Multiprocessor System. US Patent No. 6,922,756 B2. July 26, 2005.
[4] Hum, H. et al. Hierarchical Virtual Model of a Cache Hierarchy in a Multiprocessor System. US Patent Application 20040123045A1. June 24, 2004.
[5] Beers, R. et al. Non-Speculative Distributed Conflict Resolution for a Cache Coherency Protocol. US Patent No. 6,954,829 B2. October 11, 2005.
[6] Hum, H. et al. Speculative Distributed Conflict Resolution for a Cache Coherency Protocol. US Patent Application 20040122966A1. June 24, 2004.
[7] Hum, H. et al. Hierarchical Directories for Cache Coherency in a Multiprocessor System. US Patent Application 20060253657A1. November 9, 2006.
[8] Cen, Ling. Method, System, and Apparatus for a Credit Based Flow Control in a Computer System. US Patent Application 20050088967A1. April 28, 2005.
[9] Huggahalli, R. et al. Method and Apparatus for Initiating CPU Data Prefetches by an External Agent. US Patent Application 20060085602A1. April 20, 2006.
[10] Kanter, David. Intel’s Tukwila Confirmed to be Quad Core. Real World Technologies. May 5, 2006. http://www.realworldtech.com/page.cfm?NewsID=361&date=05-05-2006#361
[11] Rust, Adamson. Intel’s Stoutland to have Integrated Memory Controller. The Inquirer. February 1, 2007. http://www.theinquirer.net/default.aspx?article=37373
[12] Intel Thurley has Early CSI Interconnect. The Inquirer. February 2, 2007. http://www.theinquirer.net/default.aspx?article=37392
[13] Cherukuri, N. et al. Method and Apparatus for Periodically Retraining a Serial Links Interface. US Patent No. 7,209,907 B2. April 24, 2007.
[14] Cherukuri, N. et al. Method and Apparatus for Interactively Training Links in a Lockstep Fashion. US Patent Application 20050262184A1. November 24, 2005.
[15] Cherukuri, N. et al. Method and Apparatus for Acknowledgement-based Handshake Mechanism for Interactively Training Links. US Patent Application 20050262280A1. November 24, 2005.
[16] Cherukuri, N. et al. Method and Apparatus for Detecting Clock Failure and Establishing an Alternate Clock Lane. US Patent Application 20050261203A1. December 22, 2005.
[17] Cherukuri, N. et al. Method for Identifying Bad Lanes and Exchanging Width Capabilities of Two CSI Agents Connected Across a Link. US Patent Application 20050262284A1. November 24, 2005.
[18] Frodsham, T. et al. Method, System and Apparatus for Loopback Entry and Exit. US Patent Application 20060020861A1. January 26, 2006.
[19] Cherukuri, N. et al. Methods and Apparatuses for Resetting the Phyiscal Layers of Two Agents Interconnected Through a Link-Based Interconnection. US Patent No. 7,219,220 B2. May 15, 2007.
[20] Mannava, P. et al. Method and System for Flexible and Negotiable Exchange of Link Layer Functional Parameters. US Patent Application 20070088863A1. April 19, 2007.
[21] Spink, A. et al. Interleaving Data Packets in a Packet-based Communication System. US Patent Application 20070047584A1. March 1, 2007.
[22] Chou, et al. Link Level Retry Scheme. US Patent No. 7,016,304 B2. March 21, 2006.
[23] Creta, et al. Separating Transactions into Different Virtual Channels. US Patent No. 7,165,131 B2. January 16, 2007.
[24] Cherukuri, N. et al. Technique for Lane Virtualization. US Patent Application 20050259599A1. November 24, 2005.
[25] Steinman, M. et al. Methods and Apparatuses to Effect a Variable-width Link. US Patent Application 20050259696A1. November 24, 2005.
[26] Kwa et al. Method and System for Deterministic Throttling for Thermal Management. US Patent Application 20060294179A1. December 28, 2006.
[27] Cherukuri, N. et al. Dynamically Modulating Link Width. US Patent Application 20060034295A1. February 16, 2006.
[28] Cherukuri, N. et al. Link Power Saving State. US Patent Application 20050262368A1. November 24, 2005.
[29] Lee, V. et all. Retraining Derived Clock Receivers. US Patent Application 20050022100A1. January 27, 2005.
[30] Fan, Yongping. Matched Current Delay Cell and Delay Locked Loop. US Patent No. 7,202,715 B1. April 10, 2007.
[31] Ayyar, M. et al. Method and Apparatus for System Level Initialization. US Patent Application 20060126656A1. June 15, 2006.
[32] Frodsham, T. et al. Method, System and Apparatus for Link Latency Management. US Patent Application 20060168379A1. July 27, 2006.
[33] Frodsham, T. et al. Technique to Create Link Determinism. US Patent Application 20060020843A1. January 26, 2006.
[34] Spink. A. et al. Buffering Data Packets According to Multiple Flow Control Schemes. US Patent Application 20070053350A1. March 8, 2007.
[35] Cen, Ling. Arrangements Facilitating Ordered Transactions. US Patent Application 20040008677A1. January 15, 2004.
[36] Creta, et al. Transmitting Peer-to-Peer Transactions Through a Coherent Interface. US Patent No. 7,210,000 B2. April 24, 2007.
[37] Hum, H. et al. Globally Unique Transaction Identifiers. US Patent Application 20050251599A1. November 10, 2005.
[38] Cen, Ling. Two-hop Cache Coherency Protocol. US Patent Application 20070022252A1. January 25, 2007.
[39] Ayyar, M. et al. Method and Apparatus for Dynamic Reconfiguration of Resources. US Patent Application 20060184480A1. August 17, 2006.