Sep 30, 2007

The Common System Interface: Intel's Future Interconnect

By: David Kanter (dkanter@realworldtech.com)
Updated: 08-28-2007

Introduction


In the competitive x86 microprocessor market, there are always swings and shifts based on the latest introductions from the two main protagonists: Intel and AMD. The next anticipated shift is coming in 2008-9 when Intel will finally replace their front side bus architecture. This report details Intel’s next generation system interconnect and the associated cache coherency protocol, likely deployment plans across the desktop, notebook and server market as well as the economic implications.

Intel’s front-side bus has a long history that dates back to 1995 with the release of the Pentium Pro (P6). The P6 was the first processor to offer cheap and effective multiprocessing support; up to four CPUs could be connected to a single shared bus with very little additional effort for an OEM. The importance of cheap and effective cannot be underestimated. Before the P6, multiprocessor systems used special chipsets and usually a proprietary variant of UNIX; consequently they were quite expensive. Initially, Intel’s P6 could not always match the performance of these high end systems from the likes of IBM, DEC or Sun, but the price was so much lower that the performance gap became a secondary consideration. The workstation and low-end server markets embraced the P6 precisely because the front-side bus enabled inexpensive multiprocessors.

Ironically, the P6 bus was the subject of considerable controversy at Intel. It was originally based on the bus used in the i960 project and the designers came under pressure from various corporate factions to re-use the bus from the original Pentium so that OEMs would not have to redesign and validate new motherboards, and so end users could easily upgrade. However, the Pentium bus was strictly in-order and could only have a single memory access in flight at once, making it entirely inadequate for an out-of-order microprocessor like the P6 that would have many simultaneous memory accesses. Ultimately a compromise was reached that preserved most of the original P6 bus design, and the split-transaction P6 bus is still being used in new products 10 years after the design was started. The next step for Intel’s front side bus was to shift to the P4 bus, which was electrically similar to the P6 bus and issued commands at roughly the same rate, but clocked the data bus four times faster to provide fairly impressive throughput.

While the inexpensive P4 bus is still in use for Intel’s x86 processors, the rest of the world moved on to newer point-to-point interconnects rather than shared buses. Compared to systems based on HP’s EV7 and more importantly AMD’s Opteron, Intel’s front-side bus shows it age; it simply does not scale as well. Intel’s own Xeon and Xeon MP chipsets illustrate the point quite well ? as both use two separate front-side bus segments in order to provide enough bandwidth to feed all the processors. Similarly, Intel designed all of their MPUs with relatively large caches to reduce the pressure on the front-side bus and memory systems, exemplified by the cases of the Xeon MP and Itanium 2, sporting 16MB and 24MB of L3 cache respectively. While some critics claim that Intel is pushing an archaic solution and patchwork fixes on the industry, the truth is that this is simply a replay of the issues surrounding the Pentium versus P6 bus debate writ large. The P4 bus is vastly simpler and less expensive than a higher performance, point-to-point interconnect, such as HyperTransport or CSI. After 10 years of shipping products, there is a massive amount of knowledge and infrastructure invested in the front-side bus architecture, both at Intel and at strategic partners. Tossing out the front-side bus will force everyone back to square one. Intel opted to defer this transition by increasing cache sizes, adding more bus segments and including snoop filters to create competitive products.

While Intel’s platform engineers devised more and more creative ways to improve multiprocessor performance using the front-side bus, a highly scalable next generation interconnect was being jointly designed by engineers from teams across Intel and some of the former Alpha designers acquired from Compaq. This new interconnect, known internally as the Common System Interface (CSI), is explicitly designed to accommodate integrated memory controllers and distributed shared memory. CSI will be used as the internal fabric for almost all future Intel systems starting with Tukwila, an Itanium processor and Nehalem, an enhanced derivative of the Core microarchitecture, slated for 2008. Not only will CSI be the cache coherent fabric between processors, but versions will be used to connect I/O chips and other devices in Intel based systems.

The design goals for CSI are rather intimidating. Roughly 90% of all servers sold use four or fewer sockets, and that is where Intel faces the greatest competition. For these systems, CSI must provide low latency and high bandwidth while keeping costs attractive. On the other end of the spectrum, high-end systems using Xeon MP and Itanium processors are intended for mission critical deployment and require extreme scalability and reliability for configurations as large as 2048 processors, but customers are willing to pay extra for those benefits. Many of the techniques that larger systems use to ensure reliability and scalability are more complicated than necessary for smaller servers (let alone notebooks or desktops), producing somewhat contradictory objectives. Consequently, it should be no surprise that CSI is not a single implementation, but rather a closely related family of implementations that will serve as the backbone of Intel’s architectures for the coming years.


Physical Layer


Unlike the front-side bus, CSI is a cleanly defined, layered network fabric used to communicate between various agents. These ‘agents’ may be microprocessors, coprocessors, FPGAs, chipsets, or generally any device with a CSI port. There are five distinct layers in the CSI stack, from lowest to highest: Physical, Link, Routing, Transport and Protocol [27]. Table 1 below describes the different layers and responsibilities of each layer.



Table 1 ? Common System Interface Layers

While all five layers are clearly defined, they are not all necessary. For example, the routing layer is optional in less complex systems, such as a desktop, where there are only two CSI agents (the MPU and chipset). Similarly, in situations where all CSI agents are directly connected, the transport layer is redundant, as end-to-end reliability is equivalent to link layer reliability.

CSI is defined as a variable width, point to point, packet-based interface implemented as two uni-directional links with low-voltage differential signaling. A full width CSI link is physically configured with 20 bit lanes in each direction; these bit lanes are divided into four quadrants of 5 bit lanes, as depicted in Figure 1 [25]. While most CSI links are full width, half width (2 quadrants) and quarter width (a single quadrant) options are also possible. Reduced width links will likely be used for connecting MPUs and chipset components. Additionally, some CSI ports can be bifurcated, so that they can connect to two different agents (for example, so that an I/O hub can directly connect to two different MPUs) [25]. The width of the link determines the physical unit of information transfer, or phit, which can be 5, 10 or 20 bits.



Figure 1 ? Anatomy of a CSI Link

In order to accommodate various link widths (and hence phit sizes) and bit orderings, each nibble of output is muxed on-chip before being transmitted across the physical transmission pins and the inverse is done on the receive side [25]. The nibble muxing eliminates trace length mismatches which reduces skew and improves performance. To support port bifurcation efficiently, the bit lanes are swizzled to avoid excessive wire crossings which would require additional layers in motherboards. Together these two techniques permit a CSI port to reverse the pins (i.e. send output for pin 0 to pin 19, etc.), which is needed when the processor sockets are mounted on both sides of a motherboard.

CSI is largely defined in a way that does not require a particular clocking mechanism for the physical layer. This is essential to balance current latency requirements, which tend to favor parallel interfaces, against future scalability, which requires truly serial technology. Clock encoding and clock and data recovery are prerequisites for optical interconnects, which will eventually be used to overcome the limitations of copper. By specifying CSI in an expansive fashion, the architects created a protocol stack that can naturally be extended from a parallel implementation over copper to optical communication.

Initial implementations appear to use clock forwarding, probably with one clock lane per quadrant to reduce skew and enable certain power saving techniques [16] [19] [27]. While some documents reference a single clock lane for the entire link, this seems unlikely as it would require much tighter skew margins between different data lanes. This would result in more restrictive board design rules and more expensive motherboards.

When a CSI link first boots up, it goes through a handshake based physical layer calibration and training process [14] [15]. Initially, the link is treated like a collection of independent serial lanes. Then the transmitter sends several specially designed phit patterns that will determine and communicate back the intra-lane skew and detect any lane failures. This information is used to train the receiver circuitry to compensate for skew between the different lanes that may arise due to different trace lengths, and process, temperature and voltage variation. Once the link has been trained, it then begins to operate as a parallel interface, and the circuitry used for training is shut down to save power. The link and any de-skewing circuitry will also be periodically recalibrated, based on a timing counter; according to one patent this counter triggers every 1-10ms [13]. When the retraining occurs, all higher level functionality including flow control and data transmission is temporarily halted. This skew compensation enables motherboard designs with less restrictive design rules for trace length matching, which are less expensive as a result.

It appears some variants of CSI can designate data lanes as alternate clocking lanes, in case of a clock failure [16]. In that situation, the transmitter and receiver would then disable the alternate clock lane and probably that lane’s whole quadrant. The link would then re-initialize at reduced width, using the alternate clock lane for clock forwarding, albeit with reduced data bandwidth. The advantage is that clock failures are no longer fatal, and gracefully degrade service in the same manner as a data lane failure, which can be handled through virtualization techniques in the link layer.

Initial CSI implementations in Intel’s 65nm and 45nm high performance CMOS processes target 4.8-6.4GT/s operation, thus providing 12-16GB/s of bandwidth in each direction and 24-32GB/s for each link [30] [33]. Compared to the parallel P4 bus, CSI uses vastly fewer pins running at much higher data rates, which not only simplifies board routing, but also makes more CPU pins available for power and ground.


Link Layer


The CSI link layer is concerned with reliably sending data between two directly connected ports, and virtualizing the physical layer. Protocol packets from the higher layers of CSI are transmitted as a series of 80 bit flow control units (flits) [25]. Depending on the width of a physical link, transmitting each flit takes either 4, 8 or 16 cycles. A single flit can contain up to 64 bits of data payload, the remaining 16 bits in the header are used for flow control, packet interleave, virtual networks and channels, error detection and other purposes [20] [22]. A higher level protocol packet can consist of as little as a single flit, for control messages, power management and the like, but could include a whole cache line ? which is currently 64B for x86 MPUs and 128B for IPF.

Flow control and error detection/correction are part of the CSI link layer, and operate between each transmitter and receiver pair. CSI uses a credit based flow control system to detect errors and avoid collisions or other quality-of-service issues [22] [34]. CSI links have a number of virtual channels, which can form different virtual networks [8]. These virtual channels are used to ensure deadlock free routing, and to group traffic according to various characteristics, such as transaction size or type, coherency, ordering rules and other information [23]. These particular details are intertwined with other aspects of CSI-based systems and are discussed later. To reduce the storage requirements for the different virtual channels, CSI uses two level adaptive buffering. Each virtual channel has a small dedicated buffer, and all channels share a larger buffer pool [8].

Under ordinary conditions, the transmitter will first acquire enough credits to send an entire packet ? as previously noted this could be anywhere from 1-18+ flits. The flits will be transmitted to the receiver, and also copied into a retry buffer. Every flit is protected by an 8 bit CRC (or 16 bits in some cases), which will alert the receiver to corruption or transmission errors. When the receiver gets the flits, it will compute the CRC to check that the data is correct. If everything is clear, the receiver will send an acknowledgement (and credits) with the next flit that goes from the receiver-side to the transmitter-side (remember, there are two uni-directional links). Then the transmitter will clear the flits out of the retry buffer window. If the CRC indicates an error, the receiver-side will send a link layer retry request to the transmitter-side. The transmitter then begins resending the contents of the retry buffer until the flits have been correctly received and acknowledged. Figure 2 below shows an example of several transactions occurring across a CSI link using the flow control counters.



Figure 2 ? Flow Control Example, [34]

While CSI’s flow control mechanisms will prevent serious contention, they do not necessarily guarantee low latency. To ensure that high priority control packets are not blocked by longer latency data packets, CSI incorporates packet interleaving in the link layer [21]. A bit in the flit header indicates whether the flit belongs to a normal or interleaved packet. For example, if a CSI agent is sending a 64B cache line (8+ flits) and it must send out a cache coherency snoop, rather than delaying the snoop by 8 or more cycles, it could interleave the snoop. This would significantly improve the latency of the snoop, while barely slowing down the data transmission. Similarly, this technique could be used to interleave multiple data streams so that they arrive in a synchronized fashion, or simply to reduce the variance in packet latency.

The link layer can also virtualize the underlying physical bit lanes. This is done by turning off some of the physical transmitter lanes, and assigning these bit lanes either a static logical value, or a value based on the remaining bits in each phit [24]. For example, a failed data lane could be removed, and replaced by one of the lanes which sends CRC, thus avoiding any data pollution or power consumption as a result. The link would then continue to function with reduced CRC protection, similar to the failover mechanisms for FB-DIMMs.

Once the physical layer has been calibrated and trained, as discussed previously, the link layer goes through an initialization process [31]. The link layer is configured to auto-negotiate and exchange various parameters, which are needed for operation [20]. Table 2 below is a list of some (but not necessarily all) of the parameters that each link will negotiate. The process starts with each side of the link assuming the default values, and then negotiating which values to actually use during normal operation. The link layer can also issue an in-band reset command, which stops the clock forwarding, and forces the link to recalibrate the physical layer and then re-initialize the link layer.



Table 2 ? CSI Link Layer Parameters and Values, [20]

Most of these parameters are fairly straight forward. The only one that has not been discussed is the agent profile. This field characterizes the role of the device and contains other link level information, which is used to optimize for specific roles. For example, a ”mobile” profile agent would likely have much more aggressive power saving features, than a desktop part. Similarly, a server agent might disable some prefetch techniques that are effectively for multi-media workloads but tend to reduce performance for more typical server applications. Additionally, the two CSI agents will communicate what ‘type’ each one belongs to. Some different types would include caching agents, memory agents, I/O agents, and other agents that are defined by the CSI specification.


Power Saving Techniques


Power and heat have become first order constraints in modern MPU design, and are equally important in the design of interconnects. Thus it should come as no surprise that CSI has a variety of power saving techniques, which tend to span both the physical and link layer.

The most obvious technique to reduce power is using reduced or low-power states, just as in a microprocessor. CSI incorporates at least two different power states which boost efficiency by offering various degrees of power reduction and wake-up penalties for each side of a link [17]. The intermediate power state, L0s, saves some power, but has a relatively short wake-up time to accommodate brief periods of inactivity [28]. There are several triggers for entering the L0s state; they can include software commands, an empty or near empty transaction queue (at the transmitter), a protocol message from an agent, etc. When the CSI link enters the L0s state, an analog wake-up detection circuit is activated, which monitors for any signals which would trigger an exit from L0s. Additionally, a wake-up can be caused by flits entering the transaction queue.

During normal operation, even if one half of the link is inactive, it will still have to send idle flits to the other side to maintain flow control and provide acknowledgement that flits are being received correctly. In the L0s state, the link stops flow control temporarily and can shut down some, but not all, of the circuitry associated with the physical layer. Circuits are powered down based on whether or not they can be woken up within a predetermined period of time. This wake-up timer is configurable and the value likely depends upon factors such as the target market (mobile, desktop or server) and power source (AC versus battery). For instance, the bit lanes can generally be held in an electrical idle so they do not consume any power. However, the clock recovery circuits (receiver side PLLs or DLLs) must be kept active and periodically recalibrated. This ensures that when the link is activated, no physical layer initialization is required, which keeps the wake up latency relatively low. Generally, increasing the timer would improve the power consumption in L0s, but could negatively impact performance. Intel’s patents indicate that the wake-up timer can be set as low as 20ns, or roughly 96-128 cycles [19].

For more dramatic power savings, CSI links can be put into a low power state. The L1 state is optimized specifically for the lowest possible power, without regard for the wake-up latency, as it is intended to be used for prolonged idle periods. The biggest difference between the L0s and L1 states is that in the latter, the DLLs or PLLs used for clock recovery and skew compensation are turned off. This means that the physical layer of the link must be retrained when it is turned back on, which is fairly expensive in terms of latency ? roughly 10us [17]. However, the benefit is that the link barely dissipates any power when in the L1 state. Figure 3 shows a state diagram for the links, including various resets and the L1 and L0s states.



Figure 3 ? CSI Initialization and Power Saving State Diagram, [19]

Another power saving trick for CSI addresses situations where the link is underutilized, but must remain operational. Intel’s engineers designed CSI so that the link width can be dynamically modulated [27]. This is not too difficult, since the physical link between two CSI agents can vary between 5, 10 and 20 bits wide and the link layer must be able to efficiently accommodate each configuration. The only additional work is designing a mechanism to switch between full, half and quarter-width and ensuring that the link will operate correctly during and after a width change.

Note that width modulation is separate for each unidirectional portion of a link, so one direction might be wider to provide more bandwidth, while the opposite direction is mostly inactive. When the link layer is auto-negotiating, each CSI agent will keep track of the configurations supported by the other side (i.e. full width, half-width, quarter-width). Once the link has been established and is operating, each transmitter will periodically check to see if there is an opportunity to save power, or if more bandwidth is required.

If the link bandwidth is not being used, then the transmitter will select a narrower link configuration that is mutually supported and notify the receiver. Then the transmitter will modulate to a new width, and place the inactivated quadrants into the L0s or L1 power saving states, and the receiver will follow suit. One interesting twist is that the unused quadrants can be in distinct power states. For example, a full-width link could modulate down to a half-width link, putting quadrants 0 and 1 into the L1 state, and then later modulate down to a quarter-width link, putting quadrant 2 into the L0s state. In this situation, the link could respond to an immediate need for bandwidth by activating quadrant 2 quickly, while still saving a substantial amount of power.

If more bandwidth is required, the process is slightly more complicated. First, the transmitter will wake up its own circuitry, and also send out a wake-up signal to the receiver. However, because the wake-up is not instantaneous, the transmitter will have to wait for a predetermined and configurable period of time. Once this period has passed, and the receiver is guaranteed to be awake, then the transmitter can finally modulate to a wider link, and start transmitting data at the higher bandwidth.

Most of the previously discussed power saving techniques are highly dynamic and difficult to predict. This means that engineers will naturally have to build in substantial guard-banding to guarantee correct operation. However, CSI also offers deterministic thermal throttling [26]. When a CSI agent reaches a thermal stress point, such as exceeding TDP for a period of time, or exceeding a specific temperature on-die, the overheating agent will send a thermal management request to other agents that it is connected to via CSI. The thermal management request typically includes a specified acknowledgement window and a sleep timer (these could be programmed into the BIOS, or dynamically set by the overheating agent). If the other agent responds affirmatively within the acknowledgement window, then both sides of the link will shut down for the specified sleep time. Using an acknowledgement window ensures that the other agent has the flexibility to finish in-flight transactions before de-activating the CSI link.


Coherency Leaps Forward at Intel


CSI is a switched fabric and a natural fit for cache coherent non-uniform memory architectures (ccNUMA). However, simply recycling Intel’s existing MESI protocol and grafting it onto a ccNUMA system is far from efficient. The MESI protocol complements Intel’s older bus-based architecture and elegantly enforces coherency. But in a ccNUMA system, the MESI protocol would send many redundant messages between different nodes, often with unnecessarily high latency. In particular, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. However, the requesting processor only needs a single copy of the data, so the system is wasting a bit of bandwidth.

Intel's solution to this issue is rather elegant. They adapted the standard MESI protocol to include an additional state, the Forwarding (F) state, and changed the role of the Shared (S) state. In the MESIF protocol, only a single instance of a cache line may be in the F state and that instance is the only one that may be duplicated [3]. Other caches may hold the data, but it will be in the shared state and cannot be copied. In other words, the cache line in the F state is used to respond to any read requests, while the S state cache lines are now silent. This makes the line in the F state a first amongst equals, when responding to snoop requests. By designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist.

When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.

Figure 4 demonstrates the advantages of MESIF over the traditional MESI protocol, reducing two responses to a single response (acknowledgements are not shown). Note that a peer node is simply a node in the system that contains a cache.



Figure 4 ? MESIF versus MESI Protocol

In general, MESIF is a significant step forward for Intel’s coherency protocol. However, there is at least one optimization which Intel did not pursue ? the Owner state that is used in the MOESI protocol (found in the AMD Opteron). The O state is used to share dirty cache lines (i.e. lines that have been written to, where memory has older or dirty data), without writing back to memory.

Specifically, if a dirty cache line is in the M (modified) state, then another processor can request a copy. The dirty cache line switches to the Owned state, and a duplicate copy is made in the S state. As a result, any cache line in the O state must be written back to memory before it can be evicted, and the S state no longer implies that the cache line is clean. In comparison, a system using MESIF or MESI would change the cache line to the F or S state, copy it to the requesting cache and write the data back to memory ? the O state avoids the write back, saving some bandwidth. It is unclear why Intel avoided using the O state in the newer coherency protocol for CSI ? perhaps the architects decided that the performance gain was too small to justify the additional complexity.

Table 3 summarizes the different protocols and states for the MESI, MOESI and MESIF cache coherency protocols.



Table 3 ? Overview of States in Snoop Protocols


A Two Hop Protocol for Low Latency


In a CSI system, each node is assigned a unique node ID (NID), which serves as an address on the network fabric. Each node also has a Peer Agent list, which enumerates the other nodes in the system that it must snoop when requesting data from memory (typically peers contain a cache, but could also be an I/O hub or device with DMA). Similarly, each transaction is assigned an identifier (TID) for tracking at each involved node. The TID, together with a destination and source ID form a globally unique transaction identifier [37]. The number of TIDs, and hence outstanding transactions is not unlimited, and will likely be one differentiating factor between Xeon DP, Xeon MP and Itanium systems. Table 4 describes the different fields that can be used in each CSI message, although some messages do not use all fields. For example, a snoop response from a processor that holds data in the shared state will not contain any data, just an acknowledgement.



Table 4 ? CSI Message Fields, [1]

CSI was designed as a natural extension of the existing front side bus protocol; although there are some changes, many of the commands can be easily traced to the commands on the front side bus. A set of commands is listed in the ‘250 patent.

In a three hop protocol, such as the one used by AMD’s Opteron, read requests are first sent to the home node (i.e. where the cache line is stored in memory). The home node then snoops all peer nodes (i.e. caching agents) in the system, and reads from memory. Lastly, all snoop responses from peer nodes and the data in memory are sent to the requesting processor. This transaction involves three point to point messages: requestor to home, home to peer and peer to requestor, and a read from memory before the data can be consumed.

Rather than implement a three hop cache coherency protocol, CSI was designed with a novel two hop protocol that achieves lower latency. In the protocol used by CSI, transactions go through three phases; however, data can be used after the second phase or hop. First, the requesting node sends out snoops to all peer nodes (i.e. caches) and the home node. Each peer node sends a snoop response to the requesting node. When the second phase has finished, the requesting node sends an acknowledgement to the home node, where the transaction is finally completed.

In the rare case of a conflict, the home node is notified and will step in and resolve transactions in the appropriate order to ensure correctness. This could force one or more processor in the system to roll back, replay or otherwise cancel the effects of a load instruction. However, the additional control circuitry is neither frequently used, nor is on any critical paths, so it can be tuned for low leakage power.

In the vast majority of transactions, the home node is a silent observer, and the requestor can use the new data as soon as it is received from the peer agent’s cache, which is the lowest possible latency. In particular, a two hop protocol does not have to wait to access memory in the home node, in contrast to three hop protocols. Figure 5 compares the critical paths between two hop and three hop protocols, when data is in a cache (note that not all snoops and responses are shown ? only the critical path).



Figure 5 ? Critical Path Latency for Two and Three Hop Protocols

This arrangement is somewhat unusual in that the requesting processor is conceptually pushing transactions into the system and home node. In three hop protocols, the home node acts as a gate keeper and can defer a transaction if the appropriate queues are full, while only stalling the requestor. In a CSI-based system, the home node receives messages after the transaction in is progress or has already occurred. If these incoming transactions were lost, the system would be unable to maintain coherency. Therefore to ensure correctness, CSI home nodes must have a relatively large pre-allocated buffer to support as many transactions as can be reasonably initiated.


Virtual Channels


One of the most difficult challenges in designing multiprocessor systems is guaranteeing forward progress, and avoiding strange network behavior that limits performance. Unfortunately, this problem is an inherent aspect of multiprocessor system design, and really impacts almost every decision made; there is no clean way to separate it from other concerns. Every coherent transaction across CSI has three phases: the snoop from the requesting CPU, the responses from the peer nodes, and the acknowledgement to the home node. As noted previously, CSI uses separate link layer virtual channels to improve performance and avoid livelocks or deadlocks. Each transaction phase has one or more associated virtual channels: snoop, response and home. These arrangements should come as no surprise, since the EV7 used similar techniques and designations. Additional channels discussed in patents include short message, data and data response channels [34].

One reason for providing different virtual channels is that the traffic characteristics of the three are quite distinct. Packets sent across the home channel are typically very small and must be received in order. In the most common case, a home packet would simply be an acknowledgement that a transaction can retire. The response channel sometimes includes larger packets, often containing actual cache lines (although these may go on the data response channel), and can be processed out of order to improve performance. The snoop channel is mostly smaller packets, and can also operate out of order. The optimizations for each channel are different and by separating each class of traffic, Intel architects can more carefully tune the system for high performance and lower power.

There are also priority relationships between the different classes of traffic. When the system is saturated, the home phase channels will be given the highest priority, which ensures that some transactions will retire, leaving the system and reducing traffic. The next highest priority is the response phase and associated channels, which provide data to processors so they can continue computation, and initiate the home phase. The lowest priority of traffic are the snoop phase channels, which are used to start new transactions and is the first to throttle back.



Dynamic Reconfiguration


One of the problems with the existing bus infrastructure is that the interface presented to software is not particularly clean or isolated. Specifically, components of Intel’s system architecture cannot be dynamically added or removed from the front-side bus; instead the bus and all attached components must be shut down, and then restarted after disabling or adding the component in question. For instance, to remove one faulty processor in a 16 socket server, an entire node (4 processors, one memory controller hub, the local memory and I/O) must be off-lined.

CSI supports both in-band (coordinated by a system component) and out-of-band (coordinated by a service processor) dynamic reconfiguration of system resources, also known as hot plug [39]. A system agent and the firmware work together to quiesce individual components and then modify the routing tables and system addressing decoders, so that the changes appear to be atomic to the operating system and software.

To add a system resource, such as a processor, first the firmware creates a physical and logical profile for the new processor in the rest of the system. Next, the firmware enables the CSI links between the new processor and the rest of the system. The firmware initializes the new processor’s CSI link and sends data about the system configuration to the new processor. The new processor initializes itself and begins self-testing, and will notify the firmware when it is complete. At that point, the firmware notifies the OS and the rest of the system to begin operating with the new processor in place.

System resources, such as a processor, memory or IO hub are removed though a complementary mechanism. These two techniques can also be combined to move resources seamlessly between different system partitions.


Multiprocessor Systems


When the P6 front side bus was first released, it caused a substantial shift in the computer industry by supporting up to four processors without any chipset modifications. As a result, Intel based systems using Linux or Windows penetrated and dominated the workstation and entry level server market, largely because the existing architectures were priced vastly higher.

However, Intel hesitated to extend itself beyond that point. This hesitancy was partially due to economic incentives to maintain the same infrastructure, but also the preferences of key OEMs such as IBM, HP and others, who provide value added in the form of larger multiprocessor systems. Balancing all the different priorities inside of Intel, and pleasing partners is nearly impossible and has handicapped Intel for the past several years. However, it is quite clear that any reservations at Intel disappeared around 2002-3, when CSI development started.

Intel's patents clearly anticipate two and four processor systems, as shown in Figure 6. Each processor in a dual socket system will require a single coherent full width CSI link, with one or two half-width links to connect to I/O bridges, making the system fully symmetric (half-width links are shown as dotted lines). Processors in four socket systems will be fully connected, and each processor could also connect directly to the I/O bridge. More likely, each processor, or pair of processors, could connect to a separate I/O bridge to provide higher I/O bandwidth in the four socket systems.




Figure 6 ? 2 and 4P CSI System Diagrams [2] [34]

Fully interconnected systems, such as those shown in Figure 6 enjoy several advantages over partially connected solutions. First of all, transactions occur at the speed of the slowest participant. Hence, a system where every caching agent (including the I/O bridge) is only one hop away ensures lower transaction latency. Secondly, by lowering transaction latency, the number of transactions in flight is reduced (since the average transaction life time is shorter). This means that the buffers for each caching agent can be smaller, faster and more power efficient. Lastly, operating systems and applications have trouble handling NUMA optimizations, so more symmetrical systems are ideal from a software perspective.



Interacting with I/O


Of course, ensuring optimal communication between multiple processors is just one part of system design. The I/O architecture for Intel’s platform is also important, and CSI brings along several important changes in that area as well [36].

As Figure 6 indicates, some CSI based systems contain multiple I/O hubs, which need to communicate with each other. Since the I/O hubs are not connected, Intel’s engineers devised an efficient method to forward I/O transactions (typically PCI-Express) through CSI. Because CSI was optimized for coherent traffic, it lacks many of the features which PCI-Express relies upon, such as I/O specific packet attributes. To solve this problem, PCI-E packets are tunneled through CSI, leaving much or all of the PCI-E header information intact.


Beyond Multiprocessors


In a forward looking decision by Intel, CSI is fairly agnostic with respect to system expansion. Systems can be expanded in a hierarchical manner, which is the path that IBM took for their older X3 chipset, where one agent in each local cell acts a proxy for the rest of the system. Certainly, the definition of CSI lends itself to hierarchical arrangements, since a “CSI node” is an abstraction and may in fact consist of multiple processors. For instance, in a 16 socket system, there might be four nodes, and each node might contain four sockets and resemble the top diagram in Figure 6. Early Intel patents seem to point to hierarchical expansion as being preferred, although later patents appear to be less restrictive [2] [4]. As an alternative to hierarchical expansion, larger systems can be built using a flat topology (the 2 dimensional torus used by the EV7 would be an example). However, a flat system must have a node ID for each processor, whereas a hierarchical system needs only enough node IDs for the processors in each ‘cell’. So, while a flat 32 socket system would require 32 distinct node IDs, a comparable system using 8 nodes of 4 sockets would only need 4 distinct node IDs.

Most MPU vendors have used node ID ranges to differentiate between versions of their processors. For instance, Intel and AMD both draw clear distinctions between 1, 2 and 4P server MPUs; each one with an increasing level of RAS and more node IDs and a substantial price increase. Furthermore, a flat system with 8+ processors in all likelihood needs snoop filters or directories for scalability. However, Intel’s x86 MPUs will probably not natively support directories or snoop filters; instead leaving that choice to OEMs. This flexibility for CSI systems means that OEMs with sufficient expertise can differentiate their products with custom node controllers for each local node in a hierarchical system.

Directory based coherency protocols are the most scalable option for system expansion. However, directories use a three hop coherency protocol that is quite different from CSI. In the first phase, the requestor sends a request to the home node, which contains the directory that lists which agents have a copy of the cache line. The home node would then snoop those agents, while sending no messages to uninvolved third parties. Lastly, all the agents receiving a snoop would send a response to the requestor. This presents several problems. The directory itself is difficult to implement, since every cache miss in the system generates both a read (the lookup) and a write to the directory (updating ownership). The latency is also higher than a snooping broadcast protocol, although the system bandwidth used is lower, hence providing better scalability. Snoop filters are a more natural extension of the CSI infrastructure suitable for mid-sized systems.

Snoop filters focus on a subset of the key data to reduce the number of snoop responses. The classic example of a snoop filter, such as Intel’s Blackford, Seaburg or Clarksboro chipsets, tracks remotely cached data. Snoop filters have an advantage because they preserve the low latency of the CSI protocol, while a directory would require changing to a three hop protocol. Not every element in the system must have a snoop filter either; CSI is flexible in that regard as well.


Remote Prefetch


Another interesting innovation that may show up in CSI is remote prefetch [9]. Hardware prefetching is nothing new to modern CPUs, it has been around since the 130nm Pentium III. Typically, hardware prefetch works by tracking cache line requests from the CPU and trying to detect a spatial or temporal pattern. For instance, loading a 128MB movie will result in roughly one million sequential requests (temporal) for 128B cache lines that are probably adjacent in memory (spatial). A prefetcher in the cache controller will figure out this pattern, and then start requesting cache accesses ahead of time to hide memory access latencies. However, general purpose systems rely on cache and memory controllers prefetching for the CPU and do not receive feedback from other system agents.

One of the patents relating to CSI is for a remote device initiating a prefetch into processor caches. The general idea is that in some situations, remote agents (an I/O device or coprocessor) might have more knowledge about where data is coming from, than the simple pattern detectors in the cache or memory controller. To take advantage of that, a remote agent sends a prefetch directive message to a cache prefetcher. This message could be as simple as indicating where the prefetch would come from (and therefore where to respond), but in all likelihood would include information such as data size, priority and addressing information. The prefetcher can then respond by initiating a prefetch or simply ignoring the directive altogether. In the former case, the prefetcher would give direct cache access to the remote agent, which then writes the data into the cache. Additionally, the prefetcher could request that the remote agent pre-process the data. For example, if the data is compressed, encoded or encrypted, the remote agent could transform the data to an immediately readable format, or route it over the interprocessor fabric to a decoder or other mechanism.

The most obvious application for remote prefetching is improving I/O performance when receiving data from a high speed Ethernet, FibreChannel or Infiniband interface (the network device would be the remote agent in that case). This would be especially helpful if the transfer unit is large, as is the case for storage protocols such as iSCSI or FibreChannel, since the prefetch would hide latency for most of the data. To see how remote prefetch could improve performance, Figure 7 shows an example using a network device.



Figure 7 ? Remote Prefetch for Network Traffic

On the left is a traditional system, which is accessing 4KB of data over a SAN. It receives a packet of data through the network interface, and then issues a write-invalidate snoop for a cache line to all caching agents in the system. A cache line in memory is allocated, and the data is stored through the I/O hub. This repeats until all 4KB of data has been written into the memory, at which point the I/O device issues an interrupt to the processor. Then, the processor requests the data from memory and snoops all the caching agents; lastly it reads the memory into the cache and begins to use the data.

In a system using remote prefetch, the network adapter begins receiving data and the packet headers indicate that the data payload is 4KB. The network adapter then sends a prefetch directive, through the I/O hub to the processor’s cache, which responds by granting direct cache access. The I/O hub will issue a write-invalidate snoop for each cache line written, but instead of storing to memory, the data is placed directly in the processor’s cache in the modified state. When all the data has been moved, the I/O hub sends an interrupt and the processor begins operating on the data already in the cache. Compared to the previously described method, remote prefetching demonstrates several advantages. First, it eliminates all of the snoop requests by the processor to read the data from memory to the cache. Second, it reduces the load on the main memory system (especially if the processors can stream results back to network adapter) and modestly decreases latency.

While the most obvious application of the patented technique is for network I/O, remote prefetch could work with any part of a system that does caching. For instance, in a multiprocessor system, remote prefetch between different processors, or even coprocessors or acceleration devices is quite feasible. It is unclear whether this feature will be made available to coprocessor vendors and other partners, but it would certainly be beneficial for Intel and a promising sign for a more open platform going forward.


Speculations on CSI Clients


While the technical details of CSI are well documented in various Intel patents, there is relatively little information on future desktop or mobile implementations. These next two sections make a transition from fairly solid technical details into the realm of educated, but ultimately speculative predictions.

CSI will modestly impact the desktop and mobile markets, but may not bring any fundamental changes. Certain Intel patents seem to imply that discrete memory controllers will continue to be used with some MPUs [9]. In all likelihood, Intel will offer several different product variations based on the target market. Some versions will use integrated memory controllers, some will offer an on-package northbridge and some will probably have no system integration at all.

Intel has a brisk chipset business on both the desktop and notebook side that keeps older fabs effectively utilized ? an essential element of Intel’s capital strategy. If Intel were to integrate the northbridge in all MPUs, it would force the company to find other products which can use older fabs, or shutter some of the facilities. Full integration also increases the pin count for each MPU, which increases the production costs. While an integrated memory controller increases performance by reducing latency, many products do not need the extra performance, nor is it always desirable from a marketing perspective.

For technical reasons, an integrated memory controller can also be problematic. Integrated graphics controllers share main memory to reduce cost. As a result, integrated graphics substantially benefits from sharing a die with the memory controller, as it does currently for Intel based systems. However, integrating graphics on the processor seems a little aggressive for a company that has yet to produce an on-die memory controller, and is a waste of cutting edge silicon ? most high performance systems simply do not use integrated graphics.

Intel’s desktop version of Nehalem is code-named Bloomfield and it seems clear that the high performance MPUs, which are targeted at gamers, will feature on-die memory controllers. The performance benefits of reducing memory latency will probably be used as product differentiation by Intel to encourage gamers to move to the Extreme Edition line and justify the higher prices. However, on-die or on-package graphics is unlikely given that most OEMs will use higher performance discrete solutions from NVIDIA or AMD. The width of the CSI connection between the MPU and the chipset may be another differentiating factor. While a half-width link will work for mid-range systems, high-end gaming systems will require more bandwidth. Modern high performance GPUs use a PCI-E x16 slots, which provides 4GB/s in each direction. Hence, it is quite conceivable that by 2009 a pair of high-end GPUs would require ~16GB/s in each direction. Given that gaming systems often stress graphics, network and disk, a full width CSI link may be required to provide enough appropriate performance.

Other desktop parts based on Bloomfield will focus on low cost, and greater integration. It is very likely that these MPUs will be connected via CSI to a second die containing a memory controller and integrated graphics, all packaged inside a single MCM. A CSI link (probably half-width) would connect the northbridge to the rest of the chipset. This solution would let Intel use older fabs to produce the northbridge, and would enable more manufacturing flexibility ? each component could be upgraded individually with fewer dependencies between the two. Intel will probably also produce a MPU with no integrated system features, which will let OEMs use chipsets from 3rd party vendors, such as NVIDIA, VIA and SiS.

Gilo, the mobile proliferation of Nehalem, will face many of the same issues as desktop processors, but also some that are unique to the notebook market. Mobile MPUs do not really need the lower latency; in many situations they sacrifice performance by only populating a single channel of memory, or operating at relatively slow transfer rates. An integrated memory controller would also require a separate voltage plane from the cores, hence systems would need an additional VRM on the motherboard. The clock distribution would also need to be designed so that the cores can vary frequency independently of the memory controller. Consequently, an on-die memory controller is unlikely because of the lack of benefits and additional complexity.

The implementations for Gilo will most likely resemble the mid-range and low-end desktop product configuration. The more integrated products will feature the northbridge and graphics in the same package as the MPU, connected by CSI. A more bare-bones MPU would also be offered for OEMs that prefer higher performance discrete graphics, or wish to use alternative chipsets.

While the system architecture for Intel’s desktop and mobile offerings will change a bit, the effects will probably be more subtle. The majority of Intel MPUs will still require external memory controllers, but they will be integrated on the MPU package itself. This will not fundamentally improve Intel’s performance relative to AMD’s desktop and mobile offerings. However, it will make Intel’s products more attractive to OEMs, since the greater integration will reduce the number of discrete components on system boards and lower the overall cost. In many ways the largest impact will be on the graphics vendors ? since it will make all their solutions (both integrated and discrete) more expensive relative to a single MCM from Intel.


Speculations on CSI Servers


In the server world, CSI will be introduced in tandem with an on-die memory controller. The impact of these two modifications will be quite substantial, as they address the few remaining shortcomings in Intel’s overall server architecture and substantially increase performance. This performance improvement come from two places: the integrated memory controller will lower memory latency, while the improved interconnects for 2-4 socket servers will increase bandwidth and decrease latency.

To Intel, the launch of a broad line of CSI based systems will represent one of the best opportunities to retake server market share from AMD. New systems will use the forthcoming Nehalem microarchitecture, which is a substantially enhanced derivative of the Core microarchitecture, and features simultaneous multithreading and several other enhancements. Historically speaking, new microarchitectures tend to win the performance crown and presage market share shifts. This happened with the Athlon, the Pentium 4, Athlon64/Opteron, and the Core 2 and it seems likely this trend will continue with Nehalem. The system level performance benefits from CSI and integrated memory controllers will also eliminate Intel’s two remaining glass jaws: the older front side bus architecture and higher memory latency.

The single-processor server market is likely where CSI will have the least impact. For these entry level servers, the shared front side bus is not a substantial problem, since there is little communication compared to larger systems. Hence, the technical innovations in CSI will have relatively little impact in this market. AMD also has a much smaller presence in this market, because their advantages (which are similar to the advantages of CSI) are less pronounced. Clearly, AMD will try to make inroads into this market; if the market responds positively to AMD’s solution that may hint at future reactions to CSI.

Currently in the two socket (DP) server market, Intel enjoys a substantial performance lead for commercial workloads, such as web serving or transaction processing. Unfortunately, Intel’s systems are somewhat handicapped because they require FB-DIMMs, which use an extra 5-6 watts per DIMM and cost somewhat more than registered DDR2. This disadvantage has certainly hindered Intel in the last year, especially with customers who require lots of memory or extremely low power systems. While Intel did regain some server market share, AMD’s Opteron is still the clear choice for almost all high performance computing, where the superior system architecture provides more memory and processor communication bandwidth. This advantage has been a boon for AMD, as the HPC market is the fastest growing segment within the overall server market.

Gainestown, the first CSI based Xeon DP, will arrive in the second half of 2008, likely before any of the desktop or mobile parts. In the dual socket market, CSI will certainly be welcome and improve Intel’s line up, featuring 2x or more the bandwidth of the previous generation, but the impact will not be as pronounced as for MP systems. Intel’s dual socket platforms are actually quite competitive because the product cycles are shorter, meaning more frequent upgrades and higher bandwidth. Intel’s current Blackford and Seaburg chipsets, with dual front side buses and snoop filters, offer reasonable bandwidth, although at the cost of slightly elevated power and thermal requirements. This too shall pass; it appears that dual socket systems will shift back to DDR3, eliminating the extra ~5W penalty for each FB-DIMM [12]. This will improve Intel’s product portfolio and put additional pressure on AMD, which is still benefitting from the FB-DIMM thermal issues. The DP server market is currently fairly close to ‘equilibrium’; AMD and Intel have split the market approximately along the traditional 80/20 lines. Consequently, the introduction of CSI systems will enhance Intel’s position, but will not spark massive shifts in market share.

The first Xeon MP to use CSI will debut in the second half of 2009, lagging behind its smaller system counterparts by an entire year. Out of all the x86 product families using CSI, Beckton will have the biggest impact, substantially improving Intel’s position in the four socket server market. Beckton will offer roughly 8-10x the bandwidth of its predecessor, dramatically improving performance. The changes in system architecture will also dramatically reduce latency, which is a key element of performance for most of the target workloads, such as transaction processing, virtualization and other mission critical applications. Since the CSI links are point-to-point, they eliminate one chip and one interconnect crossing, which will cut the latency between processors in half, or better. The integrated memory controller in Beckton will similarly reduce latency, since it also removes out an extra chip and interconnect crossing.

Intel’s platform shortcomings created a weakness that AMD exploited to gain significant market share. It is estimated that Intel currently holds as little as 50% of the market for MP servers, compared to roughly 75-80% of the overall market. When CSI-based MP platforms arrive in 2009, Intel will certainly try to bring their market share back in-line with the overall market. However, Beckton will be competing against AMD’s Sandtiger, a 45nm server product with 8-16 cores also slated for 2009. Given that little is known about the latter, it is difficult to predict the competitive landscape.



Itanium and CSI


CSI will also be used for Tukwila, a quad-core Itanium processor due in 2008. Creating a common infrastructure for Itanium and Xeon based systems has been a goal for Intel since 2003. Because, the economic and technical considerations for these two products are different, they will not be fully compatible. However, the vast majority of the two interconnects will be common between the product lines.

One goal of a common platform for Itanium and Xeon is to share (and therefore better amortize) research, development, design and validation costs, by re-using components across Intel's entire product portfolio. Xeon and Xeon MP products ship in the tens of millions each year, compared to perhaps a million for Itanium. If the same components can be used across all product lines, the non-recurring engineering costs for Itanium will be substantially reduced. Additionally, the inventory and supply chain management for both Intel and its partners will be simplified, since some chipset components will be interchangeable.

Just as importantly, CSI and an integrated memory controller will substantially boost the performance of the Itanium family. Montvale, which will be released at the end of 2007, uses a 667MHz bus that is 128 bits wide ? a total of 10.6GB/s of bandwidth. This pales in comparison to the 300GB/s that a single POWER6 processor can tap into. While bandwidth is only one factor that determines performance, a 30x difference is substantial by any measure. When Tukwila debuts in 2008, it will go a long way towards evening the playing field. Tukwila will offer 120-160GB/s between MPUs (5 CSI lanes at 4.8-6.4GT/s), and multiple integrated FB-DIMM controllers. The combination of doubling the core count, massively increasing bandwidth and reducing latency should prove compelling for Itanium customers and will likely cause a wave of upgrades and migrations similar to the one triggered by the release of Montecito in 2006.


Conclusion


The success of the Pentium Pro and its lineage captured the multi-billion dollar RISC workstation and low-end server market, but that success also created inertia around the bus interface. Politics within the company and with existing partners, OEMs and customers conspired to keep Intel content with the status quo. Unfortunately for Intel, AMD was not content to play second fiddle forever. The Opteron took a portion of the server market, largely by virtue of its superior system architecture and Intel’s simultaneous weakness with the Pentium 4 (Prescott) microarchitecture. While Intel might be prone to internal politics, when an external threat looms large, everything is thrown into high gear. The industry saw that with the RISC versus CISC debate, where Intel P6 engineers hung ads from the now friendly Apple in their cubes for competitive inspiration. The Core microarchitecture, Intel’s current flag bearer, was similarly the labor of a company under intense competitive pressure.

While Intel had multiple internal projects working on a next generation interconnect, the winning design for CSI was the result of collaboration between Intel veterans from Hillsboro, Santa Clara and other sites, as well as the architects who worked on DEC’s Alpha architecture. The EV7, the last Alpha stands out for having the best system interconnect of its time, and certainly influenced the overall direction for CSI. The CSI design team was given a set of difficult, but not impossible goals: design an interconnect family that would span the range of Intel’s performance oriented computational products, from the affordable Celeron to the high-end Xeon MP and Itanium. The results were delayed, largely due to the cancellation of Whitefield, a quad core x86 processor, and the rescheduling and evisceration of Tanglewood nee Tukwila. However, Tukwila and Nehalem will feature CSI when they debut in the next two years, and the world will be able to judge the outcome.

CSI will be a turning point for the industry. In the server world, CSI paired with an integrated memory controller, will erase or reverse Intel’s system architecture deficit to AMD. Intel’s microprocessors will need less cache because of the lower memory and remote access latency; the specs for Tukwila call for 6MB/core rather than the 12MB/core in Montecito. This in turn will free up more die area for additional cores, or more economical die sizes. These changes will put Intel on a more equal footing with AMD, which has had a leg up in system architecture with their integrated memory controller and HyperTransport. As a result, Intel will be in a good position to retake lost market share in the server world in 2008/9 when CSI based systems debut.

In some ways, CSI and integrated memory controllers are the last piece of the puzzle to get Intel’s servers back on track. The new Core microarchitecture has certainly proven to be a capable design, even when paired with the front side bus and a discrete memory controller. The multithreaded microarchitecture for Nehalem, coupled with an integrated memory controller and the CSI system fabric should be an even more impressive product. For Intel, 2008 will be a year to look forward to, thanks in no small part to the engineers who worked on CSI.


References


[1] Batson, B. et al. Messaging Protocol. US Patent Application 20050262250A1. November 24, 2005.
[2] Batson, B. et al. Cache Coherence Protocol. US Patent Application 20050240734A1. October 27, 2005.
[3] Hum, H. et al. Forward State for use in Cache Coherency in a Multiprocessor System. US Patent No. 6,922,756 B2. July 26, 2005.
[4] Hum, H. et al. Hierarchical Virtual Model of a Cache Hierarchy in a Multiprocessor System. US Patent Application 20040123045A1. June 24, 2004.
[5] Beers, R. et al. Non-Speculative Distributed Conflict Resolution for a Cache Coherency Protocol. US Patent No. 6,954,829 B2. October 11, 2005.
[6] Hum, H. et al. Speculative Distributed Conflict Resolution for a Cache Coherency Protocol. US Patent Application 20040122966A1. June 24, 2004.
[7] Hum, H. et al. Hierarchical Directories for Cache Coherency in a Multiprocessor System. US Patent Application 20060253657A1. November 9, 2006.
[8] Cen, Ling. Method, System, and Apparatus for a Credit Based Flow Control in a Computer System. US Patent Application 20050088967A1. April 28, 2005.
[9] Huggahalli, R. et al. Method and Apparatus for Initiating CPU Data Prefetches by an External Agent. US Patent Application 20060085602A1. April 20, 2006.
[10] Kanter, David. Intel’s Tukwila Confirmed to be Quad Core. Real World Technologies. May 5, 2006. http://www.realworldtech.com/page.cfm?NewsID=361&date=05-05-2006#361
[11] Rust, Adamson. Intel’s Stoutland to have Integrated Memory Controller. The Inquirer. February 1, 2007. http://www.theinquirer.net/default.aspx?article=37373
[12] Intel Thurley has Early CSI Interconnect. The Inquirer. February 2, 2007. http://www.theinquirer.net/default.aspx?article=37392
[13] Cherukuri, N. et al. Method and Apparatus for Periodically Retraining a Serial Links Interface. US Patent No. 7,209,907 B2. April 24, 2007.
[14] Cherukuri, N. et al. Method and Apparatus for Interactively Training Links in a Lockstep Fashion. US Patent Application 20050262184A1. November 24, 2005.
[15] Cherukuri, N. et al. Method and Apparatus for Acknowledgement-based Handshake Mechanism for Interactively Training Links. US Patent Application 20050262280A1. November 24, 2005.
[16] Cherukuri, N. et al. Method and Apparatus for Detecting Clock Failure and Establishing an Alternate Clock Lane. US Patent Application 20050261203A1. December 22, 2005.
[17] Cherukuri, N. et al. Method for Identifying Bad Lanes and Exchanging Width Capabilities of Two CSI Agents Connected Across a Link. US Patent Application 20050262284A1. November 24, 2005.
[18] Frodsham, T. et al. Method, System and Apparatus for Loopback Entry and Exit. US Patent Application 20060020861A1. January 26, 2006.
[19] Cherukuri, N. et al. Methods and Apparatuses for Resetting the Phyiscal Layers of Two Agents Interconnected Through a Link-Based Interconnection. US Patent No. 7,219,220 B2. May 15, 2007.
[20] Mannava, P. et al. Method and System for Flexible and Negotiable Exchange of Link Layer Functional Parameters. US Patent Application 20070088863A1. April 19, 2007.
[21] Spink, A. et al. Interleaving Data Packets in a Packet-based Communication System. US Patent Application 20070047584A1. March 1, 2007.
[22] Chou, et al. Link Level Retry Scheme. US Patent No. 7,016,304 B2. March 21, 2006.
[23] Creta, et al. Separating Transactions into Different Virtual Channels. US Patent No. 7,165,131 B2. January 16, 2007.
[24] Cherukuri, N. et al. Technique for Lane Virtualization. US Patent Application 20050259599A1. November 24, 2005.
[25] Steinman, M. et al. Methods and Apparatuses to Effect a Variable-width Link. US Patent Application 20050259696A1. November 24, 2005.
[26] Kwa et al. Method and System for Deterministic Throttling for Thermal Management. US Patent Application 20060294179A1. December 28, 2006.
[27] Cherukuri, N. et al. Dynamically Modulating Link Width. US Patent Application 20060034295A1. February 16, 2006.
[28] Cherukuri, N. et al. Link Power Saving State. US Patent Application 20050262368A1. November 24, 2005.
[29] Lee, V. et all. Retraining Derived Clock Receivers. US Patent Application 20050022100A1. January 27, 2005.
[30] Fan, Yongping. Matched Current Delay Cell and Delay Locked Loop. US Patent No. 7,202,715 B1. April 10, 2007.
[31] Ayyar, M. et al. Method and Apparatus for System Level Initialization. US Patent Application 20060126656A1. June 15, 2006.
[32] Frodsham, T. et al. Method, System and Apparatus for Link Latency Management. US Patent Application 20060168379A1. July 27, 2006.
[33] Frodsham, T. et al. Technique to Create Link Determinism. US Patent Application 20060020843A1. January 26, 2006.
[34] Spink. A. et al. Buffering Data Packets According to Multiple Flow Control Schemes. US Patent Application 20070053350A1. March 8, 2007.
[35] Cen, Ling. Arrangements Facilitating Ordered Transactions. US Patent Application 20040008677A1. January 15, 2004.
[36] Creta, et al. Transmitting Peer-to-Peer Transactions Through a Coherent Interface. US Patent No. 7,210,000 B2. April 24, 2007.
[37] Hum, H. et al. Globally Unique Transaction Identifiers. US Patent Application 20050251599A1. November 10, 2005.
[38] Cen, Ling. Two-hop Cache Coherency Protocol. US Patent Application 20070022252A1. January 25, 2007.
[39] Ayyar, M. et al. Method and Apparatus for Dynamic Reconfiguration of Resources. US Patent Application 20060184480A1. August 17, 2006.