Sep 30, 2007

List of device bandwidths

List of device bandwidths (Wikipedia)

This is a list of device bandwidths: the channel capacity (or, more informally, bandwidth) of some computer devices employing methods of data transport is listed by bit/s, kilobit/s (kbit/s), megabit/s (Mbit/s), or gigabit/s (Gbit/s) as appropriate and also MB/s or megabytes per second. They are listed in order from lowest bandwidth to highest.

Whether to use bit/s or byte/s (B/s) is often a matter of convention. The most commonly cited measurement is bolded. In general, parallel interfaces are quoted in byte/s (B/s), serial in bit/s. On devices like modems, bytes may be more than 8 bits long because they may be individually padded out with additional start and stop bits; the figures below will reflect this. Where channels use line codes, such as Ethernet, Serial ATA and PCI Express, quoted speeds are for the decoded signal.

Many of these figures are theoretical maxima, and various real-world considerations will generally keep the actual effective throughput much lower. The actual throughput achievable on Ethernet networks, for example (especially when heavily loaded or when running over substandard media), is debatable. The figures are also simplex speeds, which may conflict with the duplex speeds vendors sometimes use in promotional materials.

All of the figures listed here are true metric quantities and not binary prefixes (1 kilobit, for example, is 1000 bits, not 1024 bits). Similarly, kB, MB, GB mean kilobytes, megabytes, gigabytes, not kibibytes, mebibytes, gibibytes.

List..

The Common System Interface: Intel's Future Interconnect

By: David Kanter (dkanter@realworldtech.com)
Updated: 08-28-2007

Introduction


In the competitive x86 microprocessor market, there are always swings and shifts based on the latest introductions from the two main protagonists: Intel and AMD. The next anticipated shift is coming in 2008-9 when Intel will finally replace their front side bus architecture. This report details Intel’s next generation system interconnect and the associated cache coherency protocol, likely deployment plans across the desktop, notebook and server market as well as the economic implications.

Intel’s front-side bus has a long history that dates back to 1995 with the release of the Pentium Pro (P6). The P6 was the first processor to offer cheap and effective multiprocessing support; up to four CPUs could be connected to a single shared bus with very little additional effort for an OEM. The importance of cheap and effective cannot be underestimated. Before the P6, multiprocessor systems used special chipsets and usually a proprietary variant of UNIX; consequently they were quite expensive. Initially, Intel’s P6 could not always match the performance of these high end systems from the likes of IBM, DEC or Sun, but the price was so much lower that the performance gap became a secondary consideration. The workstation and low-end server markets embraced the P6 precisely because the front-side bus enabled inexpensive multiprocessors.

Ironically, the P6 bus was the subject of considerable controversy at Intel. It was originally based on the bus used in the i960 project and the designers came under pressure from various corporate factions to re-use the bus from the original Pentium so that OEMs would not have to redesign and validate new motherboards, and so end users could easily upgrade. However, the Pentium bus was strictly in-order and could only have a single memory access in flight at once, making it entirely inadequate for an out-of-order microprocessor like the P6 that would have many simultaneous memory accesses. Ultimately a compromise was reached that preserved most of the original P6 bus design, and the split-transaction P6 bus is still being used in new products 10 years after the design was started. The next step for Intel’s front side bus was to shift to the P4 bus, which was electrically similar to the P6 bus and issued commands at roughly the same rate, but clocked the data bus four times faster to provide fairly impressive throughput.

While the inexpensive P4 bus is still in use for Intel’s x86 processors, the rest of the world moved on to newer point-to-point interconnects rather than shared buses. Compared to systems based on HP’s EV7 and more importantly AMD’s Opteron, Intel’s front-side bus shows it age; it simply does not scale as well. Intel’s own Xeon and Xeon MP chipsets illustrate the point quite well ? as both use two separate front-side bus segments in order to provide enough bandwidth to feed all the processors. Similarly, Intel designed all of their MPUs with relatively large caches to reduce the pressure on the front-side bus and memory systems, exemplified by the cases of the Xeon MP and Itanium 2, sporting 16MB and 24MB of L3 cache respectively. While some critics claim that Intel is pushing an archaic solution and patchwork fixes on the industry, the truth is that this is simply a replay of the issues surrounding the Pentium versus P6 bus debate writ large. The P4 bus is vastly simpler and less expensive than a higher performance, point-to-point interconnect, such as HyperTransport or CSI. After 10 years of shipping products, there is a massive amount of knowledge and infrastructure invested in the front-side bus architecture, both at Intel and at strategic partners. Tossing out the front-side bus will force everyone back to square one. Intel opted to defer this transition by increasing cache sizes, adding more bus segments and including snoop filters to create competitive products.

While Intel’s platform engineers devised more and more creative ways to improve multiprocessor performance using the front-side bus, a highly scalable next generation interconnect was being jointly designed by engineers from teams across Intel and some of the former Alpha designers acquired from Compaq. This new interconnect, known internally as the Common System Interface (CSI), is explicitly designed to accommodate integrated memory controllers and distributed shared memory. CSI will be used as the internal fabric for almost all future Intel systems starting with Tukwila, an Itanium processor and Nehalem, an enhanced derivative of the Core microarchitecture, slated for 2008. Not only will CSI be the cache coherent fabric between processors, but versions will be used to connect I/O chips and other devices in Intel based systems.

The design goals for CSI are rather intimidating. Roughly 90% of all servers sold use four or fewer sockets, and that is where Intel faces the greatest competition. For these systems, CSI must provide low latency and high bandwidth while keeping costs attractive. On the other end of the spectrum, high-end systems using Xeon MP and Itanium processors are intended for mission critical deployment and require extreme scalability and reliability for configurations as large as 2048 processors, but customers are willing to pay extra for those benefits. Many of the techniques that larger systems use to ensure reliability and scalability are more complicated than necessary for smaller servers (let alone notebooks or desktops), producing somewhat contradictory objectives. Consequently, it should be no surprise that CSI is not a single implementation, but rather a closely related family of implementations that will serve as the backbone of Intel’s architectures for the coming years.


Physical Layer


Unlike the front-side bus, CSI is a cleanly defined, layered network fabric used to communicate between various agents. These ‘agents’ may be microprocessors, coprocessors, FPGAs, chipsets, or generally any device with a CSI port. There are five distinct layers in the CSI stack, from lowest to highest: Physical, Link, Routing, Transport and Protocol [27]. Table 1 below describes the different layers and responsibilities of each layer.



Table 1 ? Common System Interface Layers

While all five layers are clearly defined, they are not all necessary. For example, the routing layer is optional in less complex systems, such as a desktop, where there are only two CSI agents (the MPU and chipset). Similarly, in situations where all CSI agents are directly connected, the transport layer is redundant, as end-to-end reliability is equivalent to link layer reliability.

CSI is defined as a variable width, point to point, packet-based interface implemented as two uni-directional links with low-voltage differential signaling. A full width CSI link is physically configured with 20 bit lanes in each direction; these bit lanes are divided into four quadrants of 5 bit lanes, as depicted in Figure 1 [25]. While most CSI links are full width, half width (2 quadrants) and quarter width (a single quadrant) options are also possible. Reduced width links will likely be used for connecting MPUs and chipset components. Additionally, some CSI ports can be bifurcated, so that they can connect to two different agents (for example, so that an I/O hub can directly connect to two different MPUs) [25]. The width of the link determines the physical unit of information transfer, or phit, which can be 5, 10 or 20 bits.



Figure 1 ? Anatomy of a CSI Link

In order to accommodate various link widths (and hence phit sizes) and bit orderings, each nibble of output is muxed on-chip before being transmitted across the physical transmission pins and the inverse is done on the receive side [25]. The nibble muxing eliminates trace length mismatches which reduces skew and improves performance. To support port bifurcation efficiently, the bit lanes are swizzled to avoid excessive wire crossings which would require additional layers in motherboards. Together these two techniques permit a CSI port to reverse the pins (i.e. send output for pin 0 to pin 19, etc.), which is needed when the processor sockets are mounted on both sides of a motherboard.

CSI is largely defined in a way that does not require a particular clocking mechanism for the physical layer. This is essential to balance current latency requirements, which tend to favor parallel interfaces, against future scalability, which requires truly serial technology. Clock encoding and clock and data recovery are prerequisites for optical interconnects, which will eventually be used to overcome the limitations of copper. By specifying CSI in an expansive fashion, the architects created a protocol stack that can naturally be extended from a parallel implementation over copper to optical communication.

Initial implementations appear to use clock forwarding, probably with one clock lane per quadrant to reduce skew and enable certain power saving techniques [16] [19] [27]. While some documents reference a single clock lane for the entire link, this seems unlikely as it would require much tighter skew margins between different data lanes. This would result in more restrictive board design rules and more expensive motherboards.

When a CSI link first boots up, it goes through a handshake based physical layer calibration and training process [14] [15]. Initially, the link is treated like a collection of independent serial lanes. Then the transmitter sends several specially designed phit patterns that will determine and communicate back the intra-lane skew and detect any lane failures. This information is used to train the receiver circuitry to compensate for skew between the different lanes that may arise due to different trace lengths, and process, temperature and voltage variation. Once the link has been trained, it then begins to operate as a parallel interface, and the circuitry used for training is shut down to save power. The link and any de-skewing circuitry will also be periodically recalibrated, based on a timing counter; according to one patent this counter triggers every 1-10ms [13]. When the retraining occurs, all higher level functionality including flow control and data transmission is temporarily halted. This skew compensation enables motherboard designs with less restrictive design rules for trace length matching, which are less expensive as a result.

It appears some variants of CSI can designate data lanes as alternate clocking lanes, in case of a clock failure [16]. In that situation, the transmitter and receiver would then disable the alternate clock lane and probably that lane’s whole quadrant. The link would then re-initialize at reduced width, using the alternate clock lane for clock forwarding, albeit with reduced data bandwidth. The advantage is that clock failures are no longer fatal, and gracefully degrade service in the same manner as a data lane failure, which can be handled through virtualization techniques in the link layer.

Initial CSI implementations in Intel’s 65nm and 45nm high performance CMOS processes target 4.8-6.4GT/s operation, thus providing 12-16GB/s of bandwidth in each direction and 24-32GB/s for each link [30] [33]. Compared to the parallel P4 bus, CSI uses vastly fewer pins running at much higher data rates, which not only simplifies board routing, but also makes more CPU pins available for power and ground.


Link Layer


The CSI link layer is concerned with reliably sending data between two directly connected ports, and virtualizing the physical layer. Protocol packets from the higher layers of CSI are transmitted as a series of 80 bit flow control units (flits) [25]. Depending on the width of a physical link, transmitting each flit takes either 4, 8 or 16 cycles. A single flit can contain up to 64 bits of data payload, the remaining 16 bits in the header are used for flow control, packet interleave, virtual networks and channels, error detection and other purposes [20] [22]. A higher level protocol packet can consist of as little as a single flit, for control messages, power management and the like, but could include a whole cache line ? which is currently 64B for x86 MPUs and 128B for IPF.

Flow control and error detection/correction are part of the CSI link layer, and operate between each transmitter and receiver pair. CSI uses a credit based flow control system to detect errors and avoid collisions or other quality-of-service issues [22] [34]. CSI links have a number of virtual channels, which can form different virtual networks [8]. These virtual channels are used to ensure deadlock free routing, and to group traffic according to various characteristics, such as transaction size or type, coherency, ordering rules and other information [23]. These particular details are intertwined with other aspects of CSI-based systems and are discussed later. To reduce the storage requirements for the different virtual channels, CSI uses two level adaptive buffering. Each virtual channel has a small dedicated buffer, and all channels share a larger buffer pool [8].

Under ordinary conditions, the transmitter will first acquire enough credits to send an entire packet ? as previously noted this could be anywhere from 1-18+ flits. The flits will be transmitted to the receiver, and also copied into a retry buffer. Every flit is protected by an 8 bit CRC (or 16 bits in some cases), which will alert the receiver to corruption or transmission errors. When the receiver gets the flits, it will compute the CRC to check that the data is correct. If everything is clear, the receiver will send an acknowledgement (and credits) with the next flit that goes from the receiver-side to the transmitter-side (remember, there are two uni-directional links). Then the transmitter will clear the flits out of the retry buffer window. If the CRC indicates an error, the receiver-side will send a link layer retry request to the transmitter-side. The transmitter then begins resending the contents of the retry buffer until the flits have been correctly received and acknowledged. Figure 2 below shows an example of several transactions occurring across a CSI link using the flow control counters.



Figure 2 ? Flow Control Example, [34]

While CSI’s flow control mechanisms will prevent serious contention, they do not necessarily guarantee low latency. To ensure that high priority control packets are not blocked by longer latency data packets, CSI incorporates packet interleaving in the link layer [21]. A bit in the flit header indicates whether the flit belongs to a normal or interleaved packet. For example, if a CSI agent is sending a 64B cache line (8+ flits) and it must send out a cache coherency snoop, rather than delaying the snoop by 8 or more cycles, it could interleave the snoop. This would significantly improve the latency of the snoop, while barely slowing down the data transmission. Similarly, this technique could be used to interleave multiple data streams so that they arrive in a synchronized fashion, or simply to reduce the variance in packet latency.

The link layer can also virtualize the underlying physical bit lanes. This is done by turning off some of the physical transmitter lanes, and assigning these bit lanes either a static logical value, or a value based on the remaining bits in each phit [24]. For example, a failed data lane could be removed, and replaced by one of the lanes which sends CRC, thus avoiding any data pollution or power consumption as a result. The link would then continue to function with reduced CRC protection, similar to the failover mechanisms for FB-DIMMs.

Once the physical layer has been calibrated and trained, as discussed previously, the link layer goes through an initialization process [31]. The link layer is configured to auto-negotiate and exchange various parameters, which are needed for operation [20]. Table 2 below is a list of some (but not necessarily all) of the parameters that each link will negotiate. The process starts with each side of the link assuming the default values, and then negotiating which values to actually use during normal operation. The link layer can also issue an in-band reset command, which stops the clock forwarding, and forces the link to recalibrate the physical layer and then re-initialize the link layer.



Table 2 ? CSI Link Layer Parameters and Values, [20]

Most of these parameters are fairly straight forward. The only one that has not been discussed is the agent profile. This field characterizes the role of the device and contains other link level information, which is used to optimize for specific roles. For example, a ”mobile” profile agent would likely have much more aggressive power saving features, than a desktop part. Similarly, a server agent might disable some prefetch techniques that are effectively for multi-media workloads but tend to reduce performance for more typical server applications. Additionally, the two CSI agents will communicate what ‘type’ each one belongs to. Some different types would include caching agents, memory agents, I/O agents, and other agents that are defined by the CSI specification.


Power Saving Techniques


Power and heat have become first order constraints in modern MPU design, and are equally important in the design of interconnects. Thus it should come as no surprise that CSI has a variety of power saving techniques, which tend to span both the physical and link layer.

The most obvious technique to reduce power is using reduced or low-power states, just as in a microprocessor. CSI incorporates at least two different power states which boost efficiency by offering various degrees of power reduction and wake-up penalties for each side of a link [17]. The intermediate power state, L0s, saves some power, but has a relatively short wake-up time to accommodate brief periods of inactivity [28]. There are several triggers for entering the L0s state; they can include software commands, an empty or near empty transaction queue (at the transmitter), a protocol message from an agent, etc. When the CSI link enters the L0s state, an analog wake-up detection circuit is activated, which monitors for any signals which would trigger an exit from L0s. Additionally, a wake-up can be caused by flits entering the transaction queue.

During normal operation, even if one half of the link is inactive, it will still have to send idle flits to the other side to maintain flow control and provide acknowledgement that flits are being received correctly. In the L0s state, the link stops flow control temporarily and can shut down some, but not all, of the circuitry associated with the physical layer. Circuits are powered down based on whether or not they can be woken up within a predetermined period of time. This wake-up timer is configurable and the value likely depends upon factors such as the target market (mobile, desktop or server) and power source (AC versus battery). For instance, the bit lanes can generally be held in an electrical idle so they do not consume any power. However, the clock recovery circuits (receiver side PLLs or DLLs) must be kept active and periodically recalibrated. This ensures that when the link is activated, no physical layer initialization is required, which keeps the wake up latency relatively low. Generally, increasing the timer would improve the power consumption in L0s, but could negatively impact performance. Intel’s patents indicate that the wake-up timer can be set as low as 20ns, or roughly 96-128 cycles [19].

For more dramatic power savings, CSI links can be put into a low power state. The L1 state is optimized specifically for the lowest possible power, without regard for the wake-up latency, as it is intended to be used for prolonged idle periods. The biggest difference between the L0s and L1 states is that in the latter, the DLLs or PLLs used for clock recovery and skew compensation are turned off. This means that the physical layer of the link must be retrained when it is turned back on, which is fairly expensive in terms of latency ? roughly 10us [17]. However, the benefit is that the link barely dissipates any power when in the L1 state. Figure 3 shows a state diagram for the links, including various resets and the L1 and L0s states.



Figure 3 ? CSI Initialization and Power Saving State Diagram, [19]

Another power saving trick for CSI addresses situations where the link is underutilized, but must remain operational. Intel’s engineers designed CSI so that the link width can be dynamically modulated [27]. This is not too difficult, since the physical link between two CSI agents can vary between 5, 10 and 20 bits wide and the link layer must be able to efficiently accommodate each configuration. The only additional work is designing a mechanism to switch between full, half and quarter-width and ensuring that the link will operate correctly during and after a width change.

Note that width modulation is separate for each unidirectional portion of a link, so one direction might be wider to provide more bandwidth, while the opposite direction is mostly inactive. When the link layer is auto-negotiating, each CSI agent will keep track of the configurations supported by the other side (i.e. full width, half-width, quarter-width). Once the link has been established and is operating, each transmitter will periodically check to see if there is an opportunity to save power, or if more bandwidth is required.

If the link bandwidth is not being used, then the transmitter will select a narrower link configuration that is mutually supported and notify the receiver. Then the transmitter will modulate to a new width, and place the inactivated quadrants into the L0s or L1 power saving states, and the receiver will follow suit. One interesting twist is that the unused quadrants can be in distinct power states. For example, a full-width link could modulate down to a half-width link, putting quadrants 0 and 1 into the L1 state, and then later modulate down to a quarter-width link, putting quadrant 2 into the L0s state. In this situation, the link could respond to an immediate need for bandwidth by activating quadrant 2 quickly, while still saving a substantial amount of power.

If more bandwidth is required, the process is slightly more complicated. First, the transmitter will wake up its own circuitry, and also send out a wake-up signal to the receiver. However, because the wake-up is not instantaneous, the transmitter will have to wait for a predetermined and configurable period of time. Once this period has passed, and the receiver is guaranteed to be awake, then the transmitter can finally modulate to a wider link, and start transmitting data at the higher bandwidth.

Most of the previously discussed power saving techniques are highly dynamic and difficult to predict. This means that engineers will naturally have to build in substantial guard-banding to guarantee correct operation. However, CSI also offers deterministic thermal throttling [26]. When a CSI agent reaches a thermal stress point, such as exceeding TDP for a period of time, or exceeding a specific temperature on-die, the overheating agent will send a thermal management request to other agents that it is connected to via CSI. The thermal management request typically includes a specified acknowledgement window and a sleep timer (these could be programmed into the BIOS, or dynamically set by the overheating agent). If the other agent responds affirmatively within the acknowledgement window, then both sides of the link will shut down for the specified sleep time. Using an acknowledgement window ensures that the other agent has the flexibility to finish in-flight transactions before de-activating the CSI link.


Coherency Leaps Forward at Intel


CSI is a switched fabric and a natural fit for cache coherent non-uniform memory architectures (ccNUMA). However, simply recycling Intel’s existing MESI protocol and grafting it onto a ccNUMA system is far from efficient. The MESI protocol complements Intel’s older bus-based architecture and elegantly enforces coherency. But in a ccNUMA system, the MESI protocol would send many redundant messages between different nodes, often with unnecessarily high latency. In particular, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. However, the requesting processor only needs a single copy of the data, so the system is wasting a bit of bandwidth.

Intel's solution to this issue is rather elegant. They adapted the standard MESI protocol to include an additional state, the Forwarding (F) state, and changed the role of the Shared (S) state. In the MESIF protocol, only a single instance of a cache line may be in the F state and that instance is the only one that may be duplicated [3]. Other caches may hold the data, but it will be in the shared state and cannot be copied. In other words, the cache line in the F state is used to respond to any read requests, while the S state cache lines are now silent. This makes the line in the F state a first amongst equals, when responding to snoop requests. By designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist.

When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.

Figure 4 demonstrates the advantages of MESIF over the traditional MESI protocol, reducing two responses to a single response (acknowledgements are not shown). Note that a peer node is simply a node in the system that contains a cache.



Figure 4 ? MESIF versus MESI Protocol

In general, MESIF is a significant step forward for Intel’s coherency protocol. However, there is at least one optimization which Intel did not pursue ? the Owner state that is used in the MOESI protocol (found in the AMD Opteron). The O state is used to share dirty cache lines (i.e. lines that have been written to, where memory has older or dirty data), without writing back to memory.

Specifically, if a dirty cache line is in the M (modified) state, then another processor can request a copy. The dirty cache line switches to the Owned state, and a duplicate copy is made in the S state. As a result, any cache line in the O state must be written back to memory before it can be evicted, and the S state no longer implies that the cache line is clean. In comparison, a system using MESIF or MESI would change the cache line to the F or S state, copy it to the requesting cache and write the data back to memory ? the O state avoids the write back, saving some bandwidth. It is unclear why Intel avoided using the O state in the newer coherency protocol for CSI ? perhaps the architects decided that the performance gain was too small to justify the additional complexity.

Table 3 summarizes the different protocols and states for the MESI, MOESI and MESIF cache coherency protocols.



Table 3 ? Overview of States in Snoop Protocols


A Two Hop Protocol for Low Latency


In a CSI system, each node is assigned a unique node ID (NID), which serves as an address on the network fabric. Each node also has a Peer Agent list, which enumerates the other nodes in the system that it must snoop when requesting data from memory (typically peers contain a cache, but could also be an I/O hub or device with DMA). Similarly, each transaction is assigned an identifier (TID) for tracking at each involved node. The TID, together with a destination and source ID form a globally unique transaction identifier [37]. The number of TIDs, and hence outstanding transactions is not unlimited, and will likely be one differentiating factor between Xeon DP, Xeon MP and Itanium systems. Table 4 describes the different fields that can be used in each CSI message, although some messages do not use all fields. For example, a snoop response from a processor that holds data in the shared state will not contain any data, just an acknowledgement.



Table 4 ? CSI Message Fields, [1]

CSI was designed as a natural extension of the existing front side bus protocol; although there are some changes, many of the commands can be easily traced to the commands on the front side bus. A set of commands is listed in the ‘250 patent.

In a three hop protocol, such as the one used by AMD’s Opteron, read requests are first sent to the home node (i.e. where the cache line is stored in memory). The home node then snoops all peer nodes (i.e. caching agents) in the system, and reads from memory. Lastly, all snoop responses from peer nodes and the data in memory are sent to the requesting processor. This transaction involves three point to point messages: requestor to home, home to peer and peer to requestor, and a read from memory before the data can be consumed.

Rather than implement a three hop cache coherency protocol, CSI was designed with a novel two hop protocol that achieves lower latency. In the protocol used by CSI, transactions go through three phases; however, data can be used after the second phase or hop. First, the requesting node sends out snoops to all peer nodes (i.e. caches) and the home node. Each peer node sends a snoop response to the requesting node. When the second phase has finished, the requesting node sends an acknowledgement to the home node, where the transaction is finally completed.

In the rare case of a conflict, the home node is notified and will step in and resolve transactions in the appropriate order to ensure correctness. This could force one or more processor in the system to roll back, replay or otherwise cancel the effects of a load instruction. However, the additional control circuitry is neither frequently used, nor is on any critical paths, so it can be tuned for low leakage power.

In the vast majority of transactions, the home node is a silent observer, and the requestor can use the new data as soon as it is received from the peer agent’s cache, which is the lowest possible latency. In particular, a two hop protocol does not have to wait to access memory in the home node, in contrast to three hop protocols. Figure 5 compares the critical paths between two hop and three hop protocols, when data is in a cache (note that not all snoops and responses are shown ? only the critical path).



Figure 5 ? Critical Path Latency for Two and Three Hop Protocols

This arrangement is somewhat unusual in that the requesting processor is conceptually pushing transactions into the system and home node. In three hop protocols, the home node acts as a gate keeper and can defer a transaction if the appropriate queues are full, while only stalling the requestor. In a CSI-based system, the home node receives messages after the transaction in is progress or has already occurred. If these incoming transactions were lost, the system would be unable to maintain coherency. Therefore to ensure correctness, CSI home nodes must have a relatively large pre-allocated buffer to support as many transactions as can be reasonably initiated.


Virtual Channels


One of the most difficult challenges in designing multiprocessor systems is guaranteeing forward progress, and avoiding strange network behavior that limits performance. Unfortunately, this problem is an inherent aspect of multiprocessor system design, and really impacts almost every decision made; there is no clean way to separate it from other concerns. Every coherent transaction across CSI has three phases: the snoop from the requesting CPU, the responses from the peer nodes, and the acknowledgement to the home node. As noted previously, CSI uses separate link layer virtual channels to improve performance and avoid livelocks or deadlocks. Each transaction phase has one or more associated virtual channels: snoop, response and home. These arrangements should come as no surprise, since the EV7 used similar techniques and designations. Additional channels discussed in patents include short message, data and data response channels [34].

One reason for providing different virtual channels is that the traffic characteristics of the three are quite distinct. Packets sent across the home channel are typically very small and must be received in order. In the most common case, a home packet would simply be an acknowledgement that a transaction can retire. The response channel sometimes includes larger packets, often containing actual cache lines (although these may go on the data response channel), and can be processed out of order to improve performance. The snoop channel is mostly smaller packets, and can also operate out of order. The optimizations for each channel are different and by separating each class of traffic, Intel architects can more carefully tune the system for high performance and lower power.

There are also priority relationships between the different classes of traffic. When the system is saturated, the home phase channels will be given the highest priority, which ensures that some transactions will retire, leaving the system and reducing traffic. The next highest priority is the response phase and associated channels, which provide data to processors so they can continue computation, and initiate the home phase. The lowest priority of traffic are the snoop phase channels, which are used to start new transactions and is the first to throttle back.



Dynamic Reconfiguration


One of the problems with the existing bus infrastructure is that the interface presented to software is not particularly clean or isolated. Specifically, components of Intel’s system architecture cannot be dynamically added or removed from the front-side bus; instead the bus and all attached components must be shut down, and then restarted after disabling or adding the component in question. For instance, to remove one faulty processor in a 16 socket server, an entire node (4 processors, one memory controller hub, the local memory and I/O) must be off-lined.

CSI supports both in-band (coordinated by a system component) and out-of-band (coordinated by a service processor) dynamic reconfiguration of system resources, also known as hot plug [39]. A system agent and the firmware work together to quiesce individual components and then modify the routing tables and system addressing decoders, so that the changes appear to be atomic to the operating system and software.

To add a system resource, such as a processor, first the firmware creates a physical and logical profile for the new processor in the rest of the system. Next, the firmware enables the CSI links between the new processor and the rest of the system. The firmware initializes the new processor’s CSI link and sends data about the system configuration to the new processor. The new processor initializes itself and begins self-testing, and will notify the firmware when it is complete. At that point, the firmware notifies the OS and the rest of the system to begin operating with the new processor in place.

System resources, such as a processor, memory or IO hub are removed though a complementary mechanism. These two techniques can also be combined to move resources seamlessly between different system partitions.


Multiprocessor Systems


When the P6 front side bus was first released, it caused a substantial shift in the computer industry by supporting up to four processors without any chipset modifications. As a result, Intel based systems using Linux or Windows penetrated and dominated the workstation and entry level server market, largely because the existing architectures were priced vastly higher.

However, Intel hesitated to extend itself beyond that point. This hesitancy was partially due to economic incentives to maintain the same infrastructure, but also the preferences of key OEMs such as IBM, HP and others, who provide value added in the form of larger multiprocessor systems. Balancing all the different priorities inside of Intel, and pleasing partners is nearly impossible and has handicapped Intel for the past several years. However, it is quite clear that any reservations at Intel disappeared around 2002-3, when CSI development started.

Intel's patents clearly anticipate two and four processor systems, as shown in Figure 6. Each processor in a dual socket system will require a single coherent full width CSI link, with one or two half-width links to connect to I/O bridges, making the system fully symmetric (half-width links are shown as dotted lines). Processors in four socket systems will be fully connected, and each processor could also connect directly to the I/O bridge. More likely, each processor, or pair of processors, could connect to a separate I/O bridge to provide higher I/O bandwidth in the four socket systems.




Figure 6 ? 2 and 4P CSI System Diagrams [2] [34]

Fully interconnected systems, such as those shown in Figure 6 enjoy several advantages over partially connected solutions. First of all, transactions occur at the speed of the slowest participant. Hence, a system where every caching agent (including the I/O bridge) is only one hop away ensures lower transaction latency. Secondly, by lowering transaction latency, the number of transactions in flight is reduced (since the average transaction life time is shorter). This means that the buffers for each caching agent can be smaller, faster and more power efficient. Lastly, operating systems and applications have trouble handling NUMA optimizations, so more symmetrical systems are ideal from a software perspective.



Interacting with I/O


Of course, ensuring optimal communication between multiple processors is just one part of system design. The I/O architecture for Intel’s platform is also important, and CSI brings along several important changes in that area as well [36].

As Figure 6 indicates, some CSI based systems contain multiple I/O hubs, which need to communicate with each other. Since the I/O hubs are not connected, Intel’s engineers devised an efficient method to forward I/O transactions (typically PCI-Express) through CSI. Because CSI was optimized for coherent traffic, it lacks many of the features which PCI-Express relies upon, such as I/O specific packet attributes. To solve this problem, PCI-E packets are tunneled through CSI, leaving much or all of the PCI-E header information intact.


Beyond Multiprocessors


In a forward looking decision by Intel, CSI is fairly agnostic with respect to system expansion. Systems can be expanded in a hierarchical manner, which is the path that IBM took for their older X3 chipset, where one agent in each local cell acts a proxy for the rest of the system. Certainly, the definition of CSI lends itself to hierarchical arrangements, since a “CSI node” is an abstraction and may in fact consist of multiple processors. For instance, in a 16 socket system, there might be four nodes, and each node might contain four sockets and resemble the top diagram in Figure 6. Early Intel patents seem to point to hierarchical expansion as being preferred, although later patents appear to be less restrictive [2] [4]. As an alternative to hierarchical expansion, larger systems can be built using a flat topology (the 2 dimensional torus used by the EV7 would be an example). However, a flat system must have a node ID for each processor, whereas a hierarchical system needs only enough node IDs for the processors in each ‘cell’. So, while a flat 32 socket system would require 32 distinct node IDs, a comparable system using 8 nodes of 4 sockets would only need 4 distinct node IDs.

Most MPU vendors have used node ID ranges to differentiate between versions of their processors. For instance, Intel and AMD both draw clear distinctions between 1, 2 and 4P server MPUs; each one with an increasing level of RAS and more node IDs and a substantial price increase. Furthermore, a flat system with 8+ processors in all likelihood needs snoop filters or directories for scalability. However, Intel’s x86 MPUs will probably not natively support directories or snoop filters; instead leaving that choice to OEMs. This flexibility for CSI systems means that OEMs with sufficient expertise can differentiate their products with custom node controllers for each local node in a hierarchical system.

Directory based coherency protocols are the most scalable option for system expansion. However, directories use a three hop coherency protocol that is quite different from CSI. In the first phase, the requestor sends a request to the home node, which contains the directory that lists which agents have a copy of the cache line. The home node would then snoop those agents, while sending no messages to uninvolved third parties. Lastly, all the agents receiving a snoop would send a response to the requestor. This presents several problems. The directory itself is difficult to implement, since every cache miss in the system generates both a read (the lookup) and a write to the directory (updating ownership). The latency is also higher than a snooping broadcast protocol, although the system bandwidth used is lower, hence providing better scalability. Snoop filters are a more natural extension of the CSI infrastructure suitable for mid-sized systems.

Snoop filters focus on a subset of the key data to reduce the number of snoop responses. The classic example of a snoop filter, such as Intel’s Blackford, Seaburg or Clarksboro chipsets, tracks remotely cached data. Snoop filters have an advantage because they preserve the low latency of the CSI protocol, while a directory would require changing to a three hop protocol. Not every element in the system must have a snoop filter either; CSI is flexible in that regard as well.


Remote Prefetch


Another interesting innovation that may show up in CSI is remote prefetch [9]. Hardware prefetching is nothing new to modern CPUs, it has been around since the 130nm Pentium III. Typically, hardware prefetch works by tracking cache line requests from the CPU and trying to detect a spatial or temporal pattern. For instance, loading a 128MB movie will result in roughly one million sequential requests (temporal) for 128B cache lines that are probably adjacent in memory (spatial). A prefetcher in the cache controller will figure out this pattern, and then start requesting cache accesses ahead of time to hide memory access latencies. However, general purpose systems rely on cache and memory controllers prefetching for the CPU and do not receive feedback from other system agents.

One of the patents relating to CSI is for a remote device initiating a prefetch into processor caches. The general idea is that in some situations, remote agents (an I/O device or coprocessor) might have more knowledge about where data is coming from, than the simple pattern detectors in the cache or memory controller. To take advantage of that, a remote agent sends a prefetch directive message to a cache prefetcher. This message could be as simple as indicating where the prefetch would come from (and therefore where to respond), but in all likelihood would include information such as data size, priority and addressing information. The prefetcher can then respond by initiating a prefetch or simply ignoring the directive altogether. In the former case, the prefetcher would give direct cache access to the remote agent, which then writes the data into the cache. Additionally, the prefetcher could request that the remote agent pre-process the data. For example, if the data is compressed, encoded or encrypted, the remote agent could transform the data to an immediately readable format, or route it over the interprocessor fabric to a decoder or other mechanism.

The most obvious application for remote prefetching is improving I/O performance when receiving data from a high speed Ethernet, FibreChannel or Infiniband interface (the network device would be the remote agent in that case). This would be especially helpful if the transfer unit is large, as is the case for storage protocols such as iSCSI or FibreChannel, since the prefetch would hide latency for most of the data. To see how remote prefetch could improve performance, Figure 7 shows an example using a network device.



Figure 7 ? Remote Prefetch for Network Traffic

On the left is a traditional system, which is accessing 4KB of data over a SAN. It receives a packet of data through the network interface, and then issues a write-invalidate snoop for a cache line to all caching agents in the system. A cache line in memory is allocated, and the data is stored through the I/O hub. This repeats until all 4KB of data has been written into the memory, at which point the I/O device issues an interrupt to the processor. Then, the processor requests the data from memory and snoops all the caching agents; lastly it reads the memory into the cache and begins to use the data.

In a system using remote prefetch, the network adapter begins receiving data and the packet headers indicate that the data payload is 4KB. The network adapter then sends a prefetch directive, through the I/O hub to the processor’s cache, which responds by granting direct cache access. The I/O hub will issue a write-invalidate snoop for each cache line written, but instead of storing to memory, the data is placed directly in the processor’s cache in the modified state. When all the data has been moved, the I/O hub sends an interrupt and the processor begins operating on the data already in the cache. Compared to the previously described method, remote prefetching demonstrates several advantages. First, it eliminates all of the snoop requests by the processor to read the data from memory to the cache. Second, it reduces the load on the main memory system (especially if the processors can stream results back to network adapter) and modestly decreases latency.

While the most obvious application of the patented technique is for network I/O, remote prefetch could work with any part of a system that does caching. For instance, in a multiprocessor system, remote prefetch between different processors, or even coprocessors or acceleration devices is quite feasible. It is unclear whether this feature will be made available to coprocessor vendors and other partners, but it would certainly be beneficial for Intel and a promising sign for a more open platform going forward.


Speculations on CSI Clients


While the technical details of CSI are well documented in various Intel patents, there is relatively little information on future desktop or mobile implementations. These next two sections make a transition from fairly solid technical details into the realm of educated, but ultimately speculative predictions.

CSI will modestly impact the desktop and mobile markets, but may not bring any fundamental changes. Certain Intel patents seem to imply that discrete memory controllers will continue to be used with some MPUs [9]. In all likelihood, Intel will offer several different product variations based on the target market. Some versions will use integrated memory controllers, some will offer an on-package northbridge and some will probably have no system integration at all.

Intel has a brisk chipset business on both the desktop and notebook side that keeps older fabs effectively utilized ? an essential element of Intel’s capital strategy. If Intel were to integrate the northbridge in all MPUs, it would force the company to find other products which can use older fabs, or shutter some of the facilities. Full integration also increases the pin count for each MPU, which increases the production costs. While an integrated memory controller increases performance by reducing latency, many products do not need the extra performance, nor is it always desirable from a marketing perspective.

For technical reasons, an integrated memory controller can also be problematic. Integrated graphics controllers share main memory to reduce cost. As a result, integrated graphics substantially benefits from sharing a die with the memory controller, as it does currently for Intel based systems. However, integrating graphics on the processor seems a little aggressive for a company that has yet to produce an on-die memory controller, and is a waste of cutting edge silicon ? most high performance systems simply do not use integrated graphics.

Intel’s desktop version of Nehalem is code-named Bloomfield and it seems clear that the high performance MPUs, which are targeted at gamers, will feature on-die memory controllers. The performance benefits of reducing memory latency will probably be used as product differentiation by Intel to encourage gamers to move to the Extreme Edition line and justify the higher prices. However, on-die or on-package graphics is unlikely given that most OEMs will use higher performance discrete solutions from NVIDIA or AMD. The width of the CSI connection between the MPU and the chipset may be another differentiating factor. While a half-width link will work for mid-range systems, high-end gaming systems will require more bandwidth. Modern high performance GPUs use a PCI-E x16 slots, which provides 4GB/s in each direction. Hence, it is quite conceivable that by 2009 a pair of high-end GPUs would require ~16GB/s in each direction. Given that gaming systems often stress graphics, network and disk, a full width CSI link may be required to provide enough appropriate performance.

Other desktop parts based on Bloomfield will focus on low cost, and greater integration. It is very likely that these MPUs will be connected via CSI to a second die containing a memory controller and integrated graphics, all packaged inside a single MCM. A CSI link (probably half-width) would connect the northbridge to the rest of the chipset. This solution would let Intel use older fabs to produce the northbridge, and would enable more manufacturing flexibility ? each component could be upgraded individually with fewer dependencies between the two. Intel will probably also produce a MPU with no integrated system features, which will let OEMs use chipsets from 3rd party vendors, such as NVIDIA, VIA and SiS.

Gilo, the mobile proliferation of Nehalem, will face many of the same issues as desktop processors, but also some that are unique to the notebook market. Mobile MPUs do not really need the lower latency; in many situations they sacrifice performance by only populating a single channel of memory, or operating at relatively slow transfer rates. An integrated memory controller would also require a separate voltage plane from the cores, hence systems would need an additional VRM on the motherboard. The clock distribution would also need to be designed so that the cores can vary frequency independently of the memory controller. Consequently, an on-die memory controller is unlikely because of the lack of benefits and additional complexity.

The implementations for Gilo will most likely resemble the mid-range and low-end desktop product configuration. The more integrated products will feature the northbridge and graphics in the same package as the MPU, connected by CSI. A more bare-bones MPU would also be offered for OEMs that prefer higher performance discrete graphics, or wish to use alternative chipsets.

While the system architecture for Intel’s desktop and mobile offerings will change a bit, the effects will probably be more subtle. The majority of Intel MPUs will still require external memory controllers, but they will be integrated on the MPU package itself. This will not fundamentally improve Intel’s performance relative to AMD’s desktop and mobile offerings. However, it will make Intel’s products more attractive to OEMs, since the greater integration will reduce the number of discrete components on system boards and lower the overall cost. In many ways the largest impact will be on the graphics vendors ? since it will make all their solutions (both integrated and discrete) more expensive relative to a single MCM from Intel.


Speculations on CSI Servers


In the server world, CSI will be introduced in tandem with an on-die memory controller. The impact of these two modifications will be quite substantial, as they address the few remaining shortcomings in Intel’s overall server architecture and substantially increase performance. This performance improvement come from two places: the integrated memory controller will lower memory latency, while the improved interconnects for 2-4 socket servers will increase bandwidth and decrease latency.

To Intel, the launch of a broad line of CSI based systems will represent one of the best opportunities to retake server market share from AMD. New systems will use the forthcoming Nehalem microarchitecture, which is a substantially enhanced derivative of the Core microarchitecture, and features simultaneous multithreading and several other enhancements. Historically speaking, new microarchitectures tend to win the performance crown and presage market share shifts. This happened with the Athlon, the Pentium 4, Athlon64/Opteron, and the Core 2 and it seems likely this trend will continue with Nehalem. The system level performance benefits from CSI and integrated memory controllers will also eliminate Intel’s two remaining glass jaws: the older front side bus architecture and higher memory latency.

The single-processor server market is likely where CSI will have the least impact. For these entry level servers, the shared front side bus is not a substantial problem, since there is little communication compared to larger systems. Hence, the technical innovations in CSI will have relatively little impact in this market. AMD also has a much smaller presence in this market, because their advantages (which are similar to the advantages of CSI) are less pronounced. Clearly, AMD will try to make inroads into this market; if the market responds positively to AMD’s solution that may hint at future reactions to CSI.

Currently in the two socket (DP) server market, Intel enjoys a substantial performance lead for commercial workloads, such as web serving or transaction processing. Unfortunately, Intel’s systems are somewhat handicapped because they require FB-DIMMs, which use an extra 5-6 watts per DIMM and cost somewhat more than registered DDR2. This disadvantage has certainly hindered Intel in the last year, especially with customers who require lots of memory or extremely low power systems. While Intel did regain some server market share, AMD’s Opteron is still the clear choice for almost all high performance computing, where the superior system architecture provides more memory and processor communication bandwidth. This advantage has been a boon for AMD, as the HPC market is the fastest growing segment within the overall server market.

Gainestown, the first CSI based Xeon DP, will arrive in the second half of 2008, likely before any of the desktop or mobile parts. In the dual socket market, CSI will certainly be welcome and improve Intel’s line up, featuring 2x or more the bandwidth of the previous generation, but the impact will not be as pronounced as for MP systems. Intel’s dual socket platforms are actually quite competitive because the product cycles are shorter, meaning more frequent upgrades and higher bandwidth. Intel’s current Blackford and Seaburg chipsets, with dual front side buses and snoop filters, offer reasonable bandwidth, although at the cost of slightly elevated power and thermal requirements. This too shall pass; it appears that dual socket systems will shift back to DDR3, eliminating the extra ~5W penalty for each FB-DIMM [12]. This will improve Intel’s product portfolio and put additional pressure on AMD, which is still benefitting from the FB-DIMM thermal issues. The DP server market is currently fairly close to ‘equilibrium’; AMD and Intel have split the market approximately along the traditional 80/20 lines. Consequently, the introduction of CSI systems will enhance Intel’s position, but will not spark massive shifts in market share.

The first Xeon MP to use CSI will debut in the second half of 2009, lagging behind its smaller system counterparts by an entire year. Out of all the x86 product families using CSI, Beckton will have the biggest impact, substantially improving Intel’s position in the four socket server market. Beckton will offer roughly 8-10x the bandwidth of its predecessor, dramatically improving performance. The changes in system architecture will also dramatically reduce latency, which is a key element of performance for most of the target workloads, such as transaction processing, virtualization and other mission critical applications. Since the CSI links are point-to-point, they eliminate one chip and one interconnect crossing, which will cut the latency between processors in half, or better. The integrated memory controller in Beckton will similarly reduce latency, since it also removes out an extra chip and interconnect crossing.

Intel’s platform shortcomings created a weakness that AMD exploited to gain significant market share. It is estimated that Intel currently holds as little as 50% of the market for MP servers, compared to roughly 75-80% of the overall market. When CSI-based MP platforms arrive in 2009, Intel will certainly try to bring their market share back in-line with the overall market. However, Beckton will be competing against AMD’s Sandtiger, a 45nm server product with 8-16 cores also slated for 2009. Given that little is known about the latter, it is difficult to predict the competitive landscape.



Itanium and CSI


CSI will also be used for Tukwila, a quad-core Itanium processor due in 2008. Creating a common infrastructure for Itanium and Xeon based systems has been a goal for Intel since 2003. Because, the economic and technical considerations for these two products are different, they will not be fully compatible. However, the vast majority of the two interconnects will be common between the product lines.

One goal of a common platform for Itanium and Xeon is to share (and therefore better amortize) research, development, design and validation costs, by re-using components across Intel's entire product portfolio. Xeon and Xeon MP products ship in the tens of millions each year, compared to perhaps a million for Itanium. If the same components can be used across all product lines, the non-recurring engineering costs for Itanium will be substantially reduced. Additionally, the inventory and supply chain management for both Intel and its partners will be simplified, since some chipset components will be interchangeable.

Just as importantly, CSI and an integrated memory controller will substantially boost the performance of the Itanium family. Montvale, which will be released at the end of 2007, uses a 667MHz bus that is 128 bits wide ? a total of 10.6GB/s of bandwidth. This pales in comparison to the 300GB/s that a single POWER6 processor can tap into. While bandwidth is only one factor that determines performance, a 30x difference is substantial by any measure. When Tukwila debuts in 2008, it will go a long way towards evening the playing field. Tukwila will offer 120-160GB/s between MPUs (5 CSI lanes at 4.8-6.4GT/s), and multiple integrated FB-DIMM controllers. The combination of doubling the core count, massively increasing bandwidth and reducing latency should prove compelling for Itanium customers and will likely cause a wave of upgrades and migrations similar to the one triggered by the release of Montecito in 2006.


Conclusion


The success of the Pentium Pro and its lineage captured the multi-billion dollar RISC workstation and low-end server market, but that success also created inertia around the bus interface. Politics within the company and with existing partners, OEMs and customers conspired to keep Intel content with the status quo. Unfortunately for Intel, AMD was not content to play second fiddle forever. The Opteron took a portion of the server market, largely by virtue of its superior system architecture and Intel’s simultaneous weakness with the Pentium 4 (Prescott) microarchitecture. While Intel might be prone to internal politics, when an external threat looms large, everything is thrown into high gear. The industry saw that with the RISC versus CISC debate, where Intel P6 engineers hung ads from the now friendly Apple in their cubes for competitive inspiration. The Core microarchitecture, Intel’s current flag bearer, was similarly the labor of a company under intense competitive pressure.

While Intel had multiple internal projects working on a next generation interconnect, the winning design for CSI was the result of collaboration between Intel veterans from Hillsboro, Santa Clara and other sites, as well as the architects who worked on DEC’s Alpha architecture. The EV7, the last Alpha stands out for having the best system interconnect of its time, and certainly influenced the overall direction for CSI. The CSI design team was given a set of difficult, but not impossible goals: design an interconnect family that would span the range of Intel’s performance oriented computational products, from the affordable Celeron to the high-end Xeon MP and Itanium. The results were delayed, largely due to the cancellation of Whitefield, a quad core x86 processor, and the rescheduling and evisceration of Tanglewood nee Tukwila. However, Tukwila and Nehalem will feature CSI when they debut in the next two years, and the world will be able to judge the outcome.

CSI will be a turning point for the industry. In the server world, CSI paired with an integrated memory controller, will erase or reverse Intel’s system architecture deficit to AMD. Intel’s microprocessors will need less cache because of the lower memory and remote access latency; the specs for Tukwila call for 6MB/core rather than the 12MB/core in Montecito. This in turn will free up more die area for additional cores, or more economical die sizes. These changes will put Intel on a more equal footing with AMD, which has had a leg up in system architecture with their integrated memory controller and HyperTransport. As a result, Intel will be in a good position to retake lost market share in the server world in 2008/9 when CSI based systems debut.

In some ways, CSI and integrated memory controllers are the last piece of the puzzle to get Intel’s servers back on track. The new Core microarchitecture has certainly proven to be a capable design, even when paired with the front side bus and a discrete memory controller. The multithreaded microarchitecture for Nehalem, coupled with an integrated memory controller and the CSI system fabric should be an even more impressive product. For Intel, 2008 will be a year to look forward to, thanks in no small part to the engineers who worked on CSI.


References


[1] Batson, B. et al. Messaging Protocol. US Patent Application 20050262250A1. November 24, 2005.
[2] Batson, B. et al. Cache Coherence Protocol. US Patent Application 20050240734A1. October 27, 2005.
[3] Hum, H. et al. Forward State for use in Cache Coherency in a Multiprocessor System. US Patent No. 6,922,756 B2. July 26, 2005.
[4] Hum, H. et al. Hierarchical Virtual Model of a Cache Hierarchy in a Multiprocessor System. US Patent Application 20040123045A1. June 24, 2004.
[5] Beers, R. et al. Non-Speculative Distributed Conflict Resolution for a Cache Coherency Protocol. US Patent No. 6,954,829 B2. October 11, 2005.
[6] Hum, H. et al. Speculative Distributed Conflict Resolution for a Cache Coherency Protocol. US Patent Application 20040122966A1. June 24, 2004.
[7] Hum, H. et al. Hierarchical Directories for Cache Coherency in a Multiprocessor System. US Patent Application 20060253657A1. November 9, 2006.
[8] Cen, Ling. Method, System, and Apparatus for a Credit Based Flow Control in a Computer System. US Patent Application 20050088967A1. April 28, 2005.
[9] Huggahalli, R. et al. Method and Apparatus for Initiating CPU Data Prefetches by an External Agent. US Patent Application 20060085602A1. April 20, 2006.
[10] Kanter, David. Intel’s Tukwila Confirmed to be Quad Core. Real World Technologies. May 5, 2006. http://www.realworldtech.com/page.cfm?NewsID=361&date=05-05-2006#361
[11] Rust, Adamson. Intel’s Stoutland to have Integrated Memory Controller. The Inquirer. February 1, 2007. http://www.theinquirer.net/default.aspx?article=37373
[12] Intel Thurley has Early CSI Interconnect. The Inquirer. February 2, 2007. http://www.theinquirer.net/default.aspx?article=37392
[13] Cherukuri, N. et al. Method and Apparatus for Periodically Retraining a Serial Links Interface. US Patent No. 7,209,907 B2. April 24, 2007.
[14] Cherukuri, N. et al. Method and Apparatus for Interactively Training Links in a Lockstep Fashion. US Patent Application 20050262184A1. November 24, 2005.
[15] Cherukuri, N. et al. Method and Apparatus for Acknowledgement-based Handshake Mechanism for Interactively Training Links. US Patent Application 20050262280A1. November 24, 2005.
[16] Cherukuri, N. et al. Method and Apparatus for Detecting Clock Failure and Establishing an Alternate Clock Lane. US Patent Application 20050261203A1. December 22, 2005.
[17] Cherukuri, N. et al. Method for Identifying Bad Lanes and Exchanging Width Capabilities of Two CSI Agents Connected Across a Link. US Patent Application 20050262284A1. November 24, 2005.
[18] Frodsham, T. et al. Method, System and Apparatus for Loopback Entry and Exit. US Patent Application 20060020861A1. January 26, 2006.
[19] Cherukuri, N. et al. Methods and Apparatuses for Resetting the Phyiscal Layers of Two Agents Interconnected Through a Link-Based Interconnection. US Patent No. 7,219,220 B2. May 15, 2007.
[20] Mannava, P. et al. Method and System for Flexible and Negotiable Exchange of Link Layer Functional Parameters. US Patent Application 20070088863A1. April 19, 2007.
[21] Spink, A. et al. Interleaving Data Packets in a Packet-based Communication System. US Patent Application 20070047584A1. March 1, 2007.
[22] Chou, et al. Link Level Retry Scheme. US Patent No. 7,016,304 B2. March 21, 2006.
[23] Creta, et al. Separating Transactions into Different Virtual Channels. US Patent No. 7,165,131 B2. January 16, 2007.
[24] Cherukuri, N. et al. Technique for Lane Virtualization. US Patent Application 20050259599A1. November 24, 2005.
[25] Steinman, M. et al. Methods and Apparatuses to Effect a Variable-width Link. US Patent Application 20050259696A1. November 24, 2005.
[26] Kwa et al. Method and System for Deterministic Throttling for Thermal Management. US Patent Application 20060294179A1. December 28, 2006.
[27] Cherukuri, N. et al. Dynamically Modulating Link Width. US Patent Application 20060034295A1. February 16, 2006.
[28] Cherukuri, N. et al. Link Power Saving State. US Patent Application 20050262368A1. November 24, 2005.
[29] Lee, V. et all. Retraining Derived Clock Receivers. US Patent Application 20050022100A1. January 27, 2005.
[30] Fan, Yongping. Matched Current Delay Cell and Delay Locked Loop. US Patent No. 7,202,715 B1. April 10, 2007.
[31] Ayyar, M. et al. Method and Apparatus for System Level Initialization. US Patent Application 20060126656A1. June 15, 2006.
[32] Frodsham, T. et al. Method, System and Apparatus for Link Latency Management. US Patent Application 20060168379A1. July 27, 2006.
[33] Frodsham, T. et al. Technique to Create Link Determinism. US Patent Application 20060020843A1. January 26, 2006.
[34] Spink. A. et al. Buffering Data Packets According to Multiple Flow Control Schemes. US Patent Application 20070053350A1. March 8, 2007.
[35] Cen, Ling. Arrangements Facilitating Ordered Transactions. US Patent Application 20040008677A1. January 15, 2004.
[36] Creta, et al. Transmitting Peer-to-Peer Transactions Through a Coherent Interface. US Patent No. 7,210,000 B2. April 24, 2007.
[37] Hum, H. et al. Globally Unique Transaction Identifiers. US Patent Application 20050251599A1. November 10, 2005.
[38] Cen, Ling. Two-hop Cache Coherency Protocol. US Patent Application 20070022252A1. January 25, 2007.
[39] Ayyar, M. et al. Method and Apparatus for Dynamic Reconfiguration of Resources. US Patent Application 20060184480A1. August 17, 2006.

Sep 27, 2007

데이터 통신 강의

제1장 정보와 정보통신
제2장 정보통신의 이해를 위한 개념들
제3장 모뎀과 디지털 전송기술
제4장 접속규격과 흐름제어 에러제어 방법
제5장 프로토콜이란 무엇인가?
제6장 다중화와 데이타 압축?
제7장 교환기술
제8장 제1장~제7장 요약
제9장 망구조와 개방형 통신망(OSI)
제10장 패킷통신의 이해
제11장 B-ISDN(광대역 정보종합통신망)의 이해
제12장 근거리 통신망
제13장 무선 데이타 통신
제14장 위성 통신
제15장 광통신
제16장 제9~15장 요약
제17장 정보통신 보안(Security)
제18장 인터넷(Internet)
제19장 EDI(Electronic Data Interchange)
제20장 범지구측위 시스템(GPS: Global Positioning System)과 그 활용
제21장 네트워크 관리
제22장 네트워크 선정
제23장 정보통신 기술의 미래

모뎀과 디지털 전송기술

정보통신에서 가장 중요한 것은 정보를 담고 있는 전기 신호를 원격지까지 잘 보내는 일이다.
짧게는 수 백 m에서 길게는 수 만 Km에 이르는 회선을 따라 신호가 정확히 송수신 되어야지만 결국 정확한 정보 전송이 이루어지는 셈이고 이를 담당하는 것이 전송 기술이다.
사실 정보 통신의 많은 부분이 전송 기술과 연관되지만 이번 장에서는 기본이 되는 몇 가지 항목들을 소개하고자 한다. 여러분은 이번 3장에서 아날로그 전송과 디지털 전송은 무엇이며 디지털 전송이 아날로그 전송에 비해 어떻게 유리한가 모뎀은 어떤 일을 하며 어떻게 1과 0으로 구성된 정보를 전송하는가 그리고, 문자의 정확한 송수신을 위해서는 어떠한 전송 방법이 사용되는가 등에 대해서 이해하기 바란다.


정확한 정보전송에 관련된 기술에 관한 것이다. 사용자에 의해서 생성된 정보는 통신하기 적당한 형태의 신호로서 변환된 다음, 여러 전송기술에 의해 통신로 상에 전기적 신호형태로 실려서 상대에게 운송되는 것이다.
이 때 정보를 전달하는데 있어서 무엇보다 중요한 것은 송신한 정보가 상대에게 올바로 정확하게 수신되어야 한다는 것이다.
갑이라는 사람이 보낸 정보는 '정보시대의 주역'이었는데 을이 받은 정보는 '정보시대의 조역'이나 아니면 전혀 뜻 모를 정보였다면, 갑과 을간의 정보통신은 무의미한 것이 되기 때문이다.
따라서 정보통신의 '제1의 원칙'은 정보의 정확한 전달에 있다. 바로 이 제1의 원칙을 수행하기 위해 존재하는 것이 다양한 형태의 전송기술이다.
이번에는 먼저 전송매체를 통해 정보를 잘 전달하기 위해 최우선으로 요구되는 정보의 신호전송형태, 즉 정보의 신호변환에 대해서 설명한다.
특히 오늘날 가장 많이 쓰이고 있는 전송형태인 디지털 정보의 아날로그 신호 전송을 위한 모뎀기술을 집중적으로 알아보고자 한다.
또한 다양한 정보의 전송형태 즉 직병렬 전송방식과 동기식 및 비동기식 전송방식에 대해서도 언급한다.
다음 강의에서는 다양한 데이터 장비를 전송매체에 접속하기 위한 접속기술, 특히 현재 가장 널리 사용되고 있는 RS-232C를 중심으로 살펴본다.
또 송신측에서부터 수신측까지 정보를 전달하는 정보전달통로의 구성 방식도 다루며, 정보통신의 제1원칙을 가장 명시적으로 수행하는 에러제어방식과 흐름제어방식에 대해서도 설명한다.

우리가 표현하고자 하는 정보의 형태는 아날로그 정보와 디지털정보의 두 가지 양상을 띠고 있음은 이미 지난번에 언급한 바 있다.
이러한 두 가지 형태의 정보들은 전송매체를 통해 전달될 때 전자파의 형태, 즉 신호형태로서 전송되는데, 이때 신호가 디지털 형태로 전송되는 것을 '디지털 신호전송', 아날로그 형태로 전송되는 것을 '아날로그 신호전송'이라 한다.
즉 정보의 형태와 마찬가지로 전송의 형태 역시 디지털과 아날로그, 두 가지로 구분될 수 있다.
따라서 '정보전송'이라 함은 표현 정보와 전송신호간의 관계를 다루는 것이라 할 수 있다.
정보전송의 형태는 아날로그 정보를 아날로그 신호와 디지털 신호로 각각 전송하는 경우와 디지털 정보를 아날로그 신호와 디지털 신호 각각으로 전송하는 경우 등 모두 4가지 조합이 가능하게 된다.(표 3.1 참조)

우리가 현재 가장 많이 이용하고 있는 정보통신의 형태는 일반 공중전화망, 즉 전화선에 컴퓨터를 연결하여 정보를 주고받는 것이다. 그런데 공중전화망은 원래 음성 정보를 보내기 위해 구성된 통신로로서, 주로 아날로그 신호를 실어 나르는데 적합한 특성을 갖고 있다.
따라서 전화망을 통해서 정보통신을 하려면 자신의 정보형태를 전화망이 수용할 수 있는 신호형태, 즉 아날로그 신호로 변환하여야 한다. 바로 이러한 역할, 즉 디지털 정보의 아날로그 신호 형태로의 변환을 담당하는 것이 모뎀(modem)이다.
모뎀이란 원래 Modulator 와 Demodulator 에서 합성된 용어로서 . 그림 3-1에서와 같이 디지털 정보를 아날로그 신호로 변환하는 변조 기능과 이와는 반대로 아날로그 신호로 부터 다시 원래의 디지털 정보를 추출해 내는 복조 기능을 갖는 일종의 신호 변환기이다.(그림 3-1 참조)
일반적으로 디지털 정보를 아날로그 신호로 변환하는 변조방법에는 기본적으로 진폭편이 변조, 주파수편이 변조, 위상편이 변조 등 3가지가 있다.
물론 이들을 결합한 변조방법이 사용되기도 한다. 진폭편이변조는 '0'과 '1'을 두 가지 각각 서로 다른 진폭을 갖는 반송파로서 표현하는 변조방식으로 비효율적이기 때문에 단독으로 이용되기보다는 위상편이변조방식과 혼합되어 이용하는 경우가 많다.

주파수편이 변조는 반송파 주파수 부근의 서로 다른 주파수를 갖는 정현파로서 2진값 '0'과 '1'을 표현하는 변조방식이다.
일례로 1800bps 반이중 변조방식을 규정하고 있는 Bell 202시리즈는 '1'의 2진값에 2200Hz의 주파수를 , '0'에는 1200Hz의 주파수를 할당하고 있다.
위상편이변조는 반송파로 사용하는 정현파의 위상에 정보를 싣는 변조방식으로, 일정 주파수, 일정 진폭의 정현파의 위상에 정보를 싣는 변조방식으로, 일정 주파수, 일정 진폭의 정현파의 위상을 2등분 4등분 등으로 나누어 각각 다른 위상에 '1'혹은 '0'을 할당하거나 2비트 3비트를 한꺼번에 할당하여 상대측에 보내는 방식이다.
이상 3가지 변조방식 이외에도 이들을 서로 혼합하여 사용할 수도 있다. 일례로 진폭편이 변조와 위상편이 변조방식을 혼합함으로써 음성급에서 현재 56Kbps까지의 높은 전송속도를 얻을 수 있다. 그림3-2는 반송파의 모양과 진폭 변조, 주파수 변조, 위상 변조의 모양을 보였다.

반송파의 모양과 3가지 변조방법

한편 모뎀은 보는 관점에 따라 여러 가지로 분류된다. 우선 사용 형태에 따라 시스템 내에 조립되어 사용되는 내장형 모뎀, 자체만으로 하나의 장비를 이루는 외장형으로 분류할 수 있다.
우리가 PC통신에 접속하는 경우에 사용하는 모뎀은 내장형이 대부분이다. 또한 범용 모뎀은 데이터통신속도에 따라 저속, 중속, 고속 모뎀으로 분류되기도 한다.
이러한 분류 이외에도 모뎀은 사용하는 채널의 대역폭이나 사용 가능 거리, 사용 가능한 포트수에 따라 분류되기도 한다.
모뎀은 앞에서도 설명한 바와 같이 공중전화망을 통한 정보통신에 있어서 필수적인 통신장비로서, 일반적으로 두 가지 표준에 의해서 만들어지고 있다.
그 하나는 미국을 중심으로 한 Bell 표준이며, 또 하나는 유럽을 중심으로 한 CCITT(ITU-T) 표준이다. 특히 CCITT표준은 국제적인 공신력을 갖는 것으로, CCITT의 V시리즈는 아날로그 전송로를 이용한 데이터 통신 관련 권고안을 규정하고 있다. 주파수 변조, 위상 변조의 모양을 보였다.
여러분은 앞으로 자신의 모뎀이나 잡지 등을 보면서 이러한 권고안이 표기될 때에, 이들의 데이터 전송속도나 전송방식에 따라 분류된 모뎀 표준들이란 사실을 기억하기 바란다.
아무튼 오늘날 통신기술의 발달에 따라 많은 통신장비의 디지털화가 진행되어 감에도 불구하고, 아직까지도 가장 널리 사용되고 있는 통신장비 중 하나가 바로 모뎀이다. 업계에서 모뎀을 사양 산업으로 간주하고 있음에도 모뎀의 판매 실적은 꾸준히 지속되고 있다.
특히 오늘날 모뎀의 시장 동향을 살펴보면 그 형태가 점차 소형화되어 가고 있으며, 저가격 저소비 전력화 및 고속화를 추구할 뿐만 아니라, 본연의 변복조기능 이외에도 에러제어라든가 데이터 압축과 같은 다양한 통신기능을 제공하고 있음을 알 수 있다.
가령 CCITT의 V.34, 28,800bps모뎀에 대한 규격을 권고하고 있는데 최근에는 통신속도와 고속화 동향에 따라 AT&T, Racal-Milgo등 많은 모뎀 업체에서 이 규정에 준거한 고속 모뎀을 생산하고 있다.
또 국내에서도 금융업계등에 많이 보급되어 있는 MNP모뎀은 변복조기능 이외에 에러제어 기능과 데이터 압축기능을 부가한 다기능 모뎀이다. 최근에 발표된 56Kbps 모뎀은 아직 국제 표준이 정해진 상태가 아니어서 두 회사(Rockwell과 US Robotics)의 제품간에는 호환성이 없다.

디지털 정보를 아날로그 신호로 변환하는 것과 마찬가지로, 음성정보와 같은 아날로그 정보를 디지털 신호로 변환하여 전송할 수도 있다. 바로 이러한 디지털 신호로서의 전송형태를 디지털 전송이라 한다.
공중전화망이나 사설 교환기의 디지틀화에 따라 음성정보를 아날로그 형태 그대로가 아닌 디지털 신호로 변환함으로써 보다 다양한 기능을 수행할 수 있게 되었다.
그림과 같이 아날로그 형태인 음성정보를 디지털 신호로 변환하고 또 디지털 신호로부터 다시 원래의 음성정보를 복원해내는 기기를 코덱(codec)이라 한다.
코덱이란 부호화기(coder)와 복호화기(decoder)의 합성어이다.
물론 디지털 신호로 변환된 정보는 그 상태로 처리될 수도 있으며, 또 그림 3-3에서와 같이 다시 모뎀을 거쳐 아날로그 신호로 변환된 다음 아날로그 통신망을 통해 전송되기도 한다. 그러나 이 새로운 아날로그 신호는 원래의 음성정보와는 달리 2진형태인 디지털 신호의 의미를 그대로 포함하여 전송되기 때문에, 원래의 아날로그 음성 신호와는 그 모양이 다르다.

음성정보를 디지털 신호로 변환하는 가장 널리 사용되는 방식으로는 펄스코드변조(Pulse Code Modulation)방식을 들 수 있다.
펄스코드변조방식은 먼저 음성정보를 일정간격의 시간으로 샘플링하여 펄스진폭변조 신호를 얻은 다음, 이를 다시 양자화기를 거쳐 각 진폭값을 평준화하고, 이 양자화된 값에 2진 부호값을 할당함으로써 수행된다.
샘플링이란 음성정보를 일정시간 간격, 즉 샘플로 나누어 각 샘플마다 진폭값을 부여하는 것으로, 이를 펼스진폭변조(Pulse Amplitude Modulation)방식이라고도 한다.
또 양자화란 샘플링에 의해 얻어진 진폭값을 평준화하는 것으로 예를 들면 1.08의 값을 1.1로서 표현하는 것을 말한다.
이 과정에서 생기는 오차를 양자화 에러라 한다.
양자화에 의해 얻어진 모든 진폭값을 등간격으로 구분한 다음, 이들 각 값에 2진값을 할당하게 된다.
사람의 음성을 PCM 전송하는 경우 전송 속도는 64Kbps가 된다.
이는 사람의 음성을 1초에 8000번 조사(샘플링)하고 한 번 조사에서 얻어진 값을 8비트로 표현하여 전송하기 때문에 얻어지는 속도이다.그림3-4에 PCM(펄스코드변조)과정을 보여주고 있다.
디지털 정보를 디지털 신호형태를 수용하는 통신로상에 싣기 위해서는 원래의 디지털 정보를 그대로 보낼 수 도 있지만, 전송특성을 개선하기 위해서 일반적으로 디지털 정보를 다른 디지털 신호형태로 변환하여 전송하기도 한다.
이 경우는 전송매체가 아날로그나 디지털 신호형태만을 수용하기 때문에 정보를 변환하는 모뎀이나 코덱과는 달리, 통신로를 통한 전송특성을 보다 좋게 하기 위해서 신호 형태를 변환하게 된다.
디지털 정보를 또 다른 형태의 디지털 신호로서 전송하는 경우는 디지털 정보를 아날로그 신호로 변환하여 전송하는 모뎀보다 회로 구성이 더 간단하고 비용도 더 적게 든다.
예를 들어 가장 간단한 구현으로 '1'과 '0'의 2진값 각각을 부(-)의 전압 값과 양(+)의 전압 값으로 표현하는 것이 있다. 바로 이 부호가 NRZ-L(Nonreturn-to-Zero-Level)이다. 이 명칭은 한 비트구간내에서는 결코 0레벨로 리턴 되지 않음을 의미한다. 하지만 이 부호방식은 가장 간단하다는 이점은 있지만, 비트와 비트간의 구별이 어려워 에러발생확률이 많으며 동기를 잃을 우려가 크다.

펄스 코드 변조과정

따라서 이러한 문제를 해결하기 위해 그림 3-5에서와 같은 맨체스터와 차동맨체스터 부호방식을 많이 사용하고 있다. 이들 부호방식은 그림에서와 같이 한 비트내에 신호변화를 부여함으로써 에러검출과 동기화를 쉽게 하고 있다.

맨체스터 코드와 차동맨체스터 코드

디지털 정보의 디지털 전송을 담당하는 기기를 일반적으로 디지털 서비스 유니트(Digital Service Unit)라 한다.(그림 3-6 참조)
아날로그 정보는 같은 대역폭을 갖는 아날로그 신호로 쉽게 변환할 수 있다. 가장 대표적인 예가 음성으로서, 이는 300에서 3400Hz 범위의 사람의 목소리인 음파를 같은 주파수대를 갖는 전자적인 아날로그 신호로 표현할 수 있다, 이 신호는 바로 직접 전화선을 통해 전송되는 것이다.

DSU의 기능

또한 아날로그 신호의 전송특성을 보다 좋게 하기 위해 디지털 정보의 아날로그 전송과 마찬가지로 아날로그 정보의 진폭, 주파수 등의 한 요소를 변화함으로써 아날로그 정보 역시 아날로그 신호로 변환할 수 있다.
진폭변조는 아날로그 정보에 따라 반송파의 진폭을 변환시키는 것이며, 주파수 변조는 정보에 따라 반송파의 주파수를 변환시키는 방식이다.
이러한 방식은 방송에 이용되어 각각 진폭 변조(AM), 주파수 변조(FM) 방송으로 불리우고 있다.
이상 각 정보의 형태를 디지털 신호와 아날로그 신호로서 전송하는 디지털 전송과 아날로그 전송에 대해서 살펴보았다.
앞에서도 언급한 바와 같이 신호변환을 거쳐 정보를 전송하는 이유는 무엇보다도 전달하고자 하는 정보를 상대에게 보다 정확하고 신뢰성 있게 전송하기 위한 것이다.

아날로그 전송방식은 전송되는 신호형태가 아날로그 신호로서 , 그 신호내에 포함된 정보의 내용과는 무관하게 단지 아날로그 신호만을 전달하는 것이다.
따라서 먼 거리를 전송하고자 할 때는 신호의 감쇄영향을 보상하기 위해서 증폭기에 의해 신호를 중간에 다시 재증폭하여 전송하여야만 한다.
그런데 이 증폭 시에는 정보를 싣고 있는 신호뿐만 아니라 여리게 포함된 잡음까지도 같이 증폭되기 때문에 디지털 정보의 전송시에는 특히 에러의 확률이 높다는 단점이 있다.
그러나 반면에 디지털 전송방식은 전송신호에 포함된 내용을 전달하는 것으로 전송도중에 잡음이 있다고 하더라도 중계기에 의해 원래의 신호내용을 다시 복원한 다음 전송하기 때문에 전송매체에 의한 감쇄영향을 해결할 수 있다.

또한 이외에 디지털 전송방식이 갖는 이점을 표 3-2에 정리하였다. 이 표에 정리된 대로 디지털 전송은 아날로그 전송에 비해 1장에서 소개한 통신의 3대 목표 모든 분야에 걸쳐 우수한 특성을 보이고 있다.

컴퓨터에 의한 데이터 정보를 다루는 대부분의 기기는 모든 2진형태의 디지털 신호를 다루고 있으며, 디지털 전송방식이 아날로그 전송방식에 비해 더 많은 이점을 갖추고 있기 때문에 컴퓨터를 이용한 정보통신에 있어서는 디지털 전송방식을 이용하는 것이 바람직하나 기존 음성전달을 위해 설치되어 있는 전화망의 규모가 방대하기 때문에 이를 이용하기 위하여 아날로그 전송방식을 채택하는 경우가 많은 실정이다.

그러나 최근에는 데이터통신수요가 증가함에 따라 디지털 전송을 위한 통신망 서비스가 점차 확대될 것으로 보이며, 정보화 사회가 지향하는 종합정보통신망은 디지털 신호를 바로 전달할 수 있는 디지털 통신망으로서, 이의 구축과 함께 디지털 전송방식의 이용은 더욱 확대될 것으로 기대된다.

정보가 추진중인 초고속 종합 정보 통신망의 구축은 결국 국내의 모든 통신에서 전송 방식을 디지털화 하기 위한 노력으로 볼 수 있다.


디지털 정보를 전송하는 데는 직렬 전송과 병렬 전송의 두 가지 방법이 있다. 직렬전송은 하나의 문자를 구성하는 각 비트들이 하나의 전송선을 통하여 순서적으로 전송되며 병렬전송에서는 각 비트들이 여러개의 전송선을 통하여 동시에 전송된다.
가령 하나의 문자가 8비트로 구성되어 있다면 병렬전송에 필요한 전송선은 최소한 8개 있어야 한다. 병렬전송 방식은 일반적으로 컴퓨터와 주변기기 사이의 데이터 전송을 위해서 사용되며, 거리가 멀어지면 전송로의 비용 부담 때문에 거의 이용되지 않고 있다.

따라서 정보통신에 있어서는 대부분 직렬전송방식을 그 대상으로 하고 있다. 그림 3-7에는 직렬 전송과 병렬 전송을 비교하여 보이고 있다.

직렬 전송 방식과 병렬 전송 방식


송수신자간에 문자를 전달하는데 있어서 만일 각 문자의 시작과 끝이 정확하게 정의되어 있지 않다면 송신자가 보낸 문자는 수신자가 정확하게 인식할 수 없게 된다.
따라서 송수신자 사이에는 문자를 전송하는 정확한 방법이 사전에 정의되어야 하고 비동기식 방식과 동기식 방식이 바로 널리 이용되는 두 가지 방식이다.
비동기식 전송방식은 보통 한 문자단위와 같이 매우 작은 비트블럭의 앞과 뒤에 각각 스타트비트와 스톱비트를 삽입하여 비트블럭의 동기화를 취해주는 방식으로서, 스타트-스톱전송이라고 불리기도 한다.
5비트에서 8비트까지의 한 문자단위마다 전후에 문자의 시작과 끝을 알리는 스타트비트와 스톱비트를 두고, 매문자 단위로 전송하는 방식이다.

일례로 텔레타이프형 터미널은 대부분 비동기식으로 데이터를 전송하는 것으로 키보드 하나를 누를 때마다 한 문자가 전송된다.
따라서 각 문자와 문자의 전송사이에는 휴지기간이 존재하여, 휴지 기간 동안에는 스톱비트와 같은 비트가 계속 전송된다. 수신측이 송신측보다 만일 5%정도 빠르거나 느리면 여덟 번째 비트는 45%(5%×9비트)빠르거나 느리게 되지만 정보는 정확하게 인식될 수 있다.
그러나 만일 한 문자를 이루는 비트블럭의 크기가 8비트보다 크거나 송수신측간의 타이밍 오차가 5%보다 더 크다면 타이밍 에러가 발생하게 된다.
즉, 마지막에 샘플링된 비트는 잘못 인식되며, 비트 카운트도 틀려진다.
만일 비트7이 0이고 비트 8이 1이었다면 비트8은 시작비트로 잘못 인식되는데, 바로 이러한 에러를 프레이밍에러(framing error)라 한다.
일반적으로 비동기식 전송방식은 단순하고 저렴하나, 각 문자당 스타트 비트와 스톱비트를 비롯해 2-3비트의 오버헤드를 요구하므로 전송효율이 매우 떨어지는 것으로 보통 낮은 전송속도에서 이용된다.

한편 동기식 전송방식은 한 문자단위가 아니라 여러문자를 수용하는 데이터 블록 단위로서 전송하는 방식이다.
이 방식은 양측에 설치된 모뎀이나 다중화기 등과 같은 기기에 의해 타이밍이 공급되며, 동기문자나 플래그 등을 사용하여 송수신측간의 데이터 블록을 수신해야 하기 때문에 터미널에는 반드시 버퍼장치가 요구되며 보통 2000bps이상의 전송속도에서 사용된다.
일반적으로 상당히 큰 데이터블록 경우에는 동기식 전송방식이 비동기식 전송방식 보다 전송효율에 있어서도 더 좋은 성능을 갖게 된다. 따라서 정보통신에 있어서 사용되는 대부분의 통신 프로토콜에서는 동기식 전송방식을 이용하고 있다.

그림 3-8에서 보인 바와 같이 동기식 전송 방식은 다시 문자지향형과 비트지향형으로 나뉘어지는데 최근에는 대부분 문자지향형 동기식 전송을 채택하고 있다.
HDLC, SDLC등이 비트지향형 동기식 전송 프로토콜의 전형적인 예가 된다.

동기식 전송과 비동기식 전송

용어 정리

IEEE, 40Gbps 및 100Gbps 이더넷 사양 초안 작성키로

게재 : EEKOREA, 2007년 09월 10일 by 릭메리트

IEEE는 40Gbps 및 100Gbps 이더넷 네트워킹 모두에 대한 로드맵 작성이라는 합의가 도출됨에 따라, 내년 3월 두 가지 전송속도의 사양 작성을 위한 공식적인 실무팀을 출범시킬 예정이다.

IEEE Higher Speed Study Group은 지난 7월 개최된 회의에서 40Gbps 및 100Gbps 이더넷 사양 초안 작성에 대한 지지를 결의했다. 100Gbps 사양은 단일모드 광섬유에서 40km 및 10km, 다중모드 광섬유에서 100m, 그리고 구리 케이블에서 10m에 해당하는 버전들을 포함하게 된다.

한편, 주로 데이터 센터 장비 제조업체들로부터 지지를 얻고 있는 40Gbps 제안은 이번 결정에서 보다 논쟁적인 부분을 차지했다. 최근 들어서 텔레콤 및 네트워킹 엔지니어들은 40Gbps 이더넷을 놓고 이견을 보이고 있다.

충돌하는 제안들

텔레콤 엔지니어들은 그 속도가 최소 지난 6년 전부터 배치되고 있는 현재의 40Gbps 광 전송 네트워크와 호환될 수 있어야 한다고 주장했다. Alcatel-Lucent사의 표준 전문가인 Steve Trowbridge 씨는 이러한 목적으로 제안서를 작성했고, 적어도 일곱 곳 이상의 업체들로부터 지지를 얻었다.

이 제안에는 40Gbit 표준이 현재의 10GbE에 사용되는 64/66bit 인코딩 계획을 사용할 수 없다는 것이 함축되어있다. 광 네트워크로 직접 데이터를 전송할 만큼 충분한 페이로드를 제공하지 않기 때문이다. Alcatel사의 제안은 512/513bit 인코딩을 하나의 가능한 접근법으로 설정했다.

“10G에서 우리는 두 가지 표준과 네 가지 독점적인 접근법을 갖게 되었다. 이들 중 다수가 네트워크 될 수 없는 점대점 연결이었다”고 Trowbridge 씨는 말했다. “각각의 고객들은 서로 다른 솔루션들을 선택했으며, 이들 중 몇몇은 서로 잘 연동하지 않았다. 이로 인해 너무나도 많은 시스템이 구축되었다”고 그는 설명했다. “사람들은 이러한 경험이 반복되기를 원치 않는다.”

반면 네트워킹 엔지니어들은 40Gbps 스피드 그레이드를 전혀 원하지 않았다. 오히려 광 전송 네트워크로 직접 연결할 필요가 없는 상당수의 고속 라우터 및 스위치 시장을 분열시킬 것에 대해 우려했다.

“동일한 분야에서 두 가지 솔루션을 갖게 될 것이다. 이는 본질적으로 시장을 분할하게 될 것이며, 그다지 몸집이 커지지 않을 전체 시장에서 두 개의 트랜시버 및 두 개의 회선 카드를 구축해야 할 것”이라고 익명을 요구한 한 네트워킹 엔지니어는 말했다.

합의 도출

결국에는 모든 이해당사자들이 40Gbps 표준의 세 가지 종류를 지원키로 합의했다. 멀티모드 광섬유에서 100m, 구리 케이블에서 10m, 그리고 백플레인에서 1m 확장 버전이 포함된다. IEEE의 자료에는 40Gbps 사양이 “광 전송 네트워크에 적절한 지원을 제공할 것”이라고 적혀있다.

“이러한 표현은 목표 달성을 위한 여러 가지 방법들에 대해서 여지를 남겨 두는 것”이라고 Intel사의 이더넷 제품 부문 전략 마케팅 매니저인 Robert Hays 씨는 말했다. 그는 40Gbps 제안의 지지자 중 한 사람이다. “이를 추진해 나가는 데 있어서 사람들이 갖고 있는 근심사항 중 한 가지는 40Gbps 제안이 너무 제한적일 것이라는 점이다”라고 그는 덧붙였다.

지난 7월 회의 중 함께 진행된 토론회에서 엔지니어들은 어느 업체들의 단체라도 40Gbps 이더넷과 광 전송 네트워크와의 호환성을 확실히 하기 위해 필요 이상의 짐을 지지는 않을 것이라는 점을 그들 스스로가 확인할 수 있었다고 Hays 씨는 말했다. “대다수의 요구에 충족하는 합의에 도달했기 때문에 모든 사람들이 만족하고 있다고 생각한다”고 그는 덧붙였다.

특히, 엔지니어들은 40GbE 및 100GbE 제품에 대한 서로 다른 시장 역할을 교육하기 위해 이더넷 연합(Ethernet Alliance) 등과 같은 단체들과의 협력에 대해 비공식적으로나마 합의했다.

40GbE 속도는 주로 서버 및 스위치 내 혹은 서버 및 스위치 사이에서 사용될 것이며, 2011년도나 그 이후에 가서야 그 필요성을 인정 받을 것으로 보인다. 100GbE 제품은 주로 데이터 센터와 백본 네트워크의 스위치들을 연결하게 되고 그 필요성은 보다 빨리 요구될 것이라고 현역 컨설턴트이자 HP사의 네트워킹 부문에서 오랫동안 근무했던 Dan Dove 씨는 말했다. 그는 IEEE의 노력에 적극적으로 참여했다.

Dove 씨에 따르면, 네트워킹 엔지니어들은 칩 제조업체들이 두 가지 속도 모두를 지원할 수 있는 반도체를 설계할 수 있도록 최종 사양을 작성할 것이다. “반도체를 두 배로 늘리고, 많은 시스템들의 수를 배가 시키는 상황에 직면하는 것을 원치 않는다”고 Dove 씨는 말했다. 또한 하나의 단체가 두 가지 사양 모두를 작성하는 것이 두 가지 전송 속도에서 공통 기술을 추진하는 데 도움이 될 것이라고 덧붙였다.

여전히 풀지 못한 어려운 문제는 인코딩 계획이다. 이더넷은 64/66bit 인코딩을 사용하고 있지만 텔레콤 업체들은 512/513bit 등의 접근법을 제안하고 있다. 현재 엔지니어들은 이 사양의 대부분에서, 특히 단거리 요구사항을 충족시키는 데 있어서는 10GbE의 복수 채널을 사용할 것이라고 보고 있다. 하지만 병렬이냐 직렬이냐를 놓고 벌어진 구현상의 문제는 여전히 해결되지 않고 있다.

표준을 정립하기 위한 노력이 본 궤도를 유지한다면, 40Gbit 및 100Gbit의 최종 사양은 2010년 6월 이전에는 완성될 수 있을 것이다. 표준의 초안은 2009년 초까지는 정립될 것이며, 2009년 말에는 몇 가지 상용 제품이 발표될 수 있을 것이다.

그러나 IEEE는 40GbE 및 100GbE 사양을 모두 작성해야 하기 때문에, 그 일정은 늦춰질 수 있다고 Dove 씨는 말했다.

Intel 및 Broadcom사는 40Gbps 이더넷 네트워킹의 지지를 호소하는 로비를 펼쳤다.

Intel 및 Broadcom사는 40Gbps 이더넷 네트워킹의 지지를 호소하는 로비를 펼쳤다.

Sep 20, 2007

인텔, USB 3.0 프로모터 그룹 결성하다

인텔을 비롯한 여러 선두 기업들이 현재 전송 속도보다 10배 빠른 5Gbps의 초고속 USB 개발을 위해 USB 3.0 프로모터 그룹(Promoter Group)을 결성했다고 발표했다.

인텔이 HP, NEC 코포레이션, NXP 세미컨덕터, 텍사스 인스트루먼츠와의 협력을 통해 개발한 이 기술은 디지털 미디어의 유비쿼터스화가 이루어지고 파일 크기가 최대 25GB 이상으로 증가되면서 그 필요성이 커진 PC와 컨슈머, 모바일 부문의 고속 싱크앤고(sync-and-go) 전송 애플리케이션에 사용된다.

USB(Universal Serial Bus) 3.0은 이전 USB 기술들과 마찬가지로 간편한 사용법과 플러그앤플레이(plug and play) 기능을 특징으로 할 뿐 아니라 하위 호환성을 지닌 표준을 세우게 된다.

10배 이상의 성능 향상을 목표로 하는 이 기술은 유선 USB와 동일한 아키텍처를 기반으로 한다. 이 밖에도, USB 3.0 규격은 전력 소모량 감소 및 프로토콜 효율성 증진을 위해 최적화된다. USB 3.0 포트와 케이블링은 광학용 기능을 위한 상위 및 하위 호환이 가능하도록 고안될 것이다.

인텔 기술 전략가이자 USB 임플리멘터 포럼(Implementers Forum, USB-IF) 의장인 제프 라벤크래프트(Jeff Ravencraft)는 “USB 3.0은 가장 많이 사용되는 PC 유선 접속 방법의 다음 단계이다.”라며, “디지털 시대에는 일상 생활에서 흔하게 사용되는 엄청난 양의 디지털 컨텐츠를 전송하기 위한 고속 성능 및 신뢰성 높은 접속 방법이 필요하다.

USB 3.0은 소비자들이 USB 기술을 선호하게 된 이유이자 지속적으로 기대감을 갖는 특징인 ‘손쉬운 이용’은 그대로 유지하면서 신뢰성 높은 접속 방법이라는 요구 사항을 해결해 줄 것이다.

인텔은 USB 임플리멘터 포럼이 USB 3.0 규격에 대한 업계 대표 단체의 역할을 할 것이라는 이해 하에 USB 3.0 프로모터 그룹을 창설했다. USB 3.0 규격의 완성본은 2008년 상반기에 발표될 예정이다. 또한 USB 3.0은 초반에 독립 실리콘의 형태를 띠게 될 것이다.

USB 3.0 프로모터 그룹은 기존의 USB 장비를 위한 클래스 드라이버의 인프라 구축과 투자, 동일한 룩앤필(look-and-feel), 간편한 이용법을 그대로 유지하면서 이 위대한 기술의 성능을 지속적으로 강화시키려고 노력하고 있다.

Industry Leaders Develop Superspeed USB Interconnect

Popular USB Computer Connection Technology Expands performance with Proposed USB 3.0 Specification

INTEL DEVELOPER FORUM, San Francisco, Sept. 18, 2007 ?Intel Corporation and other industry leaders have formed the USB 3.0 Promoter Group to create a superspeed personal USB interconnect that can deliver over 10 times the speed of today's connection. The technology, also developed by HP, Microsoft Corporation, NEC Corporation, NXP Semiconductors and Texas Instruments Incorporated, will target fast sync-and-go transfer applications in the PC, consumer and mobile segments that are necessary as digital media become ubiquitous and file sizes increase up to and beyond 25 Gigabytes.

USB (Universal Serial Bus) 3.0 will create a backward-compatible standard with the same ease-of-use and plug and play capabilities of previous USB technologies. Targeting over 10x performance increase, the technology will draw from the same architecture of wired USB. In addition, the USB 3.0 specification will be optimized for low power and improved protocol efficiency. USB 3.0 ports and cabling will be designed to enable backward compatibility as well as future-proofing for optical capabilities.

"USB 3.0 is the next logical step for the PC's most popular wired connectivity," said Jeff Ravencraft, technology strategist with Intel and president of the USB Implementers Forum (USB-IF). "The digital era requires high-speed performance and reliable connectivity to move the enormous amounts of digital content now present in everyday life. USB 3.0 will meet this challenge while maintaining the ease-of-use experience that users have come to love and expect from any USB technology."


Intel formed the USB 3.0 Promoter Group with the understanding that the USB-IF would act as the trade association for the USB 3.0 specification. A completed USB 3.0 specification is expected by the first half of 2008. USB 3.0 implementations will initially be in the form of discrete silicon.

The USB 3.0 Promoter Group is committed to preserving the existing USB device class driver infrastructure and investment, look-and-feel and ease-of-use of USB while continuing to expand this great technology's capabilities.

About the USB-IF
The non-profit USB Implementers Forum, Inc. was formed to provide a support organization and forum for the advancement and adoption of USB technology. The USB-IF facilitates the development of high-quality compatible USB devices, through its logo and compliance program and promotes the benefits of USB and the quality of products that have passed compliance testing. Further information, including postings of the most recent product and technology announcements, is available by visiting the USB-IF Web site at www.usb.org.

Participating Company Press Quotes:
"HP's commitment to providing customers with a reliable method for connecting peripherals is evident through our support of both USB 2.0 and Wireless USB technologies," said Phil Schultz, vice president, Consumer Inkjet Solutions, HP. "Now, with USB 3.0, we're creating an even better experience for customers when connecting their printers, digital cameras or other peripheral devices to their PCs."

"Intel worked jointly with industry leaders in the development and adoption of two generations of USB, which has become the number one peripheral interface in computing and hand-held consumer electronic devices," said Patrick Gelsinger, senior vice president and general manager, Digital Enterprise Group, Intel Corporation. "As the market evolves to support customer demands for storing and moving larger amounts of digital content, we look forward to developing the third generation of USB technology that leverages the current USB interface and optimize it to meet these demands."
"NEC has been a supporter of USB technologies since the first installment of wired USB," said Katsuhiko Itagaki, general manager, SoC Systems Division, NEC Electronics Corporation. "Now it's time to evolve an already successful interface to meet market demands for moving large amounts of content at faster speeds to minimize users wait time."

"NXP is pleased to join other top-tier companies in advancing the number one interconnect technology in the world to meet the needs of next-generation peripherals," said Pierre-Yves Couteau, director of Strategy & Business Development, Business Line Connected Entertainment, NXP Semiconductors. "As a leading provider of USB semiconductor solutions, NXP is committed to drive the standardization and applications of Superspeed USB in the industry."

"With the proliferation of Hi-Speed USB in a wide number of market segments, including personal computing, consumer electronics, and mobility, we anticipate that USB 3.0 will rapidly become the de facto standard as the replacement of USB 2.0 ports in applications where higher bandwidth is valued," said Greg Hantak, vice president Worldwide ASIC at Texas Instruments Incorporated. "TI is excited about the new applications and improved user experience that will be enabled by the performance of USB 3.0."

For more news coverage out of IDF, visit the complete press kit at www.intel.com/pressroom/idf.

About Intel
Intel, the world leader in silicon innovation, develops technologies, products and initiatives to continually advance how people work and live. Additional information about Intel is available at www.intel.com/pressroom and blogs.intel.com.