ExpressFabric: Thinking Outside the Box
ExpressFabric® is the revolutionary PLX platform that enables, for the first time, a converged fabric to be deployed using the universal PCI Express interconnect. The fabric allows:
- SR-IOV devices to be shared among multiple hosts with the standard, existing hardware, drivers, and application software
- Multiple hosts to reside on a single fabric, using standard PCIe enumeration mechanisms
- Hosts to communicate through Ethernet-like DMA or InfiniBand-like RDMA using standard devices and applications software
The Soul of a New Interconnect
PCI Express (PCIe) has been deployed throughout nearly every market since the first version was introduced in 2003. The specification has progressed from a per-lane data transfer rate of 2 Gb/sec, to the current Gen3 per lane transfer rate of 8 Gb/sec, and can be aggregated to allow a bidirectional data transfer rate as high as 256 Gb/sec.
Until recently, the success of PCIe has been primarily as a fanout interconnect, enabling CPUs, I/O, and storage devices – all of which have a PCIe interface – to communicate. There has been penetration into more sophisticated applications, such as host failover, and the PCIe interconnect standard has even been used as a backplane to connect PCIe-based subsystems. But given the performance of PCIe at Gen3, and its widespread adoption on devices, the popular interconnect has become an attractive alternative to current solutions as a fabric for cloud and enterprise data center applications.
PLX has extended the reach of PCIe for use as a cloud and enterprise data center fabric through its ExpressFabric initiative. By building on the natural strengths of PCIe – it's everywhere, it's fast, it's low power, it's low latency, it's affordable – and by adding some straightforward, standards-compliant extensions that address multi-host and I/O sharing applications, PLX has created a universal interconnect that substantially improves on the status quo.
Saving Space, Money, Power, and Effort
PCIe is the main connection to the world on almost every device used in the data center. CPUs, communication devices, storage devices, GPUs, FPGAs – even other fabrics – connect to each other through PCIe. Using PCIe as the main fabric for the data center rack eliminates the cost, power, and latency of the "bridging" devices that make up the status quo today: primarily Ethernet and InfiniBand.
The backbone of the cloud or enterprise data center will continue to remain mostly Ethernet, with InfiniBand supporting HPC installations, but PCIe will serve as the fabric within the rack. This peaceful co-existence – with each technology matching its technical capability, will be the norm.
Ethernet & InfiniBand, the incumbents, have advantages – Ethernet has low cost and a large software base; InfiniBand has low latency and high performance. But neither has the advantages of PCIe:
Almost all storage, I/O, and compute devices used in data centers have PCIe connections, often more than one. This means that a high speed fabric hooking them all together can be constructed without using bridges or other translating devices to match the source and destination subsystems. This has the advantages of requiring fewer components, leading to lower latency, lower cost, and lower power. But it also allows a unified view from an architectural and software standpoint. Two PCIe devices, whether a few inches apart or across the room, can be connected seamlessly without regard to the distance or topology.
Some CPU devices have Ethernet on them – in addition to PCIe – but fewer non-CPU devices have such ports. And even the CPUs that have Ethernet as an on-chip port do not directly support the high speeds necessary for a mainstream fabric. In fact, when Ethernet is used as a fabric, the Ethernet component is often translating to and from the PCIe subsystems. And the situation with InfiniBand is even less promising - very few devices have InfiniBand as a direct connection.
It is highly efficient to allow different data types to travel along the same pathway, and then be consumed by the appropriate end point, without needing to determine prior to sending the data whether it is I/O or storage. PCIe is the most practical interconnect for enabling this type of convergence of communication and storage data traffic. The subsystems can be separated and shared, rather than duplicated on each data center blade.
Software Defined Fabric
ExpressFabric is built on a foundation that combines dedicated hardware; initialized and managed by firmware.
The fabric is defined by several different types of connection points – or ports. Each port on the fabric can be specified to be one of these, and takes on specific characteristics once it has been defined:
A host port is assumed to be hooked up to a CPU complex or server. The host "believes" that it is acting as the root complex of the PCIe network – and runs the enumeration software or firmware for that purpose. It is not actually acting in that capacity, as will be explained. But by making it believe that it is a root complex, it can run standard software. Each host port has a dedicated DMA/RDMA engine, and a special low latency mechanism called a Tunneled Window Connection (TWC) for host-to-host communication. A host port also has the circuitry necessary to share the end points.
A downstream port is where the devices or end points reside. When a port is defined to be a downstream port, they invoke the logic that enables sharable end points (SR-IOV or multifunction) to be shared among multiple hosts. This sharing is done using the standard vendor hardware and device drivers. The ability to share SR-IOV devices across multiple hosts is called ExpressIOV.
A fabric port is the connection point between different ExpressFabric devices. This allows a scalable fabric to be created using the standard building blocks. Although the CPUs & devices are using standard PCIe, the routing inside the fabric is done by a Global ID (GID) rather than an address. The fabric port understands the GID, and thus allows a unified fabric to be deployed.
The mCPU port – there can be one per fabric device – initializes and controls the fabric. It configures the routing and other tables, handles serious errors such as hot plug, and enables the ability of multiple hosts to reside on the same PCIe network without the normal problems associated with that situation.
Creating a Synthetic Hierarchy
One of the main obstacles to creating a fabric with PCIe has been attaching multiple hosts to the network without a lot of custom enumeration and software. ExpressFabric completely removes that issue, and does so with standard PCIe enumeration mechanisms – allowing existing servers to be on the fabric without changing the firmware or software.
ExpressFabric achieves this new, unique capability by having the host enumeration accesses redirected to the mCPU upon initialization. The mCPU, which is the actual root complex of the network, provides enumeration responses to the host that are "appropriate", but instead of providing the real topology, they give the host a topology that the mCPU creates. When the enumeration is complete, the host has a topology in its memory that has been synthesized for it by the mCPU. The host then knows about the other ports in a manner that allows normal operation.
ExpressFabric offers a range of flexible options for hosts to communicate with high performance and low latency. And to do so using standard mechanisms and application software.
The majority of applications that run within a data center use Ethernet as the fabric, and there is a vast library of applications that have been deployed for this purpose. ExpressFabric enables that application software to run unchanged through the use of a virtual Ethernet NIC on each host port.
The DMA engine includes a fabric-friendly broadcast/multicast capability, enabling efficient communication with more than one host when necessary.
When performance is critical in clustering applications, Remote DMA (RDMA) – also called zero copy – is used to eliminate most of the software overhead of copying the data repeatedly. ExpressFabric has dedicated RDMA hardware to handle this function, offering InfiniBand-like performance without specialized hardware.
When there is a need for a small message to be passed between hosts, an approach called Tunneled Window Connection (TWC) is available. This allows messages to be sent from one host to another in a very low latency manner, and without the overhead associated with DMA.
Building High Performance SSD-Based Systems
One of the major enabling capabilities of ExpressFabric is the universal nature of PCIe. And nowhere is this more true than in the case of storage systems. PCIe has been used for many years throughout the storage market as a major interconnection mechanism, and there is history and momentum with the architecture and a large software application base. In storage, PCIe is the incumbent.
The newest, and the fastest growing, segment of the storage market is Solid State Discs (SSDs), and the industry has standardized on PCIe as the primary connection to this new, high performance storage element. ExpressFabric allows PCIe-based SSDs to be directly connected to the fabric with no latency-inducing translation devices. And it furthermore allows them to be connected to other PCIe-based devices and subsystems – basically everything else in a system.
In addition to this basic ability to interoperate with the rest of the system, ExpressFabric offers the extended reliability that is now part of the PCIe specification.
Most servers have difficulty handling serious errors, especially when an end-point disappears from the system due to, for example, a cabled being pulled. The problem tends to proliferate through the system until recovery becomes impractical. Downstream Port Containment (DPC) allows a downstream link to be disabled after an uncorrectable error. This makes error recovery feasible with the appropriate software, and is especially critical in storage systems, since the removal of a drive needs to be handled in a controlled and robust manner.
In addition to offering this PCI-SIG ECN, ExpressFabric devices track outstanding reads to downstream ports, and synthesize a completion so that the host does not get a completion time-out if the end-point is removed.
Highly Scalable, Flexible Fabric Topologies
ExpressFabric interfaces with standard PCIe devices, using standard software. Once inside the fabric, however, the information is routed through a Global ID (GID), rather than an address. The mapping between the address and the GID is provided by the management CPU (mCPU), and this approach allows the fabric to eliminate the hierarchical topology restriction of standard PCIe.
ExpressFabric allows other topologies such as mesh, fat tree, and many others. And it does this while allowing the components to remain architecturally and software compatible with standard PCIe.
ExpressFabric allows topologies that have more than one path between the elements of the system, and this enables the devices in the system to invoke spread routing, where the data can be sent through more than one path. Congestion information is also shared between devices, allowing the source of the data to make better decisions on how to handle the routing.
Mainstream Applications for ExpressFabric
ExpressFabric-based products are an outstanding solution when designing a heterogeneous system where there is a requirement for a flexible mix of processors, storage elements, and communication devices.
Typical server boxes that are used to create modern cloud and enterprise data centers consist of racks that include modular subsystems and communicate with each other over a backplane or through cables. The connections within the racks are well suited to ExpressFabric. Instead of treating each subsystem as a separate server node (with some predetermined or limited quanta of processing, storage, and communication), the blades on an ExpressFabric platform can be put together with dedicated blades that do a specific function.
Appliances - dedicated function boxes that offer a specific capability and are connected to the rest of the system through a standard interface – are especially well-suited to an ExpressFabric approach. Common appliance applications are test equipment and storage.
Most modern high speed storage subsystems have a mix of rotating media and SSDs to balance performance and cost, and include some processing as well to manage the system. These systems can be deployed efficiently with ExpressFabric, since the storage subsystems all hook up to PCIe either directly (SSDs) or indirectly (SAS or SATA controllers), and can communicate directly with the processors and communication chips.
HPC clusters are made up of high-performance processing elements that communicate through high bandwidth, low latency pathways in order to execute applications such as medical imaging, financial trading, data warehousing, etc. InfiniBand is often used in these applications due to its native support of RDMA.
An ExpressFabric solution can offer that same capability – high bandwidth, low latency, & native RMDA – without the need for the InfiniBand HCAs and switches. The processing subsystems can be hooked up directly to the PCIe fabric and run the same application software, benefiting from lower cost & power due to the elimination of the bridging devices.
And clustering systems can be built with I/O sharing as an additional native capability when needed. This is not normally provided with traditional clustering systems built on InfiniBand.
A MicroServer is a system designed with a large number of lower power and lower cost processing engines rather than larger (and thus much higher power and cost) high-end server processors. They offer substantial benefits when the applications need a lot of aggregate processing, but where the application can be spread among a lot of smaller engines. Some typical applications are Web servers and Hadoop data analysis.
Most MicroServer elements today are made up of Systems-on-a Chip (SoCs) that have processing, storage, and communication, and these elements are hooked together with either proprietary or low speed Ethernet connections. Since similar processing elements have PCIe on them in general, ExpressFabric is an ideal interconnect for a MicroServer system.
Rapid and Effective Development Tool Suite
One of the most important aspects of a technology is how quickly it can be brought to market. And PLX makes ExpressFabric quick and easy to design and deploy with the FabricBuilder suite. FabricBuilder includes:
A full rack-level Top-of-Rack (ToR) switch box implementation, offering 32 Gen 3 ports in a 1U form factor. This switch attaches to the rack servers through an optional redriver-based PCIe plug-in card. The connection between the server and the ToR switch is through industry standard QSFP+ connectors and either copper or optical cables.
Firmware that runs on the management CPU (mCPU) and controls the features of the device hardware. PLX includes a basic reference software solution with the switching devices that provide a fully usable system. It will boot and initialize the system, and support a functional solution with I/O sharing, DMA, & RDMA.
Host drivers that allow standard TCP/IP Ethernet-based or OFED-based applications to run.
The software that PLX includes with the package is provided in source form. This enables the designer to modify and enhance the functions of the solution.
ExpressFabric: A Fabric for the Next Generation
ExpressFabric offers the advantages of PCIe, and includes the enhancements necessary to create a state-of-the-art fabric.
ExpressFabric enables the use of existing hardware and software to create a converged system, where you add just the right mix of components based on the application need,
- ExpressFabric offers native sharing of I/O devices among multiple hosts, and allows the hosts themselves to communicate through standard cloud and enterprise data center approaches,
- ExpressFabric requires fewer components to create the fabric, translating to lower cost, power, and latency,
- ExpressFabric connects directly to PCIe-based SSD devices, and offers robust operation through downstream port containment,
Since ExpressFabric is based on PCI Express, it offers a platform that offers high volume, mainstream price points.
- Integrating Rack Level Connectivity into a PCI Express Switch
- TweakTown interviews PLX VP of Engineering, Vijay Meduri
- PLX Technology Express Fabric Explained
PCI Express Clustering (ExpressFabric™)
- TweakTown visits PLX for ExpressFabric Demo: Article plus videos (Tweaktown, Vijay Meduri and Miguel Rodriguez)
- PCI Express assuming a widening role in the rack (Embedded Computing Design, Krishna Mallampati)
- PCI Express: Thinking -- and Doing the Job -- Outside the Box (Wired Magazine, David Raun, CEO)
- PCI Express vs. Ethernet: A showdown or coexistence? (Embedded.com, Krishna Mallampati)
- PCI Express Switching Takes On Ethernet, InfiniBand (EnterpriseTech, Timothy Prickett Morgan)
- PCI Express: Powerful, versatile -- and as reliable as a sheepdog (Electronic Products, Larry Chisvin)
- Integrating Rack Level Connectivity into a PCI Express Switch (Video) (PDF) (Hot Chips Conference 2013, Jack Regula)
- PCI Express Moves Outside the Box (EE Catalog Special Feature)
- More Articles...