Understanding Mesh Networking, Part II Protocols & Performance
Mesh networks are an integral part of the burgeoning decentralization movement and critical to realizing its vision. In Part I we discussed the history of mesh networks, and some challenges in making them work. To recap briefly, the basic multi-hop mobile wireless mesh networking concept has been around since the 1970’s, under different names such as “packet radio networks”, “mobile ad hoc networks”, and “sensor networks” and has found applications in the military, first response, and recreation amongst others. Challenges specific to mesh networks include managing the link performance tradeoffs (e.g. throughput vs range), decentralized access control and decentralized routing.
In this part, we will overview some of the approaches that have been developed over the years for these problems. Often, we will find multiple approaches to the same problem, each valid in its own right, and defended with fervor by proponents of each. We will discuss the pros and cons of some of these, and their applicability to various mesh network types. Finally, we will discuss how goTenna’s unique protocol approach has radically transformed the state of art. We keep the treatment at a high level as in Part 1, and focus on the key intuition behind the ideas that make mesh networking work.
Wireless Technology
Let us begin with the whack-a-mole problem of optimizing SWaP (Size, Weight and Power), cost, throughput and range, that is, the fact that you can only optimize some of these. As such, this is an unsolved problem — e.g., a 100 Mbps device with 1 mile range at less than $100 that fits into your pocket does not exist. The best one can do is study the requirements of the application and pick the right tradeoff. The adjoining figure shows some of the radio technologies and how they have traded off data rate vs range — the cost and SWaP dimensions are not shown.
At one end of the tradeoff space, we have LoRa (which stands for Long Range) that uses ultra-low-power transmissions that go up to 10 km but at very low rate. This is ideal for many Internet-of-Things (IoT) applications. At the other end, we have LiFi which uses LED-based light for communication and has been shown to achieve upwards of 200 Gbps in the laboratory but cannot penetrate walls. In between we have several well-known technologies. Although 4G/LTE appears to have the best of both worlds, it typically requires centralized equipment, and is costly.
Given the limitations based on physical laws, one approach to achieving both range and throughput is to use devices that have multiple technologies inbuilt, say using multiple transceivers each picking a different throughput-range combination. For example, the Zigbee PRO 2017 enables compatible devices to mesh over both the 2.4 GHz and 900 MHz ISM bands, leveraging the throughput-advantage of 2.4 GHz, and the range-advantage of 900 MHz. Multiple link layers in a node can be coordinated and integrated into the mesh at a higher layer. In future, we will likely see more such devices at progressively lower price points as a kind of “indirect” way of solving the RF whack-a-mole problem.
Access Control
Accessing the channel in a mesh network is somewhat like participating in an audio-only teleconference. Anybody who has participated in one that has more than 10 or so loquacious participants knows how chaotic it can get! In centralized systems such as WiFi and cellular, the presence of a central node that manages access makes the problem fundamentally easier, but without this luxury, mesh networks need distributed mechanisms, that is, protocols where all nodes execute a set of steps (same set across nodes) to arrive at a mutually consistent decision.
The earliest access control protocol for mesh networks was developed around 1971 at the University of Hawaii. Called ALOHA, the protocol simply transmitted a packet, and if there was a collision, retransmitted it. ALOHA gets the job done in a minimalistic manner for very low load, but it was mathematically shown to have at most 18% efficiency.
The fairly straightforward next step of listening before talking (as in teleconferences) is known as Carrier Sense Multiple Access (CSMA). CSMA has many variants, and one that goes by the name of CSMA/CA (CA stands for “collision avoidance”) has a node perform a random backoff if it finds the channel busy. That is, it postpones its transmission by a period randomly selected within a time window, and tries again. This idea has been a major hit — the IEEE 802.11 set of standards, basically WiFi, employs a kind of CSMA/CA.
While CSMA/CA performs significantly better than ALOHA, it still is susceptible to the hidden node problem outlined in Part 1. A hidden node is a node that can interfere with a transmission but cannot be sensed. One of the first solutions to this problem was the use of a Request-To-Send (RTS) and Clear-To-Send (CTS) handshake prior to the transmission of data. As the adjoining figure shows, the CTS forces hidden nodes to back off, allowing for safe passage of the ensuing data packet. The RTS-CTS idea with some refinements has been incorporated into the IEEE 802.11 standards as Virtual Carrier Sense. While this elegantly solves the hidden node problem, it introduces additional control overhead that reduces overall throughput, perhaps even negating the advantage it provides.
At this point you are probably thinking: why not schedule the transmissions? Indeed, allocating a periodic slot within a repeating frame in what is known as Time Division Multiple Access (TDMA) is extremely well established in centralized systems (e.g. cellular networks) since the central controller can hand out slots. However, establishing common frame and slot boundaries and mutually agreeing on non-conflicting slots is quite hard to do in a decentralized manner, and doubly so when the nodes are moving around and the schedules need to be re-computed.
On the other hand, TDMA utilizes the channel far more efficiently than CSMA/CA, and this has motivated the pursuit of TDMA for mesh networks (see table for more detailed pros/cons). A typical approach involves splitting up a repeating frame into a two parts — an initial part where nodes do ALOHA or CSMA to contend for rights in the second part. Control packets containing schedules may be exchanged to de-conflict schedules, although a single change has the potential to percolate throughout the network. From my experience, TDMA works best in stationary mesh networks. Currently, CSMA/CA dominates in commercial mesh networks whereas both TDMA and CSMA/CA are used in military networks. Despite being around for decades, neither has emerged a clear winner, and the topic continues to be the subject of research and hot debates!
Multi-hop Routing
As discussed in Part 1, when there are more than a trivially small number of nodes, we cannot afford to use “flooding” to send packets around. In simplest terms, routing is the decentralized process by which every node decides whether or not to forward a received packet such that the sequence of these decisions gets a packet to where it needs to go using minimal transmissions. There are, broadly speaking, three kinds of routing: unicast (one-to-one), broadcast (one-to-all), and multicast (one-to-many). For this article, we focus on unicast routing.
Much of mesh routing has evolved as adaptations of the ideas first developed for the Internet. Two such ideas are distance-vector routing and link-state routing. Distance-vector routing was first developed in the mid-1970’s for ARPANET — the DoD funded network from which the Internet evolved. The basic idea is for each node to advertise how many hops it is from each destination (a “vector of distances”). A node receives such advertisements from each of its neighbors and figures out a new table combining the information — essentially figuring out which neighbor offers the smallest hops, and what that number is — and advertises that table. This process continues till the set of tables converges to something consistent.
Distance-vector is simple and elegant, but convergence is elusive in dynamic topologies. By the late 1970’s, frustrated by persistent loops and the excessive size of advertisements, ARPANET researchers replaced it with link-state routing. In link-state routing, each node periodically polls the “state” (up or down) of each of its links and broadcasts it to every other node. Consequently, every node has an up-to-date snapshot of the entire network topology. Using this information, every node generates a route to every other node and stores the next hop in a routing table. The most popular way to do this is using the Dijkstra shortest path algorithm — a neat algorithm invented by E.W. Dijkstra way back in 1959 in the context of graph theory. Since every node uses the same algorithm on the same data, a consistent sequence of next hops to the destination is obtained.
Mesh network researchers naturally started with link-state and distance-vector, but soon realized that they had to modify them quite a bit to accommodate the much lower bandwidth, wireless link unreliability, and mobility of nodes. Further, for networks with battery constraints and intermittent traffic, constantly updating routes is wasteful. Thus, mesh network routing diverged into two main approaches: proactive routing where a node knows the route in advance of actually having packets to send; and reactive routing where a node only computes routes when necessary. It turns out that while link-state is fundamentally proactive, distance-vector can be spun as both reactive and proactive. Reactive protocols typically broadcast (flood) a route request, which is like paging for the destination. The destination responds with a route reply, which is used to establish a unicast route for further traffic between the nodes. Reactive protocols are decidedly better in power-constrained settings, whereas proactive protocols provide lower latencies. The adjoining table classifies some of the more well-known mesh networking protocols today. AODV and OLSR were proposed standards within the Internet Engineering Task Force (IETF), and AODV derivatives are part of Zigbee and IEEE 802.11s.
Control overhead
At the highest level, one can think of mesh protocols as having two parts: (1) send around some control packets to establish state, i.e., information required to properly handle data packets; and (2) use that state to decide what to do with data packets (forward, access the channel etc.). The amount of control packet bytes is known as control overhead — the larger this is, the worse off one is, since there is less capacity available for actual data.
Control overhead typically grows with network size (number of nodes). Take link-state routing (e.g. OLSR) for example. Control packets are essentially flooded and so grow quadratically with size. Therefore, at some size, all capacity is consumed by control overhead, even when there is no single data packet that needs to be sent! Reactive protocols do better under low load, but they still have considerable overhead. Control overhead limits scalability.
While this is not a big issue in wireline or high bandwidth wireless mesh networks, it is a major one in low capacity networks such as goTenna, which deliberately trades off throughput for lower cost, SWaP and best-in-class range (see our recent white paper for why this is the right tradeoff). In such a situation, even the slightest overhead is problematic. On the other hand, the use of control packets is almost axiomatic — all of the protocols mentioned above, or used in industry, or even in academic papers have control packets. What then, can be done?
At goTenna, faced with this challenge, we started with the question: is the use of control packets really a must? In doing so, we came up with a novel approach that uses zero control packets, and yet provide performance that matches or exceeds that of existing protocols. State is learned from observation of data packet patterns and header information. Both our broadcasting protocol (ECHO) and unicast protocol (VINE) use zero control packets. Detailed simulations have shown that VINE uses 1/12th of the overhead of AODV and yet provides better delivery ratio. (For a 30 node mobile mesh network randomly placed in a 3km x 3km area, nodes moving at 4 m/s, unicast traffic from each node at 2 packets per min. Overhead comprises of all non-payload bits.) Our breakthrough is a game changer that finally makes long-range wireless mesh networks scalable. Both ECHO and VINE are slated to be part of a future release of goTenna’s Aspen Grove 2.0 protocol suite. We’re preparing papers on these protocols and their performance.
Mesh protocols and performance is a diverse and complex topic, and an article of this type can only cover a small part of it. For readers interested in a more in-depth treatment, we recommend the collection of papers in the book Mobile ad hoc networking.
Comments