Flow Completion Time (FCT) is probably the most important user-perceived metric. I wrote this post is to crystallize several ingredients of the line of related work.

Why FCT matters?

One has to differentiate between user-perceived metrics and metrics cared by the network operators. Users are mainly concerned about the time (i.e., FCT) to finish the current network transaction (web browsing, file transfer), unlike operators that are interested in network utilization, throughput, fairness and so forth. Optimizing operator defined metrics such as the network throughput does not necessarily imply the improvement of FCT. Why Flow-Completion Time is the Right Metric for Congestion Control provides details on such gap for canonical congestion control mechanisms.

What is the optimal policy to minimize FCT theoretically?

Shortest Remaining Processing Time (SRPT) is the optimal policy in a single link setting in terms of optimizing FCT, regardless of the flow size distribution. Analysis of SRPT Scheduling: Investigating Unfairness investigates the objections to SRTP (not knowing the flow size a priori, unclear performance improvements over PS, starvation for large flows) with the metrics of job response time and job slowdown in an M/G/1 system. The paper mainly looks at a truncated heavy-tailed distribution, i.e., the job size is sampled i.i.d. from the Bounded Pareto distribution. It proves that SRPT outperforms PS scheduling w.r.t. mean slow down for all job distributions, besides the well-known facts about its FCT optimality. It also derives formulas for M/G/1/SRPT under overload.

Unfortunately, there is no universally optimal policy in a multi-link setting. pFabric: Minimal Near-Optimal Datacenter Transport summarizes the issue: even under the simplified network fabric, minimizing the average FCT corresponds to the NP-hard sum-multicoloring problem. However, the greedy flow scheduling algorithm with an ideal big switch view promises to provide at least a 2-approximation to the optimal.

How to minimize FCT with a limited number of service queues?

Even with a single link, there is a practical problem: commodity switches typically support an only limited number of $K$ FIFO queues (e.g., $K=4\sim8$), which prevents us from differentiating flows by the key value of flow size in a fine-grained manner (ps, there are active works to enhance the switch hardware to support e.g., priority queue heap with O(log n) operation complexity).

With this, one could only use $K-1$ thresholds to roughly differentiate the flows w.r.t. the flow size. pFabric: Minimal Near-Optimal Datacenter Transport provides a derivation of a 2-queue example, as summarized below (I included the omitted intermediate steps). The authors also generalize it to an arbitrary finite number of queues in the technical report.

  • Objective: theoretically justify that the average FCT optimization depends on the threshold and the flow size distribution
  • Assumption: assume the flow size is known or accurately measured a priori
  • Model representation: link capacity 1; flow arrival rate (NOT packet arrival rate) $\lambda$ following Poisson process; flow size CDF $S\sim F_{S}(\cdot)$; the link is NOT overloaded, i.e., total load $\rho=\lambda E(s) < 1$; a single threshold $t$ that classifies flows into high priority queue 0 and low priority queue 1
  • Derivation: the aggregate Poisson process could be split into 2 independent poison processes at each service queue, with the rate $\lambda_{0}=\lambda F_{S}(t)$ and $\lambda_{1}=\lambda(1-F_{S}(t))$. The drain rate of queue 0 is $\mu_{0}=1$ and the drain rate of queue 1 is the remaining bandwidth after serving queue 0, which is $\mu_{1}=1-\rho B_{S}(t)$, where $B_{S}(\cdot)$ denotes the traffic fraction in bytes (which means when setting the threshold $t$, the fraction of total bytes of flows less than the size $t$ compared with the total bytes of all traffic). we have Pollaczek-Khinchin (P-K) Mean Formula states that the mean waiting time $W$ of a M/G/1 queue follows where $\rho=\lambda/\mu=\lambda E(S)$ refers to the workload/utilization. Here a flow is treated as a job in the theoretical model, instead of diving into the microscopic view of packet streams (though it feels to deviate from the actual case). Here $E(S^{2})$ refers to the second moment of the service time distribution, not the flow size distribution. Also, here $\mu$ refers to the service rate for each flow, NOT the drain rate. Note that the $W$ here is different from the mean spent time $W^{‘}$ (waiting time + service time) in the Little’s Law. For each queue $i (i=0, 1)$ we have $\rho_{i}=\frac{\rho B_{S}^{i}}{\mu_{i}}$, where $\mu_{i}$ is the service/drain rate for queue $i$. Hence, we could obtain the mean waiting time for each queue,

We could now compute the average normalized flow completion time (by flow size):

With $F^{‘}_{S}(t)=f_{S}(t)$, we could obtain the final result, as stated in the pFabric paper:

It is clear from the result that $FCT_{n}$ depends on the threshold, flow size distribution and the workload. The paper also attaches a numerical plot visualizing the function $FCT_{n}(t)$ for various workload 10–80% and web search flow size distribution.

pFabric

What if we are agnostic of the flow size oracle?

Information-Agnostic Flow Scheduling for Commodity Data Centers eliminates the assumption on the availability of the flow size information and borrows the Multi-Level-Feedback-Queue (MLFQ) in OS to emulate SJF without the oracle. The key to the proposal is how to set the optimal demotion thresholds. The paper gives derivation to the optimal thresholds given the fixed workload distribution and load. Though Sum-of-Linear-Ratios (SoLR) problem is generally NP-hard, the paper provides a closed-form analytical solution assuming M/M/1 queue. The paper also identifies that due to the traffic variations, the demotion thresholds mismatch could lead to performance degradation severely. The follow-up work AuTO: Scaling Deep Reinforcement Learning for Datacenter-Scale Automatic Traic Optimization leverages novel Deep Reinforcement Learning to automatically set the configurations adaptive to the traffic pattern.