Optimal Allocation Strategies for the Dark Pool Problem

In this paper we consider the problem of allocating stocks to dark pools. As described by (Ganchev et al., 2009), dark pools are a recent type of stock exchange that are designed to facilitate large transactions. A key aspect of dark pools is the censored feedback that the trader receives. At every round the trader has a certain number V t of shares to allocate amongst K different dark pools. The dark pool i trades as many of the allocated shares v i as it can with the available liquidity. The trader only finds out how many of these allocated shares were successfully traded at each dark pool, but not how many would have been traded if more were allocated. It is natural to assume that the actions of the trader affect the volume available at all dark pools at later times. Similarly, it seems natural that at a given time, the liquidities available at different venues should be correlated: we would expect counterparties to distribute large trades across many dark pools, simultaneously affecting their liquidity. Furthermore, in a realistic scenario, these variables are governed not only by the trader's actions, but also by the actions of other competing traders, each trying to maximize profits. Since the gain of one trader is at the expense of another, this problem naturally lends itself to an adversarial analysis. Generalizing the setup of (Ganchev et al., 2009), we assume that the sequences of volumes and available liquidity at each venue are chosen by an adversary who knows the previous allocations of our algorithm. We propose an exponentiated gradient (henceforth EG) style algorithm that has an optimal regret guarantee against the best allocation strategy in hindsight. Our algorithm uses a parametrization that allows it to handle the problem of changing constraint sets easily. Through a standard online to batch conversion, this also yields a significantly better algorithm in the iid setup studied in (Ganchev et al., 2009). However, the EG algorithm has the drawback that it recommends continuous-valued allocations. We describe how the problem of allocating an integral number of shares closely resembles a multi-armed bandit problem. As a result, we use ideas from the Exp3 algorithm for adversarial bandit problems (Auer et al., 2003) to design an algorithm that produces integer-valued allocations and enjoys a regret of order T 2/3 with high probability. While this regret bound holds in an adversarial setting, it also implies an improvement on (Ganchev et al., 2009) in an iid setting. We also study an efficient implementation of our algorithm using the idea of greedy approximations in Hilbert spaces (Jones, 1992), (Barron, 1993). In the next section we will describe the problem setup in more detail and survey previous work. We will describe the EG algorithm for continuous allocations and prove its regret bound and optimality in Section 3. In Section 4 we describe the algorithm for integer valued allocations. Section 4.4 describes an efficient implementation. Finally we present experiments comparing our algorithms with that of (Ganchev et al., 2009) using the data simulator described in their paper. We generalize the setup of (Ganchev et al., 2009). A learning algorithm receives a sequence of volumes V 1 , . . . V T where V t ∈ {1, . . . , V }. It has K available venues, amongst which it can allocate up to V t units at time t. The learner chooses an allocation v t i for the i th venue at time t that satisfies Each venue has a maximum consumption level s t i . The learner then receives the number of units r t i = min(v t i , s t i ) consumed at venue i. We allow the sequence of volumes and maximum consumption levels to be chosen adversarially, i.e. V t , s t i can depend on {v 1 i , . . . , v t-1 i } K i=1 . We measure the performance of our learner in terms of its regret where the outer maximization is over the vector opt ∈ {1, . . . , K} V and i.e., we compete against any strategy that chooses a fixed sequence of venues opt 1 , . . . , opt V and always allocates the vth unit to venue opt v . The work most closely related to ours is (Ganchev et al., 2009). In that paper, the authors consider the sequence of volumes V 1 , . . . , V T and allocation limits s t i to be distributed in an iid fashion. They propose an algorithm based on Kaplan-Meier estimators. Their algorithm mimics an optimal allocation strategy by estimating the tail probabilities of s t i being larger than a given value. They show that the allocations of their algorithm are -suboptimal with probability at most 1 -after seeing sufficiently many samples. Theorem 1 in (Ganchev et al., 2009) shows that, if the s t i is chosen iid, then the optimal strategy always allocates the ith unit to a fixed venue. This justifies our definition of regret in comparison to this class of strategies. The ideas used in our paper draw on the rich literature on online adversarial learning. The algorithm of Section 3 is based on the classical EG algorithm (Littlestone and Warmuth, 1994). When playing integral allocations, we describe how the multi-armed bandits problem is a special case of our problem for V = 1. For the general case, we describe an adaptation of the Exp3 algorithm (Auer et al., 2003) for adversarial multiarmed bandits. To provide regret bounds that hold with high probability, we use a variance correction similar to the Exp3.P algorithm (Auer et al., 2003). Our lower bounds use information theoretic techniques, building on Fano's method (Yu, 1993). The efficient implementation of our algorithm relies on greedy approximation techniques in Hilbert space (Jones, 1992), (Barron, 1993). Although the dark pool problem requires us to allocate an integral number of shares at every venue, we start by studying the simpler case where we can allocate any positive value for every venue, so long as they satisfy We start by noting that the reward function Maximization of concave functions is well understood, even in an adversarial scenario through approaches such as online gradient ascent. We note that in this problem, the algorithm has access to the subgradient of the reward function. To see this, we define Then it is easy to check that g t i can be constructed from the feedback we receive, and it lies in the subgradient set . Hence, we can run a standard online (sub)gradient ascent algorithm on this sequence of reward functions. However, the allocations v t i are chosen from a different set S t = { v t : K i=1 v t i ≤ V t } at every round. Using standard online gradient ascent analysis, we can demonstrate a low regret only against a comparator that lies in the intersection of all these constraint sets ∩ T t=1 S t . However the regret guarantee can be rather meaningless if V t is extremely small at even a single round. Ideally, we would like to compete with an optimal allocation strategy like (Ganchev et al., 2009). A slightly different parameterization allows us to do exactly that. Let us define ∆ Then we can construct an algorithm for allocations as follows: for each unit v = {1, . . . , V }, we have a distribution over the venues {1, . . . , K} where that unit is allocated. At time t, the algorithm plays It is clear that this allocation satisfies the volume constraint. The comparator is now defined as a fixed point u ∈ ∆ V K . We compete with the strategy that plays according to Then the best comparator u is equivalent to the best fixed allocation strategy opt ∈ {1, . . . , K} V . It is also clear that if we can compete with the best strategy in an adversarial setup, online to batch conversion techniques (see Cesa-Bianchi et al (Cesa-Bianchi et al., 2001)) will give a small expected error in the case where the volumes and maximum consumptions are drawn in an iid fashion. An online gradient ascent algorithm for this setup is presented in Algorithm 1. Input learning rate η, bound on volumes It can be shown that the algorithm enjoys the following regret guaranteee. Theorem 1. For any choices of the volumes V t ∈ [0, V ] and of the maximum consumption levels s t i , the regret of Algorithm 1 with η Proof. The regret is defined as Following the proof of Theorem 11.3 from Cesa-Bianchi et al (Cesa-Bianchi and Lugosi, 2006), we define x v t . Also, we note that the gradient is zero for v > V t . So we can sum over v from 1 to V rather than V t . Then we bound the regret as Some rewriting and simplification gives the bound Here, the last line uses the definition of KL-divergence and the fact that the telescoping terms cancel out. Now . Also, each of the KL divergence terms in the above display is equal to ln K. This is because the optimal comparator will have a 1 for exactly one venue for each unit v. As we choose x v 1 to be uniform over all venues, we get the KL divergence between a vertex of the K-simplex and the uniform distribution which, is ln K. Hence we bound the regret as where the last step follows from setting η = ln K (e-2)T . We will now show that the online exponentiated gradient ascent algorithm in Algorithm 1 has the best regret guarantee possible. We start by noting that a a regret bound of O( √ T ln K) is known to be optimal for the experts prediction problem (Haussler et al., 1998;Abernethy et al., 2009). Hence we can show the optimality of our algorithm for V = 1 by reducing experts prediction problem to the dark pools problem. Recall that in the experts prediction problem, the algorithm picks an expert from 1, . . . , K according to a probability distribution p t at round t. Then it receives a vector of rewards ρ t with ρ t,i ∈ [0, 1], i = 1, . . . , K. In order to describe a reduction, we need to map the allocations of an algorithm for the dark pools problem to the probabilities for experts, and map the rewards of experts to the liquidities at each venue. We consider a special setting where V t = 1 at all times. Since V t = 1, the allocations of any dark pools algorithm are probabilities-they are non-negative and add to 1. Hence we set p t,i = v t i . We also set the liquidity s t i = ρ t,i p t,i . Then the net reward of a dark pools algorithm at round t is: where the last line follows from the observation that 0 ≤ ρ t,i ≤ 1. Hence the net reward of the dark pools problem is same as that expected reward in the experts prediction problem. Using the known lower bounds on the optimal regret in experts prediction problems, we get: We also note that the regret in the experts prediction problem scales linearly with the scaling of the rewards. Hence, if the rewards take values in [0, V ], then the regret of any algorithm is guaranteed to be Ω(V √ T ln K). For arbitrary V , we again consider the special setting with V t identically equal to V . We would now like to reduce the experts prediction problem where every expert's reward is a value in [0, V ]. At every round, we receive a vector of allocations v t i . We set p t,i = v t i /V . We receive the rewards ρ t,i from the experts problem, and assign the liquidities The last step relies on observing that ρ t,i ≤ V so that ρ t,i p t,i /V ≤ p t,i . Now we can argue that the regrets of the two problems are identical as before. Hence the optimal regret on the dark pools problem is at least Ω(V √ T ln K). As Algorithm 1 gets the same bound up to constant factors in a harder adversarial setting than used in the lower bounds, we conclude that it attains the minimax optimal regret up to constant factors. While the above algorithm is simple and optimal in theory, it is a bit unrealistic as it can recommend we allocate 1.5 units to a venue, for example. One might choose to naively round the recommendations of the algorithm, but such a rounding would incur an additional approximation error which in general could be as large as O(T ). In this section we describe a low regret algorithm that allocates an integral number of units to each venue. To get some intuition about an algorithm for this scenario, consider the case when V = 1. Then the algorithm has to allocate 1 unit to a venue at every round. It receives feedback about the maximum allocation level s t i only at the venue where v t i = 1. This is clearly a reformulation of the classical K-armed bandits problem. An adaptation of Algorithm 1 that uses the Exp3 algorithm (Auer et al., 2003) would hence attain a regret bound of O( √ T K ln K) for V = 1. Contrasting this with the bound of Theorem 1 for V = 1, we can easily see that the regret for playing integral allocations can be higher than that of continuous allocations by a factor of up to √ K. Indeed we will now show a modification of the Exp3 approach that works for arbitrary values of V . We will also show a lower bound. The upper bound shows that our algorithm incurs O(T 2/3 ) regret in expectation, which does not match the O( √ T ) lower bound. However, it is still a significant improvement on Ganchev et al (Ganchev et al., 2009) as we will discusss later. We need some new notation before describing the algorithm. For a fractional allocation v t i , we let . Now suppose we have a strategy that wants to allocate v t i units to venue i at time t. Suppose that we instead allocate u t i = f t i units with probability 1 -d t i and u t i = f t i + 1 units with probability d t i . Using the fact that the maximum consumption limits are integral too Thus, playing an integral allocation u t i according to such a scheme would be unbiased in expectation. Of course we need to ensure that we don't violate the constraint K i=1 u t i ≤ V t in this process. To do so, we let Then we will use a distribution over subsets of {1, . . . , K} of size m that has the property that i th element gets sampled with probability d t i . It is clear that if there is such a distribution, then we will have the unbiasedness needed above. It will also ensure feasibility of u t i if v t i was a feasible allocation. Our next result shows that such a distribution always exists. Then there is always a distribution over subsets of {1, . . . , K} of size m such that the i th element is sampled with probability d t i . Proof. Proof is by induction on K. For the case K = 2, m = 1, we sample the first element with probability d t 1 . If it is not picked, we pick element 2. It is clear that the marginals are correct establishing the base case. Let us assume the claim holds up to K -1 for all m ≤ K -1. Consider the inductive step for some K, m. We are given a set of marginals, 0 We would like a distribution p on subsets of size m of {1, . . . , K} that matches these marginals. We partition these subsets into two groups; those that do and do not contain the first element. We correspondingly partition p = (p 1 , p 2 ). Let N 1 = K-1 m-1 and N 2 = K-1 m be the number of subsets in the two cases. Then we want 1 in order to get the right marginal at element 1. Hence, we can write p 1 = d t 1 q 1 , p 2 = (1 -d t 1 )q 2 for some distributions q 1 and q 2 on N 1 and N 2 subsets respectively. Now we write are marginals on subsets of size m -1 and m respectively of {1, . . . , K -1}, and are in [0, 1] as Hence there exist distributions q 1 and q 2 that attain these marginals using the inductive hypothesis. We set p 1 = d t 1 q 1 , p 2 = (1 -d t 1 )q 2 . Then Equations 2 and 3 together imply that we get the correct marginals for every element. For any allocation sequence v t , let p(d t ) be the probability distribution over subsets of {1, . . . , K} guaranteed by Theorem 2. For some constant γ ∈ (0, 1], let dt,i = (1 -γ)d t i + γm K . Then let p( dt,i ) be a distribution over subsets that samples the i th venue with probability dt,i . We can construct this by mixing p(d t i ) which exists by Theorem 2 and mixing uniform distribution over subsets of size m. Also, we let Ṽt,i ≤ V t be the largest index v 0 such that v0 v=1 x v t,i ≤ f t i . We define a gradient estimator: To see why this gradient estimator is good, we first note that the gradient of the objective function at v t i can be written as Then we can easily show the following useful lemma. Lemma 1. If an algorithm plays u t i = v t i with probability dt,i and u t i = f t i otherwise, then gt as described in Equation ( 4) is an unbiased estimator of the gradient at (v t 1 , . . . , v t K ). An algorithm for playing integer-valued allocations at every round is shown in Algorithm 2. Algorithm 2 An algorithm for playing integer-valued allocations to the dark pools Input learning rate η, threshold γ, bound on volumes We can also demonstrate a guarantee on the expected regret of this algorithm. where V is the bound on volumes V t , and the volumes and maximum consumption levels s t i are chosen by an oblivious adversary. An oblivious adversary is one that chooses V t and s t i without seeing the algorithm's (random) allocations u t i . We note that the requirement that the adversary is oblivious can be removed by proving a high probability bound. We will describe a slight modification of Algorithm 2 that enjoys such a guarantee. Proof. Since the adversary is oblivious, we can fix a comparator u ∈ ∆ V K ahead of time. For the remainder, we let E t denote conditional expectation at time t conditioned on the past moves of algorithm and adversary. Then the expected regret is Here, the second step follows from the fact that u t i would be unbiased for v t i without for the γm K adjustment. However, this adjustment costs us at most γ T t=1 m t ≤ γT K in terms of expected regret over T rounds. For the first term, it is as if we had played the continuous valued allocation v t i itself. Again using the concavity of our reward function Here the last step follows from noting that gt is unbiased estimator of g t by construction just like in Exp3 (Auer et al., 2003). Now we note that the algorithm is doing exponentiated gradient descent on the sequence gt . Hence, we can proceed as in the proof of Theorem 1 to obtain where x v t as before. Assuming a choice of η such that ηg v t,i ≤ 1, we note again that ν v i ≤ 1. So we can use the quadratic bound on exponential again and simplify as before to get Now we can swap the sum over V and i to obtain Now we look at the two gradient terms separately. Et Here, we used the fact that dt,i ≥ γ K as m ≥ 1 and indicator variables are bounded by 1. Hence Next we examine the second gradient term Substituting the above terms in the bound, we get We note that the term responsible for O(T 2/3 ) regret is . While we assume that this can accumulate at every round in the worst case, it seems unlikely that the liquidity s t i will be equal to f t i very frequently. In particular, if the s t i 's are generated by a stochastic process, one can control this probability using the distribution of s t i and obtain improved regret bounds. We would like to show that the analysis of the previous section holds not just in expectation but also with high probability. This has two advantages. First, it tells us that on most random choices made by our algorithm, it has a low regret. Further, the high probability guarantee can be easily combined with a union bound to give a regret bound for non-oblivious (adaptive) adversaries as well. High probability bounds in bandit problems are often tricky because even though the gradient estimator is unbiased, its variance is typically large. Hence, using standard martingale concentration on the estimator directly gives a worse O(T 3/4 ) regret bound. To demonstrate a high probability guarantee of O(T 2/3 ), we need to make a variance correction to our estimator gt . We define (5) The high probability analysis makes repeated use of the classical Hoeffding-Azuma inequality as well as a version of Freedman's inequality from Bartlett et al Bartlett et al. (2008). which we state for completeness. inequality. Lemma 2 (Hoeffding-Azuma inequality). Let X 1 , . . . , X T be a martingale difference sequence. Suppose that |Y t | ≤ c almost surely for all t ∈ {1, . . . , T }. Then for all δ > 0, Lemma 3 (Bartlett et al. (2008)). Let X 1 , . . . , X T be a martingale difference sequence with Var t X t be the sum of conditional variances of X t 's and σ = √ V . Then we have, for any δ ≤ 1/e and T ≥ 4, We will now prove a series of concentration results which will immediately give the desired regret bound when put together. The steps in our analysis closely resemble the technique of Abernethy and Rakhlin (2009). The first concentration lemma shows that the regret of the integral allocations is close to their continuous valued counterparts. Lemma 4. Proof. We apply Lemma 2 to the martingale difference sequence But we note that by construction The statement of lemma then follows from the above inequality and a union bound over all K venues. The next step is to show that the terms We proceed indirectly by first bounding the conditional variances. Lemma 5. We now combine this with Freedman's inequality to bound (g v t -g v t ) (u v -x v t ). Lemma 6. Proof. We define the martingale 1 by Hölder's inequality. Applying Hoeffding-Azuma inequality gives the result. Finally, we also need to show that the size of the gradient estimator which is controlled in expectation is also bounded with high probability. Lemma 7. Proof. We define the martingale Then using the bound on gt , and the bound on expectation from proof of Theorem 3, X t ≤ 2 K 2 γ 2 V . Application of Hoeffding-Azuma inequality gives the result. We are now in a position to prove a high probability bound on the regret of Algorithm 2 when run with the gradient estimator ĝt instead of gt . Theorem 4. With probability at least 1 -1 T , the regret of Algorithm 2 using the gradient estimator ĝt against oblivious adversaries is O V (T K) 2/3 . The proof essentially involves putting the lemmas together, along with the full information analysis of the quantity (u v u -x v t ) ĝv t . Proof. Using Lemma 4, with probability at least 1-δ/3 Invoking Lemma 6, with probability at least 1-2δ/3, Once again we note that we are doing exponentiated gradient descent on ĝt so that we get from proof of Theorem 1 Using Lemma 7 and setting δ = 1 T gives the statement of the theorem on optimizing for γ, η. Note that our regret analysis so far has been against a fixed comparator. When the adversary adapts to player sequence, the comparator is random as well and depends on player's moves. However, the comparator consists of delta vectors for every unit v. Hence, there are a total of K V possible comparators. Hence, we can take a union bound over all the comparators as well, and this increases our regret bound by a factor of V ln K at most. This gives us the following corollary. Corollary 1. With probability at least 1 -1 T , the regret of Algorithm 2 against adaptive adversaries is Comparison with results of Ganchev et al. (2009): We note that although our results are in the adversarial setup, the same results also apply to iid problems. In particular, using online-to-batch conversion techniques (Cesa-Bianchi et al., 2001), we can show that, after T rounds, with high probability the allocations of our algorithm on each round is within O(V 2 T -1/3 K 2/3 ) of the optimal allocation. This is a significant improvement on the result of Ganchev et al. (2009): it is straightforward to check that the proof they provide gives a corresponding upper bound no better than O(T -1/4 ). As we shall see, the generalization to adversarial setups leads to improved performance in simulations. As mentioned in the previous section, the problem of K-armed bandits is a special case of the dark pools problem with integral allocations. Hence, we would like to leverage the proof techniques from existing lower bounds on the optimal regret in the K-armed bandits problem. As before we consider a special case with V t = V at every round. Following Auer et al. (2003), we construct K different distributions for generating the liquidities s t i . At each round, the i th distribution samples s t i = V with probability 1 2 + and s i j = V with probability 1 2 for j = i. We now mimic the proof of Theorem 5.1 in Auer et al. (2003). We start with a lemma analogous to Lemma A.1 of Auer et al (Auer et al., 2003). Let V i = t v t i . Let E i and E unif denote expectations wrt the i th distribution and uniform reward distribution respectively. Lemma 8. Let f be a function of the reward sequence r taking values in [0, M ]. Then Proof. It is clear from Hölder's inequality and Pinsker's inequality that Now we can proceed as in the proof of Auer et al. (2003) Using this lemma, we can prove a lower bound on the regret of any algorithm that plays integer valued allocations. Theorem 5. Any algorithm that plays integer valued allocations has expected regret that is Ω T V (K + V ln K) . Proof. The net reward of the algorithm when distribution i is picked is given by As in the proof of Theorem 5.1 of Auer et al. (2003), we now apply Lemma 8 to the function V i of the reward sequence. As V i ∈ [0, T V ], we get . Applying Jensen's inequality to the second term we get As the index i was chosen uniformly at random, averaging over this choice gives an expected bound on the reward of 1 Noting again that the reward of optimal comparator is still 1 2 + T V , we get that the expected regret is Setting optimally to c K T V gives an Ω( √ T V K) lower bound. We also note that the lower bound of Ω(V √ T ln K) shown for continuous-valued allocations applies to the integer-valued case as well. Combining the two, we get that the regret is There is a gap between our lower and upper bounds in this case. We do not know which bound is loose. All that remains to specify in Algorithm 2 is the construction of the distribution p over subsets at every round. Since we don't know what the distribution is, we cannot sample from it easily it would seem. If K is small, one can use non-negative least squares to find the distribution that has the given marginals. However, once the number of venues K is large, p is a distribution over K m subsets, for which the least squares solver might be too slow. One way around is to use the idea of greedy approximations in Hilbert Spaces from the classic paper of (Jones, 1992). We can greedily construct a distribution on subsets which matches the marginals on every element approximately in an efficient manner. Exact sampling from the distribution without ever constructing it explicitly is also possible. The explicit algorithms giving the implementations can be found in the full version of the paper. We compared four methods experimentally. We refer to Algorithms 1 and 2 as ExpGrad and Exp3 respectively. We also run the Optimistic Kaplan Meier estimator based algorithm of (Ganchev et al., 2009), which is called OptKM. Finally we implemented the parametric maximum likelihood estimation-allocation based algorithm described in (Ganchev et al., 2009) as well, which we call ParML. As we did not have access to real dark pool data, we decided to implement a data simulator similar to (Ganchev et al., 2009). We used a combination of a Zero Bin parameter and power law distribution to generate the s t i 's while the sequence V t was kept fixed. Parameters for the Zero Bin and power law were set to lie in the same regimes as the ones observed in the real data of (Ganchev et al., 2009). We started by generating the data from the parametric model of (Ganchev et al., 2009). We used 48 venues, T = 2000 to match the experiments of (Ganchev et al., 2009). The values of s i t 's were sampled iid from Zero Bin+Power law distributions with appropriately chosen parameters. A plot of the resulting cumulative rewards averaged over 100 trial runs can be seen in Figure 1. We see that ParML has a slightly superior performance on this data, understandably as the data is being generated from the specific parametric model that the algorithm is designed for. However, ExpGrad gets net allocations quite close to ParML. Furthermore, both Exp3 and ExpGrad are far superior to the performance OptKM which is our true competitor in some sense being a non-parametric approach just like ours. Next, we study the performance of all four algorithms under a variety of adversarial scenarios. We start with a simple setup of two venues. The parameters of the power law initially favor Venue 1 for 12500 rounds, and then we switch the power law parameters to favor Venue 2. We study both the cumulative rewards as well as the allocations to both venues for each algorithm. Clearly an algorithm will be more robust to adversarial perturbations if it can detect this change quickly and switch its allocations accordingly. We show the results of this experiment in Figure 2. Because of just 2 venues, rounding has a rather negligible effect in this case and both our methodshave an almost identical performance. Our algorithms ExpGrad and Exp3 switch much faster to the new optimal venue when distributions switch. Consequently, the cumulative reward of both our algorithms also turns out significantly higher as shown in Figure 2(b). We wanted to investigate how this behavior changes when the switching involves a larger number of venues. We created another experiment where there are 5 venues, maximum volume V = 200. Venues 1 and

Optimal Allocation Strategies for the Dark Pool Problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment