270x Filetype PDF File size 1.06 MB Source: icml.cc
Minimal Loss Hashing for Compact Binary Codes
Mohammad Norouzi norouzi@cs.toronto.edu
David J. Fleet fleet@cs.toronto.edu
Department of Computer Science, University of Toronto, Canada
Abstract high-dimensional inputs, x ∈ Rp, onto binary codes,
h ∈ H ≡ {0,1}q, which preserves some notion of
Wepropose a method for learning similarity- similarity. The canonical approach assumes centered
preserving hash functions that map high- (mean-subtracted) inputs, linear projection, and bi-
dimensional data onto binary codes. The nary quantization. Such hash functions, parameter-
formulation is based on structured predic- ized by W ∈Rq×p, are given by
tion with latent variables and a hinge-like
loss function. It is efficient to train for large b(x;w) = thr(Wx), (1)
datasets, scales well to large code lengths, where w ≡ vec(W), and the ith bit of the vector
and outperforms state-of-the-art methods. thr(Wx) is 1 iff the ith element of (Wx) is positive. In
th th
other words, the i row of W determines the i bit
1. Introduction of the hash function in terms of a hyperplane in the
input space; 0 is assigned to points on one side of the
hyperplane, and 1 to points on the other side.1
Approximate nearest neighbor (ANN) search in large
datasets is used widely. In computer vision, for exam- Random projections are used in LSH (Indyk & Mot-
ple, it has been used for content-based retrieval (J´egou wani, 1998; Charikar, 2002) and related methods (Ra-
et al., 2008), object recognition (Lowe, 2004), and hu- ginsky & Lazebnik, 2009). They are dataset indepen-
man pose estimation (Shakhnarovich et al., 2003). dent, make no prior assumption about the data dis-
Acommonapproach, particularly well suited to high- tribution, and come with theoretical guarantees that
dimensional data, is to use similarity-preserving hash specificmetrics(e.g., cosine similarity) are increasingly
functions, where similar inputs are mapped to nearby well preserved in Hamming space as the code length
binary codes. One can preserve Euclidean distances, increases. But they require large code lengths for good
e.g., with Locality-Sensitive Hashing (LSH) (Indyk & retrieval accuracy, and they are not applicable to gen-
Motwani, 1998), or one might want to preserve the eral similarity measures, like human ratings.
similarity of discrete class labels, or real-valued pa- Recent work has focused on learning compact
rameters associated with training exemplars. codes. Two early approaches showed how one might
CompactbinarycodesareparticularlyusefulforANN. preserve semantically meaningful data abstractions
If the nearest neighbors of a point are within a small (Shakhnarovich et al., 2003; Salakhutdinov & Hinton,
hypercube in the Hamming space, then ANN search 2009). Multilayer neural networks worked well for
can be performed in sublinear time, treating binary document retrieval (Salakhutdinov & Hinton, 2009)
codes as hash keys. Even for an exhaustive, linear and for large image corpora (Torralba et al., 2008).
scan through the database, binary codes enable very However, these methods typically require large train-
fast search. Compact binary codes also allow one to ing sets, long learning times, relatively slow indexing,
store large databases in memory. and they exhibit poorer performance than more recent
methods (Kulis & Darrell, 2009).
We formulate the learning of compact binary codes
in terms of structured prediction with latent variables Continuous relaxations have been used to avoid opti-
using a new class of loss functions designed for hash- mizationwithdiscontinuousmappingstobinarycodes.
ing. The task is to find a hash function that maps Spectral Hashing (SH) aims to preserve Euclidean dis-
tance with an eigenvector formulation (Weiss et al.,
Appearing in Proceedings of the 28th International Con-
1One can add an offset from the origin, but we find the
ference on Machine Learning, Bellevue, WA, USA, 2011. gain is marginal. Nonlinear projections are also possible,
Copyright 2011 by the author(s)/owner(s). but in this paper we concentrate on linear projections.
Minimal Loss Hashing for Compact Binary Codes
2008). The resulting projection directions can be in- Weassume mappings from Rp to H given by (1). The
terpreted in terms of the principal directions of the quality of a mapping is determined by a loss func-
data. Empirically, SH works well for relatively small tion L : H × H × {0,1} → R that assigns a cost to
codes, but often underperforms LSH for longer code a pair of binary codes and a similarity label. For bi-
lengths. Wang et al. (2010) extended SH to incor- nary codes h,g ∈ H, and a label s ∈ {0,1}, the loss
porate discrete supervision where sets of similar and function L(h,g,s) measures how compatible h and g
dissimilar training points are available. Like Wang et are with s. For example, when s = 1, the loss assigns
al., Lin et al. (2010) propose an algorithm that learns a small cost if h and g are nearby codes, and large
binay hash functions, one bit at a time, based on some cost otherwise. Ultimately, to learn w, we minimize
similarity labels. In contrast, our method is not se- (regularized) empirical loss over training pairs:
quential; it optimizes all the code bits simultanously. L(w) = X L(b(x ;w), b(x ;w), s ). (3)
i j ij
Binaryreconstructiveembedding(BRE)(Kulis&Dar- (i,j)∈S
rell, 2009) uses a loss function that penalizes the dif-
ference between Euclidean distance in the input space The loss function we advocate is specific to learn-
and the Hamming distance between binary codes: ing binary hash functions, and bears some similar-
2 ity to the hinge loss used in SVMs. It includes a
ℓ (m ,d ) = 1m −1d . (2) hyper-parameter ρ, which is a threshold in the Ham-
bre ij ij q ij 2 ij ming space that differentiates neighbors from non-
neighbors. This is important for learning hash keys,
Here, dij is the Euclidean distance between two in- since we will want similar training points to map to
puts of unit length, and m is the Hamming distance
ij binary codes that differ by no more that ρ bits. Non-
between their corresponding q-bit binary codes. The neighbors should map to codes no closer than ρ bits.
hash function is found by minimizing empirical loss,
Let kh−gk denote the Hamming distance between
i.e., the sum of the pairwise loss, ℓbre, over training H
pairs. Experiments on several datasets demonstrated codes h,g ∈ H. Our hinge-like loss function, denoted
ℓ , depends on kh−gk and not on the individual
improved precision of BRE over SH, and LSH. One ρ H
codes, i.e., L(h,g,s) = ℓ (kh−gk ,s):
limitation of BRE is the high storage cost required for ρ H
training, making large datasets impractical. (
ℓ (m,s) = max(m−ρ+1,0) for s=1 (4)
In this paper we provide a new formulation for learn- ρ λmax(ρ−m+1,0) for s=0
ing binary hash functions, based on structural SVMs
where m ≡ kh−gk , and λ is a loss hyper-parameter
with latent variables (Yu & Joachims, 2009) and an H
effective online learning algorithm. We also introduce that controls the ratio of the slopes of the penalties
a type of loss function specifically designed for hash- incurred for similar (or dissimilar) points when they
ing, taking Hammingdistance and binary quantization are too far apart (or too close). Linear penalties are
into account. The resulting approach is shown to sig- useful as they are robust to outliers (e.g., in contrast to
nificantly outperform state-of-the-art methods. thequadraticpenaltyinℓbre). Further, notethatwhen
similar points are sufficiently close, or dissimilar points
2. Formulation are distant, our loss does not impose any penalty.
Our goal is to learn a binary hash function that pre- 3. Bound on Empirical Loss
serves similarity between pairs of training exemplars.
Let D comprise N centered, p-dimensional training Theempirical loss in (3) is discontinuous and typically
N non-convex, making optimization difficult. As a conse-
points {xi} , and let S be the set of pairs for which
i=1 quence, rather than directly minimizing empirical loss,
similarity labels exist. The binary similarity labels are
given by {s } , where x and x are similar when weinsteadformulate, andminimize, a piecewise linear,
ij (i,j)∈S i j upper bound on empirical loss. Our bound is inspired
s =1, and dissimilar when s =0. To preserve a
ij ij by a bound used, for similar reasons, in structural
specific metric (e.g., Euclidean distance) one can use
binary similarity labels obtained by thresholding pair- SVMs with latent variables (Yu & Joachims, 2009).
wise distances. Alternatively, one can use weakly su- We first re-express the hash function in (1) as a form
pervised data for which each training point is associ- of structured prediction:
ated with a set of neighbors (similar exemplars), and T
non-neighbors (dissimilar exemplars). This is useful b(x;w) = argmax h Wx (5)
for preserving similarity based on semantic content for h∈H
= argmax wTψ(x,h) , (6)
example. h∈H
Minimal Loss Hashing for Compact Binary Codes
T T
where ψ(x,h) ≡ vec(hx ). Here, w ψ(x,h) acts as Our upper bound on the loss function L, given a pair
a scoring function that determines the relevance of of inputs, x and x , a supervisory label s , and the
i j ij
input-code pairs, based on a weighted sum of features parameters of the hash function w, has the form
in the joint feature vector ψ(x,h). Other forms of
L(b(x ;w), b(x ;w), s )
ψ(.,.) are possible, leading to other hash functions. i j ij
≤ max L(g ,g ,s )+gTWx +gTWx
Tomotivate our upper bound on empirical loss, we be- g ,g ∈H i j ij i i j j
i j
gin with a short review of the bound commonly used T T
−max h Wxi −max h Wxj . (10)
for structural SVMs (Taskar et al., 2003; Tsochan- h ∈H i h ∈H j
i j
taridis et al., 2004).
The proof for (10) is similar to that for (8) above. It
3.1. Structural SVM follows from (5) that the second and third terms on
the RHS of (10) are maximized by h =b(x ;w) and
i i
In structural SVMs (SSVM), given input-output train- h =b(x ;w). If the first term were maximized by
j j
∗ N
ing pairs {(xi,y )} , one aims to learn a mapping g =b(x ;w) and g = b(x ;w), then the inequality
i i=1 i i j j
from inputs to outputs in terms of a parameterized in (10) becomes an equality. For all other values of
scoring function f(x,y;w): g and g that maximize the first term, the RHS can
i j
b only increase, hence the inequality. The bound holds
y = argmax f(x,y;w). (7) for ℓ , ℓ , and any similar loss function, with binary
y ρ bre
labels s or real-valued labels d .
ij ij
Givenalossfunctionontheoutputdomain, L(·,·), the Weformulatetheoptimizationfortheweightswofthe
SSVMwithmargin-rescaling introduces a margin vio- hashing function, in terms of minimization of the fol-
lation (slack) variable for each training pair, and min- lowing convex-concave upper bound on empirical loss:
imizes sum of slack variables. For a pair (x, y∗), slack
is defined as max [L(y,y∗)+f(x,y;w)]−f(x,y∗;w). X
y max L(g,g ,s )+gTWx +gTWx
Importantly, the slack variables provide an upper g ,g ∈H i j ij i i j j
(i,j)∈S i j
b
bound on loss for the predictor y; i.e.,
−max hTWx −max hTWx . (11)
b ∗ i i j j
L(y,y ) h ∈H h ∈H
i j
∗ b
≤ max[L(y,y )+f(x,y;w)]−f(x,y;w) (8)
y 4. Optimization
≤ max[L(y,y∗)+f(x,y;w)]−f(x,y∗;w) . (9)
y Minimizing (11) to find w entails the maximization
To see the inequality in (8), note that, if the first term of three terms for each pair (i,j) ∈ S. The second
b and third terms are trivially maximized directly by
on the RHS of (8) is maximized by y = y, then the f
terms cancel, and (8) becomes an equality. Otherwise, the hash function (5). Maximizing the first term is,
the optimal value of the max term must be larger than however, not trivial. It is similar to the loss-adjusted
b inference in the SSVMs. The next section describes
when y = y, which causes the inequality. The second
inequality (9) follows straightforwardly from the defi- an efficient algorithm for finding the exact solution of
b b loss-adjusted inference for hash function learning.
nition of y in (7); i.e., f(x,y;w) ≥ f(x,y;w) for all
y. The bound in (9) is piecewise linear, convex in w,
and easier to optimize than the empirical loss. 4.1. Binary hashing loss-adjusted inference
3.2. Convex-concave bound for hashing We solve loss-adjusted inference for general loss func-
tions of the form L(h,g,s) = ℓ(kh − gk ,s). This
H
Thedifferencebetweenlearninghashfunctionsandthe applies to both ℓ and ℓ . The loss-adjusted infer-
bre ρ
SSVM is that the binary codes for our training data ence is to find the pair of binary codes given by
are not known a priori. But note that the tighter
∗ e e
bound in (8) uses y only in the loss term, which b(xi;xj,w),b(xj;xi,w) =
is useful for hash function learning, because suitable T T
argmax ℓ(kg −g k ,s )+g Wx +g Wx . (12)
loss functions for hashing, such as (4), do not require i j H ij i i j j
g ,g ∈H
i j
ground-truth labels. The bound (8) is piecewise linear,
convex-concave (a sum of convex and concave terms), Before solving (12) in general, first consider the spe-
andis the basis for SSVMs with latent variables (Yu & cific case for which we restrict the Hamming distance
Joachims, 2009). Below we formulate a similar bound between g and g to be m, i.e., kg −g k =m. For
i j i j H
for learning binary hash functions. q-bit codes, m is an integer between 0 and q. When
Minimal Loss Hashing for Compact Binary Codes
kg −g k = m, the loss in (12) depends on m but procedure (Yuille & Rangarajan, 2003). Applying
i j H
not the specific bit sequences gi and gj. Thus, instead this method to our problem, we should iteratively
of (12), we can now solve impute the missing data (the binary codes b(xi;w))
T T and optimize for the convex term (the loss-adjusted
ℓ(m,s ) + max g Wx +g Wx (13)
ij g ,g ∈H i i j j terms in (11)). However, our preliminary experiments
i j
s.t. kg −g k =m. showed that this procedure is slow and not so effective
i j H for learning hash functions.
The key to finding the two codes that solve (13) is to Alternatively, following the structured perceptron
decide which of the m bits in the two codes should be (Collins, 2002) and recent work of McAllester et al.
different. (2010) we considered a stochastic gradient-based ap-
Let v denote the kth element of a vector v. We can proach, based on an iterative, perceptron-like, learn-
[k]
computethejointcontribution of the kth bits of gi and ing rule. At iteration t, let the current weight vector
T T be wt, and let the new training pair be (xt,x′) with
g to [g Wx +g Wx ] by t
j i i j j supervisory signal s . We update the parameters ac-
t
ζk(g , g ) = g (Wxi) +g (Wxj) , cording to the following learning rule:
i[k] j[k] i[k] [k] j[k] [k]
and these contributions can be computed for the four wt+1 = wt +ηhψ(xt,b(xt;wt))+ψ(x′,b(x′;wt))
possible states of the kth bits independently. To this t t
end, e ′ t ′ e ′ t i
−ψ(xt,b(xt;xt,w ))−ψ(xt,b(xt;xt,w )) (14)
∆ =max ζ (1,0),ζ (0,1) −max ζ (0,0),ζ (1,1)
k k k k k T
where η is the learning rate, ψ(x,h)=vec(hx ), and
represents how much is gained by setting the bits g e ′ e ′
b(x ;x ,w ) and b(x ;x ,w ) are provided by the loss-
i[k] t t t t t t
andgj[k] to be different rather than the same. Because adjusted inference above. This learning rule has been
g and g differ only in m bits, the solution to (13) is effective in our experiments.
i j
obtained by setting the m bits with m largest ∆ ’s to
k One interpretation of this update rule is that it fol-
be different. All other bits in the two codes should be lowsthenoisygradientdescentdirectionofourconvex-
the same. When g and g must be different, they
i[k] j[k] concave objective. To see this more clearly, we rewrite
are found by comparing ζk(1,0) and ζk(0,1). Other- the objective (11) as
wise, they are determined by the larger of ζk(0,0) and
ζk(1,1). Now solve (13) for all m; noting that we only Xh T e T e
L +w ψ(x ,b(x ;x ,w))+w ψ(x ,b(x ;x ,w))
compute ∆k for each bit, 1≤k≤q, once. ij i i j j j i
Tosolve (12) it suffices to find the m that provides the (ij)∈S i
−wTψ(x,b(x ;w))−wTψ(x ,b(x ;w)) . (15)
largest value for the objective function in (13). We i i j j
first sort the ∆ ’s once, and for different values of m,
k
we compare the sum of the first m largest ∆ ’s plus e
k Theloss-adjusted inference (12) yields b(xi;xj,w) and
ℓ(m,s ), and choose the m that achieves the highest e
ij b(xj;xi,w). Evaluating the loss function for these two
score. Afterwards, we determine the values of the bits binarycodesgivesLij (whichnolongerdependsonw).
according to their contributions as described above. Taking the negative gradient of the objective (15) with
Given the values of Wx and Wx , this loss-adjusted respect to w, we get the exact learning rule of (14).
i j However, note that this objective is piecewise linear,
inference algorithm takes time O(qlogq). Other than
sorting the ∆ ’s, all other steps are linear in q which due to the max operations, and thus not differentiable
k at isolated points. While the theoretical properties of
makes the inference efficient and scalable to large code
lengths. The computation of Wx can be done once this update rule should be explored further (e.g., see
i (McAllester et al., 2010)), we empirically verified that
per point, although it is used with many pairs.
the update rule lowers the upper bound, and converges
4.2. Perceptron-like learning to a local minima. For example, Fig. 1 plots the em-
pirical loss and the bound, computed over 105 training
In Sec. 3.2, we formulated a convex-concave bound pairs, as a function of the iteration number.
(11) on empirical loss. In Sec. 4.1 we described how
the value of the bound could be computed at a given 5. Implementation details
W. Now consider optimizing the objective i.e., low-
ering the bound. A standard technique for mini- We initialize W using LSH; i.e., the entries of W are
mizing such objectives is called the concave-convex sampled(IID)fromanormaldensityN(0,1),andeach
no reviews yet
Please Login to review.