Active Learning in Recommender Systems

Neil Rubens, Dain Kaplan, and Masashi Sugiyama

[PREPRINT] Please cite as:

N. Rubens, D. Kaplan, M. Sugiyama. Recommender Systems Handbook: Active Learning in Recommender Systems (eds. P.B. Kantor, F. Ricci, L. Rokach,

B. Shapira). Springer, 2011.

@INCOLLECTION{RubensRecSysHB2010, author = {Neil Rubens and Dain Kaplan and Masashi Sugiyama}, title = {Active Learning in Recommender Systems}, booktitle = {Recommender Systems Handbook}, publisher = {Springer}, year = {2011}, editor = {P.B. Kantor and F. Ricci and L. Rokach and B. Shapira}, pages = {735-767}, doi = {10.1007/978-0-387-85820-3_23}}

1 Introduction

Recommender Systems (RSs) are often assumed to present items to users for one reason to recommend items a user will likely be interested in. Of course RSs do recommend, but this assumption is biased, with no help of the title, towards the “recommendingthe system will do. There is another reason for presenting an item to the user: to learn more about their preferences, or their likes and dislikes. This is where Active Learning (AL) comes in. Augmenting RSs with AL helps the user become more self-aware of their own likes/dislikes while at the same time providing new information to the system that it can analyze for subsequent recommendations. In essence, applying AL to RSs allows for personalization of the recommending process, a concept that makes sense as recommending is inherently geared towards personalization. This is accomplished by letting the system actively influence which items the user is exposed to (e.g. the items displayed to the user during sign-up or during regular use), and letting the user explore his/her interests freely.

Unfortunately, there are very few opportunities for the system to acquire information, such as when a user rates/reviews an item, or through a user’s browsing history.1 Since these opportunities are few, we want to be as sure as possible that the data we acquire tells us something important about the user’s preferences. After all, one of the most valuable assets of a company is user data.

For example, when a new user starts using a recommender system, very little is known about his/her preferences [43, 36, 2]. A common approach to

1There is an increasing trend to utilize social networks for acquiring additional data.


learning the user’s preferences is to ask him/her to rate a number of items (known as training points). A model that approximates their preferences is then constructed from this data. Since the number of items reviewed by the user cannot span the system’s entire catalog (and indeed would make the task of AL as well as recommending moot points), the collection of items presented to the user for review must necessarily be very limited. The accuracy of the learned model thus greatly depends on the selection of good training points. A system might ask the user to rate Star Wars I, II,and III.Byratingallthree volumes of this trilogy, we will have a good idea of the user’s preferences for Star Wars, and maybe by extension, an inclination for other movies within the Sci-Fi genre, but overall the collected knowledge will be limited. It is therefore unlikely that picking the three volumes of a trilogy will be informative.2 Another issue with selecting a popular item such as Star Wars is that by definition the majority of people like them (or they would not be popular). It is not surprising then, that often little insight is gained by selecting popular items to learn about the user (unless the user’s tastes are atypical).

There is a notion that AL is a bothersome, intrusive process, but it does not have to be this way [53, 38]. If the items presented to the user are interesting, it could be both a process of discovery and of exploration. Some Recommender Systems provide a “surprise me!button to motivate the user into this explorative process, and indeed there are users who browse suggestions just to see what there is without any intention of buying. Exploration is crucial for users to become more self-aware of their own preferences (changing or not) and at the same time inform the system of what they are. Keep in mind that in a sense users can also be defined by the items they consume, not only by the ratings of their items, so by prompting users to rate dierent items it may be possible to further distinguish their preferences from one another and enable the system to provide better personalization and to better suit their needs.

This chapter is only a brief foray into Active Learning in Recommender Systems.3 We hope that this chapter can, however, provide the necessary foundations.

For further reading, [45] gives a good, general overview of AL in the context of Machine Learning (with a focus on Natural Language Processing and Bioinformatics). For a theoretical perspective related to AL (a major focus in the field of Experimental Design), see [7, 4, 22]; there have also been recent works in Computer Science [16, 5, 50].

1.1 Objectives of Active Learning in Recommender Systems

Dierent RSs have dierent objectives, which necessitate dierent objectives for their Active Learning components as well. As a result, one AL method may be better suited than another for satisfying a given task [35]. For example, what is important in the recommender system being built? The diculty of signing-up (user eort)? If the user is happy with the service (user satisfaction)? How well the system can predict a user’s preferences (accuracy)? How well the system can

2Unless our goal is to learn a kind of micro-preference, which we can define as a person’s tendency to be more ’pickyconcerning alternatives close to one another in an genre they like.

3Supplementary materials on Active Learning can be found at:

express a user’s preferences (user utility)? How well the system can serve other users by what it learns from this one (system utility)? System functionality may also be important, such as when a user inquires about a rating for an item of interest the system has insucient data to predict a rating for, what the system does in response. Does it in such a case give an ambiguous answer, allowing the user to train the system further if they have the interest and the time to do so? Or does it require them to rate several other items before providing a prediction? Perhaps the user has experienced the item (e.g. watched the movie or trailer) and thinks their rating diers substantially from the predicted one [10]. In all these cases how the system responds to the user is important for consideration.

Traditionally AL does not consider the trade-oof exploration (learning user’s preferences) and exploitation (utilizing user’s preferences), that is, it does not dynamically assign weights to exploitation/exploration depending on system objectives. This trade-ois important because for a new user about which nothing or little is known, it may be beneficial to validate the worth of the system by providing predictions the user is likely to be interested in (exploitation), while long-term users may wish to expand their interests through exploration [38, 40].

Though an objective of the RS will likely be to provide accurate predictions to the user, the system may also need to recommend items of high novelty/serendipity, improve coverage, maximize profitability, or determine if the user is even able to evaluate a given item, to name a few [43, 21, 33]. Multiple objectives may need to be considered simultaneously, e.g. minimizing the net acquisition cost of training data while maximizing net profit, or finding the best match between the cost of oering an item to the user, the utility associated with expected output, and the alternative utility of inaction [38]. The utility of training may also be important, e.g. predicting ratings for exotic cars may not be so useful if the user is not capable of purchasing them and so should be avoided. It can be seen that the system objective is often much more complex than mere predictive accuracy, and may include the combination of several objectives.

While Recommender Systems in general often have an ill-defined or open-ended objective, namely to predict items a user would be interested in, Conversation-based AL [32, 37, 9], as the name suggests, engages in a conversation with the user as a goal oriented approach. It seeks to, through each iteration of questioning, elicit a response from the user to best reduce the search space for quickly finding what it is the user seeks (Section 8).

The New User Problem When a user starts using a RS they expect to see interesting results after a minimal amount of training. Though the system knows little about their preferences, it is essential that training points are selected for rating by the user that will maximize understanding what the new user wants [35].

The New Product Problem As new products are introduced into the system, it is important to quickly improve prediction accuracy for these items by selecting users to rate them [24].

Figure 1: Active Learning: illustrative example (See Section 1.2).

Cost of obtaining an output value Dierent means of obtaining an output value come at dierent costs. Implicit strategies, such as treating a user click on asuggested itemaspositive output, ornotclickingasnegative, areinexpensive in relation to user eort. Conversely, asking the user to explicitly rate an item is more costly, though still dependent on the task. Watching a movie like Star Wars to rate may provide good results but requires substantial user eort [20]; rating a joke requires much less. This often dovetails the exploration/exploitation coupling and trade-os between obtaining outputs from dierent inputs should also be considered (e.g. certainty/uncertainty, ease of evaluation, etc.)

Adaptation for dierent AL methods Though we focus on the traditional objective of reducing predictive error, it is equally plausible to construct a method for maximizing other goals, such as profitability. In this case a model would pick points that most likely increase profit rather than a rating’s accuracy.

1.2 An Illustrative Example

Let’s look at a concrete example of Active Learning in a Recommender System. This is only meant to demonstrate concepts, so it is oversimplified. Please note that the similarity metric may dier depending on the method used; here, movies are assumed to be close to one another if they belong to the same genre. Figure 1 shows two charts, the leftmost is our starting state, in which we have already asked the user to rate a movie within the upper right group, which we will say is the Sci-Fi genre. The right chart shows us four possibilities for selecting our next training point: (a), (b), (c), or (d). If we select the training point (a) which is an obscure movie (like The Goldfish Hunter), it does not aect our predictions because no other movies (points) are nearby. If we select the training point (b), we can predict the values for the points in the same area, but these predictions are already possible from the training point in the same area (refer to the chart on the left). If training point (c) is selected, we are able to make new predictions, but only for the other three points in this area, which happens to be Zombie movies. By selecting training point (d), we are able to make predictions for a large number of test points that are in the same area, which belong to Comedy movies. Thus selecting (d) is the ideal choice because it allows us to improve accuracy of predictions the most (for the highest number of training points).4

1.3 Types of Active Learning

AL methods presented in this chapter have been categorized based on our interpretation of their primary motivation/goal. It is important to note, however, that various ways of classification may exist for a given method, e.g. sampling close to a decision boundary may be considered as Output Uncertainty-based since the outputs are unknown, Parameter-based because the point will alter the model, or even Decision boundary-based because the boundary lines will shift as a result. However, since the sampling is performed with regard to decision boundaries, we would consider this the primary motivation of this method and classify it as such.

In addition to our categorization by primary motivation (Section 1), we further subclassify a method’s algorithms into two commonly classified types for easier comprehension: instance-based and model-based.

Instance-based Methods Amethodof this type selects points based on their properties in an attempt to predict the user’s ratings by finding the closest match to other users in the system, without explicit knowledge of the underlying model. Other common names for this type include memory-based, lazy learning, case-based, and non-parametric [2]. We assume that any existing data is accessible, as well as rating predictions from the underlying model.

Model-based Methods Amethod of this type selects points in an attempt to best construct a model that explains data supplied by the user to predict user ratings [2]. These points are also selected to maximize the reduction of expected error of the model. We assume that in addition to any data available to instance-based methods, the model and its parameters are also available.

Modes of Active Learning: Batch and Sequential Because users typically want to see the system output something interesting immediately, a common approach is to recompute a user’s predicted ratings after they have rated a single item, in a sequential manner. It is also possible, however, to allow a user to rate several items, or several features of an item before readjusting the model. On the other hand, selecting training points sequentially has the advantage of allowing the system to react to the data provided by users and make necessary adjustments immediately. Though this comes at the cost of interaction with the user at each step. Thus a trade-oexists between Batch and Sequential AL: the usefulness of the data vs. the number of interactions with the user.

2 Properties of Data Points

When considering any Active Learning method, the following three factors should always be considered in order to maximize the eectiveness of a given point. Supplementary explanations are then given below for the first two. Examples refer to the Illustrative Example (Figure 1).

4This may be dependent on the specific prediction method used in the RS.

(R1) Represented: Is it already represented by the existing training set?

E.g. point (b).

(R2) Representative: Is the point a good candidate for representing other data points? Or is it an outlier? E.g. point (a).

(R3) Results:Will selecting this point result in better prediction ratings or accomplish another objective? E.g. point (d), or even point (c).

(R1) Represented by the Training Data As explained in the introduction to this chapter, asking for ratings of multiple volumes from a trilogy, such as Star Wars, is likely not beneficial, as it may not substantially contribute to the acquisition of new information about the user’s preferences. To avoid obtaining redundant information, therefore an active learning method should favor items that are not yet well represented by the training set [18].

(R2) Representative of the Test Data It is important that any item selected for being rated by an AL algorithm be as representative of the test items as possible (we consider all items as potentially belonging to the test set), since the accuracy of the algorithm will be evaluated based on these items. If a movie is selected from a small genre, like Zombie movies from the Illustrative Example (Figure 1), then obtaining a rating for this movie likely provides little insight into a user’s preferences other, more prominent genres. In addition, users naturally tend to rate movies from genres they like, meaning that any genre that dominates the training set (which is likely composed of items the user likes) may be representative of only a small portion of all items [38]. In order to increase information obtained, it is important to select representative items which may provide information about the other yet unrated items [18, 46, 52].

2.1 Other Considerations

In addition to the three Rs listed in Section 2, it may also be desirable to consider other criteria for data points, such as the following.

Cost As touched upon in the introduction to this chapter, obtaining implicit feedback from user selections is cheaper than asking the user to explicitly rate an item [19]. This can be considered a variable cost problem. One approach for tackling this, is to take into account both the cost of labeling an item and the future cost of estimated misclassification were the item to be added to the training set [27]. Moreover, the cost may be unknown beforehand [47].

Ratability Auser may notalways be able toprovide arating foran item; you cannot properly rate a movie you have not seen! It is suggested therefore that the probability of a user being able to evaluate an item also be considered [20].

Saliency Decision-centric AL places emphasis on items whose ratings are more likely to aect decision-making, and acquires instances that are related to decisions for which a relatively small change in their estimation can change the order of top rated predictions [42]. For example, unless labeling an item would result in displacing or rearranging a list of top ten recommended movies on a user’s home page (the salient items), it may be considered of little use. It is also possible to only consider the eect of obtaining an item’s rating on items that are strongly recommended by the system [6].

Popularity It has also been suggested to take into account an item’s popularity [35], i.e. how many people have rated an item. This operates on the principle that since a popular item is rated by many people, it may be rather informative. Conversely, an item’s rating uncertainty should also be considered since positive items have a tendency to be rated highly by most users, indicating the item may not provide much discriminative power and thus not worth including in the training set.

Best/Worst It has been shown [29] that looking at the best/worst reviews is beneficial when a user makes a decision about an item. Extending this idea to Active Learning we hypothesize with the “best/worstprinciple that in order to make a decision about a user’s preferences it may also be beneficial to obtain his/her best/worst ratings (as it may capture user preferences well). By asking auserto providehis/hermostliked/disliked items, it changesthe problemof AL to one in which a user is asked to provide a rating for an item in a known class (e.g. to select a favorite movie from within a liked genre), and the process of obtaining an item is what incurs the cost [30]. This process is called active class selection. This is the opposite from traditional AL techniques in which the labeling process (and not the items themselves) is what is assumed to incur acost.

3 Active Learning in Recommender Systems

With Traditional AL, users are asked to rate a set of preselected items. This is often at the time of enrollment, though a preselected list may be presented to existing users at a later date as well. It may be argued that since these items are selected by experts, they capture essential properties for determining a user’s preferences. Conceptually this may sound promising, but in practice this often leads towards selecting items that best predict the preferences of only an average user. Since the idea of RS is to provide personalized recommendations, selecting items to rate in a personalized manner should readily make more sense.

3.1 Method Summary Matrix

The following matrix (Table 1) provides a summary of the methods overviewed in this chapter. Explicit performance numbers are not supplied because to our knowledge no such comprehensive comparison in fact exists. AL methods could be compared on an individual basis, but any results would be inconclusive. This is because authors have a tendency to x the predictive method and then apply one or more compatible AL methods to compare performance. Moreover, AL methods are often designed for a specific predictive method, and may therefore not have good performance when applied to a dierent method (which creates potentially misleading results), or may not even be applicable if the underlying system is not able to provide the required information, e.g. distribution of rating Figure 2: Active learning employs an interactive/iterative process for obtaining training data, unlike passive learning, where the data is simply given.

estimates. For these reasons we have opted to omit performances figures of any kind.

4 Active Learning Formulation

Passive Learning (see Figure 2) refers to when training data is provided beforehand, or when the system makes no eort to acquire new data (it simply accumulates through user activities over time). Active Learning,on the other hand, selects training points actively (the input) so as to observe the most informative output (user ratings, behavior, etc.).

Let us define the problem of active learning in a more formal manner. An item is considered to be a multi-dimensional input variable and is denoted by a vector x (also referred to as a data point).5 The set of all items is denoted by

X . The preferences of a user u are denoted by a function fu (also referred to as a target function); for brevity, we use f when referring to a target user. A rating of an item x is considered to be an output value (or label)and is denoted as y =f(x).Each item x could be rated on a finite scale Y ={1,2,...,5}.

In supervised learning, the items and corresponding user ratings are often partitioned into complementary subsets a training set and a testing set (also called a validation set). The task of supervised learning is then too, given a training set (often supplemented by the ratings of all users), learn a function

f that accurately approximates a user’s preferences. Items that belong to the

training set are denoted by X (Train), and these items along with their corresponding ratings constitute a training set, i.e. T = {(xi,yi)}xiX(Train) .We measure how accurately the learned function predicts the true preferences of a user by the generalization error:

G(f)= L f(x),f(x)P(x). (1)


In practice, however, f(x)is not available for all x X ;it is therefore common to approximate the generalization error by the test error:

G(f)= L f(x),f(x)P(x), (2)

xX (T est)

5The way in which an item is represented depends on the RS and the underlying predictive method. In Collaborative Filtering based approaches items could represented through the ratings of the users, or, in content based RSs, items could be represented through their descriptions.

Primary Motivation of Approach Uncertainty Reduction (Section 5) Description/Goal Reducing uncertainty of: rating estimates (Section 5.1), decision boundaries (Section 5.2), model parameters (Section 5.3). Possible Considerations Reducing uncertainty may not always improve accuracy; the model could simply be certain about the wrong thing (e.g. when the predictive method is wrong).
Error Reduction (Section 6) Reducing the predictive error by utilizing the relation between the error and: Estimating reduction of error reliably could be difficult and computationally expensive.
the changes in the output estimates (Section 6.1.1),
the test set error
(Section 6.1.2),
changes in parameter estimates (Section 6.2.1),
the variance of
the parameter estimates (Section 6.2.2).
Ensemble-based (Section 7) Identifying useful training points based on consensus between:
models in the en
semble (Section 7.1),
multiple candidate models (Section 7.1).

The eectiveness depends on the quality of models/candidates, and could be computationally expensive since it is performed with regards to multiple models/candidates.

Table 1: Method Summary Matrix.

x input (item) X inputs (items) y output (item’s rating) Y = {1,2,...,5} possible outputs (ratings), i.e. y Y f user’s preferences function (unknown to the system)

X (Train)

training inputs (rated items) T = {(xi,yi}training set (items and their ratings)


f approximated function of user’s preferences (from training set)

G generalization error (predictive accuracy) see (1) xa item considered for rating G(xa) active learning criterion (estimates how useful rating an item xa would be)

Figure 3: Summary of Notation.

where X (Test) refers to the items in the test set,andpredictionerrorsare measured by utilizing a loss function L,e.g. mean absolute error (MAE):

LMAE f(x),f(x)= f(x) f(x), (3)

or mean squared error (MSE):

LMSE f(x),f(x)= f(x) f(x)2 . (4)

The active learning criterion is defined so as to estimate the usefulness of obtaining a rating of an item x and adding it to the training set X (Train) for achieving a certain objective (Section 1.1). For simplicity, let us consider this objective to be the minimization of generalization error of a learned fuction with respect to the training set. We then denote the active learning criterion as:

G(X (Train) {x}), (5)

or for brevity, denote it as: G(x). (6)

The goal of active learning is to select an item x that would allow us to minimize the generalization error G(x):

argminG(x). (7)


If we consider asking a user to rate an item xj or an item xk,then we would estimate their usefulness by an active learning criterion, i.e. G(xj) and G(xk), and select the one that will result in a smaller generalization error. Note that we need to estimate the usefulness of rating an item without knowing its actual rating. To distinguish a candidate item to be rated from the other items we refer to it as xa.AL can be applied to any predictive method as long as it provides the required information, such as rating estimates [41] and their distribution [23, 25], closeness to the decision boundary [54, 15], method parameters [48], etc.

Regression and Classification The problem of predicting a user’s ratings could be treated as both a regression and a classification problem. It is a regression problem since the ratings are discrete numerical values,such as if we consider their ordinal properties, meaning the ratings could be ordered (e.g. a rating of 4 is higher than a rating of 3). On the other hand, we can disregard the numerical properties of the ratings and treat the problem as a classification one by treating ratings as classes/labels.6 For example, we can use a nearest-neighbor (NN) approach to do classification, e.g. pick the most frequent label of the neighbors; or we can use NN to do regression, e.g. calculate the mean of the ratings of the neighbors. Throughout the chapter we use both classification and regression in examples, selecting the one most appropriate for aiding the current explanation.

5 Uncertainty-based Active Learning

Uncertainty-based AL tries to obtain training points so as to reduce uncertainty in some aspect, such as concerning output values [28], the model’s parameters [23], a decision boundary [44], etc. A possible drawback to this approach is that reducing uncertainty may not always be eective. If a system becomes certain about user ratings, it does not necessarily mean that it will be accurate, since it could simply be certain about the wrong thing (i.e., if the algorithm is wrong, reducing uncertainty will not help). As an example, if the user has so far rated items positively, a system may mistakenly be certain that a user likes all of the items, which is likely incorrect.

5.1 Output Uncertainty

In Output Uncertainty-based methods, an item to label (training point) is selected so as to reduce the uncertainty of rating predictions for test items. In Figure 1, with the assumption that the RS estimates the rating of an item based on the cluster to which it belongs (e.g. items in the same movie genre receive the same rating), if a user’s rating for a movie from the Sci-Fi genre (upper-right) has already been obtained, then there is a higher likelihood that the RS may be more certain about the ratings of other movies in the Sci-Fi genre, likely making it more beneficial to obtain a user’s preference for a movie from a genre (cluster) not yet sampled, i.e. a cluster that is still uncertain.

The dierence between instance-based and model-based approaches for Output Uncertainty-based AL is primarily in how for an arbitrary item x the rating’s distribution P (Yx) is obtained, where a rating’s distribution is defined as the probability of an item being assigned a certain rating. For model-based methods it is possible to obtain the rating’s distribution from the model itself. Probabilistic models are particularly well suited for this as they directly provide the rating’s distribution [23, 25]. For instance-based methods, collected data is used to obtain the rating’s distribution. As an example, methods utilizing nearest-neighbor techniques can obtain a rating’s distribution based on the votes of its neighbors, where “neighborhere means a user with similar preferences,7 using

6If the ordinal properties of the labels are considered, it is referred to as Ordinal Classification.

7Defining a neighbor as a similar item is also feasible depending on the method.

aformula such as:

wnn P(Yx =y)= , (8)



where NNx are neighbors that have rated an item x,and NNx,y are neighbors that have given an item x arating of y,and wnn is the weight of the neighbor (such as similarity).

5.1.1 Active Learning Methods

Some AL methods [28] estimate the usefulness of a potential training point in a local (greedy) manner by measuring the uncertainty of its output value:

GUncertaintylocal (xa)=Uncertainty(Ya). (9)

Since our goal is to minimize G, rating an item with high uncertainty is useful; it will eliminate the uncertainty about the rating of the chosen item. However, labeling an item whose rating is uncertain does not necessarily accomplish the goal of reducing the uncertainty of ratings for other items (e.g. labeling an outlier may only reduce rating uncertainty for a few other similar items, such as when selecting item (c) in the Zombie genre, or even none as in (d), shown in Figure 1.

We may thus consider reducing uncertainty in a global manner by selecting an item which may reduce the uncertainty about other unrated items. One approach [39] for doing this is to define criteria by measuring the uncertainty of ratings over all of the test items X (T est) with respect to a potential training input item xa:


GUncertainty(xa)= X (T est)ET (a) (Uncertainty(Yx)), (10)

xX(T est)


where |X(T est)| is a normalizing factor, and ET (a) (Uncertainty(Yx))is the expected value of uncertainty with respect to adding an estimated rating ya of acandidate item xa to the training set T ;i.e. T (a) =T(xa,ya).

A possible drawback of this non-local approach is that while with the local approach it is only necessary to estimate the uncertainty of a single output value ya,for the non-local approach uncertainty needs to be estimated for the output values of all the test points with respect to apotentialtrainingpoint (xa,ya);this may be dicult to estimate accurately and could be computationally expensive.

5.1.2 Uncertainty Measurement

Uncertainty of an item’s rating (output value) is often measured by its variance, its entropy [28], or by its confidence interval [38]. Variance is maximized when ratings deviate the most from the mean rating, and entropy when all the ratings are equally likely.

Uncertainty of an output value could be calculated by using a definition of variance as follows:

Uncertainty(Ya)= V AR(Ya)= y Ya2 P(Ya = y), (11)


where Ya is the mean rating of all users for an item xa and P(Ya = y) is the probability of an items rating Ya being equal to y,both being calculatedbased on either nearest-neighbors for instance-based, or obtained from the model for model-based approaches.

Uncertainty could also be measured by entropy as follows:

Uncertainty(Ya)= ENT(Ya)= P(Ya = y) log P(Ya = y). (12)


In [46] a method is proposed for measuring the uncertainty of a rating based on the probability of the most likely rating:

Uncertainty(Ya)= P(Ya = y ), (13) where y = argmaxP(Ya = y) is the most likely rating.


In [38] the confidence interval is used as a measure of uncertainty for selecting the training input point:

c = P(bl(Ya) <ya <bu(Ya)), (14)

where c is the confidence that the actual rating ya will lie in the interval between the lower bound bl(Ya) and the upper bound bu(Ya).For example, it is possible for the system to be certain that an item will be assigned a rating between 3 and 5 with a probability c = 90%. Many methods prefer items with a higher upper bound, indicating that an item may be rated highly (good for exploitation), and if the confidence interval is also wide then it may be good for exploration. In some cases where it is desirable to increase the number of items predicted to be more highly rated, it may be beneficial to use the expected change in the lower bound of the confidence interval for selecting an item [38], the higher the expected change the more desirable.

5.2 Decision Boundary Uncertainty

In Decision Boundary-based methods, training points are selected so as to improve decision boundaries. Often an existing decision boundary is assumed to be somewhat accurate, so points are sampled close to the decision boundary to further refine it (Figure 4). In a way this may also be considered Output Uncertainty-based, since the uncertainty of the points close to the decision boundary may be high. This method operates with the assumption that the decision boundary of the underlying learning method (e.g. Support Vector Machine) is easily accessible. A clear advantage of this method is that given a decision boundary, selecting training examples by their proximity to it is computationally inexpensive.

As discussed in [44], training points may be selected for obtaining a more accurate dividing hyperplane (Figure 4 (b)), or if the direction of the hyperplane is already certain, input points may be selected for reducing the size of margin (Figure 4 (c)). While it may seem obvious to sample training points closest


Figure 4: Decision boundary uncertainty.

to the decision boundary [54, 15], there are also methods that select the items furthest away [15] that have potential advantages in scenarios involving several candidate classifiers, which are discussed in Section 7. This is because a classifier should be quite certain about any items far from a decision boundary, but if newly acquired training data reveals the classifier to be inaccurate, the classifier may not t the user’s preferences well, so it should be removed from the pool of candidate classifiers.

5.3 Model Uncertainty

Model Uncertainty-based methods select training points for the purpose of reducing uncertainty within the model, more specifically to reduce uncertainty about the model’s parameters. The assumption is that if we improve the accuracy of the model’s parameters the accuracy of output values will improve as well. If we were to predict a user’s preferences based on membership in dierent interest groups [23], i.e. a group of people with a similar interest, then training points may be selected so as to determine to which groups the user belongs (Section 5.3.1).

5.3.1 Probabilistic Models

Probabilistic models are best explained with an example. The aspect model [23], a probabilistic latent semantic model in which users are considered to be a mixture of multiple interests (called aspects) is a good choice for this. Each user uU has a probabilistic membership in dierent interest groups z Z.Users in the same interest group are assumed to have the same rating patterns (e.g. two users of the same aspect will rate a given movie the same), so users and items x X are independent from each other given the latent class variable z. The probability of the user u assigning an item x the rating y can be computed as follows:

P(y|x,u)=p(y|x,z)p(z|u). (15)


The first term p(y|x,z) is the likelihood of assigning an item x the rating y by users in class z (approximated by a Gaussian distribution in [23]). It does not depend on the target user and represents the group-specific model. The global-model consists of a collection of group-specific models. The second term Figure 5: A learning scenario when the estimated model is far from the true model. Training points are indicated by solid contours.

p(z|u)is the likelihood for the target user u to be in class z,referred to as a user personalization parameter (approximated by a multinomial distribution in [23]). The user model θu consists of one or more user personalization parameters, i.e.

θu ={θuz =p(z|u)}zZ .

AtraditionalAL approach would be tomeasure the usefulness of the candidate training input point xa based on how much it would allow for reduction of the uncertainty about the user model’s parameters θu (i.e. the uncertainty about to which interest group z the user u belongs):

GθUncertainty(xa)=Uncertainty(θu), (16)

Uncertainty(θu)=θuz|xa,y logθuz|xa,y , (17) zZ


where θu denotes the currently estimated parameters of the user u and θuz|x,y aparameter thatis estimated usingan additionaltraining point (xa,y).Since the goal of the above criterion is to reduce the uncertainty of which interest groups the target user belongs to, it favors training points that assign a user to a single interest group. This approach may not be eective for all models, such as with the aspect model, in which a user’s preferences are better modeled by considering that a user belongs to multiple interest groups [23, 25].

Another potential drawback comes from the expected uncertainty being computed over the distribution p(y|x,θu)by utilizing the currently estimated model θu.The currently estimated model could be far from the true model, particularly when the number of training points is small, but the number of parameters to be estimated is large. Therefore, performing AL based only on a single estimated model can be misleading [25]. Let us illustrate this by the following example shown in Figure 5. The four existing training points are indicated by solid line contours, test points by dashed ones. Based on these four training examples, the most likely decision boundary is the horizontal line (dashed), even though the true decision boundary is a vertical line (solid). If we select training input points based only on the estimated model, subsequent training points would likely be obtained from areas along the estimated boundary, which are ineective in adjusting the estimated decision boundary (horizontal line) towards the correct decision boundary (vertical line). This example illustrates that performing AL for the currently estimated model without taking into account the model’s uncertainty can be very misleading