How Spotify Understands Your Music Diversity
Measuring user consumption diversity at Spotify to quantify the impact of recommender systems
Hardly any software product these days lack a form of personalization. It’s no secret that these are driven by recommendation algorithms under the hood, and services such as Netflix and YouTube invests a large amount of money each year to optimize these systems. Some apps such as TikTok put recommender systems center-stage in order to drive views and the virality of the app.
Spotify is no different.
Spotify was founded in Sweden by Daniel Ek and Martin Lorentzon in 2006, with the goal of creating a legal digital music platform. At present, Spotify has a library of over 50 million songs from over 1,500 genres. Added to that is about 40,000 songs added to its platform every single day!
Given the large pool of content and a large user base, it is not surprising that Spotify relies on recommendation algorithms to promote content to its user base. Recommendation algorithms drive one of their top features, Discover Weekly, which allows users to try music that are similar to the kind of music they enjoy, by finding similarities with other users’ playlists. The company is currently beta testing a sponsored recommendations feature that lets artist teams pay to sponsor their content via recommendations.
Clearly, recommender systems play a very important role in content consumption at Spotify.
This article, however, isn’t about singing Spotify’s praises¹.
Instead, given a large content and user base, it is expected that users will have different tastes. So the question is:
How does Spotify measure content diversity in its user base?
We answer just that in this article! Code in this article is found here.
What is user consumption diversity?
If you’re a Spotify user, you can find out some interesting trends in your music consumption via spotify.me. Examples would be like what are your most active listening hours of the day, favorite genres and average music tempo. Or whether you have a playlist of a cooking enthusiast.
Interestingly, there’s a banner displayed that presents the diversity of the music you have listened to.
Of course, it would be simple enough to collect popularity and usage statistics, which can be used as data for a recommender system. Suppose, however, if we want to ensure that recommendations to users are diverse. How is diversity measured?
The two standard metrics for measuring diversity are the Gini coefficient and Shannon entropy.
The Gini coefficient has its roots as an economic measure of inequality in society. It is usually defined with reference to the Lorenz curve. From the figure, the Lorenz curve plots the proportion of the total income of the population (y-axis) that is cumulatively earned by the bottom x of the population. The 45-degree line denotes perfect income equality. From the diagram, the Gini coefficient is equal to A/(A+B). Since the axes are measured between 0 and 1 (inclusive of both ends), the coefficient is also equal to 2A or 1–2B, since the equation A + B = 0.5 (total area of the triangle) holds. In the figure above, for recommender systems, we can swap “people” for user and “incomes” for clicks or some other desirable measure.
The Shannon entropy measure is a classic measure named after Claude Shannon, the father of information theory. It was first introduced in the seminal paper, “A Mathematical Theory of Communication” in 1948. It is defined as
where pᵢ is the probability of a random variable X having a realization of xᵢ. Note that it is defined for discrete random variables. The logarithm base is typically set to b=2, with bits being the unit of measure. Entropy will be maximized if and only if each item’s popularity is independent and identically distributed according to the uniform distribution.
There is one major drawback with both measures: they do not account for the similarity between items.
Anyone who has been in sales will tell you that only a small proportion of items in any collection are popular. A typical ranking by popularity would yield a Zipfian-like distribution, with a short head, long tail, and distant tail.
Items in most datasets are not independent, as these measures assume. Instead, often there is a relationship between an item and another since, they may, for example, belong to the same genre of music, as in Spotify’s case.
What’s a better way?
Spotify’s method in measuring diversity comes from factoring in the similarity between songs through the use of song embeddings.
Spotify uses Word2Vec, a well-known language model that learns the distributed representation of words from a corpus. In particular, Word2Vec is applied over user playlists (over 4 billion of them!) to learn user tastes and to help users discover music that suit their tastes.
There are a few other excellent articles, and implementations out there that uses Word2Vec for playlist recommendations. Below is a figure of the embeddings obtained from Word2Vec of 100,000 songs in Spotify’s collection when projected into 2-dimensional space using t-SNE. The existence of clusters show that there are similarities between groups of songs, mainly due to the existence of music genres.
Our reference comes from a paper² studying the effects of recommender systems on content consumption at Spotify. They defined the Generalist-Specialist (GS) score over a predefined time period T for a user uᵢ as
where the weighted center is
The weight wⱼ is the number of times the user has listened to the song j in time period T.
Intuitively, if a user listens to very similar songs, the GS score will tend towards 1, since choices are closer to the weighted center. Conversely, if a user is more of a generalist, then the score will tend towards -1, by the property of cosine similarity.
Diversity on the MovieLens Dataset
Let’s apply the GS score to an actual dataset.
We use the Movielens dataset, a publicly available dataset typically used as a benchmark for recommender systems. Here’s the link to the Jupyter notebook. When run, the notebook should download the dataset.
The steps in computing the GS score is as follows:
Word2Vec Training on MovieLens
We first trained our Word2Vec model on all user-movie interactions in the MovieLens dataset. The MovieLens dataset has ratings starting from the year 1995 to 2018. It has 283,228 unique users. As the dataset has a small number of unique users, we will use the whole dataset. Thus, the time period T spans from the year 1995 to 2018.
Now, Word2Vec is a semi-supervised language model, so quantitatively determining the quality of the item (word) vectors can be difficult. Typically, the output of Word2Vec is used in another task, say a classification task, where quantifying the quality of the model can be easily done. In Spotify’s case, this can be done by measuring the clicks on recommended song by the model. Measuring perplexity of the model is the usual way of quantifying the quality of the model, but it also has its limitations.
Note that we made no attempt at optimizing the Word2Vec model’s many hyperparameters.
Since our goal is measuring diversity, a cursory glance helps to make sense of the results. Here, we show the top 5 most similar and dissimilar movies to the classic comedy Mrs. Doubtfire, with results coming from the trained model.
Interestingly, even without being provided the genre information to Word2Vec, we can see that Mrs. Doubtfire is similar to other comedies and dissimilar to dramas. Other movies can be similarly scored, demonstrating that the model is implicitly learning the notion of movie genres.
We can visualize the movie embeddings from the model using t-SNE in Tensorboard. We can see groups of movies, though they are not as tightly clustered as the songs in Spotify’s example.
With the trained model, we proceed to the next phase.
Computing the Generalist-Specialist Score
In the dataset, each user rates a movie with a value between 1 and 5. So in our definition of the GS score, we take the rating of each movie per user to be the weight; this is in contrast to the number of times a user has listened to a song in the original Spotify definition.
We compute GS score for all users in the MovieLens dataset. For comparison, we also compute the Shannon entropy (using base-2).
The Shannon entropy measures diversity, but it is harder to interpret its meaning. We can say outright that there are a few users how have 0 entropy, meaning they have no diversity at all, which are mainly users who have only watched and rated a single movie. There are many users how watched and rated a variety of movies, but there isn’t much to be said beyond that.
The GS score for the entire MovieLens dataset is shown here. From the histogram above, the majority of users have a wide range of movies watched and rated. The spike at the 1.0 bin are mainly due to users who have only watched and rated a single movie. There are 857 users who watched and rated more than a single movie with a GS score of above 0.90 using our trained Word2Vec model.
There is a clear diffeerence between the two measures. One example in the dataset is a user who has watched and rated 140 movies. The Shannon entropy and GS score for the user is 7.13 bits and 0.49 respectively. This difference comes from the fact that the Shannon entropy treats all movies equally, while the GS score accounts for movie similarity. It is implicitly accounting for movie genres since a typical user only watches a handful of genres.
In the paper, as the activity level increases, the GS scores tend towards a stable distribution centered at 0.5. Interestingly, for MovieLens, it seems that there are two sets of users from two distributions, leading to the bimodal distribution seen in the figure. Investigating this would be subject to future work.
Tying it All Up
Underlying all these is one simple principle:
Items generally have similarities with each other.
It’s a fundamental assumption that needs to hold true if recommender systems are to work in the first place. The Spotify method of measuring diversity aims to exploit the underlying relationships between the recommended items themselves using a learning model.
If we take a step back from the implementation, the steps are clear:
By generalizing these concepts, we can see that there is no necessary reason to use Word2Vec to learn the relationships between items. Spotify’s reason for using Word2Vec was because the company has been deploying the model for its recommender systems. Alternative unsupervised language models such as GloVe and WordRank can be used instead. It would generally be preferable to use a model with less hyperparameters, as Word2Vec has a number of hyperparameters to optimize.
Moreover, “similarity” would be up to the discretion of the data scientist to define based on application and product. For instance, if there is a underlying graph structure relationship between the items, then by all means, use a graph model instead. If a matrix factorization model is used for recommendations, then the measure could be the cosine similarity between the vectors obtained from the item subspace.
A subtle assumption here is that the Word2Vec model assumes that the sequential order of the user’s consumption matters. This is certainly important in Spotify’s case, since playlists are structured in a way depending on the user’s mood and time of day. It may or may not be true in your application, so a model which assumes permutation invariance i.e., order is not important, of the dataset is more suitable instead.
Will recommender systems move our media consumption from “a world of hits to a world of niches”, or a world where hits become even bigger hits?
We hope that a measure of diversity would help us develop systems that provide all of us richer and more novel experiences.
This content was originally published here.