Now that we’ve reached the end, it’s time to comment on the final model for dnce after removing outliers, working through the process of model building, adding interactions, and assessing accuracy and stability.
Recall, my latest model attempts to predict dnce using bpm, val, acous, and the nrgy*val interaction. Of course, I’ve already divided the data to create training and test sets.
It’s R-squared is almost 38%, which means we’re doing a good job (or on the right path) with the data we have, but we are nowhere near explaining all (or even most) of the variations in dnce from song to song. So there’s clearly more work to be done.
But let’s also look at prediction accuracy and stability, if we were to make predictions of dnce for a new song not already in our dataset.
Looking at the average error sizes, we can see the model lacks stability. The two histograms do in fact differ. Although they’re both centered at 0, most observations are rather far off. And the means are not all that similar. So the model is pretty unstable.
On average, the model is off by about 1.86 points when estimating dnce. That’s not that bad since we are referring to song’s dnce levels! So the model is somewhat accurate, even though it is unstable.
So it’s clear that if we were working with producers to create a song with the hopes of making it on Spotify’s Top Songs of the Year, we’d really want to collect more data and variables to help improve our model so that we could make more accurate predictions. Our next step could be test out variables like song lyric topics by year.
*Of course, we’d also be able to easily calculate the dnce levels for a song we were studying, so we wouldn’t necessarily have a need to make predictions of it with a model! But as far as illustration goes, I hope this blog has been helpful!
Now it’s time to try adding interaction terms to the model. My existing variables are bpm, val, and acous, so I could try adding combinations of these three. Alternatively, I could try adding interactions between these and other variables that were not significant when I worked through the backward stepwise selection process–perhaps dB aka loudness becomes significant when interacting with val or better known as valence, for example?
So I’ll create the following two interaction terms to try adding to my model. If this were a real analysis, I would try many more combinations to be sure that I have extracted all of the value from my existing options for predictors.
Note: I’m starting with the dataset where outliers have been removed.
And now I’ll divide my dataset into training and test sets, and build the model using the training set.
It seems my first interaction, between nr (aka nrgy or energy of a song) and val, is significant and adds a little more than 5% to the R-squared for the original model’s 34.91%. But the other is not significant and so does not contribute to the R-squared.
My interaction term has a negative slope. So I’d say that WHEN a song has valence (or a positive, uplifting mood), as their nrgy INCREASES it tends to relate to DECREASED dnce. It’s like a slight (.004 per level of nrgy) correction to the dnce for those with high valence levels with levels of nrgy.
Interaction terms are always challenging to interpret, but it’s hard to deny that they add value to a model when they improve the R-squared by 5%. To be sure the model is actually improving (and I’m not just overfitting!), I’d want to see that prediction accuracy and stability have improved as well.
While investigating the distributions of my variables with R weeks ago I could have looked for outliers, but here I’ll look again and consider removing any unusual observations.
First, I’ll draw charts of dnce versus each of my three significant (and highly valuable) predictors. Keep in mind that outliers should be removed before modeling begins–you wouldn’t want to have outliers in your TRAINING set or TEST set, right? So I’ll use the original dataset, d, to investigate my dataset contains outliers.
I don’t see too many extremes, but I do see values of bpm, val, and acous that are equal to 0, and that seems quite strange. So I won’t focus my efforts on trimming top and bottom percentiles, but rather on removing that odd behavior.
Let’s count how often these things occur before removing them:
In this case, it isn’t a concern that the response (dnce) and the predictors (bpm, val, and acous) is 0 , so I won’t remove them here. In most scenarios, a real world analysis would come with real consequences.
Here’s the code I’ll use to remove the 0 observations:
My dataset now has 530 observations, which is not very many. This is something I might be concerned about if my prediction accuracy starts to suffer. But for now, let’s just see if removing these outliers improves the overall fit of our model to the other, more typical, observations in the dataset.
Now I’ll divide the dataset and build my model:
If you recall, the original dataset (before removing outliers) had an R-squared of 34.91%. Now that I’ve removed these strange observations, my R-squared has increased to 37.64%.
Taking it a step further, it would also be a good idea to check the prediction accuracy and stability before and after removing the outliers to be sure the model is really improving since I have a much smaller sample size now.
Removing outliers is an easy way to improve overall model fit to the typical observations of the response variable, especially when compared to collecting more observations or adding new predictor variables you don’t already have.
Overfitting is the act of adding too many variables to a model and “perfectly” fitting it to the training data.
As we aim to build a great model that explains variation in our response, we want to try adding all of the possible predictor variables. The important thing is that we evaluate the contribution of each one–that’s where backward stepwise selection can be helpful.
It is easiest to illustrate this with a linear regression model (because we have R-squared to help judge model “goodness”) but this is a concept that affects all types of models. For my linear regression model, I’ll use dnce as the continuous response variable (as in previous posts). Keep in mind that all model building should happen in the training data only.
I’ve divided my data into training and tests sets as in previous posts (assigning 70% of subjects to the training set) using the following code:
I have 13 variables in my dataset, so I’ll use dnce as the response and the other 10 as predictors. Keep in mind, I won’t be including the variables (1) artists and (2) top genres due to them not being numerical predictors. Let’s just see what happens:
In the model above with 10 predictors, we can see that only 3 are chosen as significant contributors to the explanatory power of the model: bpm, val, and acous. Given all of these choices, the others were NOT useful contributors even if they would be chosen among a smaller group of variables as significant contributors.
So my new baseline model is:
The R-squared does dip slightly with fewer variables, but this is just a more realistic representation of reality since we have included only significant predictors.
The R-squared for this model indicates that bpm, val, and acous are explaining about 34.9% of the variability in dnce from song to song.
But are any of those variables contributing very little? If there is something contributing only a fraction of a percent, we should consider removing it. To make that determination, we’ll use backward stepwise selection.
So we’ll try removing each of these three variables, one at a time:
As you can see, when acous, or acousticness, is removed, R-squared drops from about 34.9% to 33.06%; popularity is contributing about 1.84% to the R-squared.
When val, or valence, is removed, R-squared drops from about 34.9% to 11.88%; valence is contributing about 23% to the R-squared.
When bpm, or beats per minute, is removed, R-squared drops from about 34.9% to 30.57%; bpm is contributing about 4.33% to the R-squared.
Conclusions: In this first round, where each of the 3 variables is removed one at a time, it seems that two of the three are contributing a substantial amount to R-squared– the third being acousticness only contributing 1.84%. However, I won’t remove acousticness as a predictor due being higher than half a percent.
Now I feel more confident that each of my three predictors is helping to create the best possible linear regression model I can generate from this dataset. All variables have been explored, these three are all contributing substantially to the solution, and no other variables in this set should be included.
Now let’s return to the logistic regression model discussed in previous posts and assess its prediction accuracy and stability.
First, I’ll divide the data into training and test sets by selecting 70% of subjects for the training set. I’ll build the model in the training data and compare prediction performance in the training set vs. the unseen subjects in the test set.
The code above creates two new datasets, and their dimensions are shown below:
Now we can build our logistic regression model to predict instances of high danceability in the training data only:
For the model run on training dataset (as opposed to the full dataset, as was done before in the Logistic Regression post), we can see that dB, which represents loudness, is again, not a significant predictor. Its lack of significance is both strong and consistent due to remaining the same in both the full dataset and the training set. In addition, we can see that our intercept is not as significant as compared to the full dataset (had 3*), but still significant since it has one *.
Note: I should go back and look to see if including this predictor changes any of my tests of assumptions!
So the model equation is:
log(p/(1-p)) = 3.01 + .04*nrgy – .03*bpm , where p is probability of high danceability within a song.
***Note: Notice how I did not include dB. Since it is not a significant predictor, there was no need to include it in the model equation.
Now I can make predictions for the test set using this model equation:
Once I have predictions I can round those probabilities into 0s and 1s, officially stamping a NO or YES prediction for each subject:
Now trnactual, trnpred, tstactual, and tstpred are lists of 0s and 1s for each subject, and I can use confusion matrices to compare my predictions (trnpred and tstpred) to the actual song titles’ values (trnactual and tstactual).
None of the code to build confusion matrices ever changes! So this code is always useful once I have built the trnactual, trnpred, tstactual, and tstpred variables. The R output for the confusion matrices are shown below:
Stability: I can see when I look from the training data to the test data that the percentages correct/incorrect for the same categories are pretty different/do change. This is not a good sign of likely stability, so I will go to the data has bad stability.
Accuracy: My model is correct whenever it predicts a 0 for a song without high dnce (0) or when it predicts a 1 for a song with high dnce (1). So it is correct 88.7%, so almost 89% of the time (totaling the 1.7% for 0/0 and 87 for 1/1). That means my model is making a mistake 11% of the time. SO almost 90% of the time, my model is correct. In this case, I would say the dataset is pretty accurate considering we are predicting song titles’ danceability. However, in other cases, 11% could be way too high and cause the model to not be accurate enough.
Further, when the model makes a mistake, it is predicting that a song HAS high levels of danceability (above 50) when they do NOT about 8.8% of the time. But it is predicting that a song does NOT have high levels of danceability when they DO about 2.2% of the time. In this case, we can see that these percentages are rather low, leading to our model to being pretty accurate. But, our data still had bad stability.
In order to assess a model’s prediction capability, it is important to divide the data into training set and test set.
First, I need to divide the data into two parts: 1) TRAINING–to use as a “working set” to build the model with, and 2) TEST–to use exclusively to test the model I have already created.
Since my data is organized by time, by year to be exact, I won’t need to randomly divide the songs into two sets:
Now I have two datasets to use. The training data contains 70% of the original 603 observations, and the test data contains 30%.
My next step will be to build the model:
And the output is shown below:
So its model equation is:
dnce = 69.8 + .12*nrgy – .11*bpm + .28*dB
If you remember, this model equation is slightly different from my original (in one of my previous post, Testing Assumptions of Linear Regression), but that’s because the data is a little bit different. Only 70% of the original data is being used here, so there are natural fluctuations but the coefficients are all staying relatively close to their original estimates (which is good!).
Now I’ll make predictions for the patients in the test set using the model equation above and calculate their errors:
Once I have calculated the errors for the test set, I can draw histograms of the errors in both sets and look at the average size of the errors in each:
As I look at these charts to assess accuracy and stability, can you see what I see?
Stability: The two histograms do not differ much. They’re both centered at 0 and have most observations within say 15 dnce points (we are predicting danceability within songs, so errors are in the same unit). And the means are similar. So the model is pretty stable.
Accuracy: On average, the model is off by about just under 11 points when estimating dnce. That’s pretty far! If my dnce was 50 and my music producer estimated it needed to be 61 to even be considered a hit song, that means I have a less likely chance of creating a hit song. In addition, the histograms show that there are some songs that I have under or overestimated dnce by 30 or 40 points. So the model is HORRIBLY INACCURATE, even though it is stable.
It is important that a model be both stable and accurate, not just one or the other. If my model is stable but not accurate, that means it is consistently “bad”!
Now it’s time to test the assumptions and requirements of logistic regression models, just as we learned to do for linear regression models.
Recall, the logistic regression model I created in the last post is shown below:
It’s model equation, then, would look like this:
log(p/(1-p)) = 3.47 + .03*nrgy – .03*bpm where p = probability of high danceability.
Logistic regression models have 6 total checks.
Good, Linear Model
As before in the design of a good model, it would be important to be sure we have included all relevant variables and excluded all irrelevant variables. Since we’re just practicing our modeling skills, we’ll just proceed through this one.
No Perfect Multicollinearity
Multicollinearity still matters in a logistic regression model! The image below shows the code and result as I test my logistic regression model for multicollinearity:
Since dB, also known as loudness, was not significant, we don’t need to include it in our assessment! Just considering the correlation between nrgy and bpm, we do not have any perfect multicollinearity (or even regular old multicollinearity!) in our model.
Our test for independent errors relies solely on our intuition now, and not charts and statistical tests. But my intuition remains the same as before–Let’s proceed assuming that our database would only contain one song from each artist, then we will have independent errors.
Complete information is hard to test! But we can start by understanding whether we represent a range of nrgy and bpm, as well as instances of high danceability. Below are the histograms used to assess the data in the Investigating my data with R post, with the addition of a histogram of d$highdnce, which is the binary variable I created representing high danceability. We can see that there is only a slight range between highdnce since most songs showed high levels of danceability. However, we can see that we do have a range of observations of nrgy, bpm, and danceability, with no obvious missing data:
I can see that we have more mid range of bpm, and there are indeed a few song titles with nrgy levels on the lower end. As for dB, which is one of our predictors, we can see it lacks ranges in values. As for highdnce, we have a mix of 0 and 1 (not necessarily equal numbers, but significantly more 1s than 0s). With only 603 observations, we don’t have “plenty” of data, but it seems our data is reasonably distributed and that we have a reasonable amount of data.
To test for complete separation, I’ve drawn two scatterplots:
Because no vertical line can be drawn in either plot to separate songs with and without highdnce by nrgy or bpm, this model is not suffering from complete separation.
Large Sample Size
Typically a logistic regression model requires thousands of observations, and this dataset only has 603. So while my model seems to pass the other requirements, it does not have a large sample size. The model can still run (as is shown in the R output), but we may see that its prediction accuracy is poor as a result of this sample size issue.
Soon we will learn to assess prediction accuracy in our models. If we see that this model has poor prediction accuracy, collecting more data could be one way to improve it.
Now that we’ve learned logistic regression, I can start working to understand / predict instances of danceability within song titles in my dataset.
I’m building this model:
The output from R appears below:
Loudness, or dB in this case, is not significant using our typical .05 threshold, with a “.” indicating its significance level. So we will assign the slope for dB to be 0 in this dataset.
The model equation, then, should be written:
log(p/(1-p)) = 3.47 + .03*nrgy – .03*bpm where p = probability of high danceability.
The intercept and slopes for nrgy and bpm are all significant.
The intercept of 3.47 can be interpreted this way: song titles with low nrgy and low bpm, it is likely that they will have high danceability. *Remember the rule of thumb for interpretation; when log odds ratio < -3 then p is near 0, when log odds ratio > 3 then p is near 1, and when log odds ratio = 0 then p is .5. We have an extremely likely level of high danceability when nrgy and bpm are 0, because p is near 1. p came out to .97, which is very close to 1.
For nrgy, we can say that increased nrgy increases the likelihood a song title has high levels of danceability.
And for bpm, we can say that decreased bpm increases the likelihood a song title has high levels of danceability.
dB, also known as loudness, is not significant, so we have assigned its slope to 0, and we can say that for this dataset and this model predicting instance of high danceability, loudness does not seem to contribute meaningfully when we also have nrgy and bpm present.
In this post, I’ll test the five assumptions for linear regression on the model I’ve been developing in the last few posts. The model equation is shown below to remind readers (and me!) of the model I’m working with:
dnce = 78.39 + .05*nrgy – .10*bpm + 1.12*dB
As we discussed in class, the first assumption (that we have a good, linear model) is a model design question, so I’ll skip over that to test 2-5 which are more focused on implementation.
Assumption #2: No Perfect Multicollinearity
As previously discussed, there was only one potentially concerning multicollinearity in our model. The correlation matrix is shown again below as a reminder. There are no off-diagonal 1 or -1 entries, so we do not have perfect multicollinearity here. However, it is shown that loudness and energy levels share a large enough correlation of .54 (which is >.25) to be of concern, so there is multicollinearity in this model.
Assumption #3: Independent Errors
First, we’ll use our intuition to test this assumption. To pass this assumption, we need to have independent observations–the artists studied need to be independent. If we have multiple top hit songs from the same artist, then the observations would NOT be independent (and we’d violate this assumption). Let’s proceed assuming that the study design would only contain one song from each artist. (This would be an issue to investigate if the date was collected ourselves, or if we were doing a real analysis where the findings would truly impact people!)
Second, let’s plot the residuals to see that there is no pattern (if there is, then perhaps the errors would be related). Remember that we need to actually BUILD the model before we can use the code below (because “fit” is the model we’ve built).
The errors do seem to be randomly distributed around 0, so the observations do appear to be independent.
Third, we’ll look at the ACF plot to see whether errors are correlated with one another:
Looking at the first lag (Lag=1), we can see it is NOT significant (it does not go through the blue threshold on either side of 0 that determines significance). So we do not have significant correlation between each error and the error in the observation NEXT to it.
Based on the results of these three tests, my model passes Assumption #3.
Assumption #4: No Heteroskedacity
For this, we’ll need to do 2 tests. In the first, we’ll plot each X variable vs. Y to be sure that from LEFT TO RIGHT, observations are distributed evenly. In the second, we’ll plot fit$resid vs. fit$fitted to see that there is a random distribution (errors are not “predictable” in any way). I’ll plot them all at once, like this:
The plot of dnce vs. nrgy does seem to show a bit more range of dnce for the higher energy songs in the study. However, I would say from left to right, there is not much of a change, and all three pass the assumption test. In the last chart, of predictions vs. errors, we are looking for randomness or a pattern. We can see that there are a few outliers, but it looks mostly random, indicating no evidence of heteroskedasticity. If we wanted to investigate further, we could pick out those outliers and see what’s unusual about them–maybe they became popular due to having different traits from majority of the top songs on the list.
So this model does pass Assumption #4!
Assmuption #5: Normally Distributed Errors
For this test, we just need to look at ONE of the charts we’ve covered in class. I’ll show all four, just for good measure, and I’ll plot them all together using the code below.
While the histograms look fairly normal, the Q-Q plots show one strange deviation from normality in the lower percentiles of dnce (and error). We can indicate that our errors are relatively normally distributed…perhaps that one outlier is the same observation we created from the fourth chart we created to test Assumption #4. I’d say my model passes Assumption #5.
With no violations of any of the assumption tests, we can move further with this model! As I mentioned before, if we wanted to further investigate the outliers, we could pick out those specific songs and see if they’re special for some reason or whether they could be studied separately and removed from our dataset temporarily.
Of course, we’ll continue to illustrate our new skills with this data, but when we get to the end of the process, our conclusions will confirm we do not have violations of assumptions in our model. This will allow for no bias in our significance tests and won’t nullify any interesting results we may have found.
If we were truly studying the relationship between dnce and these variables to make recommendations for an artist’s whose sole purpose of creating a song is to make it on Spotify’s top songs of the year’s list, we would possibly need to address the concerns.
In my last post, I presented a model with 3 predictors, though not all were significant:
dnce = 78.39 + .05*nrgy – .10*bpm + 1.12*dB
Now it’s time to test this model for multicollinearity. Multicollinearity is correlation among predictors (X variables). So I need to look at the correlations among my 3 predictor variables. I can do this by looking at correlations in pairs with R’s “cor” function, or I can create a correlation matrix:
The code above produces the following result:
One feature of a correlation matrix is that it is symmetric; measurements above the diagonal match those below (because the correlation between variable 1 and variable 2 is the same as the correlation between variable 2 and variable 1).
The diagonal of this matrix is filled with “1”s because everything has a perfect correlation with itself.
But the NEW information here is the 3 correlations present among the 3 pairs of X variables in the model.
Energy levels and beats per minute have a correlation of .13 (not large enough to be a concern).
Loudness and energy levels have a correlation of .54.
Loudness and beats per minute have a correlation of .18.
It is shown that loudness and energy levels share a large enough correlation of .54 (which is >.25) to be of concern, so there is multicollinearity in this model.