Thursday, July 24, 2014

Similarities and Differences Between Predictive Analytics and Business Intelligence

I’ve been reminded recently of the overlap between business intelligence and predictive analytics. Of course any reader of this blog (or at least the title of the blog) knows I live in the world of data mining (DM) and predictive analytics (PA), not the world of business intelligence (BI). In general, I don’t make comments about BI because I am an outsider looking in. Nevertheless, I view BI as a sibling to PA because we share so much in common: we use the same data, often use similar metrics and even sometimes use the same tools in our analyses.

I was interviewed by Victoria Garment of Software Advice on the topic of testing accuracy of predictive models in January, 2014 (I think I was first contacted about the interview in December, 2013). What I didn’t know was that John Elder and Karl Rexer, two good friends and colleagues in this space, were interviewed as well. The resulting article, "3 Ways to Test the Accuracy of Your Predictive Modelsposted on their Plotting Success blog was well written and generated quite a bit of buzz on twitter after it was posted.

Prior to the interview, I had no knowledge of Software Advice and after looking at their blog, I understand why: they are clearly a BI blog. But after reading maybe a dozen posts, it is clear that we are siblings, in particular sharing concepts and approaches in big data, data science, staffing and talent acquisition. I've enjoyed going back to the blog. 

The similarities of BI and PA are points I’ve tried to make in talks I’ve given at eMetrics and performance management conferences. After making suitable translations of terms, these two fields can understand each other well. Two sample differences in terminology are described here.

First, one rarely hears the term KPI at a PA conference, but will often hear it at BI conferences. If we use google as an indicator of popularity of the term KPI,
  • ' “predictive analytics” KPI' yielded a mere 103,000 hits on google, whereas
  • ' “business intelligence” KPI' yielded 1,510,000 hits.
In PA, one is more likely to hear these ideas described as metrics or even features or derived variables that can be used as inputs to models are as a target variable.

As a second example, a “use case” is frequently presented in BI conferences to explain a reason for creating a particular KPI or analysis. “Use Cases” are rarely described in PA conferences; in PA we say “case studies”. Back to google, we find
  • ' "business intelligence" "use case" ' – 306,000 hits on google
  • ' “predictive analytics” ”use case” ' – 58,800 hits on google
  • ' “predictive analytics “case study” ' – 217,000 hits on google


Interestingly, the top two links for “predictive analytics” “use case” from the search weren’t even predictive analytics use cases or case studies. The second link of the two actually described how predictive analytics is a use case for cloud computing.

The BI community, however, seems to embrace PA and even consider it part of BI (much to the chagrin of the PA community, I would think). According to the Wikipedia entry on BI, the following chart shows topics that are a part of BI:


Interestingly, DM, PA, and even Prescriptive Analytics are considered a part of BI. I must admit, at all the DM and PA conferences I’ve attended, I’ve never heard attendees describe themselves as BI practitioners. I have heard more cross-branding of BI and PA at other conferences that include BI-specific material, like Performance Management and Web Analytics conferences.

Contrast this with the PA Wikipedia page. This taxonomy of fields related to PA is typical. I would personally include dashed lines to Text Mining and maybe even Link Analysis or Social Networks as they are related though not directly under PA. Interestingly, statistics falls under PA here, I’m sure to the chagrin of statisticians! And, I would guess that at a statistics conference, the attendees would not refer to themselves as predictive modelers. But maybe they would consider themselves data scientists! Alas, that’s another topic altogether. But that is the way these kinds of lists go; they are difficult to perfect and usually generate discussion over where the dividing lines occur.


This tendency to include fields are part of “our own” is a trap most of us fall into: we tend to be myopic in our views of the fields of study. It frankly reminds me of a map I remember hanging in my house growing up in Natick, MA: “A Bostonian’s Idea of The United States of America”.  Clearly, Cape Cod is far more important than Florida or even California!


Be that as it may, my final point is that BI and PA are important but complementary disciplines. BI is a much larger field and understandably so. PA is more of a specialty, but a specialty that is gaining visibility and recognition as an important skill set to have in any organization. Here’s to further collaboration in the future!

Monday, May 26, 2014

Why Overfitting is More Dangerous than Just Poor Accuracy, Part II

In Part I, I explained one problem with overfitting the data: estimates of the target variable in regions without any training data can be unstable, whether those regions require the model to interpolate or extrapolate. Accuracy is a problem, but more precisely, the problems in interpolation and extrapolation are not revealed using any accuracy metrics and only arise when new data points are encountered after the model is deployed.
This month, a second problem with overfitting is described: unreliable model interpretation. Predictive modeling algorithms find variables that associate or correlate with the target variable. When models are overfit, the algorithm has latched onto variables that it finds to be strongly associated with the target variable, but these relationships are not repeatable. The problem is that these variables that appear to be strongly associated with the target are not necessarily related at all to the target. When we interpret what the model is telling us, we therefore glean the wrong insights, and these insights can be difficult to shed once we rebuild models to simplify them and avoid overfitting.
Consider an example from the 1998 KDD Cup data. One variable, RFA_3, has 70 levels (71 if we include the missing values), a case of a high-cardinality input variable. A decision tree may try to group all levels with the highest association with the target variable, TARGET_B, a categorical variable with labels 0 for non-responders and 1 for responders to a mailing campaign.
RFA_3 turns out to be one of the top predictors when building decision trees. The decision tree may try to group all levels with high average rates of TARGET_B equal to 1. The table below shows the 10 highest rates along with the counts for how many records match each value of RFA_3. The question is this: when a value like L4G matches only 10 records, one of which is a responder (10% response rate), do we believe it? How sure are we that the measured 10% rate in our sample is reproducible for the next 10 values of L4G?
We can gain some insight by applying a simple statistical test, like a binomial distribution test you can find online. The upper and lower bounds of the measured rate given the sample size is shown in the table as well. For L4G, we are 95% sure from the statistical test that L4G will have a rate between 0% and 28.6%. This means the 10% rate we measured in the small sample size could really in the long run be 1%. Or, it could be 20%. We just don’t know.
RFA_3
 Count
 TARGET_B = 1 Percent
Confidence Interval Lower Bound
Confidence Interval Upper Bound
 95% Confidence above average
A2C
1
100.0%
No
S4B
2
50.0%
No
S4C
9
11.1%
0.0%
31.6%
No
L4G
10
10.0%
0.0%
28.6%
No
N1E
46
8.7%
0.6%
16.8%
No
S2D
200
11.0%
6.7%
15.3%
Yes
A4D
1,867
9.1%
7.8%
10.4%
Yes
S3D
1,989
9.4%
8.1%
10.7%
Yes
S3E
2,262
8.7%
7.5%
9.9%
Yes
S4D
2,675
9.6%
8.5%
10.7%
Yes

For the 1998 KDD Cup data, it turns out that RFA_3 isn’t one of the better predictors of TARGET_B; it only showed up as a significant predictor when overfitting reared it’s ugly head.
The solution? Beware of overfitting. For high-cardinality variables, apply a complexity penalty to reduce the likelihood of finding these low-count associations. For continuous variables, the problem exists as well and can be just as deceptive. For all problems you are solving, resample the data to assess models on held-out data (testing data), cross-validation, or bootstrap sampling.

note: this post first appeared in Predictive Analytics Times (with minor edits added here)


Thursday, May 01, 2014

Why Overfitting is More Dangerous than Just Poor Accuracy, Part I

Arguably, the most important safeguard in building predictive models is complexity regularization to avoid overfitting the data. When models are overfit, their accuracy is lower on new data that wasn’t seen during training, and therefore when these models are deployed, they will disappoint, sometimes even leading decision makers to believe that predictive modeling “doesn’t work”.

Overfit, however, is thankfully a well-known problem and every algorithm has ways to avoid it. CART® and C5 trees use pruning to remove branches that are prone to overfitting, CHAID trees require splits are statistically significant to add complexity to the trees. Neural networks use held-out data to stop training when accuracy on held-out data becomes worse. Stepwise regression uses information theoretic criteria like the Akaike Information Criterion (AIC), Minimum Description Length (MDL), or the Bayesian Information Criterion (BIC) to add terms only when the additional complexity is offset by enough reduction of error.

But overfitting has more problems than merely misclassification cases in holdout data or incurring large errors for regression problems. Without loss of generality, this discussion will only describe overfilling in classification problems, but the same principles apply in regression problems as well.

One way modelers reduce the likelihood of overfit is to apply the principle of Occam’s Razor, where if two models exhibit the same accuracy, we will prefer the simpler model because it is more likely to generalize well. By simpler, we must keep in mind that we prefer models that behave more simply rather than models that just appear to be simpler because they have fewer terms. John Elder (a regular contributor to the PA Times) has a fantastic discussion of that topic in the book by Seni and Elder, Ensemble Methods in Data Mining.

Consider this example contrasting linear and nonlinear models.  The figure below shows decision boundaries for two models separates two classes of the famous Iris Data (http://archive.ics.uci.edu/ml/datasets/Iris). On the left is the decision boundary from a linear model built using linear discriminant analysis (like LDA or the Fisher Discriminant) and on the right, a decision boundary built by a model using quadratic discriminant analysis (like the Bayes Rule). The image can be found at http://scikit-learn.org/0.5/auto_examples/plot_lda_vs_qda.html.

It appears that the accuracy of both models is the same (let’s assume that it is), yet the behavior of the models is very different. If there is new data to be classified that appears in the upper left of the plot, the LDA model will call the data point versicolor whereas the QDA model will call it virginica. Which is correct? We don’t know which would be correct from the training data, but we do know this: there is no justification in the data to increase the complexity of the model from linear to quadratic. We probably would prefer the linear model here.



Apply models to regions in the data without data is the entire reason for avoiding overfit. The issue with the figure above was with model behavior when doing extrapolation, where we want to make sure that the models behave in a reasonable way for values outside (larger than or smaller than) the data used in training. But models also need to behave well when they interpolate, meaning we want models to behave reasonably for data in between data that exists in the training data.

Consider the second figure below showing decision boundaries for two models built from a data set derived from the famous KDD Cup data from 1998. The two dimensions in this plot are Average Donation Amount (Y) and Recent Donation Amount (X). This data tells the story that higher values of average and recent donation amounts are related to higher likelihoods of donors responding; note that for the smallest values of both average and recent donation amount, at the very bottom left of the data, the regions are colored cyan.


Both models are built using the Support Vector Machines (SVM) algorithm, but with different values of the complexity constant, C. Obviously, the model at the left is more complex than the model on the right. The magenta regions represent responders and the cyan regions represent non-responders.

In the effort to be more accurate on training data, the model on the left creates closed-decision boundaries around any and all groupings of responders. The model at the right joins these smaller blobs together into a larger blob where the model classifies data as responders. The complexity constant for the model at the right gives up accuracy to gain simplicity.

Which model is more believable? The one on the left will exhibit strange interpolation properties; data in between the magenta blobs will be called non-responders, sometimes in very thin regions between magenta regions; this behavior isn’t smooth or believable. The model at the right creates a single region of data to be classified as a responder and is clearly better than the model at the left.

Beware of overfitting the data and test models not just on testing or validation data, but if possible, on values not in the data to ensure its behavior, whether interpolation or extrapolation, is believable.


In part II, the problem overfitting causes for model interpretation will be addressed.