Data Science for Business Praise “A must-read resource for anyone who is serious about embracing the opportunity of big data.” — Craig Vaughan Global Vice President at SAP “This timely book says out...

1 answer below »
Data Science for Business
“A must-read resource for anyone who is serious
about embracing the opportunity of big data.”
— Craig Vaughan
Global Vice President at SAP
“This timely book says out loud what has finally become apparent: in the modern world,
Data is Business, and you can no longer think business without thinking data. Read this
book and you will understand the Science behind thinking data.”
— Ron Bekkerman
Chief Data Officer at Carmel Ventures
“A great book for business managers who lead or interact with data scientists, who wish to
better understand the principles and algorithms available without the technical details of
single-disciplinary books.”
— Ronny Kohavi
Partner Architect at Microsoft Online Services Division
“Provost and Fawcett have distilled their mastery of both the art and science of real-world
data analysis into an unrivalled introduction to the field.”
— Geoff Webb
Editor-in-Chief of Data Mining and Knowledge
Discovery Journal
“I would love it if everyone I had to work with had read this book.”
— Claudia Perlich
Chief Scientist of Dstillery and Advertising Research
Foundation Innovation Award Grand Winner (2013)
“A foundational piece in the fast developing world of Data Science.
A must read for anyone interested in the Big Data revolution."
— Justin Gapper
Business Unit Analytics Manager
at Teledyne Scientific and Imaging
“The authors, both renowned experts in data science before it had a name, have taken a
complex topic and made it accessible to all levels, but mostly helpful to the budding data
scientist. As far as I know, this is the first book of its kind—with a focus on data science
concepts as applied to practical business problems. It is liberally sprinkled with compelling
real-world examples outlining familiar, accessible problems in the business world: customer
churn, targeted marking, even whiskey analytics!
The book is unique in that it does not give a cookbook of algorithms, rather it helps the
reader understand the underlying concepts behind data science, and most importantly how
to approach and be successful at problem solving. Whether you are looking for a good
comprehensive overview of data science or are a budding data scientist in need of the basics,
this is a must-read.”
— Chris Volinsky
Director of Statistics Research at AT&T Labs and Winning
Team Member for the $1 Million Netflix Challenge
“This book goes beyond data analytics 101. It’s the essential guide for those of us (all of us?)
whose businesses are built on the ubiquity of data opportunities and the new mandate for
data-driven decision-making.”
— Tom Phillips
CEO of Dstillery and Former Head of
Google Search and Analytics
“Intelligent use of data has become a force powering business to new levels of
competitiveness. To thrive in this data-driven ecosystem, engineers, analysts, and managers
alike must understand the options, design choices, and tradeoffs before them. With
motivating examples, clear exposition, and a breadth of details covering not only the “hows”
but the “whys”, Data Science for Business is the perfect primer for those wishing to become
involved in the development and application of data-driven systems.”
— Josh Attenberg
Data Science Lead at Etsy
“Data is the foundation of new waves of productivity growth, innovation, and richer
customer insight. Only recently viewed broadly as a source of competitive advantage, dealing
well with data is rapidly becoming table stakes to stay in the game. The authors’ deep applied
experience makes this a must read—a window into your competitor’s strategy.”
— Alan Murray
Serial Entrepreneur; Partner at Coriolis Ventures
“One of the best data mining books, which helped me think through various ideas on
liquidity analysis in the FX business. The examples are excellent and help you take a deep
dive into the subject! This one is going to be on my shelf for lifetime!”
— Nidhi Kathuria
Vice President of FX at Royal Bank of Scotland
“An excellent and accessible primer to help businessfolk better appreciate the concepts, tools
and techniques employed by data scientists… and for data scientists to better appreciate the
business context in which their solutions are deployed.”
— Joe McCarthy
Director of Analytics and Data Science at Atigeo
“In my opinion it is the best book on Data Science and Big Data for a professional
understanding by business analysts and managers who must apply these techniques in the
practical world.”
— Ira Laefsky
MS Engineering (Computer Science)/MBA Information
Technology and Human Computer Interaction Researcher
formerly on the Senior Consulting Staff of Arthur D. Little, Inc.
and Digital Equipment Corporation
“With motivating examples, clear exposition and a breadth of details covering not only the
“hows” but the “whys,” Data Science for Business is the perfect primer for those wishing to
become involved in the development and application of data driven systems.”
— Ted O’Brien
Co-Founder / Director of Talent Acquisition at Starbridge
Partners and Publisher of the Data Science Report
Foster Provost and Tom Fawcett
Data Science for Business
Data Science for Business
by Foster Provost and Tom Fawcett
Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles ( For more information, contact our corporate/
institutional sales department: XXXXXXXXXXor XXXXXXXXXX.
Editors: Mike Loukides and Meghan Blanchette
Production Editor: Christopher Hearse
Proofreader: Kiel Van Horn
Indexer: WordCo Indexing Services, Inc.
Cover Designer: Mark Paglietti
Interior Designer: David Futato
Illustrator: Rebecca Demarest
July 2013: First Edition
Revision History for the First Edition:
XXXXXXXXXX: First release
XXXXXXXXXX: Second release
See XXXXXXXXXXfor release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Many of the designations used by man‐
ufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations
appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been
printed in caps or initial caps. Data Science for Business is a trademark of Foster Provost and Tom Fawcett.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained
For our fathers.
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXXxiii
1. Introduction: Data-Analytic Thinking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX1
The Ubiquity of Data Opportunities XXXXXXXXXX1
Example: Hurricane Frances XXXXXXXXXX3
Example: Predicting Customer Churn XXXXXXXXXX4
Data Science, Engineering, and Data-Driven Decision Making XXXXXXXXXX4
Data Processing and “Big Data” XXXXXXXXXX7
From Big Data 1.0 to Big Data 2.0 XXXXXXXXXX8
Data and Data Science Capability as a Strategic Asset XXXXXXXXXX9
Data-Analytic Thinking XXXXXXXXXX12
Data Mining and Data Science, Revisited XXXXXXXXXX14
Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data
Scientist XXXXXXXXXX15
2. Business Problems and Data Science Solutions. . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX19
Fundamental concepts: A set of canonical data mining tasks; The data mining process;
Supervised versus unsupervised data mining.
From Business Problems to Data Mining Tasks XXXXXXXXXX19
Supervised Versus Unsupervised Methods XXXXXXXXXX24
Data Mining and Its Results XXXXXXXXXX25
The Data Mining Process XXXXXXXXXX26
Business Understanding XXXXXXXXXX27
Data Understanding XXXXXXXXXX28
Data Preparation XXXXXXXXXX29
Evaluation XXXXXXXXXX31
Deployment XXXXXXXXXX32
Implications for Managing the Data Science Team XXXXXXXXXX34
Other Analytics Techniques and Technologies XXXXXXXXXX35
Statistics XXXXXXXXXX35
Database Querying XXXXXXXXXX37
Data Warehousing XXXXXXXXXX38
Regression Analysis XXXXXXXXXX39
Machine Learning and Data Mining XXXXXXXXXX39
Answering Business Questions with These Techniques XXXXXXXXXX40
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation. 43
Fundamental concepts: Identifying informative attributes; Segmenting data by
progressive attribute selection.
Exemplary techniques: Finding correlations; Attribute/variable selection; Tree
Models, Induction, and Prediction XXXXXXXXXX44
Supervised Segmentation XXXXXXXXXX48
Selecting Informative Attributes XXXXXXXXXX49
Example: Attribute Selection with Information Gain XXXXXXXXXX56
Supervised Segmentation with Tree-Structured Models XXXXXXXXXX62
Visualizing Segmentations XXXXXXXXXX67
Trees as Sets of Rules XXXXXXXXXX71
Probability Estimation XXXXXXXXXX71
Example: Addressing the Churn Problem with Tree Induction XXXXXXXXXX73
4. Fitting a Model to Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX81
Fundamental concepts: Finding “optimal” model parameters based on data; Choosing
the goal for data mining; Objective functions; Loss functions.
Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.
Classification via Mathematical Functions XXXXXXXXXX83
Linear Discriminant Functions XXXXXXXXXX85
Optimizing an Objective Function XXXXXXXXXX88
An Example of Mining a Linear Discriminant from Data XXXXXXXXXX89
Linear Discriminant Functions for Scoring and Ranking Instances XXXXXXXXXX91
Support Vector Machines, Briefly XXXXXXXXXX92
Regression via Mathematical Functions XXXXXXXXXX95
Class Probability Estimation and Logistic “Regression” XXXXXXXXXX97
* Logistic Regression: Some Technical Details XXXXXXXXXX100
Example: Logistic Regression versus Tree Induction XXXXXXXXXX103
Nonlinear Functions, Support Vector Machines, and Neural Networks XXXXXXXXXX
vi | Table of Contents
5. Overfitting and Its Avoidance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX111
Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.
Exemplary techniques: Cross-validation; Attribute selection; Tree pruning;
Generalization XXXXXXXXXX111
Overfitting XXXXXXXXXX113
Overfitting Examined XXXXXXXXXX113
Holdout Data and Fitting Graphs XXXXXXXXXX113
Overfitting in Tree Induction XXXXXXXXXX116
Overfitting in Mathematical Functions XXXXXXXXXX118
Example: Overfitting Linear Functions XXXXXXXXXX119
* Example: Why Is Overfitting Bad? XXXXXXXXXX124
From Holdout Evaluation to Cross-Validation XXXXXXXXXX126
The Churn Dataset Revisited XXXXXXXXXX129
Learning Curves XXXXXXXXXX130
Overfitting Avoidance and Complexity Control XXXXXXXXXX133
Avoiding Overfitting with Tree Induction XXXXXXXXXX133
A General Method for Avoiding Overfitting XXXXXXXXXX134
* Avoiding Overfitting for Parameter Optimization XXXXXXXXXX136
6. Similarity, Neighbors, and Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX141
Fundamental concepts: Calculating similarity of objects described by data; Using
similarity for prediction; Clustering as similarity-based segmentation.
Exemplary techniques: Searching for similar entities; Nearest neighbor methods;
Clustering methods; Distance metrics for calculating similarity.
Similarity and Distance XXXXXXXXXX142
Nearest-Neighbor Reasoning XXXXXXXXXX144
Example: Whiskey Analytics XXXXXXXXXX145
Nearest Neighbors for Predictive Modeling XXXXXXXXXX147
How Many Neighbors and How Much Influence? XXXXXXXXXX149
Geometric Interpretation, Overfitting, and Complexity Control XXXXXXXXXX151
Issues with Nearest-Neighbor Methods XXXXXXXXXX155
Some Important Technical Details Relating to Similarities and Neighbors XXXXXXXXXX
Heterogeneous Attributes XXXXXXXXXX157
* Other Distance Functions XXXXXXXXXX158
* Combining Functions: Calculating Scores from Neighbors XXXXXXXXXX162
Clustering XXXXXXXXXX163
Example: Whiskey Analytics Revisited XXXXXXXXXX164
Hierarchical Clustering XXXXXXXXXX165
Table of Contents | vii
Nearest Neighbors Revisited: Clustering Around Centroids XXXXXXXXXX170
Example: Clustering Business News Stories XXXXXXXXXX175
Understanding the Results of Clustering XXXXXXXXXX178
* Using Supervised Learning to Generate Cluster Descriptions XXXXXXXXXX180
Stepping Back: Solving a Business Problem Versus Data Exploration XXXXXXXXXX183
7. Decision Analytic Thinking I: What Is a Good Model?. . . . . . . . . . . . . . . . . . . XXXXXXXXXX187
Fundamental concepts: Careful consideration of what is desired from data science
results; Expected value as a key evaluation framework; Consideration of appropriate
comparative baselines.
Exemplary techniques: Various evaluation metrics; Estimating costs and benefits;
Calculating expected profit; Creating baseline methods for comparison.
Evaluating Classifiers XXXXXXXXXX188
Plain Accuracy and Its Problems XXXXXXXXXX189
The Confusion Matrix XXXXXXXXXX189
Problems with Unbalanced Classes XXXXXXXXXX190
Problems with Unequal Costs and Benefits XXXXXXXXXX193
Generalizing Beyond Classification XXXXXXXXXX193
A Key Analytical Framework: Expected Value XXXXXXXXXX194
Using Expected Value to Frame Classifier Use XXXXXXXXXX195
Using Expected Value to Frame Classifier Evaluation XXXXXXXXXX196
Evaluation, Baseline Performance, and Implications for Investments in Data XXXXXXXXXX
8. Visualizing Model Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX209
Fundamental concepts: Visualization of model performance under various kinds of
uncertainty; Further consideration of what is desired from data mining results.
Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC
Ranking Instead of Classifying XXXXXXXXXX209
Profit Curves XXXXXXXXXX212
ROC Graphs and Curves XXXXXXXXXX214
The Area Under the ROC Curve (AUC) XXXXXXXXXX219
Cumulative Response and Lift Curves XXXXXXXXXX219
Example: Performance Analytics for Churn Modeling XXXXXXXXXX223
9. Evidence and Probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX233
Fundamental concepts: Explicit evidence combination with Bayes’ Rule; Probabilistic
reasoning via assumptions of conditional independence.
Exemplary techniques: Naive Bayes classification; Evidence lift.
viii | Table of Contents
Example: Targeting Online Consumers With Advertisements XXXXXXXXXX233
Combining Evidence Probabilistically XXXXXXXXXX235
Joint Probability and Independence XXXXXXXXXX236
Bayes’ Rule XXXXXXXXXX237
Applying Bayes’ Rule to Data Science XXXXXXXXXX239
Conditional Independence and Naive Bayes XXXXXXXXXX241
Advantages and Disadvantages of Naive Bayes XXXXXXXXXX243
A Model of Evidence “Lift” XXXXXXXXXX244
Example: Evidence Lifts from Facebook “Likes” XXXXXXXXXX246
Evidence in Action: Targeting Consumers with Ads XXXXXXXXXX248
10. Representing and Mining Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX251
Fundamental concepts: The importance of constructing mining-friendly data
representations; Representation of text for data mining.
Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams;
Stemming; Named entity extraction; Topic models.
Why Text Is Important XXXXXXXXXX252
Why Text Is Difficult XXXXXXXXXX252
Representation XXXXXXXXXX253
Bag of Words XXXXXXXXXX254
Term Frequency XXXXXXXXXX254
Measuring Sparseness: Inverse Document Frequency XXXXXXXXXX256
Combining Them: TFIDF XXXXXXXXXX258
Example: Jazz Musicians XXXXXXXXXX258
* The Relationship of IDF to Entropy XXXXXXXXXX263
Beyond Bag of Words XXXXXXXXXX265
N-gram Sequences XXXXXXXXXX265
Named Entity Extraction XXXXXXXXXX266
Topic Models XXXXXXXXXX266
Example: Mining News Stories to Predict Stock Price Movement XXXXXXXXXX268
Data Preprocessing XXXXXXXXXX272
11. Decision Analytic Thinking II: Toward Analytical Engineering. . . . . . . . . . . XXXXXXXXXX279
Fundamental concept: Solving business problems with data science starts with
analytical engineering: designing an analytical solution, based on the data, tools, and
techniques available.
Exemplary technique: Expected value as a framework for data science solution design.
Table of Contents | ix
Targeting the Best Prospects for a Charity Mailing XXXXXXXXXX280
The Expected Value Framework: Decomposing the Business Problem and
Recomposing the Solution Pieces XXXXXXXXXX280
A Brief Digression on Selection Bias XXXXXXXXXX282
Our Churn Example Revisited with Even More Sophistication XXXXXXXXXX283
The Expected Value Framework: Structuring a More Complicated Business
Assessing the Influence of the Incentive XXXXXXXXXX285
From an Expected Value Decomposition to a Data Science Solution XXXXXXXXXX
12. Other Data Science Tasks and Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX291
Fundamental concepts: Our fundamental concepts as the basis of many common data
science techniques; The importance of familiarity with the building blocks of data
Exemplary techniques: Association and co-occurrences; Behavior profiling; Link
prediction; Data reduction; Latent information mining; Movie recommendation; Bias-
variance decomposition of error; Ensembles of models; Causal reasoning from data.
Co-occurrences and Associations: Finding Items That Go Together XXXXXXXXXX292
Measuring Surprise: Lift and Leverage XXXXXXXXXX293
Example: Beer and Lottery Tickets XXXXXXXXXX294
Associations Among Facebook Likes XXXXXXXXXX295
Profiling: Finding Typical Behavior XXXXXXXXXX298
Link Prediction and Social Recommendation XXXXXXXXXX303
Data Reduction, Latent Information, and Movie Recommendation XXXXXXXXXX304
Bias, Variance, and Ensemble Methods XXXXXXXXXX308
Data-Driven Causal Explanation and a Viral Marketing Example XXXXXXXXXX311
13. Data Science and Business Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX315
Fundamental concepts: Our principles as the basis of success for a data-driven
business; Acquiring and sustaining competitive advantage via data science; The
importance of careful curation of data science capability.
Thinking Data-Analytically, Redux XXXXXXXXXX315
Achieving Competitive Advantage with Data Science XXXXXXXXXX317
Sustaining Competitive Advantage with Data Science XXXXXXXXXX318
Formidable Historical Advantage XXXXXXXXXX319
Unique Intellectual Property XXXXXXXXXX319
Unique Intangible Collateral Assets XXXXXXXXXX320
Superior Data Scientists XXXXXXXXXX320
Superior Data Science Management XXXXXXXXXX322
Attracting and Nurturing Data Scientists and Their Teams XXXXXXXXXX323
x | Table of Contents
Examine Data Science Case Studies XXXXXXXXXX325
Be Ready to Accept Creative Ideas from Any Source XXXXXXXXXX326
Be Ready to Evaluate Proposals for Data Science Projects XXXXXXXXXX326
Example Data Mining Proposal XXXXXXXXXX327
Flaws in the Big Red Proposal XXXXXXXXXX328
A Firm’s Data Science Maturity XXXXXXXXXX329
14. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX333
The Fundamental Concepts of Data Science XXXXXXXXXX333
Applying Our Fundamental Concepts to a New Problem: Mining Mobile
Device Data XXXXXXXXXX336
Changing the Way We Think about Solutions to Business Problems XXXXXXXXXX
What Data Can’t Do: Humans in the Loop, Revisited XXXXXXXXXX340
Privacy, Ethics, and Mining Data About Individuals XXXXXXXXXX343
Is There More to Data Science? XXXXXXXXXX344
Final Example: From Crowd-Sourcing to Cloud-Sourcing XXXXXXXXXX345
Final Words XXXXXXXXXX346
A. Proposal Review Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX349
B. Another Sample Proposal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX353
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX357
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX361
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXXXXXXXX369
Table of Contents | xi
Data Science for Business is intended for several sorts of readers:
• Business people who will be working with data scientists, managing data science–
oriented projects, or investing in data science ventures,
• Developers who will be implementing data science solutions, and
• Aspiring data scientists.
This is not a book about algorithms, nor is it a replacement for a book about algorithms.
We deliberately avoided an algorithm-centered approach. We believe there is a relatively
small set of fundamental concepts or principles that underlie techniques for extracting
useful knowledge from data. These concepts serve as the foundation for many well-
known algorithms of data mining. Moreover, these concepts underlie the analysis of
data-centered business problems, the creation and evaluation of data science solutions,
and the evaluation of general data science strategies and proposals. Accordingly, we
organized the exposition around these general principles rather than around specific
algorithms. Where necessary to describe procedural details, we use a combination of
text and diagrams, which we think are more accessible than a listing of detailed algo‐
rithmic steps.
The book does not presume a sophisticated mathematical background. However, by its
very nature the material is somewhat technical—the goal is to impart a significant un‐
derstanding of data science, not just to give a high-level overview. In general, we have
tried to minimize the mathematics and make the exposition as “conceptual” as possible.
Colleagues in industry comment that the book is invaluable for helping to align the
understanding of the business, technical/development, and data science teams. That
observation is based on a small sample, so we are curious to see how general it truly is
(see Chapter 5!). Ideally, we envision a book that any data scientist would give to his
collaborators from the development or business teams, effectively saying: if you really
want to design/implement top-notch data science solutions to business problems, we
all need to have a common understanding of this material.
Colleagues also tell us that the book has been quite useful in an unforeseen way: for
preparing to interview data science job candidates. The demand from business for hiring
data scientists is strong and increasing. In response, more and more job seekers are
presenting themselves as data scientists. Every data science job candidate should un‐
derstand the fundamentals presented in this book. (Our industry colleagues tell us that
they are surprised how many do not. We have half-seriously discussed a follow-up
pamphlet “Cliff ’s Notes to Interviewing for Data Science Jobs.”)
Our Conceptual Approach to Data Science
In this book we introduce a collection of the most important fundamental concepts of
data science. Some of these concepts are “headliners” for chapters, and others are in‐
troduced more naturally through the discussions (and thus they are not necessarily
labeled as fundamental concepts). The concepts span the process from envisioning the
problem, to applying data science techniques, to deploying the results to improve
decision-making. The concepts also undergird a large array of business analytics meth‐
ods and techniques.
The concepts fit into three general types:
1. Concepts about how data science fits in the organization and the competitive land‐
scape, including ways to attract, structure, and nurture data science teams; ways for
thinking about how data science leads to competitive advantage; and tactical con‐
cepts for doing well with data science projects.
2. General ways of thinking data-analytically. These help in identifying appropriate
data and consider appropriate methods. The concepts include the data mining pro‐
cess as well as the collection of different high-level data mining tasks.
3. General concepts for actually extracting knowledge from data, which undergird the
vast array of data science tasks and their algorithms.
For example, one fundamental concept is that of determining the similarity of two
entities described by data. This ability forms the basis for various specific tasks. It may
be used directly to find customers similar to a given customer. It forms the core of several
prediction algorithms that estimate a target value such as the expected resource usage
of a client or the probability of a customer to respond to an offer. It is also the basis for
clustering techniques, which group entities by their shared features without a focused
objective. Similarity forms the basis of information retrieval, in which documents or
webpages relevant to a search query are retrieved. Finally, it underlies several common
algorithms for recommendation. A traditional algorithm-oriented book might present
each of these tasks in a different chapter, under different names, with common aspects
xiv | Preface
1. Of course, each author has the distinct impression that he did the majority of the work on the book.
buried in algorithm details or mathematical propositions. In this book we instead focus
on the unifying concepts, presenting specific tasks and algorithms as natural manifes‐
tations of them.
As another example, in evaluating the utility of a pattern, we see a notion of lift — how
much more prevalent a pattern is than would be expected by chance—recurring broadly
across data science. It is used to evaluate very different sorts of patterns in different
contexts. Algorithms for targeting advertisements are evaluated by computing the lift
one gets for the targeted population. Lift is used to judge the weight of evidence for or
against a conclusion. Lift helps determine whether a co-occurrence (an association) in
data is interesting, as opposed to simply being a natural consequence of popularity.
We believe that explaining data science around such fundamental concepts not only
aids the reader, it also facilitates communication between business stakeholders and
data scientists. It provides a shared vocabulary and enables both parties to understand
each other better. The shared concepts lead to deeper discussions that may uncover
critical issues otherwise missed.
To the Instructor
This book has been used successfully as a textbook for a very wide variety of data science
courses. Historically, the book arose from the development of Foster’s multidisciplinary
Data Science classes at the Stern School at NYU, starting in the fall of XXXXXXXXXXThe original
class was nominally for MBA students and MSIS students, but drew students from
schools across the university. The most interesting aspect of the class was not that it
appealed to MBA and MSIS students, for whom it was designed. More interesting, it
also was found to be very valuable by students with strong backgrounds in machine
learning and other technical disciplines. Part of the reason seemed to be that the focus
on fundamental principles and other issues besides algorithms was missing from their
At NYU we now use the book in support of a variety of data science–related programs:
the original MBA and MSIS programs, undergraduate business analytics, NYU/Stern’s
new MS in Business Analytics program, and as the Introduction to Data Science for
NYU’s new MS in Data Science. In addition, (prior to publication) the book has been
adopted by more than twenty other universities for programs in nine countries (and
counting), in business schools, in computer science programs, and for more general
introductions to data science.
Stay tuned to the books’ websites (see below) for information on how to obtain helpful
instructional material, including lecture slides, sample homework questions and prob‐
Preface | xv
lems, example project instructions based on the frameworks from the book, exam ques‐
tions, and more to come.
We keep an up-to-date list of known adoptees on the book’s web‐
site. Click Who’s Using It at the top.
Other Skills and Concepts
There are many other concepts and skills that a practical data scientist needs to know
besides the fundamental principles of data science. These skills and concepts will be
discussed in Chapter 1 and Chapter 2. The interested reader is encouraged to visit the
book’s website for pointers to material for learning these additional skills and concepts
(for example, scripting in Python, Unix command-line processing, datafiles, common
data formats, databases and querying, big data architectures and systems like MapRe‐
duce and Hadoop, data visualization, and other related topics).
Sections and Notation
In addition to occasional footnotes, the book contains boxed “sidebars.” These are es‐
sentially extended footnotes. We reserve these for material that we consider interesting
and worthwhile, but too long for a footnote and too much of a digression for the main
Technical Details Ahead — A note on the starred sections
The occasional mathematical details are relegated to optional “starred”
sections. These section titles will have asterisk prefixes, and they will be
preceeded by a paragraph rendered like this one. Such “starred” sec‐
tions contain more detailed mathematics and/or more technical details
than elsewhere, and these introductory paragraph explains its purpose.
The book is written so that these sections may be skipped without loss
of continuity, although in a few places we remind readers that details
appear there.
Constructions in the text like (Smith and Jones, 2003) indicate a reference to an entry
in the bibliography (in this case, the 2003 article or book by Smith and Jones); “Smith
and Jones (2003)” is a similar reference. A single bibliography for the entire book appears
in the endmatter.
xvi | Preface
In this book we try to keep math to a minimum, and what math there is we have sim‐
plified as much as possible without introducing confusion. For our readers with tech‐
nical backgrounds, a few comments may be in order regarding our simplifying choices.
1. We avoid Sigma (Σ) and Pi (Π) notation, commonly used in textbooks to indicate
sums and products, respectively. Instead we simply use equations with ellipses like
f (x) = w1x1 + w2x2 + ⋯ + wnxn
In the technical, “starred” sections we sometimes adopt Sigma and Pi notation when
this ellipsis approach is just too cumbersome. We assume people reading these
sections are somewhat more comfortable with math notation and will not be con‐
2. Statistics books are usually careful to distinguish between a value and its estimate
by putting a “hat” on variables that are estimates, so in such books you’ll typically
see a true probability denoted p and its estimate denoted p̂. In this book we are
almost always talking about estimates from data, and putting hats on everything
makes equations verbose and ugly. Everything should be assumed to be an estimate
from data unless we say otherwise.
3. We simplify notation and remove extraneous variables where we believe they are
clear from context. For example, when we discuss classifiers mathematically, we are
technically dealing with decision predicates over feature vectors. Expressing this
formally would lead to equations like:
f̂ R(x) = xAge × XXXXXXXXXX × xBalance + 60
Instead we opt for the more readable:
f (x) = Age × XXXXXXXXXX × Balance + 60
with the understanding that x is a vector and Age and Balance are components of
We have tried to be consistent with typography, reserving fixed-width typewriter fonts
like sepal_width to indicate attributes or keywords in data. For example, in the text-
mining chapter, a word like 'discussing' designates a word in a document while dis
cuss might be the resulting token in the data.
The following typographical conventions are used in this book:
Preface | xvii
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
Throughout the book we have placed special inline tips and warnings relevant to the
material. They will be rendered differently depending on whether you’re reading paper,
PDF, or an ebook, as follows:
A sentence or paragraph typeset like this signifies a tip or a suggestion.
This text and element signifies a general note.
Text rendered like this signifies a warning or caution. These are more
important than tips and are used sparingly.
Using Examples
In addition to being an introduction to data science, this book is intended to be useful
in discussions of and day-to-day work in the field. Answering a question by citing this
book and quoting examples does not require permission. We appreciate, but do not
require, attribution. Formal attribution usually includes the title, author, publisher, and
ISBN. For example: “Data Science for Business by Foster Provost and Tom Fawcett
(O’Reilly). Copyright 2013 Foster Provost and Tom Fawcett, XXXXXXXXXX.”
If you feel your use of examples falls outside fair use or the permission given above, feel
free to contact us at XXXXXXXXXX.
xviii | Preface
Safari® Books Online
Safari Books Online is an on-demand digital library that
delivers expert content in both book and video form from
the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research, prob‐
lem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi‐
zations, government agencies, and individuals. Subscribers have access to thousands of
books, training videos, and prepublication manuscripts in one fully searchable database
from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Pro‐
fessional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John
Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT
Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course
Technology, and dozens more. For more information about Safari Books Online, please
visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
XXXXXXXXXXin the United States or Canada)
XXXXXXXXXXinternational or local)
We have two web pages for this book, where we list errata, examples, and any additional
information. You can access the publisher’s page at and the
authors’ page at
To comment or ask technical questions about this book, send email to bookques
For more information about O’Reilly Media’s books, courses, conferences, and news,
see their website at
Find us on Facebook:
Follow us on Twitter:
Watch us on YouTube:
Preface | xix
Thanks to all the many colleagues and others who have provided invaluable ideas, feed‐
back, criticism, suggestions, and encouragement based on discussions and many prior
draft manuscripts. At the risk of missing someone, let us thank in particular: Panos
Adamopoulos, Manuel Arriaga, Josh Attenberg, Solon Barocas, Ron Bekkerman, Josh
Blumenstock, Ohad Brazilay, Aaron Brick, Jessica Clark, Nitesh Chawla, Peter Devito,
Vasant Dhar, Jan Ehmke, Theos Evgeniou, Justin Gapper, Tomer Geva, Daniel Gillick,
Shawndra Hill, Nidhi Kathuria, Ronny Kohavi, Marios Kokkodis, Tom Lee, Philipp
Marek, David Martens, Sophie Mohin, Lauren Moores, Alan Murray, Nick Nishimura,
Balaji Padmanabhan, Jason Pan, Claudia Perlich, Gregory Piatetsky-Shapiro, Tom Phil‐
lips, Kevin Reilly, Maytal Saar-Tsechansky, Evan Sadler, Galit Shmueli, Roger Stein, Nick
Street, Kiril Tsemekhman, Craig Vaughan, Chris Volinsky, Wally Wang, Geoff Webb,
Debbie Yuster, and Rong Zheng. We would also like to thank more generally the students
from Foster’s classes, Data Mining for Business Analytics, Practical Data Science, In‐
troduction to Data Science, and the Data Science Research Seminar. Questions and
issues that arose when using prior drafts of this book provided substantive feedback for
improving it.
Thanks to all the colleagues who have taught us about data science and about how to
teach data science over the years. Thanks especially to Maytal Saar-Tsechansky and
Claudia Perlich. Maytal graciously shared with Foster her notes for her data mining class
many years ago. The classification tree example in Chapter 3 (thanks especially for the
“bodies” visualization) is based mostly on her idea and example; her ideas and example
were the genesis for the visualization comparing the partitioning of the instance space
with trees and linear discriminant functions in Chapter 4, the “Will David Respond”
example in Chapter 6 is based on her example, and probably other things long forgotten.
Claudia has taught companion sections of Data Mining for Business Analytics/Intro‐
duction to Data Science along with Foster for the past few years, and has taught him
much about data science in the process (and beyond).
Thanks to David Stillwell, Thore Graepel, and Michal Kosinski for providing the Face‐
book Like data for some of the examples. Thanks to Nick Street for providing the cell
nuclei data and for letting us use the cell nuclei image in Chapter 4. Thanks to David
Martens for his help with the mobile locations visualization. Thanks to Chris Volinsky
for providing data from his work on the Netflix Challenge. Thanks to Sonny Tambe for
early access to his results on big data technologies and productivity. Thanks to Patrick
Perry for pointing us to the bank call center example used in Chapter 12. Thanks to
Geoff Webb for the use of the Magnum Opus association mining system.
Most of all we thank our families for their love, patience and encouragement.
A great deal of open source software was used in the preparation of this book and its
examples. The authors wish to thank the developers and contributors of:
xx | Preface
• Python and Perl
• Scipy, Numpy, Matplotlib, and Scikit-Learn
• Weka
• The Machine Learning Repository at the University of California at Irvine (Bache
& Lichman, 2013)
Finally, we encourage readers to check our website for updates to this material, new
chapters, errata, addenda, and accompanying slide sets.
—Foster Provost and Tom Fawcett
Preface | xxi
Dream no small dreams for they have no power to
move the hearts of men.
—Johann Wolfgang von Goethe
Introduction: Data-Analytic Thinking
The past fifteen years have seen extensive investments in business infrastructure, which
have improved the ability to collect data throughout the enterprise. Virtually every as‐
pect of business is now open to data collection and often even instrumented for data
collection: operations, manufacturing, supply-chain management, customer behavior,
marketing campaign performance, workflow procedures, and so on. At the same time,
information is now widely available on external events such as market trends, industry
news, and competitors’ movements. This broad availability of data has led to increasing
interest in methods for extracting useful information and knowledge from data—the
realm of data science.
The Ubiquity of Data Opportunities
With vast amounts of data now available, companies in almost every industry are fo‐
cused on exploiting data for competitive advantage. In the past, firms could employ
teams of statisticians, modelers, and analysts to explore datasets manually, but the vol‐
ume and variety of data have far outstripped the capacity of manual analysis. At the
same time, computers have become far more powerful, networking has become ubiq‐
uitous, and algorithms have been developed that can connect datasets to enable broader
and deeper analyses than previously possible. The convergence of these phenomena has
given rise to the increasingly widespread business application of data science principles
and data-mining techniques.
Probably the widest applications of data-mining techniques are in marketing for tasks
such as targeted marketing, online advertising, and recommendations for cross-selling.
Data mining is used for general customer relationship management to analyze customer
behavior in order to manage attrition and maximize expected customer value. The
finance industry uses data mining for credit scoring and trading, and in operations via
fraud detection and workforce management. Major retailers from Walmart to Amazon
apply data mining throughout their businesses, from marketing to supply-chain man‐
agement. Many firms have differentiated themselves strategically with data science,
sometimes to the point of evolving into data mining companies.
The primary goals of this book are to help you view business problems from a data
perspective and understand principles of extracting useful knowledge from data. There
is a fundamental structure to data-analytic thinking, and basic principles that should
be understood. There are also particular areas where intuition, creativity, common
sense, and domain knowledge must be brought to bear. A data perspective will provide
you with structure and principles, and this will give you a framework to systematically
analyze such problems. As you get better at data-analytic thinking you will develop
intuition as to how and where to apply creativity and domain knowledge.
Throughout the first two chapters of this book, we will discuss in detail various topics
and techniques related to data science and data mining. The terms “data science” and
“data mining” often are used interchangeably, and the former has taken a life of its own
as various individuals and organizations try to capitalize on the current hype surround‐
ing it. At a high level, data science is a set of fundamental principles that guide the
extraction of knowledge from data. Data mining is the extraction of knowledge from
data, via technologies that incorporate these principles. As a term, “data science” often
is applied more broadly than the traditional use of “data mining,” but data mining tech‐
niques provide some of the clearest illustrations of the principles of data science.
It is important to understand data science even if you never intend to
apply it yourself. Data-analytic thinking enables you to evaluate pro‐
posals for data mining projects. For example, if an employee, a con‐
sultant, or a potential investment target proposes to improve a par‐
ticular business application by extracting knowledge from data, you
should be able to assess the proposal systematically and decide wheth‐
er it is sound or flawed. This does not mean that you will be able to
tell whether it will actually succeed—for data mining projects, that
often requires trying—but you should be able to spot obvious flaws,
unrealistic assumptions, and missing pieces.
Throughout the book we will describe a number of fundamental data science principles,
and will illustrate each with at least one data mining technique that embodies the prin‐
ciple. For each principle there are usually many specific techniques that embody it, so
in this book we have chosen to emphasize the basic principles in preference to specific
techniques. That said, we will not make a big deal about the difference between data
2 | Chapter 1: Introduction: Data-Analytic Thinking
1. Of course! What goes better with strawberry Pop-Tarts than a nice cold beer?
science and data mining, except where it will have a substantial effect on understanding
the actual concepts.
Let’s examine two brief case studies of analyzing data to extract predictive patterns.
Example: Hurricane Frances
Consider an example from a New York Times story from 2004:
Hurricane Frances was on its way, barreling across the Caribbean, threatening a direct
hit on Florida’s Atlantic coast. Residents made for higher ground, but far away, in Ben‐
tonville, Ark., executives at Wal-Mart Stores decided that the situation offered a great
opportunity for one of their newest data-driven weapons … predictive technology.
A week ahead of the storm’s landfall, Linda M. Dillman, Wal-Mart’s chief information
officer, pressed her staff to come up with forecasts based on what had happened when
Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes’ worth of
shopper history that is stored in Wal-Mart’s data warehouse, she felt that the company
could ‘start predicting what’s going to happen, instead of waiting for it to happen,’ as she
put it. (Hays, 2004)
Consider why data-driven prediction might be useful in this scenario. It might be useful
to predict that people in the path of the hurricane would buy more bottled water. Maybe,
but this point seems a bit obvious, and why would we need data science to discover it?
It might be useful to project the amount of increase in sales due to the hurricane, to
ensure that local Wal-Marts are properly stocked. Perhaps mining the data could reveal
that a particular DVD sold out in the hurricane’s path—but maybe it sold out that week
at Wal-Marts across the country, not just where the hurricane landing was imminent.
The prediction could be somewhat useful, but is probably more general than Ms. Dill‐
man was intending.
It would be more valuable to discover patterns due to the hurricane that were not ob‐
vious. To do this, analysts might examine the huge volume of Wal-Mart data from prior,
similar situations (such as Hurricane Charley) to identify unusual local demand for
products. From such patterns, the company might be able to anticipate unusual demand
for products and rush stock to the stores ahead of the hurricane’s landfall.
Indeed, that is what happened. The New York Times (Hays, 2004) reported that: “… the
experts mined the data and found that the stores would indeed need certain products
—and not just the usual flashlights. ‘We didn’t know in the past that strawberry Pop-
Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane,’
Ms. Dillman said in a recent interview. ‘And the pre-hurricane top-selling item was
Example: Hurricane Frances | 3
Example: Predicting Customer Churn
How are such data analyses performed? Consider a second, more typical business sce‐
nario and how it might be treated from a data perspective. This problem will serve as a
running example that will illuminate many of the issues raised in this book and provide
a common frame of reference.
Assume you just landed a great analytical job with MegaTelCo, one of the largest tele‐
communication firms in the United States. They are having a major problem with cus‐
tomer retention in their wireless business. In the mid-Atlantic region, 20% of cell phone
customers leave when their contracts expire, and it is getting increasingly difficult to
acquire new customers. Since the cell phone market is now saturated, the huge growth
in the wireless market has tapered off. Communications companies are now engaged
in battles to attract each other’s customers while retaining their own. Customers switch‐
ing from one company to another is called churn, and it is expensive all around: one
company must spend on incentives to attract a customer while another company loses
revenue when the customer departs.
You have been called in to help understand the problem and to devise a solution. At‐
tracting new customers is much more expensive than retaining existing ones, so a good
deal of marketing budget is allocated to prevent churn. Marketing has already designed
a special retention offer. Your task is to devise a precise, step-by-step plan for how the
data science team should use MegaTelCo’s vast data resources to decide which customers
should be offered the special retention deal prior to the expiration of their contracts.
Think carefully about what data you might use and how they would be used. Specifically,
how should MegaTelCo choose a set of customers to receive their offer in order to best
reduce churn for a particular incentive budget? Answering this question is much more
complicated than it may seem initially. We will return to this problem repeatedly through
the book, adding sophistication to our solution as we develop an understanding of the
fundamental data science concepts.
In reality, customer retention has been a major use of data mining
technologies—especially in telecommunications and finance busi‐
nesses. These more generally were some of the earliest and widest
adopters of data mining technologies, for reasons discussed later.
Data Science, Engineering, and Data-Driven Decision
Data science involves principles, processes, and techniques for understanding phe‐
nomena via the (automated) analysis of data. In this book, we will view the ultimate goal
4 | Chapter 1: Introduction: Data-Analytic Thinking
Figure 1-1. Data science in the context of various data-related processes in the
of data science as improving decision making, as this generally is of direct interest to
Figure 1-1 places data science in the context of various other closely related and data-
related processes in the organization. It distinguishes data science from other aspects
of data processing that are gaining increasing attention in business. Let’s start at the top.
Data-driven decision-making (DDD) refers to the practice of basing decisions on the
analysis of data, rather than purely on intuition. For example, a marketer could select
advertisements based purely on her long experience in the field and her eye for what
will work. Or, she could base her selection on the analysis of data regarding how con‐
sumers react to different ads. She could also use a combination of these approaches.
DDD is not an all-or-nothing practice, and different firms engage in DDD to greater or
lesser degrees.
The benefits of data-driven decision-making have been demonstrated conclusively.
Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School
conducted a study of how DDD affects firm performance (Brynjolfsson, Hitt, & Kim,
2011). They developed a measure of DDD that rates firms as to how strongly they use
Data Science, Engineering, and Data-Driven Decision Making | 5
data to make decisions across the company. They show that statistically, the more data-
driven a firm is, the more productive it is—even controlling for a wide range of possible
confounding factors. And the differences are not small. One standard deviation higher
on the DDD scale is associated with a 4%–6% increase in productivity. DDD also is
correlated with higher return on assets, return on equity, asset utilization, and market
value, and the relationship seems to be causal.
The sort of decisions we will be interested in in this book mainly fall into two types: (1)
decisions for which “discoveries” need to be made within data, and (2) decisions that
repeat, especially at massive scale, and so decision-making can benefit from even small
increases in decision-making accuracy based on data analysis. The Walmart example
above illustrates a type 1 problem: Linda Dillman would like to discover knowledge that
will help Walmart prepare for Hurricane Frances’s imminent arrival.
In 2012, Walmart’s competitor Target was in the news for a data-driven decision-making
case of its own, also a type 1 problem (Duhigg, XXXXXXXXXXLike most retailers, Target cares
about consumers’ shopping habits, what drives them, and what can influence them.
Consumers tend to have inertia in their habits and getting them to change is very dif‐
ficult. Decision makers at Target knew, however, that the arrival of a new baby in a family
is one point where people do change their shopping habits significantly. In the Target
analyst’s words, “As soon as we get them buying diapers from us, they’re going to start
buying everything else too.” Most retailers know this and so they compete with each
other trying to sell baby-related products to new parents. Since most birth records are
public, retailers obtain information on births and send out special offers to the new
However, Target wanted to get a jump on their competition. They were interested in
whether they could predict that people are expecting a baby. If they could, they would
gain an advantage by making offers before their competitors. Using techniques of data
science, Target analyzed historical data on customers who later were revealed to have
been pregnant, and were able to extract information that could predict which consumers
were pregnant. For example, pregnant mothers often change their diets, their ward‐
robes, their vitamin regimens, and so on. These indicators could be extracted from
historical data, assembled into predictive models, and then deployed in marketing
campaigns. We will discuss predictive models in much detail as we go through the book.
For the time being, it is sufficient to understand that a predictive model abstracts away
most of the complexity of the world, focusing in on a particular set of indicators that
correlate in some way with a quantity of interest (who will churn, or who will purchase,
who is pregnant, etc.). Importantly, in both the Walmart and the Target examples, the
6 | Chapter 1: Introduction: Data-Analytic Thinking
2. Target was successful enough that this case raised ethical questions on the deployment of such techniques.
Concerns of ethics and privacy are interesting and very important, but we leave their discussion for another
time and place.
data analysis was not testing a simple hypothesis. Instead, the data were explored with
the hope that something useful would be discovered.2
Our churn example illustrates a type 2 DDD problem. MegaTelCo has hundreds of
millions of customers, each a candidate for defection. Tens of millions of customers
have contracts expiring each month, so each one of them has an increased likelihood
of defection in the near future. If we can improve our ability to estimate, for a given
customer, how profitable it would be for us to focus on her, we can potentially reap large
benefits by applying this ability to the millions of customers in the population. This
same logic applies to many of the areas where we have seen the most intense application
of data science and data mining: direct marketing, online advertising, credit scoring,
financial trading, help-desk management, fraud detection, search ranking, product rec‐
ommendation, and so on.
The diagram in Figure 1-1 shows data science supporting data-driven decision-making,
but also overlapping with data-driven decision-making. This highlights the often over‐
looked fact that, increasingly, business decisions are being made automatically by com‐
puter systems. Different industries have adopted automatic decision-making at different
rates. The finance and telecommunications industries were early adopters, largely be‐
cause of their precocious development of data networks and implementation of massive-
scale computing, which allowed the aggregation and modeling of data at a large scale,
as well as the application of the resultant models to decision-making.
In the 1990s, automated decision-making changed the banking and consumer credit
industries dramatically. In the 1990s, banks and telecommunications companies also
implemented massive-scale systems for managing data-driven fraud control decisions.
As retail systems were increasingly computerized, merchandising decisions were auto‐
mated. Famous examples include Harrah’s casinos’ reward programs and the automated
recommendations of Amazon and Netflix. Currently we are seeing a revolution in ad‐
vertising, due in large part to a huge increase in the amount of time consumers are
spending online, and the ability online to make (literally) split-second advertising
Data Processing and “Big Data”
It is important to digress here to address another point. There is a lot to data processing
that is not data science—despite the impression one might get from the media. Data
engineering and processing are critical to support data science, but they are more gen‐
eral. For example, these days many data processing skills, systems, and technologies
often are mistakenly cast as data science. To understand data science and data-driven
Data Processing and “Big Data” | 7
businesses it is important to understand the differences. Data science needs access to
data and it often benefits from sophisticated data engineering that data processing
technologies may facilitate, but these technologies are not data science technologies per
se. They support data science, as shown in Figure 1-1, but they are useful for much more.
Data processing technologies are very important for many data-oriented business tasks
that do not involve extracting knowledge or data-driven decision-making, such as ef‐
ficient transaction processing, modern web system processing, and online advertising
campaign management.
“Big data” technologies (such as Hadoop, HBase, and MongoDB) have received con‐
siderable media attention recently. Big data essentially means datasets that are too large
for traditional data processing systems, and therefore require new processing technol‐
ogies. As with the traditional technologies, big data technologies are used for many
tasks, including data engineering. Occasionally, big data technologies are actually used
for implementing data mining techniques. However, much more often the well-known
big data technologies are used for data processing in support of the data mining tech‐
niques and other data science activities, as represented in Figure 1-1.
Previously, we discussed Brynjolfsson’s study demonstrating the benefits of data-driven
decision-making. A separate study, conducted by economist Prasanna Tambe of NYU’s
Stern School, examined the extent to which big data technologies seem to help firms
(Tambe, XXXXXXXXXXHe finds that, after controlling for various possible confounding factors,
using big data technologies is associated with significant additional productivity growth.
Specifically, one standard deviation higher utilization of big data technologies is asso‐
ciated with 1%–3% higher productivity than the average firm; one standard deviation
lower in terms of big data utilization is associated with 1%–3% lower productivity. This
leads to potentially very large productivity differences between the firms at the extremes.
From Big Data 1.0 to Big Data 2.0
One way to think about the state of big data technologies is to draw an analogy with the
business adoption of Internet technologies. In Web 1.0, businesses busied themselves
with getting the basic internet technologies in place, so that they could establish a web
presence, build electronic commerce capability, and improve the efficiency of their op‐
erations. We can think of ourselves as being in the era of Big Data 1.0. Firms are busying
themselves with building the capabilities to process large data, largely in support of their
current operations—for example, to improve efficiency.
Once firms had incorporated Web 1.0 technologies thoroughly (and in the process had
driven down prices of the underlying technology) they started to look further. They
began to ask what the Web could do for them, and how it could improve things they’d
always done—and we entered the era of Web 2.0, where new systems and companies
began taking advantage of the interactive nature of the Web. The changes brought on
by this shift in thinking are pervasive; the most obvious are the incorporation of social-
8 | Chapter 1: Introduction: Data-Analytic Thinking
networking components, and the rise of the “voice” of the individual consumer (and
We should expect a Big Data 2.0 phase to follow Big Data 1.0. Once firms have become
capable of processing massive data in a flexible fashion, they should begin asking: “What
can I now do that I couldn’t do before, or do better than I could do before?” This is likely
to be the golden era of data science. The principles and techniques we introduce in this
book will be applied far more broadly and deeply than they are today.
It is important to note that in the Web 1.0 era some precocious com‐
panies began applying Web 2.0 ideas far ahead of the mainstream.
Amazon is a prime example, incorporating the consumer’s “voice”
early on, in the rating of products, in product reviews (and deeper, in
the rating of product reviews). Similarly, we see some companies
already applying Big Data 2.0. Amazon again is a company at the
forefront, providing data-driven recommendations from massive da‐
ta. There are other examples as well. Online advertisers must pro‐
cess extremely large volumes of data (billions of ad impressions per
day is not unusual) and maintain a very high throughput (real-time
bidding systems make decisions in tens of milliseconds). We should
look to these and similar industries for hints at advances in big data
and data science that subsequently will be adopted by other industries.
Data and Data Science Capability as a Strategic Asset
The prior sections suggest one of the fundamental principles of data science: data, and
the capability to extract useful knowledge from data, should be regarded as key strategic
assets. Too many businesses regard data analytics as pertaining mainly to realizing value
from some existing data, and often without careful regard to whether the business has
the appropriate analytical talent. Viewing these as assets allows us to think explicitly
about the extent to which one should invest in them. Often, we don’t have exactly the
right data to best make decisions and/or the right talent to best support making decisions
from the data. Further, thinking of these as assets should lead us to the realization that
they are complementary. The best data science team can yield little value without the
appropriate data; the right data often cannot substantially improve decisions without
suitable data science talent. As with all assets, it is often necessary to make investments.
Building a top-notch data science team is a nontrivial undertaking, but can make a huge
difference for decision-making. We will discuss strategic considerations involving data
science in detail in Chapter 13. Our next case study will introduce the idea that thinking
explicitly about how to invest in data assets very often pays off handsomely.
The classic story of little Signet Bank from the 1990s provides a case in point. Previously,
in the 1980s, data science had transformed the business of consumer credit. Modeling
the probability of default had changed the industry from personal assessment of the
Data and Data Science Capability as a Strategic Asset | 9
likelihood of default to strategies of massive scale and market share, which brought
along concomitant economies of scale. It may seem strange now, but at the time, credit
cards essentially had uniform pricing, for two reasons: (1) the companies did not have
adequate information systems to deal with differential pricing at massive scale, and (2)
bank management believed customers would not stand for price discrimination.
Around 1990, two strategic visionaries (Richard Fairbanks and Nigel Morris) realized
that information technology was powerful enough that they could do more sophisti‐
cated predictive modeling—using the sort of techniques that we discuss throughout this
book—and offer different terms (nowadays: pricing, credit limits, low-initial-rate bal‐
ance transfers, cash back, loyalty points, and so on). These two men had no success
persuading the big banks to take them on as consultants and let them try. Finally, after
running out of big banks, they succeeded in garnering the interest of a small regional
Virginia bank: Signet Bank. Signet Bank’s management was convinced that modeling
profitability, not just default probability, was the right strategy. They knew that a small
proportion of customers actually account for more than 100% of a bank’s profit from
credit card operations (because the rest are break-even or money-losing). If they could
model profitability, they could make better offers to the best customers and “skim the
cream” of the big banks’ clientele.
But Signet Bank had one really big problem in implementing this strategy. They did not
have the appropriate data to model profitability with the goal of offering different terms
to different customers. No one did. Since banks were offering credit with a specific set
of terms and a specific default model, they had the data to model profitability (1) for
the terms they actually have offered in the past, and (2) for the sort of customer who
was actually offered credit (that is, those who were deemed worthy of credit by the
existing model).
What could Signet Bank do? They brought into play a fundamental strategy of data
science: acquire the necessary data at a cost. Once we view data as a business asset, we
should think about whether and how much we are willing to invest. In Signet’s case,
data could be generated on the profitability of customers given different credit terms
by conducting experiments. Different terms were offered at random to different cus‐
tomers. This may seem foolish outside the context of data-analytic thinking: you’re likely
to lose money! This is true. In this case, losses are the cost of data acquisition. The data-
analytic thinker needs to consider whether she expects the data to have sufficient value
to justify the investment.
So what happened with Signet Bank? As you might expect, when Signet began randomly
offering terms to customers for data acquisition, the number of bad accounts soared.
Signet went from an industry-leading “charge-off ” rate (2.9% of balances went unpaid)
to almost 6% charge-offs. Losses continued for a few years while the data scientists
worked to build predictive models from the data, evaluate them, and deploy them to
improve profit. Because the firm viewed these losses as investments in data, they per‐
sisted despite complaints from stakeholders. Eventually, Signet’s credit card operation
10 | Chapter 1: Introduction: Data-Analytic Thinking
3. You can read more about Capital One’s story (Clemons & Thatcher, 1998; McNamee 2001).
turned around and became so profitable that it was spun off to separate it from the
bank’s other operations, which now were overshadowing the consumer credit success.
Fairbanks and Morris became Chairman and CEO and President and COO, and pro‐
ceeded to apply data science principles throughout the business—not just customer
acquisition but retention as well. When a customer calls looking for a better offer, data-
driven models calculate the potential profitability of various possible actions (different
offers, including sticking with the status quo), and the customer service representative’s
computer presents the best offers to make.
You may not have heard of little Signet Bank, but if you’re reading this book you’ve
probably heard of the spin-off: Capital One. Fairbanks and Morris’s new company grew
to be one of the largest credit card issuers in the industry with one of the lowest charge-
off rates. In 2000, the bank was reported to be carrying out 45,000 of these “scientific
tests” as they called them.3
Studies giving clear quantitative demonstrations of the value of a data asset are hard to
find, primarily because firms are hesitant to divulge results of strategic value. One ex‐
ception is a study by Martens and Provost XXXXXXXXXXassessing whether data on the specific
transactions of a bank’s consumers can improve models for deciding what product offers
to make. The bank built models from data to decide whom to target with offers for
different products. The investigation examined a number of different types of data and
their effects on predictive performance. Sociodemographic data provide a substantial
ability to model the sort of consumers that are more likely to purchase one product or
another. However, sociodemographic data only go so far; after a certain volume of data,
no additional advantage is conferred. In contrast, detailed data on customers’ individual
(anonymized) transactions improve performance substantially over just using socio‐
demographic data. The relationship is clear and striking and—significantly, for the point
here—the predictive performance continues to improve as more data are used, increas‐
ing throughout the range investigated by Martens and Provost with no sign of abating.
This has an important implication: banks with bigger data assets may have an important
strategic advantage over their smaller competitors. If these trends generalize, and the
banks are able to apply sophisticated analytics, banks with bigger data assets should be
better able to identify the best customers for individual products. The net result will be
either increased adoption of the bank’s products, decreased cost of customer acquisition,
or both.
The idea of data as a strategic asset is certainly not limited to Capital One, nor even to
the banking industry. Amazon was able to gather data early on online customers, which
has created significant switching costs: consumers find value in the rankings and rec‐
ommendations that Amazon provides. Amazon therefore can retain customers more
easily, and can even charge a premium (Brynjolfsson & Smith, XXXXXXXXXXHarrah’s casinos
Data and Data Science Capability as a Strategic Asset | 11
4. Of course, this is not a new phenomenon. Amazon and Google are well-established companies that get
tremendous value from their data assets.
famously invested in gathering and mining data on gamblers, and moved itself from a
small player in the casino business in the mid-1990s to the acquisition of Caesar’s
Entertainment in 2005 to become the world’s largest gambling company. The huge val‐
uation of Facebook has been credited to its vast and unique data assets (Sengupta, 2012),
including both information about individuals and their likes, as well as information
about the structure of the social network. Information about network structure has been
shown to be important to predicting and has been shown to be remarkably helpful in
building models of who will buy certain products (Hill, Provost, & Volinsky, XXXXXXXXXXIt
is clear that Facebook has a remarkable data asset; whether they have the right data
science strategies to take full advantage of it is an open question.
In the book we will discuss in more detail many of the fundamental concepts behind
these success stories, in exploring the principles of data mining and data-analytic
Data-Analytic Thinking
Analyzing case studies such as the churn problem improves our ability to approach
problems “data-analytically.” Promoting such a perspective is a primary goal of this
book. When faced with a business problem, you should be able to assess whether and
how data can improve performance. We will discuss a set of fundamental concepts and
principles that facilitate careful thinking. We will develop frameworks to structure the
analysis so that it can be done systematically.
As mentioned above, it is important to understand data science even if you never intend
to do it yourself, because data analysis is now so critical to business strategy. Businesses
increasingly are driven by data analytics, so there is great professional advantage in
being able to interact competently with and within such businesses. Understanding the
fundamental concepts, and having frameworks for organizing data-analytic thinking
not only will allow one to interact competently, but will help to envision opportunities
for improving data-driven decision-making, or to see data-oriented competitive threats.
Firms in many traditional industries are exploiting new and existing data resources for
competitive advantage. They employ data science teams to bring advanced technologies
to bear to increase revenue and to decrease costs. In addition, many new companies are
being developed with data mining as a key strategic component. Facebook and Twitter,
along with many other “Digital 100” companies (Business Insider, 2012), have high
valuations due primarily to data assets they are committed to capturing or creating.4
Increasingly, managers need to oversee analytics teams and analysis projects, marketers
have to organize and understand data-driven campaigns, venture capitalists must be
12 | Chapter 1: Introduction: Data-Analytic Thinking
able to invest wisely in businesses with substantial data assets, and business strategists
must be able to devise plans that exploit data.
As a few examples, if a consultant presents a proposal to mine a data asset to improve
your business, you should be able to assess whether the proposal makes sense. If a
competitor announces a new data partnership, you should recognize when it may put
you at a strategic disadvantage. Or, let’s say you take a position with a venture firm and
your first project is to assess the potential for investing in an advertising company. The
founders present a convincing argument that they will realize significant value from a
unique body of data they will collect, and on that basis are arguing for a substantially
higher valuation. Is this reasonable? With an understanding of the fundamentals of data
science you should be able to devise a few probing questions to determine whether their
valuation arguments are plausible.
On a scale less grand, but probably more common, data analytics projects reach into all
business units. Employees throughout these units must interact with the data science
team. If these employees do not have a fundamental grounding in the principles of data-
analytic thinking, they will not really understand what is happening in the business.
This lack of understanding is much more damaging in data science projects than in
other technical projects, because the data science is supporting improved decision-
making. As we will describe in the next chapter, this requires a close interaction between
the data scientists and the business people responsible for the decision-making. Firms
where the business people do not understand what the data scientists are doing are at a
substantial disadvantage, because they waste time and effort or, worse, because they
ultimately make wrong decisions.
The need for managers with data-analytic skills
The consulting firm McKinsey and Company estimates that “there
will be a shortage of talent necessary for organizations to take advan‐
tage of big data. By 2018, the United States alone could face a short‐
age of 140,000 to 190,000 people with deep analytical skills as well as
1.5 million managers and analysts with the know-how to use the
analysis of big data to make effective decisions.” (Manyika, XXXXXXXXXXWhy
10 times as many managers and analysts than those with deep ana‐
lytical skills? Surely data scientists aren’t so difficult to manage that
they need 10 managers! The reason is that a business can get lever‐
age from a data science team for making better decisions in multi‐
ple areas of the business. However, as McKinsey is pointing out, the
managers in those areas need to understand the fundamentals of data
science to effectively get that leverage.
Data-Analytic Thinking | 13
This Book
This book concentrates on the fundamentals of data science and data mining. These are
a set of principles, concepts, and techniques that structure thinking and analysis. They
allow us to understand data science processes and methods surprisingly deeply, without
needing to focus in depth on the large number of specific data mining algorithms.
There are many good books covering data mining algorithms and techniques, from
practical guides to mathematical and statistical treatments. This book instead focuses
on the fundamental concepts and how they help us to think about problems where data
mining may be brought to bear. That doesn’t mean that we will ignore the data mining
techniques; many algorithms are exactly the embodiment of the basic concepts. But
with only a few exceptions we will not concentrate on the deep technical details of how
the techniques actually work; we will try to provide just enough detail so that you will
understand what the techniques do, and how they are based on the fundamental
Data Mining and Data Science, Revisited
This book devotes a good deal of attention to the extraction of useful (nontrivial, hope‐
fully actionable) patterns or models from large bodies of data (Fayyad, Piatetsky-
Shapiro, & Smyth, 1996), and to the fundamental data science principles underlying
such data mining. In our churn-prediction example, we would like to take the data on
prior churn and extract patterns, for example patterns of behavior, that are useful—that
can help us to predict those customers who are more likely to leave in the future, or that
can help us to design better services.
The fundamental concepts of data science are drawn from many fields that study data
analytics. We introduce these concepts throughout the book, but let’s briefly discuss a
few now to get the basic flavor. We will elaborate on all of these and more in later
Fundamental concept: Extracting useful knowledge from data to solve business problems
can be treated systematically by following a process with reasonably well-defined stages.
The Cross Industry Standard Process for Data Mining, abbreviated CRISP-DM (CRISP-
DM Project, 2000), is one codification of this process. Keeping such a process in mind
provides a framework to structure our thinking about data analytics problems. For
example, in actual practice one repeatedly sees analytical “solutions” that are not based
on careful analysis of the problem or are not carefully evaluated. Structured thinking
about analytics emphasizes these often under-appreciated aspects of supporting
decision-making with data. Such structured thinking also contrasts critical points where
human creativity is necessary versus points where high-powered analytical tools can be
brought to bear.
14 | Chapter 1: Introduction: Data-Analytic Thinking
Fundamental concept: From a large mass of data, information technology can be used to
find informative descriptive attributes of entities of interest. In our churn example, a
customer would be an entity of interest, and each customer might be described by a
large number of attributes, such as usage, customer service history, and many other
factors. Which of these actually gives us information on the customer’s likelihood of
leaving the company when her contract expires? How much information? Sometimes
this process is referred to roughly as finding variables that “correlate” with churn (we
will discuss this notion precisely). A business analyst may be able to hypothesize some
and test them, and there are tools to help facilitate this experimentation (see “Other
Analytics Techniques and Technologies” on page 35). Alternatively, the analyst could
apply information technology to automatically discover informative attributes—essen‐
tially doing large-scale automated experimentation. Further, as we will see, this concept
can be applied recursively to build models to predict churn based on multiple attributes.
Fundamental concept: If you look too hard at a set of data, you will find something—but
it might not generalize beyond the data you’re looking at. This is referred to as overfit‐
ting a dataset. Data mining techniques can be very powerful, and the need to detect and
avoid overfitting is one of the most important concepts to grasp when applying data
mining to real problems. The concept of overfitting and its avoidance permeates data
science processes, algorithms, and evaluation methods.
Fundamental concept: Formulating data mining solutions and evaluating the results
involves thinking carefully about the context in which they will be used. If our goal is the
extraction of potentially useful knowledge, how can we formulate what is useful? It
depends critically on the application in question. For our churn-management example,
how exactly are we going to use the patterns extracted from historical data? Should the
value of the customer be taken into account in addition to the likelihood of leaving?
More generally, does the pattern lead to better decisions than some reasonable alterna‐
tive? How well would one have done by chance? How well would one do with a smart
“default” alternative?
These are just four of the fundamental concepts of data science that we will explore. By
the end of the book, we will have discussed a dozen such fundamental concepts in detail,
and will have illustrated how they help us to structure data-analytic thinking and to
understand data mining techniques and algorithms, as well as data science applications,
quite generally.
Chemistry Is Not About Test Tubes: Data Science Versus
the Work of the Data Scientist
Before proceeding, we should briefly revisit the engineering side of data science. At the
time of this writing, discussions of data science commonly mention not just analytical
skills and techniques for understanding data but popular tools used. Definitions of data
Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist | 15
5. OK: Hadoop is a widely used open source architecture for doing highly parallelizable computations. It is one
of the current “big data” technologies for processing massive datasets that exceed the capacity of relational
database systems. Hadoop is based on the MapReduce parallel processing framework introduced by Google.
scientists (and advertisements for positions) specify not just areas of expertise but also
specific programming languages and tools. It is common to see job advertisements
mentioning data mining techniques (e.g., random forests, support vector machines),
specific application areas (recommendation systems, ad placement optimization),
alongside popular software tools for processing big data (Hadoop, MongoDB). There
is often little distinction between the science and the technology for dealing with large
We must point out that data science, like computer science, is a young field. The par‐
ticular concerns of data science are fairly new and general principles are just beginning
to emerge. The state of data science may be likened to that of chemistry in the mid-19th
century, when theories and general principles were being formulated and the field was
largely experimental. Every good chemist had to be a competent lab technician. Simi‐
larly, it is hard to imagine a working data scientist who is not proficient with certain
sorts of software tools.
Having said this, this book focuses on the science and not on the technology. You will
not find instructions here on how best to run massive data mining jobs on Hadoop
clusters, or even what Hadoop is or why you might want to learn about it.5 We focus
here on the general principles of data science that have emerged. In 10 years’ time the
predominant technologies will likely have changed or advanced enough that a discus‐
sion here would be obsolete, while the general principles are the same as they were 20
years ago, and likely will change little over the coming decades.
This book is about the extraction of useful information and knowledge from large vol‐
umes of data, in order to improve business decision-making. As the massive collection
of data has spread through just about every industry sector and business unit, so have
the opportunities for mining the data. Underlying the extensive body of techniques for
mining data is a much smaller set of fundamental concepts comprising data science.
These concepts are general and encapsulate much of the essence of data mining and
business analytics.
Success in today’s data-oriented business environment requires being able to think about
how these fundamental concepts apply to particular business problems—to think data-
analytically. For example, in this chapter we discussed the principle that data should be
thought of as a business asset, and once we are thinking in this direction we start to ask
whether (and how much) we should invest in data. Thus, an understanding of these
fundamental concepts is important not only for data scientists themselves, but for any‐
16 | Chapter 1: Introduction: Data-Analytic Thinking
one working with data scientists, employing data scientists, investing in data-heavy
ventures, or directing the application of analytics in an organization.
Thinking data-analytically is aided by conceptual frameworks discussed throughout the
book. For example, the automated extraction of patterns from data is a process with
well-defined stages, which are the subject of the next chapter. Understanding the process
and the stages helps to structure our data-analytic thinking, and to make it more sys‐
tematic and therefore less prone to errors and omissions.
There is convincing evidence that data-driven decision-making and big data technol‐
ogies substantially improve business performance. Data science supports data-driven
decision-making—and sometimes conducts such decision-making automatically—and
depends upon technologies for “big data” storage and engineering, but its principles are
separate. The data science principles we discuss in this book also differ from, and are
complementary to, other important technologies, such as statistical hypothesis testing
and database querying (which have their own books and classes). The next chapter
describes some of these differences in more detail.
Summary | 17
Business Problems and Data Science
Fundamental concepts: A set of canonical data mining tasks; The data mining process;
Supervised versus unsupervised data mining.
An important principle of data science is that data mining is a process with fairly well-
understood stages. Some involve the application of information technology, such as the
automated discovery and evaluation of patterns from data, while others mostly require
an analyst’s creativity, business knowledge, and common sense. Understanding the
whole process helps to structure data mining projects, so they are closer to systematic
analyses rather than heroic endeavors driven by chance and individual acumen.
Since the data mining process breaks up the overall task of finding patterns from data
into a set of well-defined subtasks, it is also useful for structuring discussions about data
science. In this book, we will use the process as an overarching framework for our
discussion. This chapter introduces the data mining process, but first we provide ad‐
ditional context by discussing common types of data mining tasks. Introducing these
allows us to be more concrete when presenting the overall process, as well as when
introducing other concepts in subsequent chapters.
We close the chapter by discussing a set of important business analytics subjects that
are not the focus of this book (but for which there are many other helpful books), such
as databases, data warehousing, and basic statistics.
From Business Problems to Data Mining Tasks
Each data-driven business decision-making problem is unique, comprising its own
combination of goals, desires, constraints, and even personalities. As with much engi‐
neering, though, there are sets of common tasks that underlie the business problems.
In collaboration with business stakeholders, data scientists decompose a business prob‐
lem into subtasks. The solutions to the subtasks can then be composed to solve the
overall problem. Some of these subtasks are unique to the particular business problem,
but others are common data mining tasks. For example, our telecommunications churn
problem is unique to MegaTelCo: there are specifics of the problem that are different
from churn problems of any other telecommunications firm. However, a subtask that
will likely be part of the solution to any churn problem is to estimate from historical
data the probability of a customer terminating her contract shortly after it has expired.
Once the idiosyncratic MegaTelCo data have been assembled into a particular format
(described in the next chapter), this probability estimation fits the mold of one very
common data mining task. We know a lot about solving the common data mining tasks,
both scientifically and practically. In later chapters, we also will provide data science
frameworks to help with the decomposition of business problems and with the re-
composition of the solutions to the subtasks.
A critical skill in data science is the ability to decompose a data-
analytics problem into pieces such that each piece matches a known
task for which tools are available. Recognizing familiar problems and
their solutions avoids wasting time and resources reinventing the
wheel. It also allows people to focus attention on more interesting
parts of the process that require human involvement—parts that have
not been automated, so human creativity and intelligence must come
into play.
Despite the large number of specific data mining algorithms developed over the years,
there are only a handful of fundamentally different types of tasks these algorithms ad‐
dress. It is worth defining these tasks clearly. The next several chapters will use the first
two (classification and regression) to illustrate several fundamental concepts. In what
follows, the term “an individual” will refer to an entity about which we have data, such
as a customer or a consumer, or it could be an inanimate entity such as a business. We
will make this notion more precise in Chapter 3. In many business analytics projects,
we want to find “correlations” between a particular variable describing an individual
and other variables. For example, in historical data we may know which customers left
the company after their contracts expired. We may want to find out which other variables
correlate with a customer leaving in the near future. Finding such correlations are the
most basic examples of classification and regression tasks.
1. Classification and class probability estimation attempt to predict, for each individual
in a population, which of a (small) set of classes this individual belongs to. Usually
the classes are mutually exclusive. An example classification question would be:
“Among all the customers of MegaTelCo, which are likely to respond to a given
offer?” In this example the two classes could be called will respond and will not
20 | Chapter 2: Business Problems and Data Science Solutions
For a classification task, a data mining procedure produces a model that, given a
new individual, determines which class that individual belongs to. A closely related
task is scoring or class probability estimation. A scoring model applied to an indi‐
vidual produces, instead of a class prediction, a score representing the probability
(or some other quantification of likelihood) that that individual belongs to each
class. In our customer response scenario, a scoring model would be able to evaluate
each individual customer and produce a score of how likely each is to respond to
the offer. Classification and scoring are very closely related; as we shall see, a model
that can do one can usually be modified to do the other.
2. Regression (“value estimation”) attempts to estimate or predict, for each individual,
the numerical value of some variable for that individual. An example regression
question would be: “How much will a given customer use the service?” The property
(variable) to be predicted here is service usage, and a model could be generated by
looking at other, similar individuals in the population and their historical usage. A
regression procedure produces a model that, given an individual, estimates the
value of the particular variable specific to that individual.
Regression is related to classification, but the two are different. Informally, classi‐
fication predicts whether something will happen, whereas regression predicts how
much something will happen. The difference will become clearer as the book
3. Similarity matching attempts to identify similar individuals based on data known
about them. Similarity matching can be used directly to find similar entities. For
example, IBM is interested in finding companies similar to their best business cus‐
tomers, in order to focus their sales force on the best opportunities. They use sim‐
ilarity matching based on “firmographic” data describing characteristics of the
companies. Similarity matching is the basis for one of the most popular methods
for making product recommendations (finding people who are similar to you in
terms of the products they have liked or have purchased). Similarity measures un‐
derlie certain solutions to other data mining tasks, such as classification, regression,
and clustering. We discuss similarity and its uses at length in Chapter 6.
4. Clustering attempts to group individuals in a population together by their similarity,
but not driven by any specific purpose. An example clustering question would be:
“Do our customers form natural groups or segments?” Clustering is useful in pre‐
liminary domain exploration to see which natural groups exist because these groups
in turn may suggest other data mining tasks or approaches. Clustering also is used
as input to decision-making processes focusing on questions such as: What products
should we offer or develop? How should our customer care teams (or sales teams) be
structured? We discuss clustering in depth in Chapter 6.
5. Co-occurrence grouping (also known as frequent itemset mining, association rule
discovery, and market-basket analysis) attempts to find associations between enti‐
ties based on transactions involving them. An example co-occurrence question
From Business Problems to Data Mining Tasks | 21
would be: What items are commonly purchased together? While clustering looks at
similarity between objects based on the objects’ attributes, co-occurrence grouping
considers similarity of objects based on their appearing together in transactions.
For example, analyzing purchase records from a supermarket may uncover that
ground meat is purchased together with hot sauce much more frequently than we
might expect. Deciding how to act upon this discovery might require some crea‐
tivity, but it could suggest a special promotion, product display, or combination
offer. Co-occurrence of products in purchases is a common type of grouping known
as market-basket analysis. Some recommendation systems also perform a type of
affinity grouping by finding, for example, pairs of books that are purchased fre‐
quently by the same people (“people who bought X also bought Y”).
The result of co-occurrence grouping is a description of items that occur together.
These descriptions usually include statistics on the frequency of the co-occurrence
and an estimate of how surprising it is.
6. Profiling (also known as behavior description) attempts to characterize the typical
behavior of an individual, group, or population. An example profiling question
would be: “What is the typical cell phone usage of this customer segment?” Behavior
may not have a simple description; profiling cell phone usage might require a com‐
plex description of night and weekend airtime averages, international usage, roam‐
ing charges, text minutes, and so on. Behavior can be described generally over an
entire population, or down to the level of small groups or even individuals.
Profiling is often used to establish behavioral norms for anomaly detection appli‐
cations such as fraud detection and monitoring for intrusions to computer systems
(such as someone breaking into your iTunes account). For example, if we know
what kind of purchases a person typically makes on a credit card, we can determine
whether a new charge on the card fits that profile or not. We can use the degree of
mismatch as a suspicion score and issue an alarm if it is too high.
7. Link prediction attempts to predict connections between data items, usually by
suggesting that a link should exist, and possibly also estimating the strength of the
link. Link prediction is common in social networking systems: “Since you and Ka‐
ren share 10 friends, maybe you’d like to be Karen’s friend?” Link prediction can
also estimate the strength of a link. For example, for recommending movies to
customers one can think of a graph between customers and the movies they’ve
watched or rated. Within the graph, we search for links that do not exist between
customers and movies, but that we predict should exist and should be strong. These
links form the basis for recommendations.
8. Data reduction attempts to take a large set of data and replace it with a smaller set
of data that contains much of the important information in the larger set. The
smaller dataset may be easier to deal with or to process. Moreover, the smaller
dataset may better reveal the information. For example, a massive dataset on con‐
sumer movie-viewing preferences may be reduced to a much smaller dataset re‐
22 | Chapter 2: Business Problems and Data Science Solutions
vealing the consumer taste preferences that are latent in the viewing data (for ex‐
ample, viewer genre preferences). Data reduction usually involves loss of informa‐
tion. What is important is the trade-off for improved insight.
9. Causal modeling attempts to help us understand what events or actions actually
influence others. For example, consider that we use predictive modeling to target
advertisements to consumers, and we observe that indeed the targeted consumers
purchase at a higher rate subsequent to having been targeted. Was this because the
advertisements influenced the consumers to purchase? Or did the predictive mod‐
els simply do a good job of identifying those consumers who would have purchased
anyway? Techniques for causal modeling include those involving a substantial in‐
vestment in data, such as randomized controlled experiments (e.g., so-called “A/B
tests”), as well as sophisticated methods for drawing causal conclusions from ob‐
servational data. Both experimental and observational methods for causal modeling
generally can be viewed as “counterfactual” analysis: they attempt to understand
what would be the difference between the situations—which cannot both happen
—where the “treatment” event (e.g., showing an advertisement to a particular in‐
dividual) were to happen, and were not to happen.
In all cases, a careful data scientist should always include with a causal conclusion
the exact assumptions that must be made in order for the causal conclusion to hold
(there always are such assumptions—always ask). When undertaking causal mod‐
eling, a business needs to weigh the trade-off of increasing investment to reduce
the assumptions made, versus deciding that the conclusions are good enough given
the assumptions. Even in the most careful randomized, controlled experimentation,
assumptions are made that could render the causal conclusions invalid. The dis‐
covery of the “placebo effect” in medicine illustrates a notorious situation where an
assumption was overlooked in carefully designed randomized experimentation.
Discussing all of these tasks in detail would fill multiple books. In this book, we present
a collection of the most fundamental data science principles—principles that together
underlie all of these types of tasks. We will illustrate the principles mainly using classi‐
fication, regression, similarity matching, and clustering, and will discuss others when
they provide important illustrations of the fundamental principles (toward the end of
the book).
Consider which of these types of tasks might fit our churn-prediction problem. Often,
practitioners formulate churn prediction as a problem of finding segments of customers
who are more or less likely to leave. This segmentation problem sounds like a classifi‐
cation problem, or possibly clustering, or even regression. To decide the best formula‐
tion, we first need to introduce some important distinctions.
From Business Problems to Data Mining Tasks | 23
Supervised Versus Unsupervised Methods
Consider two similar questions we might ask about a customer population. The first is:
“Do our customers naturally fall into different groups?” Here no specific purpose or
target has been specified for the grouping. When there is no such target, the data mining
problem is referred to as unsupervised. Contrast this with a slightly different question:
“Can we find groups of customers who have particularly high likelihoods of canceling
their service soon after their contracts expire?” Here there is a specific target defined:
will a customer leave when her contract expires? In this case, segmentation is being done
for a specific reason: to take action based on likelihood of churn. This is called a super‐
vised data mining problem.
A note on the terms: Supervised and unsupervised learning
The terms supervised and unsupervised were inherited from the field
of machine learning. Metaphorically, a teacher “supervises” the learn‐
er by carefully providing target information along with a set of ex‐
amples. An unsupervised learning task might involve the same set of
examples but would not include the target information. The learner
would be given no information about the purpose of the learning, but
would be left to form its own conclusions about what the examples
have in common.
The difference between