Paper Section 1: Reflection and Literature ReviewUsing Microsoft Word and Professional APA format, prepare a professional written paper supported with three sources of research that details what you have learned from chapters 3 and 4. This section of the paper should be a minimum of two pages.
Paper Section 2: Applied Learning ExercisesIn this section of the professional paper, apply what you have learned from chapters 3 and 4 to descriptively address and answer the problems below. Important Note : Dot not type the actual written problems within the paper itself.
Paper Section 3: Conclusions
After addressing the problems, conclude your paper with details on how you will use this knowledge and skills to support your professional and or academic goals. This section of the paper should be around one page including a custom and original process flow or flow diagram to visually represent how you will apply this knowledge going forward. This customized and original flow process flow or flow diagram can be created using the “Smart Art” tools in Microsoft Word.
Paper Section 4: APA Reference Page
The three or more sources of research used to support this overall paper should be included in proper APA format in the final section of the paper.
Chapter 3:
Data Warehousing
Business Intelligence and Analytics: Systems for Decision Support
(10th Edition)
Business Intelligence and Analytics: Systems for Decision Support
(10th Edition)
Copyright © 2014 Pearson Education, Inc.
3-‹#›
1
Learning Objectives
(Continued…)
Understand the basic definitions and concepts of data warehouses
Learn different types of data warehousing architectures; their comparative advantages and disadvantages
Describe the processes used in developing and managing data warehouses
Explain data warehousing operations
…
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Learning Objectives
Explain the role of data warehouses in decision support
Explain data integration and the extraction, transformation, and load (ETL) processes
Describe real-time (a.k.a. right-time and/or active) data warehousing
Understand data warehouse administration and security issues
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Opening Vignette…
“Isle of Capri Casinos Is Winning with Enterprise Data Warehouse”
Company background
Problem description
Proposed solution
Results
Answer & discuss the case questions.
Copyright © 2014 Pearson Education, Inc.
3-‹#›
4
Questions for the
Opening Vignette
Why is it important for Isle to have an EDW?
What were the business challenges or opportunities that Isle was facing?
What was the process Isle followed to realize EDW? Comment on the potential challenges Isle might have had going through the process of EDW development.
What were the benefits of implementing an EDW at Isle? Can you think of other potential benefits that were not listed in the case?
Why do you think large enterprises like Isle in the gaming industry can succeed without having a capable data warehouse/business intelligence infrastructure?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Main Data Warehousing Topics
DW definition
Characteristics of DW
Data Marts
ODS, EDW, Metadata
DW Framework
DW Architecture & ETL Process
DW Development
DW Issues
Copyright © 2014 Pearson Education, Inc.
3-‹#›
6
What is a Data Warehouse?
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format
“The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is non-volatile and relevant to some moment in time”
Copyright © 2014 Pearson Education, Inc.
3-‹#›
A Historical Perspective to
Data Warehousing
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Characteristics of DWs
Subject oriented
Integrated
Time-variant (time series)
Nonvolatile
Summarized
Not normalized
Metadata
Web based, relational/multi-dimensional
Client/server, real-time/right-time/active…
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Data Mart
A departmental small-scale “DW” that stores only limited/relevant data
Dependent data mart
A subset that is created directly from a data warehouse
Independent data mart
A small data warehouse designed for a strategic business unit or a department
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Other DW Components
Operational data stores (ODS)
A type of database often used as an interim area for a data warehouse
Oper marts – an operational data mart.
Enterprise data warehouse (EDW)
A data warehouse for the enterprise.
Metadata: Data about data.
In a data warehouse, metadata describe the contents of a data warehouse and the manner of its acquisition and use
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Application Case 3.1
A Better Data Plan: Well-Established TELCOs Leverage Data Warehousing and Analytics to Stay on Top in a Competitive Industry
Questions for Discussion
What are the main challenges for TELCOs?
How can data warehousing and data analytics help TELCOs in overcoming their challenges?
Why do you think TELCOs are well suited to take full advantage of data analytics?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
A Generic DW Framework
Copyright © 2014 Pearson Education, Inc.
3-‹#›
13
Application Case 3.2
Data Warehousing Helps MultiCare Save More Lives
Questions for Discussion
What do you think is the role of data warehousing in healthcare systems?
How did MultiCare use data warehousing to improve health outcomes?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
14
DW Architecture
Three-tier architecture
Data acquisition software (back-end)
The data warehouse that contains the data & software
Client (front-end) software that allows users to access and analyze data from the warehouse
Two-tier architecture
First two tiers in three-tier architecture is combined into one
… sometimes there is only one tier?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
15
DW Architectures
3-tier
architecture
2-tier
architecture
1-tier
Architecture
?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
16
Data Warehousing Architectures
Issues to consider when deciding which architecture to use:
Which database management system (DBMS) should be used?
Will parallel processing and/or partitioning be used?
Will data migration tools be used to load the data warehouse?
What tools will be used to support data retrieval and analysis?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
A Web-Based DW Architecture
Copyright © 2014 Pearson Education, Inc.
3-‹#›
18
Alternative DW Architectures
Copyright © 2014 Pearson Education, Inc.
3-‹#›
19
Alternative DW Architectures
Each architecture has advantages and disadvantages!
Which architecture is the best?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
20
Ten factors that potentially affect the architecture selection decision
Information interdependence between organizational units
Upper management’s information needs
Urgency of need for a data warehouse
Nature of end-user tasks
Constraints on resources
Strategic view of the data warehouse prior to implementation
Compatibility with existing systems
Perceived ability of the in-house IT staff
Technical issues
Social/political factors
Copyright © 2014 Pearson Education, Inc.
3-‹#›
21
Teradata Corp. DW Architecture
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Data Integration and the Extraction, Transformation, and Load Process
ETL = Extract Transform Load
Data integration
Integration that comprises three major processes: data access, data federation, and change capture.
Enterprise application integration (EAI)
A technology that provides a vehicle for pushing data from source systems into a data warehouse
Enterprise information integration (EII)
An evolving tool space that promises real-time data integration from a variety of sources, such as relational or multidimensional databases, Web services, etc.
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Data Integration and the Extraction, Transformation, and Load Process
Copyright © 2014 Pearson Education, Inc.
3-‹#›
ETL (Extract, Transform, Load)
Issues affecting the purchase of an ETL tool
Data transformation tools are expensive
Data transformation tools may have a long learning curve
Important criteria in selecting an ETL tool
Ability to read from and write to an unlimited number of data sources/architectures
Automatic capturing and delivery of metadata
A history of conforming to open standards
An easy-to-use interface for the developer and the functional user
Copyright © 2014 Pearson Education, Inc.
3-‹#›
25
Data Warehouse Development
Data warehouse development approaches
Inmon Model: EDW approach (top-down)
Kimball Model: Data mart approach (bottom-up)
Which model is best?
Table 3.3 provides a comparative analysis between EDW and Data Mart approach
One alternative is the hosted warehouse
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Application Case 3.5
Starwood Hotels & Resorts Manages Hotel Profitability with Data Warehousing
Questions for Discussion
How big and complex are the business operations of Starwood Hotels & Resorts?
How did Starwood Hotels & Resorts use data warehousing for better profitability?
What were the challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Additional DW Considerations Hosted Data Warehouses
Benefits:
Requires minimal investment in infrastructure
Frees up capacity on in-house systems
Frees up cash flow
Makes powerful solutions affordable
Enables solutions that provide for growth
Offers better quality equipment and software
Provides faster connections
… more in the book
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Representation of Data in DW
Dimensional Modeling
A retrieval-based system that supports high-volume query access
Star schema
The most commonly used and the simplest style of dimensional modeling
Contain a fact table surrounded by and connected to several dimension tables
Snowflakes schema
An extension of star schema where the diagram resembles a snowflake in shape
Copyright © 2014 Pearson Education, Inc.
3-‹#›
The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions)
Multidimensional presentation
Dimensions: products, salespeople, market segments, business units, geographical locations, distribution channels, country, or industry
Measures: money, sales volume, head count, inventory profit, actual versus forecast
Time: daily, weekly, monthly, quarterly, or yearly
Multidimensionality
Copyright © 2014 Pearson Education, Inc.
3-‹#›
30
Star versus Snowflake Schema
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Analysis of Data in DW
OLTP vs. OLAP…
OLTP (online transaction processing)
Capturing and storing data from ERP, CRM, POS, …
The main focus is on efficiency of routine tasks
OLAP (Online analytical processing)
Converting data into information for decision support
Data cubes, drill-down / rollup, slice & dice, …
Requesting ad hoc reports
Conducting statistical and other analyses
Developing multimedia-based applications
…more in the book
Copyright © 2014 Pearson Education, Inc.
3-‹#›
OLAP vs. OLTP
Copyright © 2014 Pearson Education, Inc.
3-‹#›
OLAP Operations
Slice – a subset of a multidimensional array
Dice – a slice on more than two dimensions
Drill Down/Up – navigating among levels of data ranging from the most summarized (up) to the most detailed (down)
Roll Up – computing all of the data relationships for one or more dimensions
Pivot – used to change the dimensional orientation of a report or an ad hoc query-page display
Copyright © 2014 Pearson Education, Inc.
3-‹#›
OLAP
Slicing Operations on a Simple Tree-Dimensional
Data Cube
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Variations of OLAP
Multidimensional OLAP (MOLAP)
OLAP implemented via a specialized multidimensional database (or data store) that summarizes transactions into multidimensional views ahead of time
Relational OLAP (ROLAP)
The implementation of an OLAP database on top of an existing relational database
Database OLAP and Web OLAP (DOLAP and WOLAP); Desktop OLAP,…
Copyright © 2014 Pearson Education, Inc.
3-‹#›
36
Technology Insights 3.2
Hands-On DW with MicroStrategy
A wealth of teaching and learning resources can be found at TUN portal
www.teradatauniversitynetwork.com
The available resource includes scripted demonstrations, assignments, white papers, etc…
Copyright © 2014 Pearson Education, Inc.
3-‹#›
DW Implementation Issues
Identification of data sources and governance
Data quality planning, data model design
ETL tool selection
Establishment of service-level agreements
Data transport, data conversion
Reconciliation process
End-user support
Political issues
… more in the book
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Successful DW Implementation
Things to Avoid
Starting with the wrong sponsorship chain
Setting expectations that you cannot meet
Engaging in politically naive behavior
Loading the data warehouse with information just because it is available
Believing that data warehousing database design is the same as transactional database design
Choosing a data warehouse manager who is technology oriented rather than user oriented
… more in the book
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Failure Factors in DW Projects
Lack of executive sponsorship
Unclear business objectives
Cultural issues being ignored
Change management
Unrealistic expectations
Inappropriate architecture
Low data quality / missing information
Loading data just because it is available
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Massive DW and Scalability
Scalability
The main issues pertaining to scalability:
The amount of data in the warehouse
How quickly the warehouse is expected to grow
The number of concurrent users
The complexity of user queries
Good scalability means that queries and other data-access functions will grow linearly with the size of the warehouse
Copyright © 2014 Pearson Education, Inc.
3-‹#›
41
Real-Time/Active DW/BI
Enabling real-time data updates for real-time analysis and real-time decision making is growing rapidly
Push vs. Pull (of data)
Concerns about real-time BI
Not all data should be updated continuously
Mismatch of reports generated minutes apart
May be cost prohibitive
May also be infeasible
Copyright © 2014 Pearson Education, Inc.
3-‹#›
42
Enterprise Decision Evolution and Data Warehousing
Copyright © 2014 Pearson Education, Inc.
3-‹#›
43
Real-Time/Active DW at Teradata
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Traditional versus Active DW
Copyright © 2014 Pearson Education, Inc.
3-‹#›
DW Administration and Security
Data warehouse administrator (DWA)
DWA should…
have the knowledge of high-performance software, hardware and networking technologies
possess solid business knowledge and insight
be familiar with the decision-making processes so as to suitably design/maintain the data warehouse structure
possess excellent communications skills
Security and privacy is a pressing issue in DW
Safeguarding the most valuable assets
Government regulations (HIPAA, etc.)
Must be explicitly planned and executed
Copyright © 2014 Pearson Education, Inc.
3-‹#›
The Future of DW
Sourcing…
Web, social media, and Big Data
Open source software
SaaS (software as a service)
Cloud computing
Infrastructure…
Columnar
Real-time DW
Data warehouse appliances
Data management practices/technologies
In-database & In-memory processing New DBMS
Advanced analytics
…
Copyright © 2014 Pearson Education, Inc.
3-‹#›
Free of Charge DW Portal
for Teaching & Learning
www.TeradataStudentNetwork.com
Password to signup:
Copyright © 2014 Pearson Education, Inc.
3-‹#›
48
End of the Chapter
Questions, comments
Copyright © 2014 Pearson Education, Inc.
3-‹#›
49
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America.
Copyright © 2014 Pearson Education, Inc.
3-‹#›
50
1970s1980s1990s2000s2010s
üMainframe computers
üSimple data entry
üRoutine reporting
üPrimitive database structures
üTeradata incorporated
üMini/personal computers (PCs)
üBusiness applications for PCs
üDistributer DBMS
üRelational DBMS
üTeradata ships commercial DBs
üBusiness Data Warehousecoined
üCentralized data storage
üData warehousing was born
üInmon, Building the Data Warehouse
üKimball, The Data Warehouse Toolkit
üEDW architecture design
üExponentially growing data Web data
üConsolidation of DW/BI industry
üData warehouse appliances emerged
üBusiness intelligence popularized
üData mining and predictive modeling
üOpen source software
üSaaS, PaaS, Cloud Computing
üBig Data analytics
üSocial media analytics
üText and Web Analytics
üHadoop, MapReduce, NoSQL
üIn-memory, in-database
Data
Sources
ERP
Legacy
POS
Other
OLTP/wEB
External
data
Select
Transform
Extract
Integrate
Load
ETL
Process
Enterprise
Data warehouse
Metadata
Replication
A
P
I
/
M
i
d
d
l
e
w
a
r
e
Data/text
mining
Custom built
applications
OLAP,
Dashboard,
Web
Routine
Business
Reporting
Applications
(Visualization)
Data mart
(Engineering)
Data mart
(Marketing)
Data mart
(Finance)
Data mart
(…)
Access
No data marts option
Tier 2:
Application server
Tier 1:
Client workstation
Tier 3:
Database server
Tier 1:
Client workstation
Tier 2:
Application & database server
Web
Server
Client
(Web browser)
Application
Server
Data
warehouse
Web pages
Internet/
Intranet/
Extranet
Source
Systems
Staging
Area
Independent data marts
(atomic/summarized data)
End user
access and
applications
ETL
Source
Systems
Staging
Area
End user
access and
applications
ETL
Dimensionalized data marts
linked by conformed dimensions
(atomic/summarized data)
Source
Systems
Staging
Area
End user
access and
applications
ETL
Normalized relational
warehouse (atomic data)
Dependent data marts
(summarized/some atomic data)
(a) Independent Data Marts Architecture
(b) Data Mart Bus Architecture with Linked Dimensional Datamarts
(c) Hub and Spoke Architecture (Corporate Information Factory)
Source
Systems
Staging
Area
Normalized relational
warehouse (atomic/some
summarized data)
End user
access and
applications
End user
access and
applications
Logical/physical integration of
common data elements
Existing data warehouses
Data marts and legacy systems
ETL
Data mapping / metadata
(d) Centralized Data Warehouse Architecture
(e) Federated Architecture
Packaged
application
Legacy
system
Other internal
applications
Transient
data source
ExtractTransformCleanseLoad
Data
warehouse
Data mart
Fact Table
SALES
UnitsSold
…
Dimension
TIME
Quarter
…
Dimension
PEOPLE
Division
…
Dimension
PRODUCT
Brand
…
Dimension
GEOGRAPHY
Country
…
Fact Table
SALES
UnitsSold
…
Dimension
DATE
Date
…
Dimension
PEOPLE
Division
…
Dimension
PRODUCT
LineItem
…
Dimension
STORE
LocID
…
Dimension
BRAND
Brand
…
Dimension
CATEGORY
Category
…
Dimension
LOCATION
State
…
Dimension
MONTH
M_Name
…
Dimension
QUARTER
Q_Name
…
Star SchemaSnowflake Schema
Product
T
i
m
e
G
e
o
g
r
a
p
h
y
Sales volumes of
a specific Product
on variable Time
and Region
Sales volumes of
a specific Region
on variable Time
and Products
Sales volumes of
a specific Time on
variable Region
and Products
Cells are filled
with numbers
representing
sales volumes
A 3-dimensional
OLAP cube with
slicing
operations
Chapter 4:
Business Reporting,
Visual Analytics, and Business
Performance Management
Business Intelligence and Analytics: Systems for Decision Support
(10th Edition)
Business Intelligence and Analytics: Systems for Decision Support
(10th Edition)
Copyright © 2014 Pearson Education, Inc.
4-‹#›
1
Learning Objectives
Define business reporting and understand its historical evolution
Recognize the need for and the power of business reporting
Understand the importance of data/information visualization
Learn different types of visualization techniques
Appreciate the value that visual analytics brings to BI/BA
…
(Continued…)
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Learning Objectives
Know the capabilities and limitations of dashboards
Understand the nature of business performance management (BPM)
Learn the closed-loop BPM methodology
Describe the basic elements of balanced scorecards
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Opening Vignette…
Self-Service Reporting Environment Saves Millions For Corporate Customers
Background
Business Challenge
Solution
Results
Answer & discuss the case questions.
Copyright © 2014 Pearson Education, Inc.
4-‹#›
4
Questions for the
Opening Vignette
What does Travel and Transport, Inc., do?
Describe the complexity and the competitive nature of the business environment in which Travel and Transport, Inc., functions.
What were the main business challenges?
What was the solution? Implementation?
Why do you think a multi-vendor, multi-tool solution was implemented?
List and comment on three main benefits of the implemented system.
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Business Reporting
Definitions and Concepts
Report = Information Decision
Report?
Any communication artifact prepared to convey specific information
A report can fulfill many functions
To ensure proper departmental functioning
To provide information
To provide the results of an analysis
To persuade others to act
To create an organizational memory…
Copyright © 2014 Pearson Education, Inc.
4-‹#›
6
What is a Business Report?
A written document that contains information regarding business matters.
Purpose: to improve managerial decisions
Source: data from inside and outside the organization (via the use of ETL)
Format: text + tables + graphs/charts
Distribution: in-print, email, portal/intranet
Data acquisition Information generation Decision making Process management
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Business Reporting
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Key to Any Successful Report
Clarity …
Brevity …
Completeness …
Correctness …
Report types (in terms of content and format)
Informal – a single letter or a memo
Formal – 10-100 pages; cover + summary + text
Short report – periodic, informative, investigative
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Application Case 4.1
Delta Lloyd Group Ensures Accuracy and Efficiency in Financial Reporting
Questions for Discussion
How did Delta Lloyd Group improve accuracy and efficiency in financial reporting?
What were the challenges, the proposed solution, and the obtained results?
Why is it important for Delta Lloyd Group to comply with industry regulations?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Types of Business Reports
Metric Management Reports
Help manage business performance through metrics (SLAs for externals; KPIs for internals)
Can be used as part of Six Sigma and/or TQM
Dashboard-Type Reports
Graphical presentation of several performance indicators in a single page using dials/gauges
Balanced Scorecard-Type Reports
Include financial, customer, business process, and learning & growth indicators
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Components of
Business Reporting Systems
Common characteristics
OLTP (online transaction processing)
ERP, POS, SCM, RFID, Sensors, Web, …
Data supply (volume, variety, velocity, …)
ETL
Data storage
Business logic
Publication medium
Assurance
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Application Case 4.2
Flood of Paper Ends at FEMA
Questions for Discussion
What is FEMA and what does it do?
What are the main challenges that FEMA faces in delivering its services?
How did FEMA improve its inefficient reporting practices?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Data and Information Visualization
“The use of visual representations to explore, make sense of, and communicate data.”
Data visualization vs. Information visualization
Information = aggregation, summarization, and contextualization of data
Related to information graphics, scientific visualization, and statistical graphics
Often includes charts, graphs, illustrations, …
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Application Case 4.3
Tableau Saves Blastrac
Thousands of Dollars with
Simplified Information Sharing
Questions for Discussion
How did Blastrac achieve significant cost saving in reporting and information sharing?
What were the challenge, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
A Brief History of
Data Visualization
Data visualization can date back to the second century AD
Most developments have occurred in the last two and a half centuries
Until recently it was not recognized as a discipline
Today’s most popular visual forms date back a few centuries
Copyright © 2014 Pearson Education, Inc.
4-‹#›
The First Pie Chart
Created by William Playfair in 1801
William Playfair is widely credited as the inventor of the modern chart, having created the first line and pie charts.
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Decimation of Napoleon’s Army During the 1812 Russian Campaign
Arguably the most popular multi-dimensional chart
By Charles Joseph Minard
Copyright © 2014 Pearson Education, Inc.
4-‹#›
A Brief History of Data Visualization
1900s –
more formal attitude toward visualization
focus on color, value scales, and labeling
Publication of the book Semiologie Graphique
2000s –
Emergence of Internet as the medium for information visualization raising visual literacy
Incorporate interaction, animation, 3D graphics-rendering, virtual worlds, real-time data feed
2010s and beyond – ?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Application Case 4.4
TIBCO Spotfire Provides Dana-Farber Cancer Institute with Unprecedented Insight into Cancer Vaccine Clinical Trials
Questions for Discussion
How did Dana-Farber Cancer Institute use TIBCO Spotfire to enhance information reporting and visualization?
What were the challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Different Types of
Charts and Graphs
Which one to use? Where and when?
Specialized Charts and Graphs
Histogram
Gantt Chart
PERT Chart
Geographic Map
Bullet Graph
Heat Map / Tree Map
Highlight Table
Basic Charts and Graphs
Line Chart
Bar Chart
Pie Chart
Scatter Plot
Bubble Chart
Copyright © 2014 Pearson Education, Inc.
4-‹#›
A Gapminder Chart
Wealth and Health of Nations
See gapminder.org for
interesting animated examples
Copyright © 2014 Pearson Education, Inc.
4-‹#›
The Emergence of Data Visualization And Visual Analytics
Magic Quadrant for Business Intelligence and Analytics Platforms (Source: Gartner.com)
Many data visualization companies are in the 4th quadrant
There is a move toward visualization
Copyright © 2014 Pearson Education, Inc.
4-‹#›
The Emergence of Data Visualization And Visual Analytics
Emergence of new companies
Tableau, Spotfire, QlikView, …
Increased focus by the big players
MicroStrategy improved Visual Insight
SAP launched Visual Intelligence
SAS launched Visual Analytics
Microsoft bolstered PowerPivot with Power View
IBM launched Cognos Insight
Oracle acquired Endeca
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Visual Analytics
A recently coined term
Information visualization + predictive analytics
Information visualization
Descriptive, backward focused
“what happened” “what is happening”
Predictive analytics
Predictive, future focused
“what will happen” “why will it happen”
There is a strong move toward visual analytics
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Visual Analytics by SAS Institute
SAS Visual Analytics Architecture
Big data + In memory + Massively parallel processing + ..
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Visual Analytics by SAS Institute
At teradatauniversitynetwork.com, you can learn more about SAS VA, experiment with the tool
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Performance Dashboards
Performance dashboards are commonly used in BPM software suites and BI platforms
Dashboards provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily drilled in and further explored
Copyright © 2014 Pearson Education, Inc.
4-‹#›
28
Performance Dashboards
Copyright © 2014 Pearson Education, Inc.
4-‹#›
29
Performance Dashboards
Dashboard design
The fundamental challenge of dashboard design is to display all the required information on a single screen, clearly and without distraction, in a manner that can be assimilated quickly
Three layer of information
Monitoring
Analysis
Management
Copyright © 2014 Pearson Education, Inc.
4-‹#›
30
Application Case 4.6
Saudi Telecom Company Excels with Information Visualization
Questions for Discussion
Why do you think telecommunication companies are among the prime users of information visualization tools?
How did Saudi Telecom use information visualization?
What were their challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Application Case 4.6
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Performance Dashboards
What to look for in a dashboard
Use of visual components to highlight data and exceptions that require action.
Transparent to the user, meaning that they require minimal training and are extremely easy to use
Combine data from a variety of systems into a single, summarized, unified view of the business
Enable drill-down or drill-through to underlying data sources or reports
Present a dynamic, real-world view with timely data
Require little coding to implement/deploy/maintain
Copyright © 2014 Pearson Education, Inc.
4-‹#›
33
Best Practices in
Dashboard Design
Benchmark KPIs with Industry Standards
Wrap the Metrics with Contextual Metadata
Validate the Design by a Usability Specialist
Prioritize and Rank Alerts and Exceptions
Enrich Dashboard with Business-User Comments
Present Information in Three Different Levels
Pick the Right Visual Constructs
Provide for Guided Analytics
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Business Performance Management (BPM)
Business Performance Management (BPM) is…
A real-time system that alerts managers to potential opportunities, impending problems and threats, and then empowers them to react through models and collaboration.
Also called corporate performance management (CPM by Gartner Group), enterprise performance management (EPM by Oracle), strategic enterprise management (SEM by SAP)
Copyright © 2014 Pearson Education, Inc.
4-‹#›
35
Business Performance Management (BPM)
BPM refers to the business processes, methodologies, metrics, and technologies used by enterprises to measure, monitor, and manage business performance.
BPM encompasses three key components
A set of integrated, closed-loop management and analytic processes, supported by technology …
Tools for businesses to define strategic goals and then measure/manage performance against them
Methods and tools for monitoring key performance indicators (KPIs), linked to organizational strategy
Copyright © 2014 Pearson Education, Inc.
4-‹#›
36
A Closed-Loop Process to Optimize Business Performance
Process Steps
Strategize
Plan
Monitor/analyze
Act/adjust
Each with its own process steps
Copyright © 2014 Pearson Education, Inc.
4-‹#›
37
Strategize:
Where Do We Want to Go?
Strategic planning
Common tasks for the strategic planning process:
Conduct a current situation analysis
Determine the planning horizon
Conduct an environment scan
Identify critical success factors
Complete a gap analysis
Create a strategic vision
Develop a business strategy
Identify strategic objectives and goals
Copyright © 2014 Pearson Education, Inc.
4-‹#›
38
Plan:
How Do We Get There?
Operational planning
Operational plan: plan that translates an organization’s strategic objectives and goals into a set of well-defined tactics and initiatives, resources requirements, and expected results for some future time period (usually a year).
Operational planning can be
Tactic-centric (operationally focused)
Budget-centric plan (financially focused)
Copyright © 2014 Pearson Education, Inc.
4-‹#›
39
Monitor/Analyze:
How Are We Doing?
A comprehensive framework for monitoring performance should address two key issues:
What to monitor?
Critical success factors
Strategic goals and targets
…
How to monitor?
…
Copyright © 2014 Pearson Education, Inc.
4-‹#›
40
Success (or mere survival) depends on new projects: creating new products, entering new markets, acquiring new customers (or businesses), or streamlining some process.
Many new projects and ventures fail!
What is the chance of failure?
60% of Hollywood movies fail
70% of large IT projects fail, …
Act and Adjust: What Do We Need to Do Differently?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
41
Application Case 4.7
IBM Cognos Express Helps Mace for Faster and Better Business Reporting
Questions for Discussion
What was the reporting challenge Mace was facing? Do you think this is an unusual challenge specific to Mace?
What was the approach for a potential solution?
What were the results obtained in the short term, and what were the future plans?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Performance measurement system
A system that assists managers in tracking the implementations of business strategy by comparing actual results against strategic goals and objectives
Comprises systematic comparative methods that indicate progress (or lack thereof) against goals
Performance Measurement
Copyright © 2014 Pearson Education, Inc.
4-‹#›
43
Key performance indicator (KPI)
A KPI represents a strategic objective and metrics that measure performance against a goal
Distinguishing features of KPIs
KPIs and Operational Metrics
Strategy
Targets
Ranges
Encodings
Time frames
Benchmarks
Copyright © 2014 Pearson Education, Inc.
4-‹#›
44
Key performance indicator (KBI)
Outcome KPIs vs. Driver KPIs
(lagging indicators (leading indicators
e.g., revenues) e.g., sales leads)
Operational areas covered by driver KPIs
Customer performance
Service performance
Sales operations
Sales plan/forecast
Performance Measurement
Copyright © 2014 Pearson Education, Inc.
4-‹#›
45
Balanced Scorecard (BSC)
A performance measurement and management methodology that helps translate an organization’s financial, customer, internal process, and learning and growth objectives and targets into a set of actionable initiatives
“The Balanced Scorecard: Measures That Drive Performance” (HBR, 1992)
Performance Measurement System
Copyright © 2014 Pearson Education, Inc.
4-‹#›
46
Balanced Scorecard
The meaning of “balance” ?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
47
Six Sigma
A performance management methodology aimed at reducing the number of defects in a business process to as close to zero defects per million opportunities (DPMO) as possible
Six Sigma as a Performance Measurement System
Copyright © 2014 Pearson Education, Inc.
4-‹#›
48
The DMAIC performance model
A closed-loop business improvement model that encompasses the steps of defining, measuring, analyzing, improving, and controlling a process
Lean Six Sigma
Lean manufacturing / lean production
Lean production versus six sigma?
Six Sigma as a Performance Measurement System
Copyright © 2014 Pearson Education, Inc.
4-‹#›
49
Comparison of Balanced Scorecard and Six Sigma
Copyright © 2014 Pearson Education, Inc.
4-‹#›
Application Case 4.8
Expedia.com’s Customer Satisfaction Scorecard
Questions for Discussion
Who are the customers for Expedia.com? Why is customer satisfaction a very important part of their business?
How did Expedia.com improve customer satisfaction with scorecards?
What were the challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
4-‹#›
End of the Chapter
Questions, comments
Copyright © 2014 Pearson Education, Inc.
4-‹#›
52
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America.
Copyright © 2014 Pearson Education, Inc.
4-‹#›
53
Data
Repositories
Business Functions
UOB 1.0X
UOB 2.2
UOB 2.1XUOB 3.0
1
Machine
Failure
SymbolCountDescription
Exception Event
Transactional Records
PHASE 5
DEPT 4
DEPT 3
DEPT 2
DEPT 1
PHASE 4PHASE 3PHASE 2PHASE 1
DEPLOYMENT CHART
1
2
3
4
5
Information
(reporting)
Decision
Maker
Action
(decision)
Data
117
LEARNING OBJECTIVES
Nature of Data, Statistical
Modeling, and Visualization
■■ Understand the nature of data as they relate to
business intelligence (BI) and analytics
■■ Learn the methods used to make real-world data
analytics ready
■■ Describe statistical modeling and its relationship to
business analytics
■■ Learn about descriptive and inferential statistics
■■ Define business reporting and understand its
historical evolution
■■ Understand the importance of data/information
visualization
■■ Learn different types of visualization techniques
■■ Appreciate the value that visual analytics brings
to business analytics
■■ Know the capabilities and limitations of
dashboards
I n the age of Big Data and business analytics in which we are living, the importance of data is undeniable. Newly coined phrases such as “data are the oil,” “data are the new bacon,” “data are the new currency,” and “data are the king” are further stress-
ing the renewed importance of data. But the type of data we are talking about is obvi-
ously not just any data. The “garbage in garbage out—GIGO” concept/principle applies
to today’s Big Data phenomenon more so than any data definition that we have had in
the past. To live up to their promise, value proposition, and ability to turn into insight,
data have to be carefully created/identified, collected, integrated, cleaned, transformed,
and properly contextualized for use in accurate and timely decision making.
Data are the main theme of this chapter. Accordingly, the chapter starts with a de-
scription of the nature of data: what they are, what different types and forms they can
come in, and how they can be preprocessed and made ready for analytics. The first few
sections of the chapter are dedicated to a deep yet necessary understanding and process-
ing of data. The next few sections describe the statistical methods used to prepare data
as input to produce both descriptive and inferential measures. Following the statistics
sections are sections on reporting and visualization. A report is a communication artifact
3
C H A P T E R
118 Part I • Introduction to Analytics and AI
prepared with the specific intention of converting data into information and knowledge
and relaying that information in an easily understandable/digestible format. Today, these
reports are visually oriented, often using colors and graphical icons that collectively look
like a dashboard to enhance the information content. Therefore, the latter part of the
chapter is dedicated to subsections that present the design, implementation, and best
practices regarding information visualization, storytelling, and information dashboards.
This chapter has the following sections:
3.1 Opening Vignette: SiriusXM Attracts and Engages a New Generation of Radio
Consumers with Data-Driven Marketing 118
3.2 Nature of Data 121
3.3 Simple Taxonomy of Data 125
3.4 Art and Science of Data Preprocessing 129
3.5 Statistical Modeling for Business Analytics 139
3.6 Regression Modeling for Inferential Statistics 151
3.7 Business Reporting 163
3.8 Data Visualization 166
3.9 Different Types of Charts and Graphs 171
3.10 Emergence of Visual Analytics 176
3.11 Information Dashboards 182
3.1 OPENING VIGNETTE: SiriusXM Attracts and Engages
a New Generation of Radio Consumers with
Data-Driven Marketing
SiriusXM Radio is a satellite radio powerhouse, the largest radio company in the world
with $3.8 billion in annual revenues and a wide range of hugely popular music, sports,
news, talk, and entertainment stations. The company, which began broadcasting in 2001
with 50,000 subscribers, had 18.8 million subscribers in 2009, and today has nearly 29
million.
Much of SiriusXM’s growth to date is rooted in creative arrangements with automo-
bile manufacturers; today, nearly 70 percent of new cars are SiriusXM enabled. Yet the
company’s reach extends far beyond car radios in the United States to a worldwide pres-
ence on the Internet, on smartphones, and through other services and distribution chan-
nels, including SONOS, JetBlue, and Dish.
BUSINESS CHALLENGE
Despite these remarkable successes, changes in customer demographics, technology, and
a competitive landscape over the past few years have posed a new series of business
challenges and opportunities for SiriusXM. Here are some notable ones:
• As its market penetration among new cars increased, the demographics of its buy-
ers changed, skewing toward younger people with less discretionary income. How
could SiriusXM reach this new demographic?
• As new cars become used cars and change hands, how could SiriusXM identify,
engage, and convert second owners to paying customers?
• With its acquisition of the connected vehicle business from Agero—the leading pro-
vider of telematics in the U.S. car market—SiriusXM gained the ability to deliver its
service via satellite and wireless networks. How could it successfully use this acqui-
sition to capture new revenue streams?
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 119
PROPOSED SOLUTION: SHIFTING THE VISION TOWARD
DATA-DRIVEN MARKETING
SiriusXM recognized that to address these challenges, it would need to become a high-
performance, data-driven marketing organization. The company began making that shift
by establishing three fundamental tenets. First, personalized interactions—not mass
marketing—would rule the day. The company quickly understood that to conduct more
personalized marketing, it would have to draw on past history and interactions as well as
on a keen understanding of the consumer’s place in the subscription life cycle.
Second, to gain that understanding, information technology (IT) and its external tech-
nology partners would need the ability to deliver integrated data, advanced analytics,
integrated marketing platforms, and multichannel delivery systems.
And third, the company could not achieve its business goals without an integrated
and consistent point of view across the company. Most important, the technology and
business sides of SiriusXM would have to become true partners to best address the chal-
lenges involved in becoming a high-performance marketing organization that draws on
data-driven insights to speak directly with consumers in strikingly relevant ways.
Those data-driven insights, for example, would enable the company to differentiate
between consumers, owners, drivers, listeners, and account holders. The insights would help
SiriusXM to understand what other vehicles and services are part of each household and cre-
ate new opportunities for engagement. In addition, by constructing a coherent and reliable
360-degree view of all its consumers, SiriusXM could ensure that all messaging in all cam-
paigns and interactions would be tailored, relevant, and consistent across all channels. The
important bonus is that a more tailored and effective marketing is typically more cost-efficient.
IMPLEMENTATION: CREATING AND FOLLOWING THE PATH TO
HIGH-PERFORMANCE MARKETING
At the time of its decision to become a high-performance marketing company, SiriusXM
was working with a third-party marketing platform that did not have the capacity to
support SiriusXM’s ambitions. The company then made an important, forward-thinking
decision to bring its marketing capabilities in-house—and then carefully plotted what it
would need to do to make the transition successfully.
1. Improve data cleanliness through improved master data management and governance.
Although the company was understandably impatient to put ideas into action, data
hygiene was a necessary first step to create a reliable window into consumer behavior.
2. Bring marketing analytics in-house and expand the data warehouse to enable scale
and fully support integrated marketing analytics.
3. Develop new segmentation and scoring models to run in databases, eliminating la-
tency and data duplication.
4. Extend the integrated data warehouse to include marketing data and scoring, lever-
aging in-database analytics.
5. Adopt a marketing platform for campaign development.
6. Bring all of its capability together to deliver real-time offer management across all
marketing channels: call center, mobile, Web, and in-app.
Completing those steps meant finding the right technology partner. SiriusXM chose
Teradata because its strengths were a powerful match for the project and company.
Teradata offered the ability to:
• Consolidate data sources with an integrated data warehouse (IDW), advanced ana-
lytics, and powerful marketing applications.
• Solve data-latency issues.
120 Part I • Introduction to Analytics and AI
• Significantly reduce data movement across multiple databases and applications.
• Seamlessly interact with applications and modules for all marketing areas.
• Scale and perform at very high levels for running campaigns and analytics in-database.
• Conduct real-time communications with customers.
• Provide operational support, either via the cloud or on premise.
This partnership has enabled SiriusXM to move smoothly and swiftly along its road
map, and the company is now in the midst of a transformational, five-year process.
After establishing its strong data governance process, SiriusXM began by implementing its
IDW, which allowed the company to quickly and reliably operationalize insights through-
out the organization.
Next, the company implemented Customer Interaction Manager—part of the Teradata
Integrated Marketing Cloud, which enables real-time, dialog-based customer interaction
across the full spectrum of digital and traditional communication channels. SiriusXM also
will incorporate the Teradata Digital Messaging Center.
Together, the suite of capabilities allows SiriusXM to handle direct communications
across multiple channels. This evolution will enable real-time offers, marketing messages,
and recommendations based on previous behavior.
In addition to streamlining the way it executes and optimizes outbound marketing
activities, SiriusXM is also taking control of its internal marketing operations with the
implementation of Marketing Resource Management, also part of the Teradata Integrated
Marketing Cloud. The solution will allow SiriusXM to streamline workflow, optimize mar-
keting resources, and drive efficiency through every penny of its marketing budget.
RESULTS: REAPING THE BENEFITS
As SiriusXM continues its evolution into a high-performance marketing organization, it already
is benefiting from its thoughtfully executed strategy. Household-level consumer insights and
a complete view of marketing touch strategy with each consumer enable SiriusXM to create
more targeted offers at the household, consumer, and device levels. By bringing the data
and marketing analytics capabilities in-house, SiriusXM achieved the following:
• Campaign results in near real time rather than four days, resulting in massive reduc-
tions in cycle times for campaigns and the analysts who support them.
• Closed-loop visibility, allowing the analysts to support multistage dialogs and
in-campaign modifications to increase campaign effectiveness.
• Real-time modeling and scoring to increase marketing intelligence and sharpen cam-
paign offers and responses at the speed of their business.
Finally, SiriusXM’s experience has reinforced the idea that high-performance market-
ing is a constantly evolving concept. The company has implemented both processes and
the technology that give it the capacity for continued and flexible growth.
u QUESTIONS FOR THE OPENING VIGNETTE
1. What does SiriusXM do? In what type of market does it conduct its business?
2. What were its challenges? Comment on both technology and data-related
challenges.
3. What were the proposed solutions?
4. How did the company implement the proposed solutions? Did it face any
implementation challenges?
5. What were the results and benefits? Were they worth the effort/investment?
6. Can you think of other companies facing similar challenges that can potentially
benefit from similar data-driven marketing solutions?
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 121
WHAT WE CAN LEARN FROM THIS VIGNETTE
Striving to thrive in a fast-changing competitive industry, SiriusXM realized the need
for a new and improved marketing infrastructure (one that relies on data and analytics)
to effectively communicate its value proposition to its existing and potential custom-
ers. As is the case in any industry, success or mere survival in entertainment depends
on intelligently sensing the changing trends (likes and dislikes) and putting together
the right messages and policies to win new customers while retaining the existing
ones. The key is to create and manage successful marketing campaigns that resonate
with the target population of customers and have a close feedback loop to adjust and
modify the message to optimize the outcome. At the end, it was all about the preci-
sion in the way that SiriusXM conducted business: being proactive about the changing
nature of the clientele and creating and transmitting the right products and services
in a timely manner using a fact-based/data-driven holistic marketing strategy. Source
identification, source creation, access and collection, integration, cleaning, transforma-
tion, storage, and processing of relevant data played a critical role in SiriusXM’s suc-
cess in designing and implementing a marketing analytics strategy as is the case in any
analytically savvy successful company today, regardless of the industry in which they
are participating.
Sources: C. Quinn, “Data-Driven Marketing at SiriusXM,” Teradata Articles & News, 2016. http://bigdata.
teradata.com/US/Articles-News/Data-Driven-Marketing-At-SiriusXM/ (accessed August 2016); “SiriusXM
Attracts and Engages a New Generation of Radio Consumers.” http://assets.teradata.com/resourceCenter/
downloads/CaseStudies/EB8597 ?processed=1.
3.2 NATURE OF DATA
Data are the main ingredient for any BI, data science, and business analytics initiative.
In fact, they can be viewed as the raw material for what popular decision technolo-
gies produce—information, insight, and knowledge. Without data, none of these tech-
nologies could exist and be popularized—although traditionally we have built analytics
models using expert knowledge and experience coupled with very little or no data at all;
however, those were the old days, and now data are of the essence. Once perceived as a
big challenge to collect, store, and manage, data today are widely considered among the
most valuable assets of an organization with the potential to create invaluable insight to
better understand customers, competitors, and the business processes.
Data can be small or very large. They can be structured (nicely organized for
computers to process), or they can be unstructured (e.g., text that is created for humans
and hence not readily understandable/consumable by computers). Data can come in
small batches continuously or can pour in all at once as a large batch. These are some
of the characteristics that define the inherent nature of today’s data, which we often
call Big Data. Even though these characteristics of data make them more challenging to
process and consume, they also make the data more valuable because the character-
istics enrich them beyond their conventional limits, allowing for the discovery of new
and novel knowledge. Traditional ways to manually collect data (via either surveys
or human-entered business transactions) mostly left their places to modern-day data
collection mechanisms that use Internet and/or sensor/radio frequency identification
(RFID)–based computerized networks. These automated data collection systems are not
only enabling us to collect more volumes of data but also enhancing the data quality
and integrity. Figure 3.1 illustrates a typical analytics continuum—data to analytics to
actionable information.
http://bigdata.teradata.com/US/Articles-News/Data-Driven-Marketing-At-SiriusXM
http://bigdata.teradata.com/US/Articles-News/Data-Driven-Marketing-At-SiriusXM
http://assets.teradata.com/resourceCenter/downloads/CaseStudies/EB8597 ?processed=1
http://assets.teradata.com/resourceCenter/downloads/CaseStudies/EB8597 ?processed=1
122 Part I • Introduction to Analytics and AI
Although their value proposition is undeniable, to live up their promise, data must
comply with some basic usability and quality metrics. Not all data are useful for all tasks,
obviously. That is, data must match with (have the coverage of the specifics for) the task
for which they are intended to be used. Even for a specific task, the relevant data on
hand need to comply with the quality and quantity requirements. Essentially, data have
to be analytics ready. So what does it mean to make data analytics ready? In addition to
its relevancy to the problem at hand and the quality/quantity requirements, it also has to
have a certain structure in place with key fields/variables with properly normalized val-
ues. Furthermore, there must be an organization-wide agreed-on definition for common
variables and subject matters (sometimes also called master data management), such as
how to define a customer (what characteristics of customers are used to produce a holis-
tic enough representation to analytics) and where in the business process the customer-
related information is captured, validated, stored, and updated.
Sometimes the representation of the data depends on the type of analytics being
employed. Predictive algorithms generally require a flat file with a target variable, so mak-
ing data analytics ready for prediction means that data sets must be transformed into
a flat-file format and made ready for ingestion into those predictive algorithms. It is also
imperative to match the data to the needs and wants of a specific predictive algorithm
and/or a software tool. For instance, neural network algorithms require all input variables
UOB
1.0
X
UOB
2.2
UOB
2.1
UOB
3.0
ERP CRM SCM
Business Process
Facebook
Google+
Linked In
YouTube
Twitter
Tumblr
Flicker
Instagram
Pinterest
Snapchat
Reddit
Foursquare
Internet/Social Media
Machines/Internet of Things
Data
Storage Analytics
Data Protection
Cloud Storage and
Computing
Pa
tte
rn
s
Trends
Knowledge
Applications
End Users
Validate
Built
Test
X
FIGURE 3.1 A Data to Knowledge Continuum.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 123
to be numerically represented (even the nominal variables need to be converted into
pseudo binary numeric variables), whereas decision tree algorithms do not require such
numerical transformation—they can easily and natively handle a mix of nominal and nu-
meric variables.
Analytics projects that overlook data-related tasks (some of the most critical steps)
often end up with the wrong answer for the right problem, and these unintentionally cre-
ated, seemingly good answers could lead to inaccurate and untimely decisions. Following
are some of the characteristics (metrics) that define the readiness level of data for an ana-
lytics study (Delen, 2015; Kock, McQueen, & Corner, 1997).
• Data source reliability. This term refers to the originality and appropriateness of
the storage medium where the data are obtained—answering the question of “Do
we have the right confidence and belief in this data source?” If at all possible, one
should always look for the original source/creator of the data to eliminate/mitigate
the possibilities of data misrepresentation and data transformation caused by the
mishandling of the data as they moved from the source to destination through one
or more steps and stops along the way. Every move of the data creates a chance to
unintentionally drop or reformat data items, which limits the integrity and perhaps
true accuracy of the data set.
• Data content accuracy. This means that data are correct and are a good match
for the analytics problem—answering the question of “Do we have the right data for
the job?” The data should represent what was intended or defined by the original
source of the data. For example, the customer’s contact information recorded within
a database should be the same as what the customer said it was. Data accuracy will
be covered in more detail in the following subsection.
• Data accessibility. This term means that the data are easily and readily obtainable—
answering the question of “Can we easily get to the data when we need to?” Access to
data can be tricky, especially if they are stored in more than one location and storage
medium and need to be merged/transformed while accessing and obtaining them. As
the traditional relational database management systems leave their place (or coexist
with a new generation of data storage mediums such as data lakes and Hadoop infra-
structure), the importance/criticality of data accessibility is also increasing.
• Data security and data privacy. Data security means that the data are secured
to allow only those people who have the authority and the need to access them and
to prevent anyone else from reaching them. Increasing popularity in educational
degrees and certificate programs for Information Assurance is evidence of the criti-
cality and the increasing urgency of this data quality metric. Any organization that
maintains health records for individual patients must have systems in place that not
only safeguard the data from unauthorized access (which is mandated by federal
laws such as the Health Insurance Portability and Accountability Act [HIPAA]) but
also accurately identify each patient to allow proper and timely access to records by
authorized users (Annas, 2003).
• Data richness. This means that all required data elements are included in the data
set. In essence, richness (or comprehensiveness) means that the available variables
portray a rich enough dimensionality of the underlying subject matter for an accurate
and worthy analytics study. It also means that the information content is complete
(or near complete) to build a predictive and/or prescriptive analytics model.
• Data consistency. This means that the data are accurately collected and com-
bined/merged. Consistent data represent the dimensional information (variables of
interest) coming from potentially disparate sources but pertaining to the same sub-
ject. If the data integration/merging is not done properly, some of the variables of
different subjects could appear in the same record—having two different patient
124 Part I • Introduction to Analytics and AI
records mixed up; for instance, this could happen while merging the demographic
and clinical test result data records.
• Data currency/data timeliness. This means that the data should be up-to-date
(or as recent/new as they need to be) for a given analytics model. It also means
that the data are recorded at or near the time of the event or observation so that the
time delay–related misrepresentation (incorrectly remembering and encoding) of
the data is prevented. Because accurate analytics relies on accurate and timely data,
an essential characteristic of analytics-ready data is the timeliness of the creation
and access to data elements.
• Data granularity. This requires that the variables and data values be defined at
the lowest (or as low as required) level of detail for the intended use of the data.
If the data are aggregated, they might not contain the level of detail needed for
an analytics algorithm to learn how to discern different records/cases from one
another. For example, in a medical setting, numerical values for laboratory results
should be recorded to the appropriate decimal place as required for the meaning-
ful interpretation of test results and proper use of those values within an analytics
algorithm. Similarly, in the collection of demographic data, data elements should be
defined at a granular level to determine the differences in outcomes of care among
various subpopulations. One thing to remember is that the data that are aggregated
cannot be disaggregated (without access to the original source), but they can easily
be aggregated from its granular representation.
• Data validity. This is the term used to describe a match/mismatch between the
actual and expected data values of a given variable. As part of data definition,
the acceptable values or value ranges for each data element must be defined. For
example, a valid data definition related to gender would include three values: male,
female, and unknown.
• Data relevancy. This means that the variables in the data set are all relevant to
the study being conducted. Relevancy is not a dichotomous measure (whether a
variable is relevant or not); rather, it has a spectrum of relevancy from least relevant
to most relevant. Based on the analytics algorithms being used, one can choose to
include only the most relevant information (i.e., variables) or, if the algorithm is
capable enough to sort them out, can choose to include all the relevant ones regard-
less of their levels. One thing that analytics studies should avoid is including totally
irrelevant data into the model building because this could contaminate the informa-
tion for the algorithm, resulting in inaccurate and misleading results.
The above-listed characteristics are perhaps the most prevailing metrics to keep up
with; the true data quality and excellent analytics readiness for a specific application do-
main would require different levels of emphasis to be placed on these metric dimensions
and perhaps add more specific ones to this collection. The following section will delve
into the nature of data from a taxonomical perspective to list and define different data
types as they relate to different analytics projects.
u SECTION 3.2 REVIEW QUESTIONS
1. How do you describe the importance of data in analytics? Can we think of analytics
without data?
2. Considering the new and broad definition of business analytics, what are the main
inputs and outputs to the analytics continuum?
3. Where do the data for business analytics come from?
4. In your opinion, what are the top three data-related challenges for better analytics?
5. What are the most common metrics that make for analytics-ready data?
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 125
3.3 SIMPLE TAXONOMY OF DATA
The term data (datum in singular form) refers to a collection of facts usually obtained
as the result of experiments, observations, transactions, or experiences. Data can consist
of numbers, letters, words, images, voice recordings, and so on, as measurements of a
set of variables (characteristics of the subject or event that we are interested in studying).
Data are often viewed as the lowest level of abstraction from which information and then
knowledge is derived.
At the highest level of abstraction, one can classify data as structured and unstruc-
tured (or semistructured). Unstructured data/semistructured data are composed of
any combination of textual, imagery, voice, and Web content. Unstructured/semistruc-
tured data will be covered in more detail in the text mining and Web mining chapter.
Structured data are what data mining algorithms use and can be classified as categori-
cal or numeric. The categorical data can be subdivided into nominal or ordinal data,
whereas numeric data can be subdivided into intervals or ratios. Figure 3.2 shows a
simple data taxonomy.
• Categorical data. These represent the labels of multiple classes used to divide
a variable into specific groups. Examples of categorical variables include race, sex,
age group, and educational level. Although the latter two variables can also be
considered in a numerical manner by using exact values for age and highest grade
completed, for example, it is often more informative to categorize such variables
into a relatively small number of ordered classes. The categorical data can also be
called discrete data, implying that they represent a finite number of values with no
continuum between them. Even if the values used for the categorical (or discrete)
variables are numeric, these numbers are nothing more than symbols and do not
imply the possibility of calculating fractional values.
• Nominal data. These contain measurements of simple codes assigned to objects
as labels, which are not measurements. For example, the variable marital status
can be generally categorized as (1) single, (2) married, and (3) divorced. Nominal
Data in Analytics
Structured Data Unstructured or Semi-Structured Data
Nominal
Ordinal
Textual
Multimedia
XML/JSON
Categorical Numerical
Interval
Ratio
Image
Audio
Video
FIGURE 3.2 A Simple Taxonomy of Data.
126 Part I • Introduction to Analytics and AI
data can be represented with binomial values having two possible values (e.g.,
yes/no, true/false, good/bad) or multinomial values having three or more pos-
sible values (e.g., brown/green/blue, white/black/Latino/Asian, single/married/
divorced).
• Ordinal data. These contain codes assigned to objects or events as labels that
also represent the rank order among them. For example, the variable credit score
can be generally categorized as (1) low, (2) medium, or (3) high. Similar ordered
relationships can be seen in variables such as age group (i.e., child, young,
middle-aged, elderly) and educational level (i.e., high school, college, graduate
school). Some predictive analytic algorithms, such as ordinal multiple logistic
regression, take into account this additional rank-order information to build a
better classification model.
• Numeric data. These represent the numeric values of specific variables.
Examples of numerically valued variables include age, number of children, total
household income (in U.S. dollars), travel distance (in miles), and temperature (in
Fahrenheit degrees). Numeric values representing a variable can be integers (only
whole numbers) or real (also fractional numbers). The numeric data can also be
called continuous data, implying that the variable contains continuous measures
on a specific scale that allows insertion of interim values. Unlike a discrete vari-
able, which represents finite, countable data, a continuous variable represents scal-
able measurements, and it is possible for the data to contain an infinite number of
fractional values.
• Interval data. These are variables that can be measured on interval scales. A
common example of interval scale measurement is temperature on the Celsius scale.
In this particular scale, the unit of measurement is 1/100 of the difference between
the melting temperature and the boiling temperature of water in atmospheric pres-
sure; that is, there is not an absolute zero value.
• Ratio data. These include measurement variables commonly found in the physical
sciences and engineering. Mass, length, time, plane angle, energy, and electric charge
are examples of physical measures that are ratio scales. The scale type takes its name
from the fact that measurement is the estimation of the ratio between a magnitude
of a continuous quantity and a unit magnitude of the same kind. Informally, the dis-
tinguishing feature of a ratio scale is the possession of a nonarbitrary zero value. For
example, the Kelvin temperature scale has a nonarbitrary zero point of absolute zero,
which is equal to –273.15 degrees Celsius. This zero point is nonarbitrary because
the particles that comprise matter at this temperature have zero kinetic energy.
Other data types, including textual, spatial, imagery, video, and voice, need to be
converted into some form of categorical or numeric representation before they can be pro-
cessed by analytics methods (data mining algorithms; Delen, 2015). Data can also be classi-
fied as static or dynamic (i.e., temporal or time series).
Some predictive analytics (i.e., data mining) methods and machine-learning
algorithms are very selective about the type of data that they can handle. Providing them
with incompatible data types can lead to incorrect models or (more often) halt the model
development process. For example, some data mining methods need all the variables
(both input and output) represented as numerically valued variables (e.g., neural net-
works, support vector machines, logistic regression). The nominal or ordinal variables are
converted into numeric representations using some type of 1-of-N pseudo variables (e.g.,
a categorical variable with three unique values can be transformed into three pseudo
variables with binary values—1 or 0). Because this process could increase the number of
variables, one should be cautious about the effect of such representations, especially for
the categorical variables that have large numbers of unique values.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 127
Similarly, some predictive analytics methods, such as ID3 (a classic decision tree
algorithm) and rough sets (a relatively new rule induction algorithm), need all the vari-
ables represented as categorically valued variables. Early versions of these methods re-
quired the user to discretize numeric variables into categorical representations before
they could be processed by the algorithm. The good news is that most implementa-
tions of these algorithms in widely available software tools accept a mix of numeric
and nominal variables and internally make the necessary conversions before process-
ing the data.
Data come in many different variable types and representation schemas. Business
analytics tools are continuously improving in their ability to help data scientists in the
daunting task of data transformation and data representation so that the data require-
ments of specific predictive models and algorithms can be properly executed. Application
Case 3.1 illustrates a business scenario in which one of the largest telecommunication
companies streamlined and used a wide variety of rich data sources to generate customers
insight to prevent churn and to create new revenue sources.
The Problem
In the ultra-competitive telecommunications indus-
try, staying relevant to consumers while finding new
sources of revenue is critical, especially since cur-
rent revenue sources are in decline.
For Fortune 13 powerhouse Verizon, the
secret weapon that catapulted the company into the
nation’s largest and most reliable network provider
is also guiding the business toward future success
(see the following figure for some numbers about
Verizon). The secret weapon? Data and analytics.
Because telecommunication companies are typically
rich in data, having the right analytics solution and
personnel in place can uncover critical insights that
benefit every area of the business.
The Backbone of the Company
Since its inception in 2000, Verizon has partnered
with Teradata to create a data and analytics archi-
tecture that drives innovation and science-based
decision making. The goal is to stay relevant to cus-
tomers while also identifying new business oppor-
tunities and making adjustments that result in more
cost-effective operations.
“With business intelligence, we help the
business identify new business opportunities or
make course corrections to operate the business
in a more cost-effective way,” said Grace Hwang,
executive director of Financial Performance &
Analytics, BI, for Verizon. “We support decision
makers with the most relevant information to
improve the competitive advantage of Verizon.”
By leveraging data and analytics, Verizon is
able to offer a reliable network, ensure customer
satisfaction, and develop products and services that
consumers want to buy.
“Our incubator of new products and services will
help bring the future to our customers,” Hwang said.
“We’re using our network to make breakthroughs in
Application Case 3.1 Verizon Answers the Call for Innovation: The Nation’s
Largest Network Provider uses Advanced Analytics to
Bring the Future to its Customers
Verizon by the Numbers
The top ranked wireless carrier in the U.S. has:
$131.6B
177K
1,700
112.1M
106.5M
13M
retail locations
retail connections
postpaid customers
TV and internet subscribers
in revenue
employees
$
(Continued )
128 Part I • Introduction to Analytics and AI
interactive entertainment, digital media, the Internet
of Things, and broadband services.”
Data Insights across Three Business Units
Verizon relies on advanced analytics that are exe-
cuted on the Teradata® Unified Data Architecture™
to support its business units. The analytics enable
Verizon to deliver on its promise to help customers
innovate their lifestyles and provide key insights to
support these three areas:
• Identify new revenue sources. Research and
development teams use data, analytics, and
strategic partnerships to test and develop
with the Internet of Things (IoT). The new
frontier in data is IoT, which will lead to new
revenues that in turn generate opportunities
for top-line growth. Smart cars, smart agricul-
ture, and smart IoT will all be part of this new
growth.
• Predict churn in the core mobile business.
Verizon has multiple use cases that demonstrate
how its advanced analytics enable laser-accurate
churn prediction—within a one to two percent
margin—in the mobile space. For a $131 billion
company, predicting churn with such precision
is significant. By recognizing specific patterns
in tablet data usage, Verizon can identify which
customers most often access their tablets, then
engage those who do not.
• Forecast mobile phone plans. Customer behav-
ioral analytics allow finance to better predict
earnings in fast-changing market conditions. The
U.S. wireless industry is moving from monthly
payments for both the phone and the service to
paying for the phone independently. This opens
up a new opportunity for Verizon to gain busi-
ness. The analytic environment helps Verizon
better predict churn with new plans and forecast
the impact of changes to pricing plans.
The analytics deliver what Verizon refers to as “hon-
est data” that inform various business units. “Our
mission is to be the honest voice and the indepen-
dent third-party opinion on the success or oppor-
tunities for improvement to the business,” Hwang
explains. “So my unit is viewed as the golden
source of information, and we come across with the
honest voice, and a lot of the business decisions are
through various rungs of course correction.”
Hwang adds that oftentimes, what forces a
company to react is competitors affecting change
in the marketplace, rather than the company
making the wrong decisions. “So we try to guide
the business through the best course of correc-
tion, wherever applicable, timely, so that we can
continue to deliver record-breaking results year
after year,” she said. “I have no doubt that the
business intelligence had led to such success in
the past.”
Disrupt and Innovate
Verizon leverages advanced analytics to optimize
marketing by sending the most relevant offers to
customers. At the same time, the company relies on
analytics to ensure they have the financial acumen
to stay number one in the U.S. mobile market. By
continuing to disrupt the industry with innovative
products and solutions, Verizon is positioned to
remain the wireless standard for the industry.
“We need the marketing vision and the sales
rigor to produce the most relevant offer to our
customers, and then at the same time we need
to have the finance rigor to ensure that whatever
we offer to the customer is also profitable to the
business so that we’re responsible to our share-
holders,” Hwang says.
In Summary—Executing the Seven Ps of
Modern Marketing
Telecommunications giant Verizon uses seven Ps
to drive its modern-day marketing efforts. The Ps,
when used in unison, help Verizon penetrate the
market in the way it predicted.
1. People: Understanding customers and their
needs to create the product.
2. Place: Where customers shop.
3. Product: The item that’s been manufactured
and is for sale.
Application Case 3.1 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 129
u SECTION 3.3 REVIEW QUESTIONS
1. What are data? How do data differ from information and knowledge?
2. What are the main categories of data? What types of data can we use for BI and
analytics?
3. Can we use the same data representation for all analytics models? Why, or
why not?
4. What is a 1-of-N data representation? Why and where is it used in analytics?
3.4 ART AND SCIENCE OF DATA PREPROCESSING
Data in their original form (i.e., the real-world data) are not usually ready to be used in
analytics tasks. They are often dirty, misaligned, overly complex, and inaccurate. A te-
dious and time-demanding process (so-called data preprocessing) is necessary to con-
vert the raw real-world data into a well-refined form for analytics algorithms (Kotsiantis,
Kanellopoulos, & Pintelas, 2006). Many analytics professionals would testify that the time
spent on data preprocessing (which is perhaps the least enjoyable phase in the whole
process) is significantly longer than the time spent on the rest of the analytics tasks
(the fun of analytics model building and assessment). Figure 3.3 shows the main steps in
the data preprocessing endeavor.
In the first step of data preprocessing, the relevant data are collected from the iden-
tified sources, the necessary records and variables are selected (based on an intimate
understanding of the data, the unnecessary information is filtered out), and the records
coming from multiple data sources are integrated/merged (again, using the intimate un-
derstanding of the data, the synonyms and homonyms are able to be handled properly).
In the second step of data preprocessing, the data are cleaned (this step is also
known as data scrubbing). Data in their original/raw/real-world form are usually
dirty (Hernández & Stolfo, 1998; Kim et al., 2003). In this phase, the values in the
data set are identified and dealt with. In some cases, missing values are an anomaly
in the data set, in which case they need to be imputed (filled with a most probable
value) or ignored; in other cases, the missing values are a natural part of the data set
4. Process: How customers get to the shop or
place to buy the product.
5. Pricing: Working with promotions to get cus-
tomers’ attention.
6. Promo: Working with pricing to get customers’
attention.
7. Physical evidence: The business intelligence
that gives insights.
“The Aster and Hadoop environment allows us
to explore things we suspect could be the rea-
sons for breakdown in the seven Ps,” says Grace
Hwang, executive director of Financial Performance
& Analytics, BI, for Verizon. “This goes back to
providing the business value to our decision-
makers. With each step in the seven Ps, we ought
to be able to tell them where there are opportunities
for improvement.”
Questions for Case 3.1
1. What was the challenge Verizon was facing?
2. What was the data-driven solution proposed for
Verizon’s business units?
3. What were the results?
Source: Teradata Case Study “Verizon Answers the Call for
Innovation” https://www.teradata.com/Resources/Case-Studies/
Verizon-answers-the-call-for-innovation (accessed July 2018).
https://www.teradata.com/Resources/Case-Studies/Verizon-answers-the-call-for-innovation
https://www.teradata.com/Resources/Case-Studies/Verizon-answers-the-call-for-innovation
130 Part I • Introduction to Analytics and AI
(e.g., the household income field is often left unanswered by people who are in the
top income tier). In this step, the analyst should also identify noisy values in the data
(i.e., the outliers) and smooth them out. In addition, inconsistencies (unusual values
within a variable) in the data should be handled using domain knowledge and/or
expert opinion.
In the third step of data preprocessing, the data are transformed for better process-
ing. For instance, in many cases, the data are normalized between a certain minimum
and maximum for all variables to mitigate the potential bias of one variable having
DW
Well-Formed
Data
Social Data
Legacy DBWeb Data
Data Consolidation
Collect data
Select data
Integrate data
Data Cleaning
Impute values
Reduce noise
Eliminate duplicates
Data Transformation
Normalize data
Discretize data
Create attributes
Data Reduction
Reduce dimension
Reduce volume
Balance data
OLTP
Raw Data
Sources
F
e
e
d
b
a
c
k
FIGURE 3.3 Data Preprocessing Steps.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 131
large numeric values (such as household income) dominating other variables (such
as number of dependents or years in service, which could be more important) having
smaller values. Another transformation that takes place is discretization and/or aggrega-
tion. In some cases, the numeric variables are converted to categorical values (e.g., low,
medium, high); in other cases, a nominal variable’s unique value range is reduced to
a smaller set using concept hierarchies (e.g., as opposed to using the individual states
with 50 different values, one could choose to use several regions for a variable that
shows location) to have a data set that is more amenable to computer processing. Still,
in other cases, one might choose to create new variables based on the existing ones to
magnify the information found in a collection of variables in the data set. For instance,
in an organ transplantation data set, one might choose to use a single variable show-
ing the blood-type match (1: match, 0: no match) as opposed to separate multinominal
values for the blood type of both the donor and the recipient. Such simplification could
increase the information content while reducing the complexity of the relationships in
the data.
The final phase of data preprocessing is data reduction. Even though data scientists
(i.e., analytics professionals) like to have large data sets, too much data can also be a
problem. In the simplest sense, one can visualize the data commonly used in predictive
analytics projects as a flat file consisting of two dimensions: variables (the number of
columns) and cases/records (the number of rows). In some cases (e.g., image process-
ing and genome projects with complex microarray data), the number of variables can be
rather large, and the analyst must reduce the number to a manageable size. Because the
variables are treated as different dimensions that describe the phenomenon from differ-
ent perspectives, in predictive analytics and data mining, this process is commonly called
dimensional reduction (or variable selection). Even though there is not a single best
way to accomplish this task, one can use the findings from previously published litera-
ture; consult domain experts; run appropriate statistical tests (e.g., principal component
analysis or independent component analysis); and, more preferably, use a combination of
these techniques to successfully reduce the dimensions in the data into a more manage-
able and most relevant subset.
With respect to the other dimension (i.e., the number of cases), some data sets can
include millions or billions of records. Even though computing power is increasing ex-
ponentially, processing such a large number of records cannot be practical or feasible. In
such cases, one might need to sample a subset of the data for analysis. The underlying
assumption of sampling is that the subset of the data will contain all relevant patterns of
the complete data set. In a homogeneous data set, such an assumption could hold well,
but real-world data are hardly ever homogeneous. The analyst should be extremely careful
in selecting a subset of the data that reflects the essence of the complete data set and is
not specific to a subgroup or subcategory. The data are usually sorted on some variable,
and taking a section of the data from the top or bottom could lead to a biased data set
on specific values of the indexed variable; therefore, always try to randomly select the
records on the sample set. For skewed data, straightforward random sampling might not
be sufficient, and stratified sampling (a proportional representation of different subgroups
in the data is represented in the sample data set) might be required. Speaking of skewed
data, it is a good practice to balance the highly skewed data by either oversampling the
less represented or undersampling the more represented classes. Research has shown
that balanced data sets tend to produce better prediction models than unbalanced ones
(Thammasiri et al., 2014).
The essence of data preprocessing is summarized in Table 3.1, which maps the
main phases (along with their problem descriptions) to a representative list of tasks and
algorithms.
132 Part I • Introduction to Analytics and AI
TABLE 3.1 A Summary of Data Preprocessing Tasks and Potential Methods
Main Task Subtasks Popular Methods
Data consolidation Access and collect the data
Select and filter the data
Integrate and unify the data
SQL queries, software agents, Web services.
Domain expertise, SQL queries, statistical tests.
SQL queries, domain expertise, ontology-driven data
mapping.
Data cleaning Handle missing values in the
data
Fill in missing values (imputations) with most appropriate val-
ues (mean, median, min/max, mode, etc.); recode the missing
values with a constant such as “ML”; remove the record of the
missing value; do nothing.
Identify and reduce noise in
the data
Identify the outliers in data with simple statistical techniques
(such as averages and standard deviations) or with cluster
analysis; once identified, either remove the outliers or smooth
them by using binning, regression, or simple averages.
Find and eliminate erroneous
data
Identify the erroneous values in data (other than outliers),
such as odd values, inconsistent class labels, odd distributions;
once identified, use domain expertise to correct the values or
remove the records holding the erroneous values.
Data transformation Normalize the data Reduce the range of values in each numerically valued variable
to a standard range (e.g., 0 to 1 or -1 to +1) by using a vari-
ety of normalization or scaling techniques.
Discretize or aggregate the
data
If needed, convert the numeric variables into discrete represen-
tations using range- or frequency-based binning techniques;
for categorical variables, reduce the number of values by
applying proper concept hierarchies.
Construct new attributes Derive new and more informative variables from the existing
ones using a wide range of mathematical functions (as simple
as addition and multiplication or as complex as a hybrid combi-
nation of log transformations).
Data reduction Reduce number of attributes Use principal component analysis, independent component
analysis, chi-square testing, correlation analysis, and decision
tree induction.
Reduce number of records Perform random sampling, stratified sampling, expert-
knowledge-driven purposeful sampling.
Balance skewed data Oversample the less represented or undersample the more
represented classes.
It is almost impossible to underestimate the value proposition of data preprocess-
ing. It is one of those time-demanding activities in which investment of time and effort
pays off without a perceivable limit for diminishing returns. That is, the more resources
you invest in it, the more you will gain at the end. Application Case 3.2 illustrates an
interesting study that used raw, readily available academic data within an educational
organization to develop predictive models to better understand attrition and improve
freshman student retention in a large higher education institution. As the application case
clearly states, each and every data preprocessing task described in Table 3.1 was criti-
cal to a successful execution of the underlying analytics project, especially the task that
related to the balancing of the data set.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 133
Student attrition has become one of the most chal-
lenging problems for decision makers in academic
institutions. Despite all the programs and services
that are put in place to help retain students, accord-
ing to the U.S. Department of Education’s Center for
Educational Statistics (nces.ed.gov), only about half
of those who enter higher education actually earn
a bachelor’s degree. Enrollment management and
the retention of students have become a top priority
for administrators of colleges and universities in the
United States and other countries around the world.
High dropout of students usually results in overall
financial loss, lower graduation rates, and an inferior
school reputation in the eyes of all stakeholders. The
legislators and policy makers who oversee higher
education and allocate funds, the parents who pay
for their children’s education to prepare them for a
better future, and the students who make college
choices look for evidence of institutional quality and
reputation to guide their decision-making processes.
The Proposed Solution
To improve student retention, one should try to
understand the nontrivial reasons behind the attrition.
To be successful, one should also be able to accu-
rately identify those students who are at risk of drop-
ping out. So far, the vast majority of student attrition
research has been devoted to understanding this com-
plex, yet crucial, social phenomenon. Even though
these qualitative, behavioral, and survey-based stud-
ies revealed invaluable insight by developing and
testing a wide range of theories, they do not pro-
vide the much-needed instruments to accurately pre-
dict (and potentially improve) student attrition. The
project summarized in this case study proposed a
quantitative research approach in which the histori-
cal institutional data from student databases could
be used to develop models that are capable of pre-
dicting as well as explaining the institution-specific
nature of the attrition problem. The proposed analyt-
ics approach is shown in Figure 3.4.
Although the concept is relatively new to
higher education, for more than a decade now,
similar problems in the field of marketing man-
agement have been studied using predictive data
analytics techniques under the name of “churn
analysis” where the purpose has been to identify
a sample among current customers to answer the
question, “Who among our current customers are
more likely to stop buying our products or services?”
so that some kind of mediation or intervention pro-
cess can be executed to retain them. Retaining exist-
ing customers is crucial because, as we all know
and as the related research has shown time and time
again, acquiring a new customer costs on an order
of magnitude more effort, time, and money than try-
ing to keep the one that you already have.
Data Are of the Essence
The data for this research project came from a sin-
gle institution (a comprehensive public university
located in the Midwest region of the United States)
with an average enrollment of 23,000 students, of
which roughly 80 percent are the residents of the
same state and roughly 19 percent of the students
are listed under some minority classification. There
is no significant difference between the two genders
in the enrollment numbers. The average freshman
student retention rate for the institution was about
80 percent, and the average six-year graduation rate
was about 60 percent.
The study used five years of institutional data,
which entailed 16,000+ students enrolled as fresh-
men, consolidated from various and diverse univer-
sity student databases. The data contained variables
related to students’ academic, financial, and demo-
graphic characteristics. After merging and convert-
ing the multidimensional student data into a single
flat file (a file with columns representing the vari-
ables and rows representing the student records),
the resultant file was assessed and preprocessed to
identify and remedy anomalies and unusable val-
ues. As an example, the study removed all inter-
national student records from the data set because
they did not contain information about some of the
most reputed predictors (e.g., high school GPA, SAT
scores). In the data transformation phase, some of
the variables were aggregated (e.g., “Major” and
“Concentration” variables aggregated to binary vari-
ables MajorDeclared and ConcentrationSpecified)
Application Case 3.2 Improving Student Retention with Data-Driven Analytics
(Continued )
134 Part I • Introduction to Analytics and AI
FIGURE 3.4 An Analytics Approach to Predicting Student Attrition.
Application Case 3.2 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 135
for better interpretation for the predictive model-
ing. In addition, some of the variables were used to
derive new variables (e.g., Earned/Registered ratio
and YearsAfterHighSchool).
Earned>Registered = EarnedHours>
RegisteredHours
YearsAfterHigh = FreshmenEnrollmentYear –
School HighSchoolGraduationYear
The Earned/Registered ratio was created to have
a better representation of the students’ resiliency and
determination in their first semester of the freshman
year. Intuitively, one would expect greater values for
this variable to have a positive impact on retention/
persistence. The YearsAfterHighSchool was created
to measure the impact of the time taken between
high school graduation and initial college enrollment.
Intuitively, one would expect this variable to be a
contributor to the prediction of attrition. These aggre-
gations and derived variables are determined based
on a number of experiments conducted for a number
of logical hypotheses. The ones that made more com-
mon sense and the ones that led to better prediction
accuracy were kept in the final variable set. Reflecting
the true nature of the subpopulation (i.e., the fresh-
men students), the dependent variable (i.e., “Second
Fall Registered”) contained many more yes records
(~80%) than no records (~20%; see Figure 3.5).
Research shows that having such imbalanced
data has a negative impact on model performance.
Therefore, the study experimented with the options of
using and comparing the results of the same type of
models built with the original imbalanced data (biased
for the yes records) and the well-balanced data.
Modeling and Assessment
The study employed four popular classification meth-
ods (i.e., artificial neural networks, decision trees, sup-
port vector machines, and logistic regression) along
with three model ensemble techniques (i.e., bagging,
busting, and information fusion). The results obtained
from all model types were then compared to each
other using regular classification model assessment
methods (e.g., overall predictive accuracy, sensitivity,
specificity) on the holdout samples.
In machine-learning algorithms (some of which
will be covered in Chapter 4), sensitivity analysis
is a method for identifying the “cause-and-effect”
relationship between the inputs and outputs of
a given prediction model. The fundamental idea
behind sensitivity analysis is that it measures the
importance of predictor variables based on the
change in modeling performance that occurs if
a predictor variable is not included in the model.
This modeling and experimentation practice is also
called a leave-one-out assessment. Hence, the mea-
sure of sensitivity of a specific predictor variable is
the ratio of the error of the trained model without
the predictor variable to the error of the model that
includes this predictor variable. The more sensitive
50% No
50% No
B
a
la
n
c
e
d
D
a
ta
Model Building, Testing,
and Validating
Model Assessment
TP FP
FN TN
Yes No
Yes
No
(80%, 80%, 80%)
(90%, 100%, 50%)
Which one
is better?
Input Data
80% No
20% Yes
*Yes: dropped out, No: persisted.
(accuracy, precision+, precision–)
(accuracy, precision+, precision–)
Validate
Built
Test
Im
b
a
la
n
c
e
d
D
a
ta
FIGURE 3.5 A Graphical Depiction of the Class Imbalance Problem.
(Continued )
136 Part I • Introduction to Analytics and AI
the network is to a particular variable, the greater
the performance decrease would be in the absence
of that variable and therefore the greater the ratio of
importance. In addition to the predictive power of
the models, the study also conducted sensitivity
analyses to determine the relative importance of the
input variables.
The Results
In the first set of experiments, the study used the
original imbalanced data set. Based on the 10-fold
cross-validation assessment results, the support vector
machines produced the best accuracy with an overall
prediction rate of 87.23 percent, and the decision tree
was the runner-up with an overall prediction rate of
87.16 percent, followed by artificial neural networks
and logistic regression with overall prediction rates
of 86.45 percent and 86.12 percent, respectively (see
Table 3.2). A careful examination of these results
reveals that the prediction accuracy for the “Yes” class
is significantly higher than the prediction accuracy of
the “No” class. In fact, all four model types predicted
the students who are likely to return for the second
year with better than 90 percent accuracy, but the
types did poorly on predicting the students who are
likely to drop out after the freshman year with less
than 50 percent accuracy. Because the prediction of
the “No” class is the main purpose of this study, less
than 50 percent accuracy for this class was deemed
not acceptable. Such a difference in prediction accu-
racy of the two classes can (and should) be attributed
to the imbalanced nature of the training data set (i.e.,
~80% “Yes” and ~20% “No” samples).
The next round of experiments used a well-
balanced data set in which the two classes are
represented nearly equally in counts. In realizing
this approach, the study took all samples from the
minority class (i.e., the “No” class herein), randomly
selected an equal number of samples from the major-
ity class (i.e., the “Yes” class herein), and repeated this
process 10 times to reduce potential bias of random
sampling. Each of these sampling processes resulted
in a data set of 7,000+ records, of which both class
labels (“Yes” and “No”) were equally represented.
Again, using a 10-fold cross-validation methodology,
the study developed and tested prediction models
for all four model types. The results of these experi-
ments are shown in Table 3.3. Based on the hold-
out sample results, support vector machines once
again generated the best overall prediction accuracy
with 81.18 percent followed by decision trees, artifi-
cial neural networks, and logistic regression with an
overall prediction accuracy of 80.65 percent, 79.85
percent, and 74.26 percent, respectively. As can be
seen in the per-class accuracy figures, the prediction
models did significantly better on predicting the “No”
class with the well-balanced data than they did with
the unbalanced data. Overall, the three machine-
learning techniques performed significantly better
than their statistical counterpart, logistic regression.
Next, another set of experiments was con-
ducted to assess the predictive ability of the three
ensemble models. Based on the 10-fold cross-
validation methodology, the information fusion–
type ensemble model produced the best results with
an overall prediction rate of 82.10 percent, followed
by the bagging-type ensembles and boosting-type
TABLE 3.2 Prediction Results for the Original/Unbalanced Data Set
ANN(MLP) DT(C5) SVM LR
No Yes No Yes No Yes No Yes
No 1,494 384 1,518 304 1,478 255 1,438 376
Yes 1,596 11,142 1,572 11,222 1,612 11,271 1,652 11,150
SUM 3,090 11,526 3,090 11,526 3,090 11,526 3,090 11,526
Per-class accuracy 48.35% 96.67% 49.13% 97.36% 47.83% 97.79% 46.54% 96.74%
Overall accuracy 86.45% 87.16% 87.23% 86.12%
*ANN: Artificial Neural Network; MLP: Multi-Layer Perceptron; DT: Decision Tree; SVM: Support Vector Machine; LR: Logistic Regression
Application Case 3.2 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 137
ensembles with overall prediction rates of 81.80 per-
cent and 80.21 percent, respectively (see Table 3.4).
Even though the prediction results are slightly better
than those of the individual models, ensembles are
known to produce more robust prediction systems
compared to a single-best prediction model (more
on this can be found in Chapter 4).
In addition to assessing the prediction accu-
racy for each model type, a sensitivity analysis was
also conducted using the developed prediction
models to identify the relative importance of the
independent variables (i.e., the predictors). In real-
izing the overall sensitivity analysis results, each of
the four individual model types generated its own
sensitivity measures, ranking all independent vari-
ables in a prioritized list. As expected, each model
type generated slightly different sensitivity rank-
ings of the independent variables. After collecting
all four sets of sensitivity numbers, the sensitivity
numbers are normalized and aggregated and plot-
ted in a horizontal bar chart (see Figure 3.6).
The Conclusions
The study showed that, given sufficient data with
the proper variables, data mining methods are
capable of predicting freshmen student attrition
with approximately 80 percent accuracy. Results
also showed that, regardless of the prediction
model employed, the balanced data set (compared
to unbalanced/original data set) produced better
prediction models for identifying the students who
are likely to drop out of the college prior to their
sophomore year. Among the four individual pre-
diction models used in this study, support vector
machines performed the best, followed by deci-
sion trees, neural networks, and logistic regres-
sion. From the usability standpoint, despite the
fact that support vector machines showed better
prediction results, one might choose to use deci-
sion trees, because compared to support vector
machines and neural networks, they portray a
more transparent model structure. Decision trees
TABLE 3.3 Prediction Results for the Balanced Data Set
Confusion
Matrix
ANN(MLP) DT(C5) SVM LR
No Yes No Yes No Yes No Yes
No 2,309 464 2311 417 2,313 386 2,125 626
Yes 781 2,626 779 2,673 777 2,704 965 2,464
SUM 3,090 3,090 3,090 3,090 3,090 3,090 3,090 3,090
Per-class accuracy 74.72% 84.98% 74.79% 86.50% 74.85% 87.51% 68.77% 79.74%
Overall accuracy 79.85% 80.65% 81.18% 74.26%
TABLE 3.4 Prediction Results for the Three Ensemble Models
Boosting Bagging Information Fusion
(boosted trees) (random forest) (weighted average)
No Yes No Yes No Yes
No 2,242 375 2,327 362 2,335 351
Yes 848 2,715 763 2,728 755 2,739
SUM 3,090 3,090 3,090 3,090 3,090 3,090
Per-class accuracy 72.56% 87.86% 75.31% 88.28% 75.57% 88.64%
Overall accuracy 80.21% 81.80% 82.10%
(Continued )
138 Part I • Introduction to Analytics and AI
explicitly show the reasoning process of different
predictions, providing a justification for a specific
outcome, whereas support vector machines and
artificial neural networks are mathematical models
that do not provide such a transparent view of
“how they do what they do.”
Questions for Case 3.2
1. What is student attrition, and why is it an impor-
tant problem in higher education?
2. What were the traditional methods to deal with
the attrition problem?
3. List and discuss the data-related challenges
within the context of this case study.
4. What was the proposed solution? What were the
results?
Sources: D. Thammasiri, D. Delen, P. Meesad, & N. Kasap, “A
Critical Assessment of Imbalanced Class Distribution Problem: The
Case of Predicting Freshmen Student Attrition,” Expert Systems with
Applications, 41(2), 2014, pp. 321–330; D. Delen, “A Comparative
Analysis of Machine Learning Techniques for Student Retention
Management,” Decision Support Systems, 49(4), 2010, pp. 498–506,
and “Predicting Student Attrition with Data Mining Methods,”
Journal of College Student Retention 13(1), 2011, pp. 17–35.
EarnedByRegistered
SpringStudentLoan
FallGPA
SpringGrantTuitionWaiverScholarship
FallRegisteredHours
FallStudentLoan
MaritalStatus
AdmissionType
Ethnicity
SATHighMath
SATHighEnglish
FallFederalWorkStudy
SpringFederalWorkStudy
FallGrantTuitionWaiverScholarship
PermanentAddressState
SATHighScience
SATHighComprehensive
SATHighReading
TransferredHours
ReceivedFallAid
MajorDeclared
ConcentrationSpecified
StartingTerm
HighSchoolGraduationMonth
HighSchoolGPA
YearsAfterHS
Age
0.00 0.20 0.40 0.60 1.000.80 1.20
Sex
CLEPHours
FIGURE 3.6 Sensitivity-Analysis-Based Variable Importance Results.
Application Case 3.2 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 139
u SECTION 3.4 REVIEW QUESTIONS
1. Why are the original/raw data not readily usable by analytics tasks?
2. What are the main data preprocessing steps?
3. What does it mean to clean/scrub the data? What activities are performed in this
phase?
4. Why do we need data transformation? What are the commonly used data transforma-
tion tasks?
5. Data reduction can be applied to rows (sampling) and/or columns (variable selec-
tion). Which is more challenging?
3.5 STATISTICAL MODELING FOR BUSINESS ANALYTICS
Because of the increasing popularity of business analytics, the traditional statistical meth-
ods and underlying techniques are also regaining their attractiveness as enabling tools to
support evidence-based managerial decision making. Not only are they regaining atten-
tion and admiration, but this time, they are attracting business users in addition to statisti-
cians and analytics professionals.
Statistics (statistical methods and underlying techniques) is usually considered as
part of descriptive analytics (see Figure 3.7). Some of the statistical methods can also be
considered as part of predictive analytics, such as discriminant analysis, multiple regres-
sion, logistic regression, and k-means clustering. As shown in Figure 3.7, descriptive ana-
lytics has two main branches: statistics and online analytics processing (OLAP). OLAP
is the term used for analyzing, characterizing, and summarizing structured data stored in
organizational databases (often stored in a data warehouse or in a data mart) using cubes
(i.e., multidimensional data structures that are created to extract a subset of data values to
answer a specific business question). The OLAP branch of descriptive analytics has also
been called business intelligence. Statistics, on the other hand, helps to characterize the
data, either one variable at a time or multivariable, all together using either descriptive or
inferential methods.
Statistics—a collection of mathematical techniques to characterize and interpret
data—has been around for a very long time. Many methods and techniques have been
developed to address the needs of the end users and the unique characteristics of the
data being analyzed. Generally speaking, at the highest level, statistical methods can be
Descriptive Inferential
OLAP Statistics
Business Analytics
Descriptive Predictive Prescriptive
FIGURE 3.7 Relationship between Statistics and Descriptive Analytics.
140 Part I • Introduction to Analytics and AI
classified as either descriptive or inferential. The main difference between descriptive and
inferential statistics is the data used in these methods—whereas descriptive statistics
is all about describing the sample data on hand, inferential statistics is about drawing
inferences or conclusions about the characteristics of the population. In this section, we
briefly describe descriptive statistics (because of the fact that it lays the foundation for,
and is the integral part of, descriptive analytics), and in the following section we cover
regression (both linear and logistic regression) as part of inferential statistics.
Descriptive Statistics for Descriptive Analytics
Descriptive statistics, as the name implies, describes the basic characteristics of the data at
hand, often one variable at a time. Using formulas and numerical aggregations, descrip-
tive statistics summarizes the data in such a way that often meaningful and easily under-
standable patterns emerge from the study. Although it is very useful in data analytics and
very popular among the statistical methods, descriptive statistics does not allow making
conclusions (or inferences) beyond the sample of the data being analyzed. That is, it is
simply a nice way to characterize and describe the data on hand without making conclu-
sions (inferences or extrapolations) regarding the population of related hypotheses we
might have in mind.
In business analytics, descriptive statistics plays a critical role—it allows us to un-
derstand and explain/present our data in a meaningful manner using aggregated num-
bers, data tables, or charts/graphs. In essence, descriptive statistics helps us convert our
numbers and symbols into meaningful representations for anyone to understand and use.
Such an understanding helps not only business users in their decision-making processes
but also analytics professionals and data scientists to characterize and validate the data for
other more sophisticated analytics tasks. Descriptive statistics allows analysts to identify
data concertation, unusually large or small values (i.e., outliers), and unexpectedly dis-
tributed data values for numeric variables. Therefore, the methods in descriptive statistics
can be classified as either measures for central tendency or measures of dispersion. In
the following section, we use a simple description and mathematical formulation/repre-
sentation of these measures. In mathematical representation, we will use x1, x2, . . . , xn to
represent individual values (observations) of the variable (measure) that we are interested
in characterizing.
Measures of Centrality Tendency (Also Called Measures
of Location or Centrality)
Measures of centrality are the mathematical methods by which we estimate or describe
central positioning of a given variable of interest. A measure of central tendency is a
single numerical value that aims to describe a set of data by simply identifying or estimat-
ing the central position within the data. The mean (often called the arithmetic mean or
the simple average) is the most commonly used measure of central tendency. In addition
to mean, you could also see median or mode being used to describe the centrality of a
given variable. Although, the mean, median, and mode are all valid measures of central
tendency, under different circumstances, one of these measures of centrality becomes
more appropriate than the others. What follows are short descriptions of these measures,
including how to calculate them mathematically and pointers on the circumstances in
which they are the most appropriate measure to use.
Arithmetic Mean
The arithmetic mean (or simply mean or average) is the sum of all the values/observa-
tions divided by the number of observations in the data set. It is by far the most popular
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 141
and most commonly used measure of central tendency. It is used with continuous or
discrete numeric data. For a given variable x, if we happen to have n values/observations
1×1, x2, . . ., xn2, we can write the arithmetic mean of the data sample (x, pronounced as
x-bar) as follows:
x =
x1 + x2 + g + xn
n
or
x = a i=1
n
xi
n
The mean has several unique characteristics. For instance, the sum of the absolute devia-
tions (differences between the mean and the observations) above the mean is the same
as the sum of the deviations below the mean, balancing the values on either side of it.
That said, it does not suggest, however, that half the observations are above and the other
half are below the mean (a common misconception among those who do not know basic
statistics). Also, the mean is unique for every data set and is meaningful and calculable
for both interval- and ratio-type numeric data. One major downside is that the mean can
be affected by outliers (observations that are considerably larger or smaller than the rest
of the data points). Outliers can pull the mean toward their direction and, hence, bias the
centrality representation. Therefore, if there are outliers or if the data are erratically dis-
persed and skewed, one should either avoid using the mean as the measure of centrality
or augment it with other central tendency measures, such as median and mode.
Median
The median is the measure of center value in a given data set. It is the number in the
middle of a given set of data that has been arranged/sorted in order of magnitude (either
ascending or descending). If the number of observations is an odd number, identifying
the median is very easy—just sort the observations based on their values and pick the
value right in the middle. If the number of observations is an even number, identify the
two middle values, and then take the simple average of these two values. The median is
meaningful and calculable for ratio, interval, and ordinal data types. Once determined,
one-half of the data points in the data is above and the other half is below the median. In
contrary to the mean, the median is not affected by outliers or skewed data.
Mode
The mode is the observation that occurs most frequently (the most frequent value in our
data set). On a histogram, it represents the highest bar in a bar chart, and, hence, it can
be considered as the most popular option/value. The mode is most useful for data sets
that contain a relatively small number of unique values. That is, it could be useless if the
data have too many unique values (as is the case in many engineering measurements
that capture high precision with a large number of decimal places), rendering each value
having either one or a very small number representing its frequency. Although it is a
useful measure (especially for nominal data), mode is not a very good representation of
centrality, and therefore, it should not be used as the only measure of central tendency
for a given data set.
In summary, which central tendency measure is the best? Although there is not a
clear answer to this question, here are a few hints—use the mean when the data are not
prone to outliers and there is no significant level of skewness; use the median when the
data have outliers and/or it is ordinal in nature; use the mode when the data are nominal.
142 Part I • Introduction to Analytics and AI
Perhaps the best practice is to use all three together so that the central tendency of the
data set can be captured and represented from three perspectives. Mostly because “av-
erage” is a very familiar and highly used concept to everyone in regular daily activities,
managers (as well as some scientists and journalists) often use the centrality measures
(especially mean) inappropriately when other statistical information should be consid-
ered along with the centrality. It is a better practice to present descriptive statistics as a
package—a combination of centrality and dispersion measures—as opposed to a single
measure such as mean.
Measures of Dispersion (Also Called Measures of Spread
or Decentrality)
Measures of dispersion are the mathematical methods used to estimate or describe the
degree of variation in a given variable of interest. They represent the numerical spread
(compactness or lack thereof) of a given data set. To describe this dispersion, a number
of statistical measures are developed; the most notable ones are range, variance, and
standard deviation (and also quartiles and absolute deviation). One of the main reasons
why the measures of dispersion/spread of data values are important is the fact that they
give us a framework within which we can judge the central tendency—give us the indica-
tion of how well the mean (or other centrality measures) represents the sample data. If
the dispersion of values in the data set is large, the mean is not deemed to be a very good
representation of the data. This is because a large dispersion measure indicates large dif-
ferences between individual scores. Also, in research, it is often perceived as a positive
sign to see a small variation within each data sample, as it may indicate homogeneity,
similarity, and robustness within the collected data.
Range
The range is perhaps the simplest measure of dispersion. It is the difference between
the largest and the smallest values in a given data set (i.e., variables). So we calculate
range by simply identifying the smallest value in the data set (minimum), identifying the
largest value in the data set (maximum), and calculating the difference between them
(range = maximum – minimum).
Variance
A more comprehensive and sophisticated measure of dispersion is the variance. It is
a method used to calculate the deviation of all data points in a given data set from the
mean. The larger the variance, the more the data are spread out from the mean and the
more variability one can observe in the data sample. To prevent the offsetting of negative
and positive differences, the variance takes into account the square of the distances from
the mean. The formula for a data sample can be written as
s2 =
a ni = 1( xi – x )2
n – 1
where n is the number of samples, x is the mean of the sample, and xi is the i
th
value in the data set. The larger values of variance indicate more dispersion, whereas
smaller values indicate compression in the overall data set. Because the differences
are squared, larger deviations from the mean contribute significantly to the value of
variance. Again, because the differences are squared, the numbers that represent de-
viation/variance become somewhat meaningless (as opposed to a dollar difference,
here you are given a squared dollar difference). Therefore, instead of variance, in
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 143
many business applications, we use a more meaningful dispersion measure, called
standard deviation.
Standard Deviation
The standard deviation is also a measure of the spread of values within a set of data.
The standard deviation is calculated by simply taking the square root of the variations. The
following formula shows the calculation of standard deviation from a given sample of
data points.
s = A a ni = 1(xi – x )2n – 1
Mean Absolute Deviation
In addition to variance and standard deviation, sometimes we also use mean absolute
deviation to measure dispersion in a data set. It is a simpler way to calculate the overall
deviation from the mean. Specifically, the mean absolute deviation is calculated by mea-
suring the absolute values of the differences between each data point and the mean and
then summing them. This process provides a measure of spread without being specific
about the data point being lower or higher than the mean. The following formula shows
the calculation of the mean absolute deviation:
MAD =
a ni =1� xi – x �
n
Quartiles and Interquartile Range
Quartiles help us identify spread within a subset of the data. A quartile is a quarter of
the number of data points given in a data set. Quartiles are determined by first sorting
the data and then splitting the sorted data into four disjoint smaller data sets. Quartiles
are a useful measure of dispersion because they are much less affected by outliers or a
skewness in the data set than the equivalent measures in the whole data set. Quartiles
are often reported along with the median as the best choice of measure of dispersion
and central tendency, respectively, when dealing with skewed and/or data with outliers.
A common way of expressing quartiles is as an interquartile range, which describes the
difference between the third quartile (Q3) and the first quartile (Q1), telling us about the
range of the middle half of the scores in the distribution. The quartile-driven descriptive
measures (both centrality and dispersion) are best explained with a popular plot called a
box-and-whiskers plot (or box plot).
Box-and-Whiskers Plot
The box-and-whiskers plot (or simply a box plot) is a graphical illustration of several
descriptive statistics about a given data set. They can be either horizontal or vertical, but
vertical is the most common representation, especially in modern-day analytics software
products. It is known to be first created and presented by John W. Tukey in 1969. Box
plot is often used to illustrate both centrality and dispersion of a given data set (i.e., the
distribution of the sample data) in an easy-to-understand graphical notation. Figure 3.8
shows two box plots side by side, sharing the same y-axis. As shown therein, a single
chart can have one or more box plots for visual comparison purposes. In such cases,
the y-axis would be the common measure of magnitude (the numerical value of the
144 Part I • Introduction to Analytics and AI
variable), with the x-axis showing different classes/subsets such as different time dimen-
sions (e.g., descriptive statistics for annual Medicare expenses in 2015 versus 2016) or
different categories (e.g., descriptive statistics for marketing expenses versus total sales).
Although historically speaking, the box plot has not been used widely and often
enough (especially in areas outside of statistics), with the emerging popularity of business
analytics, it is gaining fame in less technical areas of the business world. Its information
richness and ease of understanding are largely to credit for its recent popularity.
The box plot shows the centrality (median and sometimes also mean) as well as
the dispersion (the density of the data within the middle half—drawn as a box between
the first and third quartiles), the minimum and maximum ranges (shown as extended
lines from the box, looking like whiskers, that are calculated as 1.5 times the upper or
lower end of the quartile box), and the outliers that are larger than the limits of the whis-
kers. A box plot also shows whether the data are symmetrically distributed with respect
to the mean or sway one way or another. The relative position of the median versus
mean and the lengths of the whiskers on both side of the box give a good indication of
the potential skewness in the data.
x
Max
Upper
Quartile
Median
Lower
Quartile
Min
Outliers
Outliers
Larger than 1.5 times the
upper quartile
Largest value, excluding
larger outliers
25% of data is larger than
this value
25% of data is smaller
than this value
Smallest value, excluding
smaller outliers
Smaller than 1.5 times the
lower quartile
Mean
50% of data is larger than
this value—middle of dataset
Simple average of the dataset
x
Variable 1 Variable 2
FIGURE 3.8 Understanding the Specifics about Box-and-Whiskers Plots.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 145
Shape of a Distribution
Although not as common as the centrality and dispersion, the shape of the data distribu-
tion is also a useful measure for the descriptive statistics. Before delving into the shape of
the distribution, we first need to define the distribution itself. Simply put, distribution is
the frequency of data points counted and plotted over a small number of class labels or
numerical ranges (i.e., bins). In a graphical illustration of distribution, the y-axis shows
the frequency (count or percentage), and the x-axis shows the individual classes or bins
in a rank-ordered fashion. A very well-known distribution is called normal distribution,
which is perfectly symmetric on both sides of the mean and has numerous well-founded
mathematical properties that make it a very useful tool for research and practice. As the
dispersion of a data set increases, so does the standard deviation, and the shape of the
distribution looks wider. A graphic illustration of the relationship between dispersion and
distribution shape (in the context of normal distribution) is shown in Figure 3.9.
There are two commonly used measures to calculate the shape characteristics of a
distribution: skewness and kurtosis. A histogram (frequency plot) is often used to visually
illustrate both skewness and kurtosis.
Skewness is a measure of asymmetry (sway) in a distribution of the data that por-
trays a unimodal structure—only one peak exists in the distribution of the data. Because
normal distribution is a perfectly symmetric unimodal distribution, it does not have
0 1 2 3212223
(a)
(c)
(d)
(b)
FIGURE 3.9 Relationship between Dispersion and Distribution Shape Properties.
146 Part I • Introduction to Analytics and AI
skewness; that is, its skewness measure (i.e., the value of the coefficient of skewness) is
equal to zero. The skewness measure/value can be either positive or negative. If the dis-
tribution sways left (i.e., the tail is on the right side and the mean is smaller than median),
then it produces a positive skewness measure; if the distribution sways right (i.e., the tail
is on the left side and the mean is larger than median), then it produces a negative skew-
ness measure. In Figure 3.9, (c) represents a positively skewed distribution whereas (d)
represents a negatively skewed distribution. In the same figure, both (a) and (b) represent
perfect symmetry and hence zero measure for skewness.
Skewness = S =
a ni = 1(xi – x)3
(n – 1)s3
where s is the standard deviation and n is the number of samples.
Kurtosis is another measure to use in characterizing the shape of a unimodal dis-
tribution. As opposed to the sway in shape, kurtosis focuses more on characterizing the
peak/tall/skinny nature of the distribution. Specifically, kurtosis measures the degree to
which a distribution is more or less peaked than a normal distribution. Whereas a posi-
tive kurtosis indicates a relatively peaked/tall distribution, a negative kurtosis indicates a
relatively flat/short distribution. As a reference point, a normal distribution has a kurtosis
of 3. The formula for kurtosis can be written as
Kurtosis = K =
a ni = 1(xi – x)4
ns4
-3
Descriptive statistics (as well as inferential statistics) can easily be calculated using com-
mercially viable statistical software packages (e.g., SAS, SPSS, Minitab, JMP, Statistica) or
free/open source tools (e.g., R). Perhaps the most convenient way to calculate descriptive
and some of the inferential statistics is to use Excel. Technology Insights 3.1 describes in
detail how to use Microsoft Excel to calculate descriptive statistics.
TECHNOLOGY INSIGHTS 3.1 How to Calculate Descriptive Statistics
in Microsoft Excel
Excel, arguably the most popular data analysis tool in the world, can easily be used for descriptive
statistics. Although the base configuration of Excel does not seem to have the statistics function
readily available for end users, those functions come with the Excel installation and can be acti-
vated (turned on) with only a few mouse clicks. Figure 3.10 shows how these statistics functions
(as part of the Analysis ToolPak) can be activated in Microsoft Excel 2016.
Once activated, the Analysis ToolPak will appear in the Data menu option under the
name of Data Analysis. When you click on Data Analysis in the Analysis group under the Data
tab in the Excel menu bar, you will see Descriptive Statistics as one of the options within the
list of data analysis tools (see Figure 3.11, steps 1, 2); click on OK, and the Descriptive Statistics
dialog box will appear (see the middle of Figure 3.11). In this dialog box, you need to enter
the range of the data, which can be one or more numerical columns, along with the preference
check boxes, and click OK (see Figure 3.11, steps 3, 4). If the selection includes more than one
numeric column, the tool treats each column as a separate data set and provides descriptive
statistics for each column separately.
As a simple example, we selected two columns (labeled as Expense and Demand) and
executed the Descriptive Statistics option. The bottom section of Figure 3.11 shows the output
created by Excel. As can be seen, Excel produced all descriptive statistics that are covered in
the previous section and added a few more to the list. In Excel 2016, it is also very easy (a few
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 147
1
2
3
4
FIGURE 3.10 Activating Statistics Function in Excel 2016.
148 Part I • Introduction to Analytics and AI
3
1
4
2
FIGURE 3.11 Obtaining Descriptive Statistics in Excel.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 149
1
2
3
FIGURE 3.12 Creating a Box-and-Whiskers Plot in Excel 2016.
mouse clicks) to create a box-and-whiskers plot. Figure 3.12 shows the simple three-step pro-
cess of creating a box-and-whiskers plot in Excel.
Although Analysis ToolPak is a very useful tool in Excel, one should be aware of an im-
portant point related to the results that it generates, which have a different behavior than other
ordinary Excel functions: Although Excel functions dynamically change as the underlying data in
the spreadsheet are changed, the results generated by the Analysis ToolPak do not. For example, if
you change the values in either or both of these columns, the Descriptive Statistics results produced
by the Analysis ToolPak will stay the same. However, the same is not true for ordinary Excel func-
tions. If you were to calculate the mean value of a given column (using “=AVERAGE(A1:A121)”)
and then change the values within the data range, the mean value would automatically change. In
summary, the results produced by Analysis ToolPak do not have a dynamic link to the underlying
data, and if the data change, the analysis needs to be redone using the dialog box.
Successful applications of data analytics cover a wide range of business and organizational
settings, addressing problems once thought unsolvable. Application Case 3.3 is an excellent il-
lustration of those success stories in which a small municipality administration adopted a data
analytics approach to intelligently detect and solve problems by continuously analyzing demand
and consumption patterns.
u SECTION 3.5 REVIEW QUESTIONS
1. What is the relationship between statistics and business analytics?
2. What are the main differences between descriptive and inferential statistics?
3. List and briefly define the central tendency measures of descriptive statistics.
4. List and briefly define the dispersion measures of descriptive statistics.
5. What is a box-and-whiskers plot? What types of statistical information does it represent?
6. What are the two most commonly used shape characteristics to describe a data distribution?
150 Part I • Introduction to Analytics and AI
A leaky faucet. A malfunctioning dishwasher. A cracked
sprinkler head. These are more than just a headache
for a home owner or business to fix. They can be
costly, unpredictable, and, unfortunately, hard to pin-
point. Through a combination of wireless water meters
and a data-analytics-driven, customer-accessible portal,
the Town of Cary, North Carolina, is making it much
easier to find and fix water loss issues. In the process,
the town has gained a big-picture view of water usage
critical to planning future water plant expansions and
promoting targeted conservation efforts.
When the town of Cary installed wireless
meters for 60,000 customers in 2010, it knew the
new technology wouldn’t just save money by
eliminating manual monthly readings; the town
also realized it would get more accurate and
timely information about water consumption.
The Aquastar wireless system reads meters once
an hour—that is 8,760 data points per customer
each year instead of 12 monthly readings. The data
had tremendous potential if they could be easily
consumed.
“Monthly readings are like having a gallon of
water’s worth of data. Hourly meter readings are
more like an Olympic-size pool of data,” says Karen
Mills, finance director for Cary. “SAS helps us man-
age the volume of that data nicely.” In fact, the solu-
tion enables the town to analyze half a billion data
points on water usage and make them available to
and easily consumable by all customers.
The ability to visually look at data by house-
hold or commercial customer by the hour has led to
some very practical applications:
• The town can notify customers of potential
leaks within days.
• Customers can set alerts that notify them with-
in hours if there is a spike in water usage.
• Customers can track their water usage online,
helping them to be more proactive in conserv-
ing water.
Through the online portal, one business in the
town saw a spike in water consumption on weekends
when employees are away. This seemed odd, and
the unusual reading helped the company learn that a
commercial dishwasher was malfunctioning, running
continuously over weekends. Without the wireless
water-meter data and the customer-accessible portal,
this problem could have gone unnoticed, continuing
to waste water and money.
The town has a much more accurate picture
of daily water usage per person, critical for planning
future water plant expansions. Perhaps the most
interesting perk is that the town was able to verify a
hunch that has far-reaching cost ramifications: Cary
residents are very economical in their use of water.
“We calculate that with modern high-efficiency appli-
ances, indoor water use could be as low as 35 gal-
lons per person per day. Cary residents average 45
gallons, which is still phenomenally low,” explains
town Water Resource Manager Leila Goodwin. Why
is this important? The town was spending money
to encourage water efficiency—rebates on low-flow
toilets or discounts on rain barrels. Now it can take
a more targeted approach, helping specific consum-
ers understand and manage both their indoor and
outdoor water use.
SAS was critical not just for enabling residents
to understand their water use but also working
behind the scenes to link two disparate databases.
“We have a billing database and the meter-reading
database. We needed to bring that together and
make it presentable,” Mills says.
The town estimates that by just removing the
need for manual readings, the Aquastar system will
save more than $10 million above the cost of the
project. But the analytics component could provide
even bigger savings. Already, both the town and
individual citizens have saved money by catch-
ing water leaks early. As Cary continues to plan its
future infrastructure needs, having accurate infor-
mation on water usage will help it invest in the
right amount of infrastructure at the right time. In
addition, understanding water usage will help the
town if it experiences something detrimental like a
drought.
“We went through a drought in 2007,” says
Goodwin. “If we go through another, we have a
plan in place to use Aquastar data to see exactly how
much water we are using on a day-by-day basis and
communicate with customers. We can show ‘here’s
what’s happening, and here is how much you can
use because our supply is low.’ Hopefully, we’ll
never have to use it, but we’re prepared.”
Application Case 3.3 Town of Cary Uses Analytics to Analyze Data from
Sensors, Assess Demand, and Detect Problems
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 151
Questions for Case 3.3
1. What were the challenges the Town of Cary was
facing?
2. What was the proposed solution?
3. What were the results?
4. What other problems and data analytics solutions
do you foresee for towns like Cary?
Source: “Municipality Puts Wireless Water Meter-Reading Data To
Work (SAS’ Analytics)—The Town of Cary, North Carolina Uses
SAS Analytics to Analyze Data from Wireless Water Meters, Assess
Demand, Detect Problems and Engage Customers.” Copyright ©
2016 SAS Institute Inc., Cary, NC, USA. Reprinted with permis-
sion. All rights reserved.
3.6 REGRESSION MODELING FOR INFERENTIAL STATISTICS
Regression, especially linear regression, is perhaps the most widely known and used
analytics technique in statistics. Historically speaking, the roots of regression date back
to the 1920s and 1930s, to the earlier work on inherited characteristics of sweet peas by
Sir Francis Galton and subsequently by Karl Pearson. Since then, regression has become
the statistical technique for characterization of relationships between explanatory (input)
variable(s) and response (output) variable(s).
As popular as it is, regression essentially is a relatively simple statistical technique
to model the dependence of a variable (response or output variable) on one (or more)
explanatory (input) variables. Once identified, this relationship between the variables can
be formally represented as a linear/additive function/equation. As is the case with many
other modeling techniques, regression aims to capture the functional relationship be-
tween and among the characteristics of the real world and describe this relationship with
a mathematical model, which can then be used to discover and understand the complexi-
ties of reality—explore and explain relationships or forecast future occurrences.
Regression can be used for one of two purposes: hypothesis testing—investigating
potential relationships between different variables—and prediction/forecasting— estimating
values of a response variable based on one or more explanatory variables. These two uses
are not mutually exclusive. The explanatory power of regression is also the foundation of
its predictive ability. In hypothesis testing (theory building), regression analysis can reveal
the existence/strength and the directions of relationships between a number of explanatory
variables (often represented with xi) and the response variable (often represented with y).
In prediction, regression identifies additive mathematical relationships (in the form of an
equation) between one or more explanatory variables and a response variable. Once deter-
mined, this equation can be used to forecast the values of the response variable for a given
set of values of the explanatory variables.
CORRELATION VERSUS REGRESSION Because regression analysis originated from cor-
relation studies, and because both methods attempt to describe the association between
two (or more) variables, these two terms are often confused by professionals and even
by scientists. Correlation makes no a priori assumption of whether one variable is de-
pendent on the other(s) and is not concerned with the relationship between variables;
instead it gives an estimate on the degree of association between the variables. On the
other hand, regression attempts to describe the dependence of a response variable on
one (or more) explanatory variables where it implicitly assumes that there is a one-
way causal effect from the explanatory variable(s) to the response variable, regardless of
whether the path of effect is direct or indirect. Also, although correlation is interested in
the low-level relationships between two variables, regression is concerned with the rela-
tionships between all explanatory variables and the response variable.
152 Part I • Introduction to Analytics and AI
SIMPLE VERSUS MULTIPLE REGRESSION If the regression equation is built between one
response variable and one explanatory variable, then it is called simple regression. For
instance, the regression equation built to predict/explain the relationship between the
height of a person (explanatory variable) and the weight of a person (response variable)
is a good example of simple regression. Multiple regression is the extension of simple
regression when the explanatory variables are more than one. For instance, in the pre-
vious example, if we were to include not only the height of the person but also other
personal characteristics (e.g., BMI, gender, ethnicity) to predict the person’s weight, then
we would be performing multiple regression analysis. In both cases, the relationship
between the response variable and the explanatory variable(s) is linear and additive in
nature. If the relationships are not linear, then we might want to use one of many other
nonlinear regression methods to better capture the relationships between the input and
output variables.
How Do We Develop the Linear Regression Model?
To understand the relationship between two variables, the simplest thing that one can do
is to draw a scatter plot where the y-axis represents the values of the response variable
and the x-axis represents the values of the explanatory variable (see Figure 3.13). A scat-
ter plot would show the changes in the response variable as a function of the changes in
the explanatory variable. In the case shown in Figure 3.13, there seems to be a positive
relationship between the two; as the explanatory variable values increase, so does the
response variable.
Simple regression analysis aims to find a mathematical representation of this rela-
tionship. In reality, it tries to find the signature of a straight line passing through right
between the plotted dots (representing the observation/historical data) in such a way
that it minimizes the distance between the dots and the line (the predicted values on the
R
e
s
p
o
n
s
e
V
a
ri
a
b
le
:
y
Explanatory Variable: x
b0
b1
(xi, yi)
(xi, yi)
(xi, yi)
Regression Line
FIGURE 3.13 A Scatter Plot and a Linear Regression Line.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 153
theoretical regression line). Even though there are several methods/algorithms proposed
to identify the regression line, the one that is most commonly used is called the ordinary
least squares (OLS) method. The OLS method aims to minimize the sum of squared
residuals (squared vertical distances between the observation and the regression point)
and leads to a mathematical expression for the estimated value of the regression line
(which are known as b parameters). For simple linear regression, the aforementioned
relationship between the response variable 1y2 and the explanatory variable(s) 1×2 can
be shown as a simple equation as follows:
y = b0 + b1x
In this equation, b0 is called the intercept, and b1 is called the slope. Once OLS deter-
mines the values of these two coefficients, the simple equation can be used to forecast
the values of y for given values of x. The sign and the value of b1 also reveal the direc-
tion and the strengths of relationship between the two variables.
If the model is of a multiple linear regression type, then there would be more coef-
ficients to be determined, one for each additional explanatory variable. As the following
formula shows, the additional explanatory variable would be multiplied with the new
bi coefficients and summed together to establish a linear additive representation of the
response variable.
y = b0 + b1x1 + b2x2 + b3x3 + # + bnxn
How Do We Know If the Model Is Good Enough?
Because of a variety of reasons, sometimes models as representations of the reality do
not prove to be good. Regardless of the number of explanatory variables included, there
is always a possibility of not having a good model, and therefore the linear regression
model needs to be assessed for its fit (the degree to which it represents the response
variable). In the simplest sense, a well-fitting regression model results in predicted values
close to the observed data values. For the numerical assessment, three statistical measures
are often used in evaluating the fit of a regression model: R 2(R – squared), the overall
F-test, and the root mean square error (RMSE). All three of these measures are based on
the sums of the square errors (how far the data are from the mean and how far the data
are from the model’s predicted values). Different combinations of these two values pro-
vide different information about how the regression model compares to the mean model.
Of the three, R 2 has the most useful and understandable meaning because of its
intuitive scale. The value of R 2 ranges from 0 to 1 (corresponding to the amount of vari-
ability explained in percentage) with 0 indicating that the relationship and the prediction
power of the proposed model is not good, and 1 indicating that the proposed model is
a perfect fit that produces exact predictions (which is almost never the case). The good
R 2 values would usually come close to one, and the closeness is a matter of the phe-
nomenon being modeled—whereas an R 2 value of 0.3 for a linear regression model in
social sciences can be considered good enough, an R 2 value of 0.7 in engineering might
be considered as not a good enough fit. The improvement in the regression model can
be achieved by adding more explanatory variables or using different data transforma-
tion techniques, which would result in comparative increases in an R 2 value. Figure 3.14
shows the process flow of developing regression models. As can be seen in the process
flow, the model development task is followed by the model assessment task in which not
only is the fit of the model assessed, but because of restrictive assumptions with which
the linear models have to comply, the validity of the model also needs to be put under
the microscope.
154 Part I • Introduction to Analytics and AI
What Are the Most Important Assumptions in Linear Regression?
Even though they are still the choice of many for data analyses (both for explanatory
and for predictive modeling purposes), linear regression models suffer from several
highly restrictive assumptions. The validity of the linear model built depends on its
ability to comply with these assumptions. Here are the most commonly pronounced
assumptions:
1. Linearity. This assumption states that the relationship between the response
variable and the explanatory variables is linear. That is, the expected value of the
response variable is a straight-line function of each explanatory variable while
holding all other explanatory variables fixed. Also, the slope of the line does not
depend on the values of the other variables. It also implies that the effects of dif-
ferent explanatory variables on the expected value of the response variable are
additive in nature.
2. Independence (of errors). This assumption states that the errors of the response
variable are uncorrelated with each other. This independence of the errors is weaker
Tabulated
Data
Data Assessment
Scatter plot
Correlations
Model Fitting
Transform data
Estimate parameters
Model Assessment
Test assumptions
Assess model fit
Deployment
One-time use
Recurrent use
FIGURE 3.14 A Process Flow for Developing Regression Models.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 155
than actual statistical independence, which is a stronger condition and is often not
needed for linear regression analysis.
3. Normality (of errors). This assumption states that the errors of the response vari-
able are normally distributed. That is, they are supposed to be totally random and
should not represent any nonrandom patterns.
4. Constant variance (of errors). This assumption, also called homoscedasticity,
states that the response variables have the same variance in their error regardless of
the values of the explanatory variables. In practice, this assumption is invalid if the
response variable varies over a wide enough range/scale.
5. Multicollinearity. This assumption states that the explanatory variables are not
correlated (i.e., do not replicate the same but provide a different perspective of the
information needed for the model). Multicollinearity can be triggered by having two
or more perfectly correlated explanatory variables presented to the model (e.g., if
the same explanatory variable is mistakenly included in the model twice, one with
a slight transformation of the same variable). A correlation-based data assessment
usually catches this error.
There are statistical techniques developed to identify the violation of these assump-
tions and techniques to mitigate them. The most important part for a modeler is to be
aware of their existence and to put in place the means to assess the models to make sure
that they are compliant with the assumptions they are built on.
Logistic Regression
Logistic regression is a very popular, statistically sound, probability-based classifica-
tion algorithm that employs supervised learning. It was developed in the 1940s as a
complement to linear regression and linear discriminant analysis methods. It has been
used extensively in numerous disciplines, including the medical and social sciences
fields. Logistic regression is similar to linear regression in that it also aims to regress
to a mathematical function that explains the relationship between the response vari-
able and the explanatory variables using a sample of past observations (training data).
Logistic regression differs from linear regression with one major point: its output (re-
sponse variable) is a class as opposed to a numerical variable. That is, whereas linear
regression is used to estimate a continuous numerical variable, logistic regression is
used to classify a categorical variable. Even though the original form of logistic regres-
sion was developed for a binary output variable (e.g., 1/0, yes/no, pass/fail, accept/
reject), the present-day modified version is capable of predicting multiclass output
variables (i.e., multinomial logistic regression). If there is only one predictor variable
and one predicted variable, the method is called simple logistic regression (similar to
calling linear regression models with only one independent variable simple linear
regression).
In predictive analytics, logistic regression models are used to develop probabilis-
tic models between one or more explanatory/predictor variables (which can be a mix
of both continuous and categorical in nature) and a class/response variable (which can
be binomial/binary or multinomial/multiclass). Unlike ordinary linear regression, logis-
tic regression is used for predicting categorical (often binary) outcomes of the response
variable—treating the response variable as the outcome of a Bernoulli trial. Therefore,
logistic regression takes the natural logarithm of the odds of the response variable to
create a continuous criterion as a transformed version of the response variable. Thus, the
logit transformation is referred to as the link function in logistic regression—even though
the response variable in logistic regression is categorical or binomial, the logit is the con-
tinuous criterion on which linear regression is conducted. Figure 3.15 shows a logistic
156 Part I • Introduction to Analytics and AI
regression function where the odds are represented in the x-axis (a linear function of the
independent variables), whereas the probabilistic outcome is shown in the y-axis (i.e.,
response variable values change between 0 and 1).
The logistic function, f1y2 in Figure 3.15 is the core of logistic regression, which
can take values only between 0 and 1. The following equation is a simple mathematical
representation of this function:
f1y2 = 1
1 + e-1b0 +b1x 2
The logistic regression coefficients (the bs) are usually estimated using the maximum
likelihood estimation method. Unlike linear regression with normally distributed residu-
als, it is not possible to find a closed-form expression for the coefficient values that maxi-
mizes the likelihood function, so an iterative process must be used instead. This process
begins with a tentative starting solution, then revises the parameters slightly to see if the
solution can be improved, and repeats this iterative revision until no improvement can
be achieved or is very minimal, at which point the process is said to have completed/
converged.
Sports analytics—use of data and statistical/analytics techniques to better manage
sports teams/organizations—has been gaining tremendous popularity. Use of data-driven
analytics techniques has become mainstream for not only professional teams but also col-
lege and amateur sports. Application Case 3.4 is an example of how existing and readily
available public data sources can be used to predict college football bowl game outcomes
using both classification and regression-type prediction models.
Time-Series Forecasting
Sometimes the variable that we are interested in (i.e., the response variable) might not
have distinctly identifiable explanatory variables, or there might be too many of them in a
highly complex relationship. In such cases, if the data are available in a desired format, a
prediction model, the so-called time series, can be developed. A time series is a sequence
of data points of the variable of interest, measured and represented at successive points
in time spaced at uniform time intervals. Examples of time series include monthly rain
volumes in a geographic area, the daily closing value of the stock market indexes, and
f (y)
1
26 24 22 0 2 4 6
b0 1 b1x
0.5
FIGURE 3.15 The Logistic Function.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 157
Predicting the outcome of a college football game (or
any sports game, for that matter) is an interesting and
challenging problem. Therefore, challenge-seeking
researchers from both academics and industry have
spent a great deal of effort on forecasting the out-
come of sporting events. Large amounts of historic
data exist in different media outlets (often publicly
available) regarding the structure and outcomes of
sporting events in the form of a variety of numeri-
cally or symbolically represented factors that are
assumed to contribute to those outcomes.
The end-of-season bowl games are very impor-
tant to colleges in terms of both finance (bring-
ing in millions of dollars of additional revenue) and
reputation—for recruiting quality students and highly
regarded high school athletes for their athletic pro-
grams (Freeman & Brewer, 2016). Teams that are
selected to compete in a given bowl game split a
purse, the size of which depends on the specific bowl
(some bowls are more prestigious and have higher
payouts for the two teams), and therefore securing
an invitation to a bowl game is the main goal of any
division I-A college football program. The decision
makers of the bowl games are given the authority
to select and invite bowl-eligible (a team that has six
wins against its Division I-A opponents in that season)
successful teams (as per the ratings and rankings) that
will play in an exciting and competitive game, attract
fans of both schools, and keep the remaining fans
tuned in via a variety of media outlets for advertising.
In a recent data mining study, Delen et al.
(2012) used eight years of bowl game data along
with three popular data mining techniques (decision
trees, neural networks, and support vector machines)
to predict both the classification-type outcome of a
game (win versus loss) and the regression-type out-
come (projected point difference between the scores
of the two opponents). What follows is a shorthand
description of their study.
The Methodology
In this research, Delen and his colleagues followed
a popular data mining methodology, CRISP-DM
(Cross-Industry Standard Process for Data Mining),
which is a six-step process. This popular meth-
odology, which is covered in detail in Chapter 4,
provided them with a systematic and structured way
to conduct the underlying data mining study and
hence improved the likelihood of obtaining accurate
Application Case 3.4 Predicting NCAA Bowl Game Outcomes
(Continued )
158 Part I • Introduction to Analytics and AI
and reliable results. To objectively assess the pre-
diction power of the different model types, they
used a cross-validation methodology k-fold cross-
validation. Details on k-fold cross-validation can be
found in Chapter 4. Figure 3.16 graphically illustrates
the methodology employed by the researchers.
Data Acquisition and Data Preprocessing
The sample data for this study are collected from
a variety of sports databases available on the Web,
including jhowel.net, ESPN.com, Covers.com,
ncaa.org, and rauzulusstreet.com. The data set
included 244 bowl games representing a com-
plete set of eight seasons of college football bowl
games played between 2002 and 2009. Delen et
al. also included an out-of-sample data set (2010–
2011 bowl games) for additional validation pur-
poses. Exercising one of the popular data mining
rules of thumb, they included as much relevant
information in the model as possible. Therefore,
after an in-depth variable identification and
Classification &
Regression Trees
Neural Networks
X1
X2
Support Vector
Machines
M
ax
im
um
-m
ar
gin
h
yp
er
pla
ne
M
argin
Data Collection, Organization,
Cleaning, and Transformation
Raw Data Sources
Built
Classification
Models
Test
Model
Tabulate the
Results
Built
Regression
Models
Transform and
Tabulate Results
Compare the
Prediction Results
Test
Model
Classification
Modeling
Regression
Modeling
DBs
Output: Binary (win/loss) Output: Integer (point difference)
Win Loss
Win
Loss
…
……
…
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 % 10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 % 10 %
FIGURE 3.16 The Graphical Illustration of the Methodology Employed in the Study.
Application Case 3.4 (Continued)
http://jhowel.net
http://ESPN.com
http://Covers.com
http://rauzulusstreet.com
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 159
collection process, they ended up with a data set
that included 36 variables, of which the first 6 were
the identifying variables (i.e., name and the year
of the bowl game, home and away team names,
and their athletic conferences—see variables 1–6 in
Table 3.5), followed by 28 input variables (which
included variables delineating a team’s seasonal sta-
tistics on offense and defense, game outcomes, team
composition characteristics, athletic conference char-
acteristics, and how they fared against the odds—see
TABLE 3.5 Description of Variables Used in the Study
No Cat Variable Name Description
1 ID1 YEAR Year of the bowl game
2 ID BOWLGAME Name of the bowl game
3 ID HOMETEAM Home team (as listed by the bowl organizers)
4 ID AWAYTEAM Away team (as listed by the bowl organizers)
5 ID HOMECONFERENCE Conference of the home team
6 ID AWAYCONFERENCE Conference of the away team
7 I12 DEFPTPGM Defensive points per game
8 I1 DEFRYDPGM Defensive rush yards per game
9 I1 DEFYDPGM Defensive yards per game
10 I1 PPG Average number of points a given team scored per game
11 I1 PYDPGM Average total pass yards per game
12 I1 RYDPGM Team’s average total rush yards per game
13 I1 YRDPGM Average total offensive yards per game
14 I2 HMWIN% Home winning percentage
15 I2 LAST7 How many games the team won out of their last 7 games
16 I2 MARGOVIC Average margin of victory
17 I2 NCTW Nonconference team winning percentage
18 I2 PREVAPP Did the team appear in a bowl game previous year
19 I2 RDWIN% Road winning percentage
20 I2 SEASTW Winning percentage for the year
21 I2 TOP25 Winning percentage against AP top 25 teams for the year
22 I3 TSOS Strength of schedule for the year
23 I3 FR% Percentage of games played by freshmen class players for the year
24 I3 SO% Percentage of games played by sophomore class players for the year
25 I3 JR% Percentage of games played by junior class players for the year
26 I3 SR% Percentage of games played by senior class players for the year
27 I4 SEASOvUn% Percentage of times a team went over the O/U3 in the current season
28 I4 ATSCOV% Against the spread cover percentage of the team in previous bowl games
(Continued )
160 Part I • Introduction to Analytics and AI
TABLE 3.5 (Continued)
No Cat Variable Name Description
29 I4 UNDER% Percentage of times a team went under in previous bowl games
30 I4 OVER% Percentage of times a team went over in previous bowl games
31 I4 SEASATS% Percentage of covering against the spread for the current season
32 I5 CONCH Did the team win their respective conference championship game
33 I5 CONFSOS Conference strength of schedule
34 I5 CONFWIN% Conference winning percentage
35 O1 ScoreDiff4 Score difference (HomeTeamScore – AwayTeamScore)
36 O2 WinLoss4 Whether the home team wins or loses the game
1ID: Identifier variables; O1: output variable for regression models; O2: output variable for classification models.
2Offense/defense; I2: game outcome; I3: team configuration; I4: against the odds; I5: conference stats.
3Over/Under—Whether or not a team will go over or under the expected score difference.
4Output variables—ScoreDiff for regression models and WinLoss for binary classification models.
variables 7–34 in Table 3.5), and finally the last two
were the output variables (i.e., ScoreDiff—the score
difference between the home team and the away
team represented with an integer number—and
WinLoss—whether the home team won or lost the
bowl game represented with a nominal label).
In the formulation of the data set, each row
(a.k.a. tuple, case, sample, example, etc.) represented
a bowl game, and each column stood for a variable
(i.e., identifier/input or output type). To represent
the game-related comparative characteristics of the
two opponent teams in the input variables, Delen
et al. calculated and used the differences between
the measures of the home and away teams. All these
variable values are calculated from the home team’s
perspective. For instance, the variable PPG (average
number of points a team scored per game) repre-
sents the difference between the home team’s PPG
and away team’s PPG. The output variables repre-
sent whether the home team wins or loses the bowl
game. That is, if the ScoreDiff variable takes a posi-
tive integer number, then the home team is expected
to win the game by that margin; otherwise (if the
ScoreDiff variable takes a negative integer number),
the home team is expected to lose the game by that
margin. In the case of WinLoss, the value of the out-
put variable is a binary label, “Win” or “Loss,” indi-
cating the outcome of the game for the home team.
The Results and Evaluation
In this study, three popular prediction techniques are
used to build models (and to compare them to each
other): artificial neural networks, decision trees, and
support vector machines. These prediction techniques
are selected based on their capability of modeling both
classification and regression-type prediction problems
and their popularity in recently published data mining
literature. More details about these popular data min-
ing methods can be found in Chapter 4.
To compare predictive accuracy of all models
to one another, the researchers used a stratified k-fold
cross-validation methodology. In a stratified version of
k-fold cross-validation, the folds are created in a way
that they contain approximately the same proportion
of predictor labels (i.e., classes) as the original data set.
In this study, the value of k is set to 10 (i.e., the com-
plete set of 244 samples are split into 10 subsets, each
having about 25 samples), which is a common prac-
tice in predictive data mining applications. A graphical
depiction of the 10-fold cross-validations was shown
earlier in this chapter. To compare the prediction mod-
els that were developed using the aforementioned
three data mining techniques, the researchers chose to
use three common performance criteria: accuracy, sen-
sitivity, and specificity. The simple formulas for these
metrics were also explained earlier in this chapter.
Application Case 3.4 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 161
The prediction results of the three model-
ing techniques are presented in Tables 3.6 and
3.7. Table 3.6 presents the 10-fold cross-validation
results of the classification methodology in which
the three data mining techniques are formulated
to have a binary-nominal output variable (i.e.,
WinLoss). Table 3.7 presents the 10-fold cross-
validation results of the regression-based classifica-
tion methodology in which the three data mining
techniques are formulated to have a numerical out-
put variable (i.e., ScoreDiff). In the regression-based
classification prediction, the numerical output of the
models is converted to a classification type by label-
ing the positive WinLoss numbers with a “Win” and
negative WinLoss numbers with a “Loss” and then
tabulating them in the confusion matrixes. Using the
confusion matrices, the overall prediction accuracy,
sensitivity, and specificity of each model type are
calculated and presented in Tables 3.6 and 3.7. As
the results indicate, the classification-type prediction
methods performed better than regression-based
classification-type prediction methodology. Among
the three data mining technologies, classification
and regression trees produced better prediction
accuracy in both prediction methodologies. Overall,
classification and regression tree classification mod-
els produced a 10-fold cross-validation accuracy of
86.48 percent followed by support vector machines
TABLE 3.6 Prediction Results for the Direct Classification Methodology
Prediction Method
(classification1)
Confusion
Matrix
Accuracy2 (in %)
Sensitivity (in %)
Specificity (in %)
Win Loss
ANN (MLP) Win 92 42 75.00 68.66 82.73
Loss 19 91
SVM (RBF) Win 105 29 79.51 78.36 80.91
Loss 21 89
DT (C&RT) Win 113 21 86.48 84.33 89.09
Loss 12 98
1The output variable is a binary categorical variable (Win or Loss).
2Differences were significant.
TABLE 3.7 Prediction Results for the Regression-Based Classification Methodology
Prediction Method
(regression based1)
Confusion
Matrix
Accuracy2
Sensitivity
Specificity
Win Loss
ANN (MLP) Win 94 40 72.54 70.15 75.45
Loss 27 83
SVM (RBF) Win 100 34 74.59 74.63 74.55
Loss 28 82
DT (C&RT) Win 106 28 77.87 76.36 79.10
Loss 26 84
1The output variable is a numerical/integer variable (point-diff).
2Differences were sig p 6 0.01.
(Continued )
162 Part I • Introduction to Analytics and AI
daily sales totals for a grocery store. Often, time series are visualized using a line chart.
Figure 3.17 shows an example time series of sales volumes for the years 2008 through
2012 on a quarterly basis.
Time-series forecasting is the use of mathematical modeling to predict future
values of the variable of interest based on previously observed values. The time-series
plots/charts look and feel very similar to simple linear regression in that, as was the case
in simple linear regression, in time series there are two variables: the response variable
and the time variable presented in a scatter plot. Beyond this appearance similarity, there
is hardly any other commonality between the two. Although regression analysis is often
employed in testing theories to see if current values of one or more explanatory variables
explain (and hence predict) the response variable, the time-series models are focused on
extrapolating on their time-varying behavior to estimate the future values.
Time-series-forecasting assumes that all of the explanatory variables are aggregated
into the response variable as a time-variant behavior. Therefore, capturing the time- variant
behavior is the way to predict the future values of the response variable. To do that, the
pattern is analyzed and decomposed into its main components: random variations, time
trends, and seasonal cycles. The time-series example shown in Figure 3.17 illustrates all
of these distinct patterns.
The techniques used to develop time-series forecasts range from very simple (the
naïve forecast that suggests today’s forecast is the same as yesterday’s actual) to very
complex like ARIMA (a method that combines autoregressive and moving average pat-
terns in data). Most popular techniques are perhaps the averaging methods that include
simple average, moving average, weighted moving average, and exponential smoothing.
Many of these techniques also have advanced versions when seasonality and trend can
also be taken into account for better and more accurate forecasting. The accuracy of a
method is usually assessed by computing its error (calculated deviation between actuals
and forecasts for the past observations) via mean absolute error (MAE), mean squared
error (MSE), or mean absolute percent error (MAPE). Even though they all use the same
(with a 10-fold cross-validation accuracy of 79.51
percent) and neural networks (with a 10-fold cross-
validation accuracy of 75.00 percent). Using a t-test,
researchers found that these accuracy values were
significantly different at 0.05 alpha level; that is, the
decision tree is a significantly better predictor of this
domain than the neural network and support vec-
tor machine, and the support vector machine is a
significantly better predictor than neural networks.
The results of the study showed that the
classification-type models predict the game out-
comes better than regression-based classification
models. Even though these results are specific to the
application domain and the data used in this study
and therefore should not be generalized beyond the
scope of the study, they are exciting because deci-
sion trees are not only the best predictors but also
the best in understanding and deployment, com-
pared to the other two machine-learning techniques
employed in this study. More details about this study
can be found in Delen et al. (2012).
Questions for Case 3.4
1. What are the foreseeable challenges in predicting
sporting event outcomes (e.g., college bowl games)?
2. How did the researchers formulate/design the
prediction problem (i.e., what were the inputs
and output, and what was the representation of
a single sample—row of data)?
3. How successful were the prediction results?
What else can they do to improve the accuracy?
Sources: D. Delen, D. Cogdell, and N. Kasap, “A Comparative
Analysis of Data Mining Methods in Predicting NCAA Bowl
Outcomes,” International Journal of Forecasting, 28, 2012,
pp. 543–552; K. M. Freeman, and R. M. Brewer, “The Politics
of American College Football,” Journal of Applied Business and
Economics, 18(2), 2016, pp. 97–101.
Application Case 3.4 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 163
core error measure, these three assessment methods emphasize different aspects of the
error, some penalizing larger errors more so than the others.
u SECTION 3.6 REVIEW QUESTIONS
1. What is regression, and what statistical purpose does it serve?
2. What are the commonalities and differences between regression and correlation?
3. What is OLS? How does OLS determine the linear regression line?
4. List and describe the main steps to follow in developing a linear repression model.
5. What are the most commonly pronounced assumptions for linear regression?
6. What is logistics regression? How does it differ from linear regression?
7. What is time series? What are the main forecasting techniques for time-series data?
3.7 BUSINESS REPORTING
Decision makers need information to make accurate and timely decisions. Information
is essentially the contextualization of data. In addition to statistical means that were ex-
plained in the previous section, information (descriptive analytics) can also be obtained
using OLTP systems (see the simple taxonomy of descriptive analytics in Figure 3.7). The
information is usually provided to decision makers in the form of a written report (digital
or on paper), although it can also be provided orally. Simply put, a report is any com-
munication artifact prepared with the specific intention of conveying information in a di-
gestible form to whoever needs it whenever and wherever. It is typically a document that
contains information (usually driven from data) organized in a narrative, graphic, and/or
tabular form, prepared periodically (recurring) or on an as-needed (ad hoc) basis, refer-
ring to specific time periods, events, occurrences, or subjects. Business reports can fulfill
many different (but often related) functions. Here are a few of the most prevailing ones:
• To ensure that all departments are functioning properly.
• To provide information.
0
1
2
3
4
5
6
7
8
9
10
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
2008 2009 2010 2011 2012
Quarterly Product Sales (in millions)
FIGURE 3.17 A Sample Time Series of Data on Quarterly Sales Volumes.
164 Part I • Introduction to Analytics and AI
• To provide the results of an analysis.
• To persuade others to act.
• To create an organizational memory (as part of a knowledge management
system).
Business reporting (also called OLAP or BI) is an essential part of the larger drive to-
ward improved, evidence-based, optimal managerial decision making. The foundation of
these business reports is various sources of data coming from both inside and outside
the organization (OLTP systems). Creation of these reports involves extract, transform,
and load (ETL) procedures in coordination with a data warehouse and then using one or
more reporting tools.
Due to the rapid expansion of IT coupled with the need for improved competitive-
ness in business, there has been an increase in the use of computing power to produce
unified reports that join different views of the enterprise in one place. Usually, this report-
ing process involves querying structured data sources, most of which were created using
different logical data models and data dictionaries, to produce a human-readable, easily
digestible report. These types of business reports allow managers and coworkers to stay
informed and involved, review options and alternatives, and make informed decisions.
Figure 3.18 shows the continuous cycle of data acquisition S information generation S
decision-making S business process management. Perhaps the most critical task in this
cyclical process is the reporting (i.e., information generation)—converting data from dif-
ferent sources into actionable information.
Key to any successful report are clarity, brevity, completeness, and correctness.
The nature of the report and the level of importance of these success factors changes
significantly based on for whom the report is created. Most of the research in effective
reporting is dedicated to internal reports that inform stakeholders and decision makers
within the organization. There are also external reports between businesses and the
government (e.g., for tax purposes or for regular filings to the Securities and Exchange
Commission). Even though there is a wide variety of business reports, the ones that
are often used for managerial purposes can be grouped into three major categories
(Hill, 2016).
Data
Repositories
Business Functions
UOB 1.0 X
UOB 2.2
UOB 2.1 X UOB 3.0
Symbol Count Description
1
Machine
Failure
Exception Event
Transactional Records
Information
(reporting)
Decision
Maker
Action
(decision)
Data
1 2
3 4
5
FIGURE 3.18 The Role of Information Reporting in Managerial Decision Making.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 165
METRIC MANAGEMENT REPORTS In many organizations, business performance is man-
aged through outcome-oriented metrics. For external groups, these are service-level
agreements. For internal management, they are key performance indicators (KPIs).
Typically, there are enterprise-wide agreed upon targets to be tracked against over a pe-
riod of time. They can be used as part of other management strategies such as Six Sigma
or total quality management.
DASHBOARD-TYPE REPORTS A popular idea in business reporting in recent years has
been to present a range of different performance indicators on one page like a dashboard
in a car. Typically, dashboard vendors would provide a set of predefined reports with
static elements and fixed structure but also allow for customization of the dashboard wid-
gets, views, and set targets for various metrics. It is common to have color-coded traffic
lights defined for performance (red, orange, green) to draw management’s attention to
particular areas. A more detailed description of dashboards can be found in a later section
of this chapter.
BALANCED SCORECARD–TYPE REPORTS This is a method developed by Kaplan and
Norton that attempts to present an integrated view of success in an organization. In
addition to financial performance, balanced scorecard–type reports also include cus-
tomer, business process, and learning and growth perspectives. More details on balanced
scorecards are provided in a later section in this chapter.
Application Case 3.5 is an example to illustrate the power and the utility of auto-
mated report generation for a large (and, at a time of natural crisis, somewhat chaotic)
organization such as the Federal Emergency Management Agency.
Staff at the Federal Emergency Management Agency
(FEMA), the U.S. federal agency that coordinates
disaster response when the president declares a
national disaster, always got two floods at once.
First, water covered the land. Next, a flood of
paper required to administer the National Flood
Insurance Program (NFIP) covered their desks—
pallets and pallets of green-striped reports poured
off a mainframe printer and into their offices.
Individual reports were sometimes 18 inches thick
with a nugget of information about insurance
claims, premiums, or payments buried in them
somewhere.
Bill Barton and Mike Miles do not claim to
be able to do anything about the weather, but the
project manager and computer scientist, respec-
tively, from Computer Sciences Corporation (CSC)
have used WebFOCUS software from Information
Builders to turn back the flood of paper generated
by the NFIP. The program allows the government
to work with national insurance companies to col-
lect flood insurance premiums and pay claims for
flooding in communities that adopt flood control
measures. As a result of CSC’s work, FEMA staffs no
longer leaf through paper reports to find the data
they need. Instead, they browse insurance data
posted on NFIP’s BureauNet intranet site, select
just the information they want to see, and get an
on-screen report or download the data as a spread-
sheet. And that is only the start of the savings that
WebFOCUS has provided. The number of times that
NFIP staff ask CSC for special reports has dropped
in half because NFIP staff can generate many of the
special reports they need without calling on a pro-
grammer to develop them. Then there is the cost
of creating BureauNet in the first place. Barton esti-
mates that using conventional Web and database
software to export data from FEMA’s mainframe,
Application Case 3.5 Flood of Paper Ends at FEMA
(Continued )
166 Part I • Introduction to Analytics and AI
u SECTION 3.7 REVIEW QUESTIONS
1. What is a report? What are reports used for?
2. What is a business report? What are the main characteristics of a good business report?
3. Describe the cyclic process of management, and comment on the role of business
reports.
4. List and describe the three major categories of business reports.
5. What are the main components of a business reporting system?
3.8 DATA VISUALIZATION
Data visualization (or more appropriately, information visualization) has been defined
as “the use of visual representations to explore, make sense of, and communicate data”
(Few, 2007). Although the name that is commonly used is data visualization, usually
what this means is information visualization. Because information is the aggregation,
summarization, and contextualization of data (raw facts), what is portrayed in visualiza-
tions is the information, not the data. However, because the two terms data visualiza-
tion and information visualization are used interchangeably and synonymously, in this
chapter we will follow suit.
Data visualization is closely related to the fields of information graphics, information
visualization, scientific visualization, and statistical graphics. Until recently, the major forms
of data visualization available in both BI applications have included charts and graphs as
well as the other types of visual elements used to create scorecards and dashboards.
To better understand the current and future trends in the field of data visualization,
it helps to begin with some historical context.
store it in a new database, and link that to a Web
server would have cost about 100 times as much—
more than $500,000—and taken about two years
to complete compared with the few months Miles
spent on the WebFOCUS solution.
When Tropical Storm Allison, a huge slug of
sodden, swirling cloud, moved out of the Gulf of
Mexico onto the Texas and Louisiana coastline in
June 2001, it killed 34 people, most from drowning;
damaged or destroyed 16,000 homes and businesses;
and displaced more than 10,000 families. President
George W. Bush declared 28 Texas counties disaster
areas, and FEMA moved in to help. This was the first
serious test for BureauNet, and it delivered. This first
comprehensive use of BureauNet resulted in FEMA
field staff readily accessing what they needed when
they needed it and asking for many new types of
reports. Fortunately, Miles and WebFOCUS were
up to the task. In some cases, Barton says, “FEMA
would ask for a new type of report one day, and
Miles would have it on BureauNet the next day,
thanks to the speed with which he could create new
reports in WebFOCUS.”
The sudden demand on the system had little
impact on its performance, noted Barton. “It han-
dled the demand just fine,” he says. “We had no
problems with it at all. And it made a huge differ-
ence to FEMA and the job they had to do. They
had never had that level of access before, never had
been able to just click on their desktop and generate
such detailed and specific reports.”
Questions for Case 3.5
1. What is FEMA, and what does it do?
2. What are the main challenges that FEMA faces?
3. How did FEMA improve its inefficient reporting
practices?
Source: Used with permission from Information Builders. Useful
information flows at disaster response agency. informationbuild-
ers.com/applications/fema (accessed July 2018); and fema.gov.
Application Case 3.5 (Continued)
http://informationbuild-ers.com/applications/fema
http://informationbuild-ers.com/applications/fema
http://fema.gov
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 167
Brief History of Data Visualization
Despite the fact that predecessors to data visualization date back to the second century AD,
most developments have occurred in the last two and a half centuries, predominantly during
the last 30 years (Few, 2007). Although visualization has not been widely recognized as a
discipline until fairly recently, today’s most popular visual forms date back a few centuries.
Geographical exploration, mathematics, and popularized history spurred the creation of early
maps, graphs, and timelines as far back as the 1600s, but William Playfair is widely credited
as the inventor of the modern chart, having created the first widely distributed line and bar
charts in his Commercial and Political Atlas of 1786 and what is generally considered to be
the first time-series line chart in his Statistical Breviary published in 1801 (see Figure 3.19).
Perhaps the most notable innovator of information graphics during this period was
Charles Joseph Minard, who graphically portrayed the losses suffered by Napoleon’s army in
the Russian campaign of 1812 (see Figure 3.20). Beginning at the Polish–Russian border, the
thick band shows the size of the army at each position. The path of Napoleon’s retreat from
Moscow in the bitterly cold winter is depicted by the dark lower band, which is tied to tem-
perature and time scales. Popular visualization expert, author, and critic Edward Tufte says
that this “may well be the best statistical graphic ever drawn.” In this graphic, Minard man-
aged to simultaneously represent several data dimensions (the size of the army, direction
of movement, geographic locations, outside temperature, etc.) in an artistic and informative
FIGURE 3.19 The First Time-Series Line Chart Created by William PlayFair in 1801.
168 Part I • Introduction to Analytics and AI
manner. Many more excellent visualizations were created in the 1800s, and most of them are
chronicled on Tufte’s Web site (edwardtufte.com) and his visualization books.
The 1900s saw the rise of a more formal, empirical attitude toward visualization,
which tended to focus on aspects such as color, value scales, and labeling. In the mid-
1900s, cartographer and theorist Jacques Bertin published his Semiologie Graphique,
which some say serves as the theoretical foundation of modern information visualization.
Although most of his patterns are either outdated by more recent research or completely
inapplicable to digital media, many are still very relevant.
In the 2000s, the Internet emerged as a new medium for visualization and brought
with it many new tricks and capabilities. Not only has the worldwide, digital distribution of
both data and visualization made them more accessible to a broader audience (raising visual
literacy along the way), but also it has spurred the design of new forms that incorporate
interaction, animation, and graphics-rendering technology unique to screen media and real-
time data feeds to create immersive environments for communicating and consuming data.
Companies and individuals are, seemingly all of a sudden, interested in data; that in-
terest has in turn sparked a need for visual tools that help them understand it. Cheap hard-
ware sensors and do-it-yourself frameworks for building your own system are driving down
the costs of collecting and processing data. Countless other applications, software tools,
and low-level code libraries are springing up to help people collect, organize, manipulate,
visualize, and understand data from practically any source. The Internet has also served as a
fantastic distribution channel for visualizations; a diverse community of designers, program-
mers, cartographers, tinkerers, and data wonks has assembled to disseminate all sorts of
new ideas and tools for working with data in both visual and nonvisual forms.
Google Maps has also single-handedly democratized both the interface conven-
tions (click to pan, double-click to zoom) and the technology (256-pixel square map tiles
with predictable file names) for displaying interactive geography online to the extent that
most people just know what to do when they are presented with a map online. Flash has
served well as a cross-browser platform on which to design and develop rich, beautiful
Internet applications incorporating interactive data visualization and maps; now, new
FIGURE 3.20 Decimation of Napoleon’s Army during the 1812 Russian Campaign.
http://edwardtufte.com
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 169
browser-native technologies such as canvas and SVG (sometimes collectively included
under the umbrella of HTML5) are emerging to challenge Flash’s supremacy and extend
the reach of dynamic visualization interfaces to mobile devices.
The future of data/information visualization is very hard to predict. We can only
extrapolate from what has already been invented: more three-dimensional visualization,
immersive experience with multidimensional data in a virtual reality environment, and
holographic visualization of information. There is a pretty good chance that we will see
something that we have never seen in the information visualization realm invented be-
fore the end of this decade. Application Case 3.6 shows how visual analytics/reporting
tools such as Tableau can help facilitate effective and efficient decision making through
information/insight creation and sharing.
The Background
Macfarlan Smith has earned its place in medical history.
The company held a royal appointment to provide
medicine to Her Majesty Queen Victoria and supplied
groundbreaking obstetrician Sir James Simpson with
chloroform for his experiments in pain relief during
labor and delivery. Today, Macfarlan Smith is a sub-
sidiary of the Fine Chemical and Catalysts division of
Johnson Matthey plc. The pharmaceutical manufac-
turer is the world’s leading manufacturer of opiate
narcotics such as codeine and morphine.
Every day, Macfarlan Smith is making decisions
based on its data. The company collects and ana-
lyzes manufacturing operational data, for example,
to allow it to meet continuous improvement goals.
Sales, marketing, and finance rely on data to identify
new pharmaceutical business opportunities, grow
revenues, and satisfy customer needs. Additionally,
the company’s manufacturing facility in Edinburgh
needs to monitor, trend, and report quality data to
ensure the identity, quality, and purity of its phar-
maceutical ingredients for customers and regulatory
authorities such as the U.S. FDA and others as part
of current good manufacturing practice (CGMP).
Challenges: Multiple Sources of Truth and
Slow, Onerous Reporting Processes
The process of gathering that data, making decisions,
and reporting was not easy, though. The data were
scattered across the business including in the compa-
ny’s bespoke enterprise resource planning (ERP) plat-
form, inside legacy departmental databases such as
SQL, Access databases, and stand-alone spreadsheets.
When those data were needed for decision mak-
ing, excessive time and resources were devoted to
extracting the data, integrating them, and presenting
them in a spreadsheet or other presentation outlet.
Data quality was another concern. Because
teams relied on their own individual sources of data,
there were multiple versions of the truth and con-
flicts between the data. And it was sometimes hard
to tell which version of the data was correct and
which was not.
It didn’t stop there. Even once the data had
been gathered and presented, making changes “on
the fly” was slow and difficult. In fact, whenever a
member of the Macfarlan Smith team wanted to per-
form trend or other analysis, the changes to the data
needed to be approved. The end result was that the
data were frequently out of date by the time they
were used for decision making.
Liam Mills, Head of Continuous Improvement
at Macfarlan Smith highlights a typical reporting
scenario:
One of our main reporting processes is the
“Corrective Action and Preventive Action,” or
CAPA, which is an analysis of Macfarlan Smith’s
manufacturing processes taken to eliminate
causes of non-conformities or other unde-
sirable situations. Hundreds of hours every
month were devoted to pulling data together
for CAPA—and it took days to produce each
Application Case 3.6 Macfarlan Smith Improves Operational Performance
Insight with Tableau Online
(Continued )
170 Part I • Introduction to Analytics and AI
report. Trend analysis was tricky too, because
the data was static. In other reporting scenar-
ios, we often had to wait for spreadsheet pivot
table analysis; which was then presented on
a graph, printed out, and pinned to a wall for
everyone to review.
Slow, labor-intensive reporting processes, dif-
ferent versions of the truth, and static data were all
catalysts for change. “Many people were frustrated
because they believed they didn’t have a complete
picture of the business,” says Mills. “We were having
more and more discussions about issues we faced—
when we should have been talking about business
intelligence reporting.”
The Solution: Interactive Data
Visualizations
One of the Macfarlan Smith team had previous expe-
rience in using Tableau and recommended Mills
explore the solution further. A free trial of Tableau
Online quickly convinced Mills that the hosted inter-
active data visualization solution could conquer the
data battles the company was facing.
“I was won over almost immediately,” he says.
“The ease of use, the functionality and the breadth
of data visualizations are all very impressive. And of
course being a software-as-a-service (SaaS)-based
solution, there’s no technology infrastructure invest-
ment, we can be live almost immediately, and we
have the flexibility to add users whenever we need.”
One of the key questions that needed to be
answered concerned the security of the online data.
“Our parent company Johnson Matthey has a cloud-
first strategy, but has to be certain that any hosted
solution is completely secure. Tableau Online fea-
tures like single sign-on and allowing only autho-
rized users to interact with the data provide that
watertight security and confidence.”
The other security question that Macfarlan
Smith and Johnson Matthey wanted answered was
this: Where are the data physically stored? Mills
again: “We are satisfied Tableau Online meets our
criteria for data security and privacy. The data and
workbooks are all hosted in Tableau’s new Dublin
data center, so it never leaves Europe.”
Following a six-week trial, the Tableau sales
manager worked with Mills and his team to build a
business case for Tableau Online. The management
team approved it almost straight away, and a pilot
program involving 10 users began. The pilot involved
a manufacturing quality improvement initiative: look-
ing at deviations from the norm, such as when a heat-
ing device used in the opiate narcotics manufacturing
process exceeds a temperature threshold. From this,
a “quality operations” dashboard was created to track
and measure deviations and put in place measures to
improve operational quality and performance.
“That dashboard immediately signaled where
deviations might be. We weren’t ploughing through
rows of data—we reached answers straight away,”
says Mills.
Throughout this initial trial and pilot, the team
used Tableau training aids, such as the free training
videos, product walk-throughs, and live online train-
ing. They also participated in a two-day “fundamen-
tals training” event in London. According to Mills,
“The training was expert, precise and pitched just
at the right level. It demonstrated to everyone just
how intuitive Tableau Online is. We can visualize 10
years’ worth of data in just a few clicks.” The com-
pany now has five Tableau Desktop users and up to
200 Tableau Online licensed users.
Mills and his team particularly like the Tableau
Union feature in Version 9.3, which allows them to
piece together data that have been split into little files.
“It’s sometimes hard to bring together the data we
use for analysis. The Union feature lets us work with
data spread across multiple tabs or files, reducing the
time we spend on prepping the data,” he says.
The Results: Cloud Analytics Transform
Decision Making and Reporting
By standardizing on Tableau Online, Macfarlan Smith
has transformed the speed and accuracy of its deci-
sion making and business reporting. This includes:
• New interactive dashboards can be produced
within one hour. Previously, it used to take
days to integrate and present data in a static
spreadsheet.
• The CAPA manufacturing process report,
which used to absorb hundreds of man-hours
every month and days to produce, can now be
produced in minutes—with insights shared in
the cloud.
Application Case 3.6 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 171
u SECTION 3.8 REVIEW QUESTIONS
1. What is data visualization? Why is it needed?
2. What are the historical roots of data visualization?
3. Carefully analyze Charles Joseph Minard’s graphical portrayal of Napoleon’s march.
Identify and comment on all the information dimensions captured in this ancient diagram.
4. Who is Edward Tufte? Why do you think we should know about his work?
5. What do you think is the “next big thing” in data visualization?
3.9 DIFFERENT TYPES OF CHARTS AND GRAPHS
Often end users of business analytics systems are not sure what type of chart or graph to
use for a specific purpose. Some charts or graphs are better at answering certain types of
questions. Some look better than others. Some are simple; some are rather complex and
crowded. What follows is a short description of the types of charts and/or graphs com-
monly found in most business analytics tools and the types of questions they are better at
answering/analyzing. This material is compiled from several published articles and other
literature (Abela, 2008; Hardin et al., 2012; SAS, 2014).
Basic Charts and Graphs
What follows are the basic charts and graphs that are commonly used for information
visualization.
LINE CHART The line chart is perhaps the most frequently used graphical visuals for
time-series data. Line charts (or line graphs) show the relationship between two variables;
they are most often used to track changes or trends over time (having one of the vari-
ables set to time on the x-axis). Line charts sequentially connect individual data points
to help infer changing trends over a period of time. Line charts are often used to show
time-dependent changes in the values of some measure, such as changes in a specific
• Reports can be changed and interrogated “on
the fly” quickly and easily, without technical
intervention. Macfarlan Smith has the flexibility
to publish dashboards with Tableau Desktop
and share them with colleagues, partners, or
customers.
• The company has one, single, trusted version
of the truth.
• Macfarlan Smith is now having discussions
about its data—not about the issues surround-
ing data integration and data quality.
• New users can be brought online almost
instantly—and there’s no technical infrastruc-
ture to manage.
Following this initial success, Macfarlan Smith
is now extending Tableau Online to financial report-
ing, supply chain analytics, and sales forecasting. Mills
concludes, “Our business strategy is now based on
data-driven decisions, not opinions. The interactive
visualizations enable us to spot trends instantly, identify
process improvements and take business intelligence
to the next level. I’ll define my career by Tableau.”
Questions for Case 3.6
1. What were the data and reporting related chal-
lenges that Macfarlan Smith faced?
2. What were the solution and the obtained results/
benefits?
Source: Tableau Customer Case Study, “Macfarlan Smith improves
operational performance insight with Tableau Online,” http://www.
tableau.com/stories/customer/macfarlan-smith- improves-
operational-performance-insight-tableau-online (accessed June
2018). Used with permission from Tableau Software, Inc.
http://www.tableau.com/stories/customer/macfarlan-smith-improves-operational-performance-insight-tableau-online
http://www.tableau.com/stories/customer/macfarlan-smith-improves-operational-performance-insight-tableau-online
http://www.tableau.com/stories/customer/macfarlan-smith-improves-operational-performance-insight-tableau-online
172 Part I • Introduction to Analytics and AI
stock price over a five-year period or changes in the number of daily customer service
calls over a month.
BAR CHART The bar chart is among the most basic visuals used for data representation.
They are effective when you have nominal data or numerical data that split nicely into
different categories so you can quickly see comparative results and trends within your
data. Bar charts are often used to compare data across multiple categories such as the
percentage of advertising spending by departments or by product categories. Bar charts
can be vertically or horizontally oriented. They can also be stacked on top of each other
to show multiple dimensions in a single chart.
PIE CHART The pie chart is visually appealing, as the name implies, pie-looking charts.
Because they are so visually attractive, they are often incorrectly used. Pie charts should
be used only to illustrate relative proportions of a specific measure. For instance, they
can be used to show the relative percentage of an advertising budget spent on differ-
ent product lines, or they can show relative proportions of majors declared by college
students in their sophomore year. If the number of categories to show is more than just
a few (say more than four), one should seriously consider using a bar chart instead of a
pie chart.
SCATTER PLOT The scatter plot is often used to explore the relationship between two
or three variables (in 2D or 3D visuals). Because scatter plots are visual exploration tools,
translating more than three variables into more than three dimensions is not easily achiev-
able. Scatter plots are an effective way to explore the existence of trends, concentrations,
and outliers. For instance, in a two-variable (two-axis) graph, a scatter plot can be used to
illustrate the co-relationship between age and weight of heart disease patients, or it can
illustrate the relationship between the number of customer care representatives and the
number of open customer service claims. Often, a trend line is superimposed on a two-
dimensional scatter plot to illustrate the nature of the relationship.
BUBBLE CHART The bubble chart is often an enhanced version of scatter plots. Bubble
charts, though, are not a new visualization type; instead, they should be viewed as a tech-
nique to enrich data illustrated in scatter plots (or even geographic maps). By varying the
size and/or color of the circles, one can add additional data dimensions, offering more
enriched meaning about the data. For instance, a bubble chart can be used to show a
competitive view of college-level class attendance by major and by time of the day, and it
can be used to show profit margin by product type and by geographic region.
Specialized Charts and Graphs
The graphs and charts that we review in this section are either derived from the basic
charts as special cases or they are relatively new and are specific to a problem type and/
or an application area.
HISTOGRAM Graphically speaking, a histogram looks just like a bar chart. The dif-
ference between histograms and generic bar charts is the information that is portrayed.
Histograms are used to show the frequency distribution of one variable or several vari-
ables. In a histogram, the x-axis is often used to show the categories or ranges, and the
y-axis is used to show the measures/values/frequencies. Histograms show the distribu-
tional shape of the data. That way, one can visually examine whether the data are nor-
mally or exponentially distributed. For instance, one can use a histogram to illustrate the
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 173
exam performance of a class, to show distribution of the grades as well as comparative
analysis of individual results, or to show the age distribution of the customer base.
GANTT CHART A Gantt chart is a special case of horizontal bar charts used to portray
project timelines, project tasks/activity durations, and overlap among the tasks/activities.
By showing start and end dates/times of tasks/activities and the overlapping relation-
ships, Gantt charts provide an invaluable aid for management and control of projects. For
instance, Gantt charts are often used to show project timelines, task overlaps, relative task
completions (a partial bar illustrating the completion percentage inside a bar that shows
the actual task duration), resources assigned to each task, milestones, and deliverables.
PERT CHART The PERT chart (also called a network diagram) is developed primarily to
simplify the planning and scheduling of large and complex projects. A PERT chart shows
precedence relationships among project activities/tasks. It is composed of nodes (rep-
resented as circles or rectangles) and edges (represented with directed arrows). Based
on the selected PERT chart convention, either nodes or the edges can be used to repre-
sent the project activities/tasks (activity-on-node versus activity-on-arrow representation
schema).
GEOGRAPHIC MAP When the data set includes any kind of location data (e.g., physical
addresses, postal codes, state names or abbreviations, country names, latitude/longitude,
or some type of custom geographic encoding), it is better and more informative to see the
data on a map. Maps usually are used in conjunction with other charts and graphs rather
than by themselves. For instance, one can use maps to show the distribution of customer
service requests by product type (depicted in pie charts) by geographic locations. Often
a large variety of information (e.g., age distribution, income distribution, education, eco-
nomic growth, population changes) can be portrayed in a geographic map to help decide
where to open a new restaurant or a new service station. These types of systems are often
called geographic information systems (GIS).
BULLET A bullet graph is often used to show progress toward a goal. This graph is
essentially a variation of a bar chart. Often bullet graphs are used in place of gauges,
meters, and thermometers in a dashboard to more intuitively convey the meaning within
a much smaller space. Bullet graphs compare a primary measure (e.g., year-to-date rev-
enue) to one or more other measures (e.g., annual revenue target) and present this in the
context of defined performance metrics (e.g., sales quotas). A bullet graph can intuitively
illustrate how the primary measure is performing against overall goals (e.g., how close a
sales representative is to achieving his or her annual quota).
HEAT MAP The heat map is a great visual to illustrate the comparison of continuous
values across two categories using color. The goal is to help the user quickly see where
the intersection of the categories is strongest and weakest in terms of numerical values
of the measure being analyzed. For instance, one can use a heat map to show segmenta-
tion analysis of target markets where the measure (color gradient would be the purchase
amount) and the dimensions would be age and income distribution.
HIGHLIGHT TABLE The highlight table is intended to take heat maps one step further. In
addition to showing how data intersect by using color, highlight tables add a number on
top to provide additional detail. That is, they are two-dimensional tables with cells popu-
lated with numerical values and gradients of colors. For instance, one can show sales
representatives’ performance by product type and by sales volume.
174 Part I • Introduction to Analytics and AI
TREE MAP A tree map displays hierarchical (tree-structured) data as a set of nested
rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller
rectangles representing subbranches. A leaf node’s rectangle has an area proportional to
a specified dimension on the data. Often the leaf nodes are colored to show a separate
dimension of the data. When the color and size dimensions are correlated in some way
with the tree structure, one can often easily see patterns that would be difficult to spot
in other ways, such as a certain color that is particularly relevant. A second advantage of
tree maps is that, by construction, they make efficient use of space. As a result, they can
legibly display thousands of items on the screen simultaneously.
Which Chart or Graph Should You Use?
Which chart or graph that we explained in the previous section is the best? The answer
is rather easy: There is not one best chart or graph because if there were, we would not
have so many chart and graph types. They all have somewhat different data representa-
tion “skills.” Therefore, the right question should be, “Which chart or graph is the best
for a given task?” The capabilities of the charts given in the previous section can help
in selecting and using the proper chart/graph for a specific task, but doing so still is not
easy to sort out. Several different chart/graph types can be used for the same visualiza-
tion task. One rule of thumb is to select and use the simplest one from the alternatives to
make it easy for the intended audience to understand and digest.
Although there is not a widely accepted, all-encompassing chart selection algorithm
or chart/graph taxonomy, Figure 3.21 presents a rather comprehensive and highly logical
Single
Variable
What would you like to show
in your chart or graph?
Composition
DistributionRelationship
Two
Variables
Three
Variables
Changing over Time
Static
Few Periods Many Periods
Only Relative
Difference
Matters
Relative and
Absolute Difference
Matter
Only Relative
Difference
Matters
Relative and
Absolute Difference
Matter
Simple
Share
of Total
Accumulation or
Subtraction
to Total
Components
of
Components
Two
Variables
Three
Variables
Among Items
Over Time
Two Variables
per Item
One Variable per Item
Many
Categories Few Categories
Few ItemsMany Items
Many Periods Few Periods
Cyclic Data Non-Cyclic Data Single or Few
Categories
Many Categories
Many
Data
Points
Few
Data
Points
Comparison
FIGURE 3.21 A Taxonomy of Charts and Graphs. Source: Adapted from Abela, A. (2008). Advanced
Presentations by Design: Creating Communication That Drives Action. New York: Wiley.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 175
organization of chart/graph types in a taxonomy-like structure (the original version was
published in Abela, 2008). The taxonomic structure is organized around the questions of
“What would you like to show in your chart or graph?”—that is, what the purpose of the
chart or graph will be. At that level, the taxonomy divides the purpose into four different
types—relationship, comparison, distribution, and composition—and further divides the
branches into subcategories based on the number of variables involved and time depen-
dency of the visualization.
Even though these charts and graphs cover a major part of what is commonly
used in information visualization, they by no means cover all. Today, one can find many
other specialized graphs and charts that serve a specific purpose. Furthermore, the cur-
rent trend is to combine/hybridize and animate these charts for better-looking and more
intuitive visualization of today’s complex and volatile data sources. For instance, the
interactive, animated, bubble charts available at the Gapminder Web site (gapminder.
org) provide an intriguing way of exploring world health, wealth, and population data
from a multidimensional perspective. Figure 3.22 depicts the types of displays available
at that site. In this graph, population size, life expectancy, and per capita income at the
continent level are shown; also given is a time-varying animation that shows how these
variables change over time.
FIGURE 3.22 A Gapminder Chart That Shows the Wealth and Health of Nations. Source: gapminder.org.
176 Part I • Introduction to Analytics and AI
u SECTION 3.9 REVIEW QUESTIONS
1. Why do you think there are many different types of charts and graphs?
2. What are the main differences among line, bar, and pie charts? When should you use
one over the others?
3. Why would you use a geographic map? What other types of charts can be combined
with it?
4. Find and explain the role of two types of charts that are not covered in this section.
3.10 EMERGENCE OF VISUAL ANALYTICS
As Seth Grimes (2009a, b) has noted, there is a “growing palate” of data visualization tech-
niques and tools that enable the users of business analytics and BI systems to better “commu-
nicate relationships, add historical context, uncover hidden correlations, and tell persuasive
stories that clarify and call to action.” The latest Magic Quadrant for Business Intelligence
and Analytics Platforms released by Gartner in February 2016 further emphasizes the impor-
tance of data visualization in BI and analytics. As the chart in Figure 3.23 shows, all solution
FIGURE 3.23 Magic Quadrant for Business Intelligence and Analytics Platforms. Source: Used with
permission from Gartner Inc.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 177
providers in the Leaders and Visionaries quadrants are either relatively recently founded
information visualization companies (e.g., Tableau Software, QlikTech) or well-established
large analytics companies (e.g., Microsoft, SAS, IBM, SAP, MicroStrategy, Alteryx) that are
increasingly focusing their efforts on information visualization and visual analytics. More de-
tails on Gartner’s latest Magic Quadrant are given in Technology Insights 3.2.
In BI and analytics, the key challenges for visualization have revolved around the
intuitive representation of large, complex data sets with multiple dimensions and mea-
sures. For the most part, the typical charts, graphs, and other visual elements used in
these applications usually involve two dimensions, sometimes three, and fairly small sub-
sets of data sets. In contrast, the data in these systems reside in a data warehouse. At a
TECHNOLOGY INSIGHTS 3.2 Gartner Magic Quadrant for Business
Intelligence and Analytics Platforms
Gartner, Inc., the creator of Magic Quadrants, is the leading IT research and advisory company
publically traded in the United States with over $2 billion annual revenues in 2015. Founded in
1979, Gartner has 7,600 associates, including 1,600 research analysts and consultants and numer-
ous clients in 90 countries.
Magic Quadrant is a research method designed and implemented by Gartner to monitor
and evaluate the progress and positions of companies in a specific, technology-based market.
By applying a graphical treatment and a uniform set of evaluation criteria, Magic Quadrant helps
users to understand how technology providers are positioned within a market.
Gartner changed the name of this Magic Quadrant from “Business Intelligence Platforms”
to “Business Intelligence and Analytics Platforms” to emphasize the growing importance of ana-
lytics capabilities to the information systems that organizations are now building. Gartner defines
the BI and analytics platform market as a software platform that delivers 15 capabilities across
three categories: integration, information delivery, and analysis. These capabilities enable orga-
nizations to build precise systems of classification and measurement to support decision making
and improve performance.
Figure 3.23 illustrates the latest Magic Quadrant for Business Intelligence and Analytics
Platforms. Magic Quadrant places providers in four groups (niche players, challengers, visionar-
ies, and leaders) along two dimensions: completeness of vision (x-axis) and ability to execute
(y-axis). As the quadrant clearly shows, most of the well-known BI/BA (business analytics)
providers are positioned in the “leaders” category while many of the less known, relatively new,
emerging providers are positioned in the “niche players” category.
The BI and analytics platform market’s multiyear shift from IT-led enterprise reporting to
business-led self-service analytics seems to have passed the tipping point. Most new buying is of
modern, business-user-centric visual analytics platforms forcing a new market perspective, sig-
nificantly reordering the vendor landscape. Most of the activity in the BI and analytics platform
market is from organizations that are trying to mature their visualization capabilities and to move
from descriptive to predictive and prescriptive analytics echelons. The vendors in the market
have overwhelmingly concentrated on meeting this user demand. If there were a single market
theme in 2015, it would be that data discovery/visualization became a mainstream architec-
ture. While data discovery/visualization vendors such as Tableau, Qlik, and Microsoft are solidi-
fying their position in the Leaders quadrant, others (both emerging and large, well-established
tool/solution providers) are trying to move out of Visionaries into the Leaders quadrant.
This emphasis on data discovery/visualization from most of the leaders and visionar-
ies in the market—which are now promoting tools with business-user-friendly data integration
coupled with embedded storage and computing layers and unfettered drilling—continues to
accelerate the trend toward decentralization and user empowerment of BI and analytics and
greatly enables organizations’ ability to perform diagnostic analytics.
Source: Gartner Magic Quadrant, released on February 4, 2016, gartner.com (accessed August 2016). Used
with permission from Gartner Inc.
http://gartner.com
178 Part I • Introduction to Analytics and AI
minimum, these warehouses involve a range of dimensions (e.g., product, location, orga-
nizational structure, time), a range of measures, and millions of cells of data. In an effort
to address these challenges, a number of researchers have developed a variety of new
visualization techniques.
Visual Analytics
Visual analytics is a recently coined term that is often used loosely to mean nothing
more than information visualization. What is meant by visual analytics is the combi-
nation of visualization and predictive analytics. Whereas information visualization is
aimed at answering “What happened?” and “What is happening?” and is closely associ-
ated with BI (routine reports, scorecards, and dashboards), visual analytics is aimed
at answering “Why is it happening?” and “What is more likely to happen?” and is usu-
ally associated with business analytics (forecasting, segmentation, correlation analysis).
Many of the information visualization vendors are adding the capabilities to call them-
selves visual analytics solution providers. One of the top, long-time analytics solution
providers, SAS Institute, is approaching it from another direction. It is embedding its
analytics capabilities into a high-performance data visualization environment that it
calls visual analytics.
Visual or not visual, automated or manual, online or paper based, business report-
ing is not much different than telling a story. Technology Insights 3.3 provides a different,
unorthodox viewpoint on better business reporting.
TECHNOLOGY INSIGHTS 3.3 Telling Great Stories with Data
and Visualization
Everyone who has data to analyze has stories to tell, whether it’s diagnosing the reasons for
manufacturing defects, selling a new idea in a way that captures the imagination of your
target audience, or informing colleagues about a particular customer service improvement
program. And when it’s telling the story behind a big strategic choice so that you and your
senior management team can make a solid decision, providing a fact-based story can be es-
pecially challenging. In all cases, it’s a big job. You want to be interesting and memorable;
you know you need to keep it simple for your busy executives and colleagues. Yet you also
know you have to be factual, detail oriented, and data driven, especially in today’s metric-
centric world.
It’s tempting to present just the data and facts, but when colleagues and senior manage-
ment are overwhelmed by data and facts without context, you lose. We have all experienced
presentations with large slide decks only to find that the audience is so overwhelmed with data
that they don’t know what to think, or they are so completely tuned out that they take away only
a fraction of the key points.
Start engaging your executive team and explaining your strategies and results more
powerfully by approaching your assignment as a story. You will need the “what” of your
story (the facts and data), but you also need the “Who?” “How?” “Why?” and the often-missed
“So what?” It’s these story elements that will make your data relevant and tangible for your
audience. Creating a good story can aid you and senior management in focusing on what is
important.
Why Story?
Stories bring life to data and facts. They can help you make sense and order out of a disparate
collection of facts. They make it easier to remember key points and can paint a vivid picture of
what the future can look like. Stories also create interactivity—people put themselves into stories
and can relate to the situation.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 179
Cultures have long used storytelling to pass on knowledge and content. In some
cultures, storytelling is critical to their identity. For example, in New Zealand, some of the
Maori people tattoo their faces with mokus. A moku is a facial tattoo containing a story
about ancestors—the family tribe. A man may have a tattoo design on his face that shows
features of a hammerhead to highlight unique qualities about his lineage. The design he
chooses signifies what is part of his “true self” and his ancestral home.
Likewise, when we are trying to understand a story, the storyteller navigates to finding the
“true north.” If senior management is looking to discuss how they will respond to a competitive
change, a good story can make sense and order out of a lot of noise. For example, you may
have facts and data from two studies, one including results from an advertising study and one
from a product satisfaction study. Developing a story for what you measured across both studies
can help people see the whole where there were disparate parts. For rallying your distributors
around a new product, you can employ a story to give vision to what the future can look like.
Most important, storytelling is interactive—typically, the presenter uses words and pictures that
audience members can put themselves into. As a result, they become more engaged and better
understand the information.
So What Is a Good Story?
Most people can easily rattle off their favorite film or book. Or they remember a funny story that
a colleague recently shared. Why do people remember these stories? Because they contain cer-
tain characteristics. First, a good story has great characters. In some cases, the reader or viewer
has a vicarious experience where they become involved with the character. The character then
has to be faced with a challenge that is difficult but believable. There must be hurdles that the
character overcomes. And finally, the outcome or prognosis is clear by the end of the story. The
situation may not be resolved—but the story has a clear endpoint.
Think of Your Analysis as a Story—Use a Story Structure
When crafting a data-rich story, the first objective is to find the story. Who are the characters?
What is the drama or challenge? What hurdles have to be overcome? And at the end of your
story, what do you want your audience to do as a result?
Once you know the core story, craft your other story elements: define your characters,
understand the challenge, identify the hurdles, and crystallize the outcome or decision question.
Make sure you are clear with what you want people to do as a result. This will shape how your
audience will recall your story. With the story elements in place, write out the storyboard, which
represents the structure and form of your story. Although it’s tempting to skip this step, it is bet-
ter first to understand the story you are telling and then to focus on the presentation structure
and form. Once the storyboard is in place, the other elements will fall into place. The storyboard
will help you think about the best analogies or metaphors, clearly set up challenge or oppor-
tunity, and finally see the flow and transitions needed. The storyboard also helps you focus on
key visuals (graphs, charts, and graphics) that you need your executives to recall. Figure 3.24
shows a storyline for the impact of small loans in a worldwide view within the Tableau visual
analytics environment.
In summary, do not be afraid to use data to tell great stories. Being factual, detail oriented,
and data driven is critical in today’s metric-centric world, but it does not have to mean being
boring and lengthy. In fact, by finding the real stories in your data and following the best prac-
tices, you can get people to focus on your message—and thus on what’s important. Here are
those best practices:
1. Think of your analysis as a story—use a story structure.
2. Be authentic—your story will flow.
3. Be visual—think of yourself as a film editor.
4. Make it easy for your audience and you.
5. Invite and direct discussion.
Source: Fink, E., & Moore, S. J. (2012). “Five Best Practices for Telling Great Stories with Data.” White paper
by Tableau Software, Inc., www.tableau.com/whitepapers/telling-data-stories (accessed May 2016).
http://www.tableau.com/whitepapers/telling-data-stories
180 Part I • Introduction to Analytics and AI
High-Powered Visual Analytics Environments
Due to the increasing demand for visual analytics coupled with fast-growing data volumes,
there is an exponential movement toward investing in highly efficient visualization sys-
tems. With its latest move into visual analytics, the statistical software giant SAS Institute is
now among those who are leading this wave. Its new product, SAS Visual Analytics, is a
very high-performance computing, in-memory solution for exploring massive amounts
of data in a very short time (almost instantaneously). It empowers users to spot patterns,
identify opportunities for further analysis, and convey visual results via Web reports or
mobile platforms such as tablets and smartphones. Figure 3.24 shows the high-level ar-
chitecture of the SAS Visual Analytics platform. On one end of the architecture, there are
universal data builder and administrator capabilities, leading into explorer, report designer,
and mobile BI modules, collectively providing an end-to-end visual analytics solution.
Some of the key benefits proposed by the SAS analytics platform (see Figure 3.25)
are the following:
• Empowers all users with data exploration techniques and approachable analytics
to drive improved decision making. SAS Visual Analytics enables different types of
users to conduct fast, thorough explorations on all available data. Sampling to re-
duce the data is not required and not preferred.
• Has easy-to-use, interactive Web interfaces that broaden the audience for analyt-
ics, enabling everyone to glean new insights. Users can look at additional options,
make more precise decisions, and drive success even faster than before.
FIGURE 3.24 A Storyline Visualization in Tableau Software. Source: Used with permission from Tableau Software, Inc.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 181
• Answers complex questions faster, enhancing the contributions from your analytic
talent. SAS Visual Analytics augments the data discovery and exploration process by
providing extremely fast results to enable better, more focused analysis. Analytically
savvy users can identify areas of opportunity or concern from vast amounts of data
so further investigation can take place quickly.
• Improves information sharing and collaboration. Large numbers of users, including
those with limited analytical skills, can quickly view and interact with reports and
charts via the Web, Adobe PDF files, and iPad mobile devices while IT maintains
control of the underlying data and security. SAS Visual Analytics provides the right
information to the right person at the right time to improve productivity and orga-
nizational knowledge.
• Liberates IT by giving users a new way to access the information they need. Frees
IT from the constant barrage of demands from users who need access to different
amounts of data, different data views, ad hoc reports, and one-off requests for infor-
mation. SAS Visual Analytics enables IT to easily load and prepare data for multiple
users. Once data are loaded and available, users can dynamically explore data, cre-
ate reports, and share information on their own.
• Provides room to grow at a self-determined pace. SAS Visual Analytics provides the
option of using commodity hardware or database appliances from EMC Greenplum
and Teradata. It is designed from the ground up for performance optimization and
scalability to meet the needs of any size organization.
Figure 3.26 shows a screenshot of a SAS Analytics platform where time-series fore-
casting and confidence interval around the forecast are depicted.
u SECTION 3.10 REVIEW QUESTIONS
1. What are the main reasons for the recent emergence of visual analytics?
2. Look at Gartner’s Magic Quadrant for Business Intelligence and Analytics Platforms.
What do you see? Discuss and justify your observations.
3. What is the difference between information visualization and visual analytics?
FIGURE 3.25 An Overview of SAS Visual Analytics Architecture. Source: Copyright © SAS Institute, Inc. Used with permission.
182 Part I • Introduction to Analytics and AI
4. Why should storytelling be a part of your reporting and data visualization?
5. What is a high-powered visual analytics environment? Why do we need it?
3.11 INFORMATION DASHBOARDS
Information dashboards are common components of most, if not all, BI or business ana-
lytics platforms, business performance management systems, and performance measure-
ment software suites. Dashboards provide visual displays of important information that
is consolidated and arranged on a single screen so that the information can be digested
at a single glance and easily drilled in and further explored. A typical dashboard is
shown in Figure 3.27. This particular executive dashboard displays a variety of key per-
formance indicators (KPIs) for a hypothetical software company called Sonatica (selling
audio tools). This executive dashboard shows a high-level view of the different functional
groups surrounding the products, starting from a general overview to the marketing ef-
forts, sales, finance, and support departments. All of this is intended to give executive
FIGURE 3.26 A Screenshot from SAS Visual Analytics. Source: Copyright © SAS Institute, Inc. Used with permission.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 183
decision makers a quick and accurate idea of what is going on within the organization.
On the left side of the dashboard, we can see (in a time-series fashion) the quarterly
changes in revenues, expenses, and margins as well as the comparison of those figures
to previous years’ monthly numbers. On the upper-right side are two dials with color-
coded regions showing the amount of monthly expenses for support services (dial on
the left) and the amount of other expenses (dial on the right). As the color coding in-
dicates, although the monthly support expenses are well within the normal ranges, the
other expenses are in the red region, indicating excessive values. The geographic map
on the bottom right shows the distribution of sales at the country level throughout the
world. Behind these graphical icons there are various mathematical functions aggregating
numerous data points to their highest level of meaningful figures. By clicking on these
graphical icons, the consumer of this information can drill down to more granular levels
of information and data.
Dashboards are used in a wide variety of businesses for a wide variety of reasons.
For instance, in Application Case 3.7, you will find the summary of a successful imple-
mentation of information dashboards by the Dallas Cowboys football team.
FIGURE 3.27 A Sample Executive Dashboard. Source: A Sample Executive Dashboard from Dundas Data Visualization,
Inc., www.dundas.com, reprinted with permission.
http://www.dundas.com
184 Part I • Introduction to Analytics and AI
Founded in 1960, the Dallas Cowboys are a pro-
fessional American football team headquartered in
Irving, Texas. The team has a large national follow-
ing, which is perhaps best represented by their NFL
record for number of consecutive games at sold-out
stadiums.
The Challenge
Bill Priakos, chief operating officer (COO) of the
Dallas Cowboys Merchandising Division, and his
team needed more visibility into their data so they
could run it more profitably. Microsoft was selected
as the baseline platform for this upgrade as well as
a number of other sales, logistics, and e-commerce
(per MW) applications. The Cowboys expected that
this new information architecture would provide the
needed analytics and reporting. Unfortunately, this
was not the case, and the search began for a robust
dashboarding, analytics, and reporting tool to fill
this gap.
The Solution and Results
Tableau and Teknion together provided real-time
reporting and dashboard capabilities that exceeded
the Cowboys’ requirements. Systematically and
methodically, the Teknion team worked side by side
with data owners and data users within the Dallas
Cowboys to deliver all required functionality on
time and under budget. “Early in the process, we
were able to get a clear understanding of what it
would take to run a more profitable operation for
the Cowboys,” said Teknion Vice President Bill
Luisi. “This process step is a key step in Teknion’s
approach with any client, and it always pays huge
dividends as the implementation plan progresses.”
Added Luisi, “Of course, Tableau worked very
closely with us and the Cowboys during the entire
project. Together, we made sure that the Cowboys
could achieve their reporting and analytical goals in
record time.”
Now, for the first time, the Dallas Cowboys are
able to monitor their complete merchandising activi-
ties from manufacture to end customer and not only
see what is happening across the life cycle but also
drill down even further into why it is happening.
Today, this BI solution is used to report and
analyze the business activities of the Merchandising
Division, which is responsible for all of the Dallas
Cowboys’ brand sales. Industry estimates say that
the Cowboys generate 20 percent of all NFL mer-
chandise sales, which reflects the fact that they are
the most recognized sports franchise in the world.
According to Eric Lai, a ComputerWorld
reporter, Tony Romo and the rest of the Dallas
Cowboys may have been only average on the foot-
ball field in the last few years, but off the field,
especially in the merchandising arena, they remain
America’s team.
Questions for Case 3.7
1. How did the Dallas Cowboys use information
visualization?
2. What were the challenge, the proposed solution,
and the obtained results?
Sources: Lai, E. (2009, October 8). “BI Visualization Tool Helps
Dallas Cowboys Sell More Tony Romo Jerseys,” ComputerWorld.
Tableau case study. tableau.com/blog/computerworld-dallas-
cowboys-business-intelligence (accessed July 2018).
Application Case 3.7 Dallas Cowboys Score Big with Tableau and Teknion
Dashboard Design
Dashboards are not a new concept. Their roots can be traced at least to the executive
information system of the 1980s. Today, dashboards are ubiquitous. For example, a few
years back, Forrester Research estimated that over 40 percent of the largest 2,000 com-
panies in the world used the technology (Ante & McGregor, 2006). Since then, one can
safely assume that this number has gone up quite significantly. In fact, today it would be
rather unusual to see a large company using a BI system that does not employ some sort
of performance dashboards. The Dashboard Spy Web site (dashboardspy.com/about)
provides further evidence of their ubiquity. The site contains descriptions and screenshots
http://tableau.com
http://dashboardspy.com/about
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 185
of thousands of BI dashboards, scorecards, and BI interfaces used by businesses of all
sizes and industries, nonprofits, and government agencies.
According to Eckerson (2006), a well-known expert on BI in general and dash-
boards in particular, the most distinctive feature of a dashboard is its three layers of
information:
1. Monitoring: Graphical, abstracted data to monitor key performance metrics.
2. Analysis: Summarized dimensional data to analyze the root cause of problems.
3. Management: Detailed operational data that identify what actions to take to re-
solve a problem.
Because of these layers, dashboards pack a large amount of information into a sin-
gle screen. According to Few (2005), “The fundamental challenge of dashboard design is
to display all the required information on a single screen, clearly and without distraction,
in a manner that can be assimilated quickly.” To speed assimilation of the numbers, they
need to be placed in context. This can be done by comparing the numbers of interest to
other baseline or target numbers, by indicating whether the numbers are good or bad,
by denoting whether a trend is better or worse, and by using specialized display widgets
or components to set the comparative and evaluative context. Some of the common
comparisons that are typically made in BI systems include comparisons against past val-
ues, forecasted values, targeted values, benchmark or average values, multiple instances
of the same measure, and the values of other measures (e.g., revenues versus costs).
Even with comparative measures, it is important to specifically point out whether a
particular number is good or bad and whether it is trending in the right direction. Without
these types of evaluative designations, it can be time consuming to determine the status
of a particular number or result. Typically, either specialized visual objects (e.g., traffic
lights, dials, and gauges) or visual attributes (e.g., color coding) are used to set the evalu-
ative context. An interactive dashboard-driven reporting data exploration solution built by
an energy company is featured in Application Case 3.8.
Energy markets all around the world are going
through a significant change and transformation,
creating ample opportunities along with significant
challenges. As is the case in any industry, oppor-
tunities are attracting more players in the market-
place, increasing the competition, and reducing the
tolerances for less-than-optimal business decision
making. Success requires creating and disseminat-
ing accurate and timely information to whomever
whenever it is needed. For instance, if you need to
easily track marketing budgets, balance employee
workloads, and target customers with tailored mar-
keting messages, you would need three different
reporting solutions. Electrabel GDF SUEZ is doing
all of that for its marketing and sales business unit
with SAS’Analytics Visual Analytics platform.
The one-solution approach is a great time-saver
for marketing professionals in an industry that is
undergoing tremendous change. “It is a huge chal-
lenge to stabilize our market position in the energy
market. That includes volume, prices, and margins
for both retail and business customers,” notes Danny
Noppe, manager of Reporting Architecture and
Development in the Electrabel Marketing and Sales
business unit. The company is the largest supplier of
electricity in Belgium and the largest producer of elec-
tricity for Belgium and the Netherlands. Noppe says it
is critical that Electrabel increase the efficiency of its
customer communications as it explores new digital
channels and develops new energy-related services.
“The better we know the customer, the bet-
ter our likelihood of success,” he says. “That is why
Application Case 3.8 Visual Analytics Helps Energy Supplier Make Better Connections
(Continued )
186 Part I • Introduction to Analytics and AI
What to Look for in a Dashboard
Although performance dashboards and other information visualization frameworks differ,
they all share some common design characteristics. First, they all fit within the larger BI
and/or performance measurement system. This means that their underlying architecture
is the BI or performance management architecture of the larger system. Second, all well-
designed dashboards and other information visualizations possess the following charac-
teristics (Novell, 2009):
we combine information from various sources—
phone traffic with the customer, online questions,
text messages, and mail campaigns. This enhanced
knowledge of our customer and prospect base will
be an additional advantage within our competitive
market.”
One Version of the Truth
Electrabel was using various platforms and tools for
reporting purposes. This sometimes led to ambigu-
ity in the reported figures. The utility also had per-
formance issues in processing large data volumes.
SAS Visual Analytics with in-memory technology
removes the ambiguity and the performance issues.
“We have the autonomy and flexibility to respond to
the need for customer insight and data visualization
internally,” Noppe says. “After all, fast reporting is
an essential requirement for action-oriented depart-
ments such as sales and marketing.”
Working More Efficiently at a Lower Cost
SAS Visual Analytics automates the process of
updating information in reports. Instead of building
a report that is out of date by the time it is com-
pleted, the data are refreshed for all the reports once
a week and is available on dashboards. In deploying
the solution, Electrabel chose a phased approach,
starting with simple reports and moving on to more
complex ones. The first report took a few weeks
to build, and the rest came quickly. The successes
include the following:
• Reduction of data preparation from two days
to only two hours.
• Clear graphic insight into the invoicing and
composition of invoices for business-to-busi-
ness (B2B) customers.
• A workload management report by the op-
erational teams. Managers can evaluate team
workloads on a weekly or long-term basis and
can make adjustments accordingly.
“We have significantly improved our effi-
ciency and can deliver quality data and reports
more frequently, and at a significantly lower cost,”
says Noppe. And if the company needs to combine
data from multiple sources, the process is equally
easy. “Building visual reports, based on these data
marts, can be achieved in a few days, or even a few
hours.”
Noppe says the company plans to continue
broadening its insight into the digital behavior of
its customers, combining data from Web analytics,
e-mail, and social media with data from back-end
systems. “Eventually, we want to replace all labor-
intensive reporting with SAS Visual Analytics,” he
says, adding that the flexibility of SAS Visual Analytics
is critical for his department. “This will give us more
time to tackle other challenges. We also want to
make this tool available on our mobile devices. This
will allow our account managers to use up-to-date,
insightful, and adaptable reports when visiting cus-
tomers. We’ve got a future-oriented reporting plat-
form to do all we need.”
Questions for Case 3.8
1. Why do you think energy supply companies are
among the prime users of information visualiza-
tion tools?
2. How did Electrabel use information visualization
for the single version of the truth?
3. What were their challenges, the proposed solu-
tion, and the obtained results?
Source: SAS Customer Story, “Visual Analytics Helps Energy
Supplier Make Better Connections.” http://www.sas.com/
en_us/customers/electrabel-be.html (accessed July 2018).
Copyright © 2018 SAS Institute Inc., Cary, NC, United States.
Reprinted with permission. All rights reserved.
Application Case 3.8 (Continued)
http://www.sas.com/en_us/customers/electrabel-be.html
http://www.sas.com/en_us/customers/electrabel-be.html
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 187
• They use visual components (e.g., charts, performance bars, sparklines, gauges,
meters, stoplights) to highlight, at a glance, the data and exceptions that require
action.
• They are transparent to the user, meaning that they require minimal training and are
extremely easy to use.
• They combine data from a variety of systems into a single, summarized, unified
view of the business.
• They enable drill-down or drill-through to underlying data sources or reports, pro-
viding more detail about the underlying comparative and evaluative context.
• They present a dynamic, real-world view with timely data refreshes, enabling the
end user to stay up-to-date with any recent changes in the business.
• They require little, if any, customized coding to implement, deploy, and maintain.
Best Practices in Dashboard Design
The real estate saying “location, location, location” makes it obvious that the most im-
portant attribute for a piece of real estate property is where it is located. For dashboards,
it is “data, data, data.” Often overlooked, data are considered one of the most important
things to focus on in designing dashboards (Carotenuto, 2007). Even if a dashboard’s ap-
pearance looks professional, is aesthetically pleasing, and includes graphs and tables cre-
ated according to accepted visual design standards, it is also important to ask about the
data: Are they reliable? Are they timely? Are any data missing? Are they consistent across
all dashboards? Here are some of the experience-driven best practices in dashboard de-
sign (Radha, 2008).
Benchmark Key Performance Indicators with Industry Standards
Many customers, at some point in time, want to know if the metrics they are measuring
are the right metrics to monitor. Sometimes customers have found that the metrics they
are tracking are not the right ones to track. Doing a gap assessment with industry bench-
marks aligns you with industry best practices.
Wrap the Dashboard Metrics with Contextual Metadata
Often when a report or a visual dashboard/scorecard is presented to business users,
questions remain unanswered. The following are some examples:
• Where did you source these data?
• While loading the data warehouse, what percentage of the data was rejected/
encountered data quality problems?
• Is the dashboard presenting “fresh” information or “stale” information?
• When was the data warehouse last refreshed?
• When is it going to be refreshed next?
• Were any high-value transactions that would skew the overall trends rejected as a
part of the loading process?
Validate the Dashboard Design by a Usability Specialist
In most dashboard environments, the dashboard is designed by a tool specialist without
giving consideration to usability principles. Even though it is a well-engineered data
warehouse that can perform well, many business users do not use the dashboard because
it is perceived as not being user friendly, leading to poor adoption of the infrastructure
and change management issues. Up-front validation of the dashboard design by a usabil-
ity specialist can mitigate this risk.
188 Part I • Introduction to Analytics and AI
Prioritize and Rank Alerts/Exceptions Streamed to the Dashboard
Because there are tons of raw data, having a mechanism by which important exceptions/
behaviors are proactively pushed to the information consumers is important. A business
rule can be codified, which detects the alert pattern of interest. It can be coded into a pro-
gram, using database-stored procedures, which can crawl through the fact tables and detect
patterns that need immediate attention. This way, information finds the business user as
opposed to the business user polling the fact tables for the occurrence of critical patterns.
Enrich the Dashboard with Business-User Comments
When the same dashboard information is presented to multiple business users, a small
text box can be provided that can capture the comments from an end user’s perspective.
This can often be tagged to the dashboard to put the information in context, adding per-
spective to the structured KPIs being rendered.
Present Information in Three Different Levels
Information can be presented in three layers depending on the granularity of the infor-
mation: the visual dashboard level, the static report level, and the self-service cube level.
When a user navigates the dashboard, a simple set of 8 to 12 KPIs can be presented,
which would give a sense of what is going well and what is not.
Pick the Right Visual Construct Using Dashboard Design Principles
In presenting information in a dashboard, some information is presented best with
bar charts and some with time-series line graphs, and when presenting correlations, a
scatter plot is useful. Sometimes merely rendering it as simple tables is effective. Once
the dashboard design principles are explicitly documented, all the developers working
on the front end can adhere to the same principles while rendering the reports and
dashboard.
Provide for Guided Analytics
In a typical organization, business users can be at various levels of analytical maturity.
The capability of the dashboard can be used to guide the “average” business user to
access the same navigational path as that of an analytically savvy business user.
u SECTION 3.11 REVIEW QUESTIONS
1. What is an information dashboard? Why is it so popular?
2. What are the graphical widgets commonly used in dashboards? Why?
3. List and describe the three layers of information portrayed on dashboards.
4. What are the common characteristics of dashboards and other information visuals?
5. What are the best practices in dashboard design?
Chapter Highlights
• Data have become one of the most valuable
assets of today’s organizations.
• Data are the main ingredient for any BI, data
science, and business analytics initiative.
• Although its value proposition is undeniable, to
live up its promise, the data must comply with
some basic usability and quality metrics.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 189
• The term data (datum in singular form) refers
to a collection of facts usually obtained as the
result of experiments, observations, transactions,
or experiences.
• At the highest level of abstraction, data can be
classified as structured and unstructured.
• Data in original/raw form are not usually ready to
be useful in analytics tasks.
• Data preprocessing is a tedious, time-demanding,
yet crucial task in business analytics.
• Statistics is a collection of mathematical tech-
niques to characterize and interpret data.
• Statistical methods can be classified as either de-
scriptive or inferential.
• Statistics in general, as well as descriptive statistics
in particular, is a critical part of BI and business
analytics.
• Descriptive statistics methods can be used to mea-
sure central tendency, dispersion, or the shape of
a given data set.
• Regression, especially linear regression, is per-
haps the most widely known and used analytics
technique in statistics.
• Linear regression and logistic regression are the
two major regression types in statistics.
• Logistics regression is a probability-based classifi-
cation algorithm.
• Time series is a sequence of data points of a vari-
able, measured and recorded at successive points
in time spaced at uniform time intervals.
• A report is any communication artifact prepared
with the specific intention of conveying informa-
tion in a presentable form.
• A business report is a written document that con-
tains information regarding business matters.
• The key to any successful business report is clar-
ity, brevity, completeness, and correctness.
• Data visualization is the use of visual representations
to explore, make sense of, and communicate data.
• Perhaps the most notable information graphic
of the past was developed by Charles J. Minard,
who graphically portrayed the losses suffered by
Napoleon’s army in the Russian campaign of 1812.
• Basic chart types include line, bar, and pie chart.
• Specialized charts are often derived from the
basic charts as exceptional cases.
• Data visualization techniques and tools make the
users of business analytics and BI systems better
information consumers.
• Visual analytics is the combination of visualiza-
tion and predictive analytics.
• Increasing demand for visual analytics coupled
with fast-growing data volumes led to exponen-
tial growth in highly efficient visualization sys-
tems investment.
• Dashboards provide visual displays of important
information that is consolidated and arranged
on a single screen so that information can be di-
gested at a single glance and easily drilled in and
further explored.
Key terms
analytics ready
arithmetic mean
box-and-whiskers plot
box plot
bubble chart
business report
categorical data
centrality
correlation
dashboards
data preprocessing
data quality
data security
data taxonomy
data visualization
datum
descriptive statistics
dimensional reduction
dispersion
high-performance computing
histogram
inferential statistics
key performance indicator
(KPI)
knowledge
kurtosis
learning
linear regression
logistic regression
mean absolute deviation
median
mode
nominal data
online analytics processing
(OLAP)
ordinal data
ordinary least squares (OLS)
pie chart
quartile
range
ratio data
regression
report
scatter plot
skewness
standard deviation
statistics
storytelling
structured data
time-series forecasting
unstructured data
variable selection
variance
visual analytics
190 Part I • Introduction to Analytics and AI
Questions for Discussion
1. How do you describe the importance of data in analyt-
ics? Can we think of analytics without data? Explain.
2. Considering the new and broad definition of business
analytics, what are the main inputs and outputs to the
analytics continuum?
3. Where do the data for business analytics come from?
What are the sources and the nature of those incoming
data?
4. What are the most common metrics that make for
analytics-ready data?
5. What are the main categories of data? What types of
data can we use for BI and analytics?
6. Can we use the same data representation for all
analytics models (i.e., do different analytics models
require different data representation schema)? Why,
or why not?
7. Why are the original/raw data not readily usable by ana-
lytics tasks?
8. What are the main data preprocessing steps? List and
explain their importance in analytics.
9. What does it mean to clean/scrub the data? What activi-
ties are performed in this phase?
10. Data reduction can be applied to rows (sampling) and/
or columns (variable selection). Which is more chal-
lenging? Explain.
11. What is the relationship between statistics and business
analytics? (Consider the placement of statistics in a busi-
ness analytics taxonomy.)
12. What are the main differences between descriptive and
inferential statistics?
13. What is a box-and-whiskers plot? What types of statisti-
cal information does it represent?
14. What are the two most commonly used shape character-
istics to describe a data distribution?
15. List and briefly define the central tendency measures of
descriptive statistics.
16. What are the commonalities and differences between
regression and correlation?
17. List and describe the main steps to follow in developing
a linear regression model.
18. What are the most commonly pronounced assumptions
for linear regression? What is crucial to the regression
models against these assumptions?
19. What are the commonalities and differences between
linear regression and logistic regression?
20. What is time series? What are the main forecasting tech-
niques for time-series data?
21. What is a business report? Why is it needed?
22. What are the best practices in business reporting? How
can we make our reports stand out?
23. Describe the cyclic process of management, and com-
ment on the role of business reports.
24. List and describe the three major categories of business
reports.
25. Why has information visualization become a center-
piece in BI and business analytics? Is there a difference
between information visualization and visual analytics?
26. What are the main types of charts/graphs? Why are
there so many of them?
27. How do you determine the right chart for a job? Explain
and defend your reasoning.
28. What is the difference between information visualiza-
tion and visual analytics?
29. Why should storytelling be a part of your reporting and
data visualization?
30. What is an information dashboard? What does it present?
31. What are the best practices in designing highly informa-
tive dashboards?
32. Do you think information/performance dashboards are
here to stay? Or are they about to be outdated? What do
you think will be the next big wave in BI and business
analytics in terms of data/information visualization?
Exercises
Teradata University and Other Hands-on
Exercises
1. Download the “Voting Behavior” data and the brief
data description from the book’s Web site. This is a
data set manually compiled from counties all around
the United States. The data are partially processed,
that is, some derived variables have been created.
Your task is to thoroughly preprocess the data by
identifying the error and anomalies and proposing
remedies and solutions. At the end, you should have
an analytics-ready version of these data. Once the pre-
processing is completed, pull these data into Tableau
(or into some other data visualization software tool)
to extract useful visual information from it. To do so,
conceptualize relevant questions and hypotheses
(come up with at least three of them) and create
proper visualizations that address those questions of
“tests” of those hypotheses.
2. Download Tableau (at tableau.com, following aca-
demic free software download instructions on the site).
Using the Visualization_MFG_Sample data set (available
as an Excel file on this book’s Web site), answer the fol-
lowing questions:
a. What is the relationship between gross box office
revenue and other movie-related parameters given
in the data set?
http://tableau.com
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 191
b. How does this relationship vary across different years?
Prepare a professional-looking written report that is
enhanced with screenshots of your graphic findings.
3. Go to teradatauniversitynetwork.com. Look for an
article that deals with the nature of data, management of
data, and/or governance of data as it relates to BI and
analytics, and critically analyze the content of the article.
4. Go to UCI data repository (archive.ics.uci.edu/ml/
datasets.html) and identify a large data set that con-
tains both numeric and nominal values. Using Microsoft
Excel or any other statistical software:
a. Calculate and interpret central tendency measures
for each and every variable.
b. Calculate and interpret the dispersion/spread mea-
sures for each and every variable.
5. Go to UCI data repository (archive.ics.uci.edu/ml/
datasets.html) and identify two data sets, one for
estimation/regression and one for classification. Using
Microsoft Excel or any other statistical software:
a. Develop and interpret a linear regression model.
b. Develop and interpret a logistic regression model.
6. Go to KDnuggest.com and become familiar with the
range of analytics resources available on this portal.
Then identify an article, a white paper, or an interview
script that deals with the nature of data, management of
data, and/or governance of data as they relate to BI and
business analytics, and critically analyze the content of
the article.
7. Go to Stephen Few’s blog, “The Perceptual Edge”
(perceptualedge.com). Go to the section of
“Examples.” In this section, he provides critiques of
various dashboard examples. Read a handful of these
examples. Now go to dundas.com. Select the “Gallery”
section of the site. Once there, click the “Digital
Dashboard” selection. You will be shown a variety of
different dashboard demos. Run a couple of them.
a. What types of information and metrics are shown on
the demos? What types of actions can you take?
b. Using some of the basic concepts from Few’s cri-
tiques, describe some of the good design points and
bad design points of the demos.
8. Download an information visualization tool, such as
Tableau, QlikView, or Spotfire. If your school does not
have an educational agreement with these companies,
a trial version would be sufficient for this exercise. Use
your own data (if you have any) or use one of the data
sets that comes with the tool (such tools usually have
one or more data sets for demonstration purposes).
Study the data, come up with several business prob-
lems, and use data visualization to analyze, visualize,
and potentially solve those problems.
9. Go to teradatauniversitynetwork.com. Find the
“Tableau Software Project.” Read the description, exe-
cute the tasks, and answer the questions.
10. Go to teradatauniversitynetwork.com. Find the
assignments for SAS Visual Analytics. Using the infor-
mation and step-by-step instructions provided in the
assignment, execute the analysis on the SAS Visual
Analytics tool (which is a Web-enabled system that does
not require any local installation). Answer the questions
posed in the assignment.
11. Find at least two articles (one journal article and one
white paper) that talk about storytelling, especially
within the context of analytics (i.e., data-driven storytell-
ing). Read and critically analyze the article and paper,
and write a report to reflect your understanding and
opinions about the importance of storytelling in BI and
business analytics.
12. Go to data.gov—a U.S. government–sponsored data
portal that has a very large number of data sets on a
wide variety of topics ranging from healthcare to edu-
cation, climate to public safety. Pick a topic that you
are most passionate about. Go through the topic- specific
information and explanation provided on the site.
Explore the possibilities of downloading the data, and
use your favorite data visualization tool to create your
own meaningful information and visualizations.
Team Assignments and Role-Playing Projects
1. Analytics starts with data. Identifying, accessing, obtain-
ing, and processing of relevant data is the most essential
task in any analytics study. As a team, you are tasked
to find a large enough real-world data (either from your
own organization, which is the most preferred, or from
the Internet that can start with a simple search, or from
the data links posted on KDnuggets.com), one that
has tens of thousands of rows and more than 20 vari-
ables to go through, and document a thorough data
preprocessing project. In your processing of the data,
identify anomalies and discrepancies using descriptive
statistics methods and measures, and make the data an-
alytics ready. List and justify your preprocessing steps
and decisions in a comprehensive report.
2. Go to a well-known information dashboard provider
Web site (dundas.com, idashboards.com, enterprise-
dashboard.com). These sites provide a number of exam-
ples of executive dashboards. As a team, select a particular
industry (e.g., healthcare, banking, airline). Locate a hand-
ful of example dashboards for that industry. Describe the
types of metrics found on the dashboards. What types of
displays are used to provide the information? Using what
you know about dashboard design, provide a paper pro-
totype of a dashboard for this information.
3. Go to teradatauniversitynetwork.com. From there,
go to University of Arkansas data sources. Choose one
of the large data sets, and download a large number of
records (this could require you to write an SQL state-
ment that creates the variables that you want to include
in the data set). Come up with at least 10 questions that
can be addressed with information visualization. Using
your favorite data visualization tool (e.g., Tableau), ana-
lyze the data, and prepare a detailed report that includes
screenshots and other visuals.
http://teradatauniversitynetwork.com
http://archive.ics.uci.edu/ml/datasets.html
http://archive.ics.uci.edu/ml/datasets.html
http://archive.ics.uci.edu/ml/datasets.html
http://archive.ics.uci.edu/ml/datasets.html
http://KDnuggest.com
http://perceptualedge.com
http://dundas.com
http://teradatauniversitynetwork.com
http://teradatauniversitynetwork.com
http://data.gov
http://KDnuggets.com
http://dundas.com
http://enterprise-dashboard.com
http://enterprise-dashboard.com
http://teradatauniversitynetwork.com
192 Part I • Introduction to Analytics and AI
References
Abela, A. (2008). Advanced Presentations by Design: Creating
Communication That Drives Action. New York, NY: Wiley.
Annas, G. (2003). “HIPAA Regulations—A New Era of
Medical-Record Privacy?” New England Journal of
Medicine, 348(15), 1486–1490.
Ante, S., & J. McGregor. (2006). “Giving the Boss the Big Pic-
ture: A Dashboard Pulls Up Everything the CEO Needs to
Run the Show.” Business Week, 43–51.
Carotenuto, D. (2007). “Business Intelligence Best Practices for
Dashboard Design.” WebFOCUS. www. datawarehouse.
inf.br/papers/information_builders_dashboard_
best_practices (accessed August 2016).
Dell Customer Case Study. “Medical Device Company
Ensures Product Quality While Saving Hundreds of
Thousands of Dollars.” https://software.dell.com/
documents/instrumentation-laboratory-medical-
device- companyensures-product-quality-while-
saving-hundreds-of thousands-of-dollars-case-
study-80048 (accessed August 2016).
Delen, D. (2010). “A Comparative Analysis of Machine Learn-
ing Techniques for Student Retention Management.” Deci-
sion Support Systems, 49(4), 498–506.
Delen, D. (2011). “Predicting Student Attrition with Data Min-
ing Methods.” Journal of College Student Retention 13(1),
17–35.
Delen, D. (2015). Real-World Data Mining: Applied Business
Analytics and Decision Making. Upper Saddle River, NJ:
Financial Times Press (A Pearson Company).
Delen, D., D. Cogdell, & N. Kasap. (2012). “A Comparative
Analysis of Data Mining Methods in Predicting NCAA
Bowl Outcomes.” International Journal of Forecasting, 28,
543–552.
Eckerson, W. (2006). Performance Dashboards. New York:
Wiley.
Few, S. (2005, Winter). “Dashboard Design: Beyond Meters,
Gauges, and Traffic Lights.” Business Intelligence Journal,
10(1).
Few, S. (2007). “Data Visualization: Past, Present and Future.”
Perceptualedge.com/articles/Whitepapers/Data_
Visualization (accessed July 2016).
Fink, E., & S. J. Moore. (2012). “Five Best Practices for Telling
Great Stories with Data.” Tableau Software, Inc. www.
tableau.com/whitepapers/telling-data-stories (accessed
May 2016).
Freeman, K., & R. M. Brewer. (2016). “The Politics of Ameri-
can College Football.” Journal of Applied Business and
Economics, 18(2), 97–101.
Gartner Magic Quadrant. (2016, February 4). gartner.com
(accessed August 2016).
Grimes, S. (2009a, May 2). “Seeing Connections: Visualizations
Makes Sense of Data. Intelligent Enterprise.” i.cmpnet.
com/intelligententer pr ise/next-era-business-
intelligence/Intelligent_Enterprise_Next_Era_BI_
Visualization (accessed January 2010).
Grimes, S. (2009b). Text “Analytics 2009: User Perspectives
on Solutions and Providers.” Alta Plana. altaplana.com/
TextAnalyticsPerspectives2009 (accessed July,
2016).
Hardin, M. Hom, R. Perez, & Williams L. (2012). “Which Chart
or Graph Is Right for You?” Tableau Software. http://
www.tableau.com/sites/default/files/media/which_
chart_v6_final_0 (accessed August 2016).
Hernández, M., & S. J. Stolfo. (1998, January). “Real-World
Data Is Dirty: Data Cleansing and the Merge/Purge
Problem.” Data Mining and Knowledge Discovery, 2(1),
9–37.
Hill, G. (2016). “A Guide to Enterprise Reporting.” Ghill.
customer.netspace.net.au/reporting/definition.html
(accessed July 2016).
Kim, W., B. J. Choi, E. K. Hong, S. K. Kim, & D. Lee. (2003).
“A Taxonomy of Dirty Data.” Data Mining and Knowledge
Discovery, 7(1), 81–99.
Kock, N. F., R. J. McQueen, & J. L. Corner. (1997). “The Na-
ture of Data, Information and Knowledge Exchanges in
Business Processes: Implications for Process Improvement
and Organizational Learning.” The Learning Organization,
4(2), 70–80.
Kotsiantis, S., D. Kanellopoulos, & P. E. Pintelas. (2006). “Data
Preprocessing for Supervised Leaning.” International Jour-
nal of Computer Science, 1(2), 111–117.
Lai, E. (2009, October 8). “BI Visualization Tool Helps
Dallas Cowboys Sell More Tony Romo Jerseys.” Com-
puterWorld.
Quinn, C. (2016). “Data-Driven Marketing at SiriusXM,”
Teradata Articles & News. http://bigdata.teradata.
com/US/Articles-News/Data-Driven-Marketing-At-
SiriusXM/ (accessed August 2016); “SiriusXM Attracts and
Engages a New Generation of Radio Consumers.” http://
assets. teradata.com/resourceCenter/downloads/
CaseStudies/EB8597 ?processed=1 (accessed
August 2018).
Novell. (2009, April). “Executive Dashboards Elements
of Success.” Novell white paper. www.novell.com/
docrep/documents/3rkw3et fc3/Executive%20
Dashboards_Elements_of_Success_White_Paper_
en (accessed June 2016).
Radha, R. (2008). “Eight Best Practices in Dashboard Design.”
Information Management. www.information-
management.com/news/columns/-10001129-1.html
(accessed July 2016).
SAS. (2014). “Data Visualization Techniques: From Basics
to Big Data.” http://www.sas.com/content/dam/
SAS/en_us/doc/whitepaper1/data-visualization-
techniques-106006 (accessed July 2016).
Thammasiri, D., D. Delen, P. Meesad, & N. Kasap. (2014). “A
Critical Assessment of Imbalanced Class Distribution Prob-
lem: The Case of Predicting Freshmen Student Attrition.”
Expert Systems with Applications, 41(2), 321–330.
http://www.datawarehouse.inf.br/papers/information_builders_dashboard_best_practices
http://www.datawarehouse.inf.br/papers/information_builders_dashboard_best_practices
http://www.datawarehouse.inf.br/papers/information_builders_dashboard_best_practices
https://software.dell.com/documents/instrumentation-laboratory-medical-device-companyensures-product-quality-while-saving-hundreds-ofthousands-of-dollars-case-study-80048
https://software.dell.com/documents/instrumentation-laboratory-medical-device-companyensures-product-quality-while-saving-hundreds-ofthousands-of-dollars-case-study-80048
https://software.dell.com/documents/instrumentation-laboratory-medical-device-companyensures-product-quality-while-saving-hundreds-ofthousands-of-dollars-case-study-80048
https://software.dell.com/documents/instrumentation-laboratory-medical-device-companyensures-product-quality-while-saving-hundreds-ofthousands-of-dollars-case-study-80048
https://software.dell.com/documents/instrumentation-laboratory-medical-device-companyensures-product-quality-while-saving-hundreds-ofthousands-of-dollars-case-study-80048
http://Perceptualedge.com/articles/Whitepapers/Data_Visualization
http://Perceptualedge.com/articles/Whitepapers/Data_Visualization
http://www.tableau.com/whitepapers/telling-data-stories
http://www.tableau.com/whitepapers/telling-data-stories
http://gartner.com
http://i.cmpnet.com/intelligententerprise/next-era-business-intelligence/Intelligent_Enterprise_Next_Era_BI_Visualization
http://i.cmpnet.com/intelligententerprise/next-era-business-intelligence/Intelligent_Enterprise_Next_Era_BI_Visualization
http://i.cmpnet.com/intelligententerprise/next-era-business-intelligence/Intelligent_Enterprise_Next_Era_BI_Visualization
http://i.cmpnet.com/intelligententerprise/next-era-business-intelligence/Intelligent_Enterprise_Next_Era_BI_Visualization
http://altaplana.com/TextAnalyticsPerspectives2009
http://altaplana.com/TextAnalyticsPerspectives2009
http://www.tableau.com/sites/default/files/media/which_chart_v6_final_0
http://www.tableau.com/sites/default/files/media/which_chart_v6_final_0
http://www.tableau.com/sites/default/files/media/which_chart_v6_final_0
http://Ghill.customer.netspace.net.au/reporting/definition.html
http://Ghill.customer.netspace.net.au/reporting/definition.html
http://bigdata.teradata.com/US/Articles-News/Data-Driven-Marketing-At-SiriusXM/
http://bigdata.teradata.com/US/Articles-News/Data-Driven-Marketing-At-SiriusXM/
http://bigdata.teradata.com/US/Articles-News/Data-Driven-Marketing-At-SiriusXM/
http://assets.teradata.com/resourceCenter/downloads/CaseStudies/EB8597 ?processed=1
http://assets.teradata.com/resourceCenter/downloads/CaseStudies/EB8597 ?processed=1
http://assets.teradata.com/resourceCenter/downloads/CaseStudies/EB8597 ?processed=1
http://www.novell.com/docrep/documents/3rkw3etfc3/Executive%20Dashboards_Elements_of_Success_White_Paper_en
http://www.novell.com/docrep/documents/3rkw3etfc3/Executive%20Dashboards_Elements_of_Success_White_Paper_en
http://www.novell.com/docrep/documents/3rkw3etfc3/Executive%20Dashboards_Elements_of_Success_White_Paper_en
http://www.novell.com/docrep/documents/3rkw3etfc3/Executive%20Dashboards_Elements_of_Success_White_Paper_en
http://www.information-management.com/news/columns/-10001129-1.html
http://www.information-management.com/news/columns/-10001129-1.html
http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/data-visualization-techniques-106006
http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/data-visualization-techniques-106006
http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/data-visualization-techniques-106006
193
P A R T
Predictive Analytics/
Machine Learning
II
194
LEARNING OBJECTIVES
Data Mining Process, Methods,
and Algorithms
4
C H A P T E R
■■ Define data mining as an enabling technology for
business analytics
■■ Understand the objectives and benefits of data
mining
■■ Become familiar with the wide range of applications
of data mining
■■ Learn the standardized data mining processes
■■ Learn different methods and algorithms of
data mining
■■ Build awareness of existing data mining
software tools
■■ Understand the privacy issues, pitfalls, and myths
of data mining
G enerally speaking, data mining is a way to develop intelligence (i.e., actionable information or knowledge) from data that an organization collects, organizes, and stores. A wide range of data mining techniques is being used by organiza-
tions to gain a better understanding of their customers and their operations and to solve
complex organizational problems. In this chapter, we study data mining as an enabling
technology for business analytics and predictive analytics; learn about the standard pro-
cesses of conducting data mining projects; understand and build expertise in the use of
major data mining techniques; develop awareness of the existing software tools; and
explore privacy issues, common myths, and pitfalls that are often associated with data
mining.
4.1 Opening Vignette: Miami-Dade Police Department Is Using Predictive Analytics
to Foresee and Fight Crime 195
4.2 Data Mining Concepts 198
4.3 Data Mining Applications 208
4.4 Data Mining Process 211
4.5 Data Mining Methods 220
4.6 Data Mining Software Tools 236
4.7 Data Mining Privacy Issues, Myths, and Blunders 242
Chapter 4 • Data Mining Process, Methods, and Algorithms 195
4.1 OPENING VIGNETTE: Miami-Dade Police Department Is
Using Predictive Analytics to Foresee and Fight Crime
Predictive analytics and data mining have become an integral part of many law enforce-
ment agencies including the Miami-Dade Police Department whose mission is not only
to protect the safety of Florida’s largest county, with 2.5 million citizens (making it the
seventh largest in the United States), but also to provide a safe and inviting climate for
the millions of tourists who come from around the world to enjoy the county’s natural
beauty, warm climate, and stunning beaches. With tourists spending nearly US$20 billion
every year and generating nearly one-third of Florida’s sales taxes, it is hard to overstate
the importance of tourism to the region’s economy. So although few of the county’s
police officers would likely list economic development in their job description, nearly all
grasp the vital link between safe streets and the region’s tourist-driven prosperity.
That connection is paramount for Lieutenant Arnold Palmer, currently supervising
the Robbery Investigations Section and a former supervisor of the department’s Robbery
Intervention Detail. This specialized team of detectives is focused on intensely policing
the county’s robbery hot spots and worst repeat offenders. He and the team occupy mod-
est offices on the second floor of a modern-looking concrete building set back from a
palm-lined street on the western edge of Miami. In his 10 years in the unit and 23 in total
on the force, Palmer has seen many changes—not just in policing practices like the way
his team used to mark street crime hot spots with colored pushpins on a map.
POLICING WITH LESS
Palmer and the team have also seen the impact of a growing population, shifting demo-
graphics, and a changing economy on the streets they patrol. Like any good police force
officers, they have continually adapted their methods and practices to meet a policing
challenge that has grown in scope and complexity. But like nearly all branches of the
county’s government, intensifying budget pressures have placed the department in a
squeeze between rising demands and shrinking resources.
Palmer, who sees detectives as front-line fighters against a rising tide of street crime
and the looming prospect of ever-tightening resources, put it this way: “Our basic chal-
lenge was how to cut street crime even as tighter resources have reduced the number of
cops on the street.” Over the years, the team has been open to trying new tools, the most
notable of which is a program called “analysis-driven enforcement,” which used crime
history data as the basis for positioning teams of detectives. “We’ve evolved a lot since
then in our ability to predict where robberies are likely to occur, both through the use of
analysis and our own collective experience.”
NEW THINKING ON COLD CASES
The more confounding challenge for Palmer and his team of investigators, one shared
with the police of all major urban areas, is in closing the hardest cases whose leads, wit-
nesses, video—any facts or evidence that can help solve a case—are lacking. This is not
surprising, explains Palmer, because “the standard practices we used to generate leads,
like talking to informants or to the community or to patrol officers, haven’t changed
much, if at all. That kind of an approach works okay, but it relies a lot on the experience
our detectives carry in their head. When the detectives retire or move on, that experience
goes with them.”
Palmer’s conundrum was that turnover resulting from the retirement of many of his
most experienced detectives was on an upward trend. True, he saw the infusion of young
blood as an inherently good thing, especially given this group’s increased comfort with
196 Part II • Predictive Analytics/Machine Learning
the new types of information—from e-mails, social media, and traffic cameras, to name
a few—to which his team had access. But as Palmer recounts, the problem came when
the handful of new detectives coming into the unit turned to look for guidance from the
senior officers “and it’s just not there. We knew at that point we needed a different way
to fill the experience gap going forward.”
His ad hoc efforts to come up with a solution led to blue-sky speculation. What if
new detectives on the squad could pose the same questions to a computer database as
they would to a veteran detective? That speculation planted a seed in Palmer’s mind that
wouldn’t go away.
THE BIG PICTURE STARTS SMALL
What was taking shape within the robbery unit demonstrated how big ideas can come
from small places. But more importantly, it showed that for these ideas to reach frui-
tion, the “right” conditions need to be in alignment at the right time. On a leadership
level, this means a driving figure in the organization who knows what it takes to nurture
top-down support as well as crucial bottom-up buy-in from the ranks while keeping
the department’s information technology (IT) personnel on the same page. That person
was Palmer. At the organizational level, the robbery unit served as a particularly good
launching point for lead modeling because of the prevalence of repeat offenders among
perpetrators. Ultimately, the department’s ability to unleash the broader transformative
potential of lead modeling would hinge in large part on the team’s ability to deliver
results on a small scale.
When early tests and demos proved encouraging—with the model yielding accu-
rate results when the details of solved cases were fed into it—the team started gaining
attention. The initiative received a critical boost when the robbery bureau’s unit major
and captain voiced their support for the direction of the project, telling Palmer that “if
you can make this work, run with it.” But more important than the encouragement,
Palmer explains, was their willingness to advocate for the project among the department’s
higher-ups. “I can’t get it off the ground if the brass doesn’t buy in,” says Palmer. “So their
support was crucial.”
SUCCESS BRINGS CREDIBILITY
Having been appointed the official liaison between IT and the robbery unit, Palmer set
out to strengthen the case for the lead modeling tool—now officially called Blue PALMS
(for Predictive Analytics Lead Modeling Software)—by building a series of successes. His
constituency was not only the department brass but also the detectives whose support
would be critical to its successful adoption as a robbery-solving tool. In his attempts
to introduce Blue PALMS, resistance was predictably stronger among veteran detectives
who saw no reason to give up their long-standing practices. Palmer knew that dictates or
coercion would not win their hearts and minds. He would need to build a beachhead of
credibility.
Palmer found that opportunity in one of his best and most experienced detectives.
Early in a robbery investigation, the detective indicated to Palmer that he had a strong
hunch who the perpetrator was and wanted, in essence, to test the Blue PALMS system.
At the detective’s request, the department analyst fed key details of the crime into the sys-
tem, including the modus operandi (MO). The system’s statistical models compared these
details to a database of historical data, looking for important correlations and similarities
in the crime’s signature. The report that came out of the process included a list of 20
suspects ranked in order of match strength, or likelihood. When the analyst handed the
detective the report, his “hunch” suspect was listed in the top five. Soon after his arrest,
the suspect confessed, and Palmer had gained a solid convert.
Chapter 4 • Data Mining Process, Methods, and Algorithms 197
Although it was a useful exercise, Palmer realized that the true test was not in con-
firming hunches but in breaking cases that had come to a dead end. Such was the situ-
ation in a carjacking that had, in Palmer’s words, “no witnesses, no video, and no crime
scene—nothing to go on.” When the senior detective on the stalled case went on leave
after three months, the junior detective to whom it was assigned requested a Blue PALMS
report. Shown photographs of the top people on the suspect list, the victim made a posi-
tive identification of the suspect leading to the successful conclusion of the case. That
suspect was number one on the list.
JUST THE FACTS
The success that Blue PALMS continues to build has been a major factor in Palmer’s get-
ting his detectives on board. But if there is a part of his message that resonates even more
with his detectives, it is the fact that Blue PALMS is designed not to change the basics of
policing practices but to enhance them by giving them a second chance of cracking the
case. “Police work is at the core about human relations—about talking to witnesses, to
victims, to the community—and we’re not out to change that,” says Palmer. “Our aim is
to give investigators factual insights from information we already have that might make a
difference, so even if we’re successful 5 percent of the time, we’re going to take a lot of
offenders off the street.”
The growing list of cold cases solved has helped Palmer in his efforts to reinforce the
merits of Blue PALMS. But, in showing where his loyalty lies, he sees the detectives who
have closed these cold cases—not the program—as most deserving of the spotlight, and
that approach has gone over well. At his chief’s request, Palmer is beginning to use his liai-
son role as a platform for reaching out to other areas in the Miami-Dade Police Department.
SAFER STREETS FOR A SMARTER CITY
When he speaks of the impact of tourism, a thread that runs through Miami-Dade’s
Smarter Cities vision, Palmer sees Blue PALMS as an important tool to protect one of the
county’s greatest assets. “The threat to tourism posed by rising street crime was a big rea-
son the unit was established,” says Palmer. “The fact that we’re able to use analytics and
intelligence to help us close more cases and keep more criminals off the street is good
news for our citizens and our tourist industry.”
u QUESTIONS FOR THE OPENING VIGNETTE
1. Why do law enforcement agencies and departments like the Miami-Dade Police
Department embrace advanced analytics and data mining?
2. What are the top challenges for law enforcement agencies and departments like the
Miami-Dade Police Department? Can you think of other challenges (not mentioned
in this case) that can benefit from data mining?
3. What are the sources of data that law enforcement agencies and departments like
the Miami-Dade Police Department use for their predictive modeling and data
mining projects?
4. What type of analytics do law enforcement agencies and departments like the
Miami-Dade Police Department use to fight crime?
5. What does “the big picture starts small” mean in this case? Explain.
WHAT WE CAN LEARN FROM THIS VIGNETTE
Law enforcement agencies and departments are under tremendous pressure to carry out
their mission of safeguarding people with limited resources. The environment within
which they perform their duties is becoming increasingly more challenging so that they
198 Part II • Predictive Analytics/Machine Learning
have to constantly adopt and perhaps stay a few steps ahead to prevent the likelihood
of catastrophes. Understanding the changing nature of crime and criminals is an ongo-
ing challenge. In the midst of these challenges, what works in favor of these agencies
is the availability of the data and analytics technologies to better analyze past occur-
rences and to foresee future events. Data have become available more now than in
the past. Applying advanced analytics and data mining tools (i.e., knowledge discovery
techniques) to these large and rich data sources provides them with the insight that they
need to better prepare and act on their duties. Therefore, law enforcement agencies are
becoming one of the leading users of the new face of analytics. Data mining is a prime
candidate for better understanding and management of these mission-critical tasks with
a high level of accuracy and timeliness. The study described here clearly illustrates the
power of analytics and data mining to create a holistic view of the world of crime and
criminals for better and faster reaction and management. In this chapter, you will see a
wide variety of data mining applications solving complex problems in a variety of indus-
tries and organizational settings where the data are used to discover actionable insight to
improve mission readiness, operational efficiency, and competitive advantage.
Sources: “Miami-Dade Police Department: Predictive modeling pinpoints likely suspects based on common
crime signatures of previous crimes,” IBM Customer Case Studies. www-03.ibm.com/software/
businesscasestudies/om/en/corp?synkey=C894638H25952N07; “Law Enforcement Analytics: Intelligence-
Led and Predictive Policing by Information Builder.” www.informationbuilders.com/solutions/gov-lea.
4.2 DATA MINING CONCEPTS
Data mining, a relatively new and exciting technology, has become a common practice for
a vast majority of organizations. In an interview with Computerworld magazine in January
1999, Dr. Arno Penzias (Nobel laureate and former chief scientist of Bell Labs) identified
data mining from organizational databases as a key application for corporations of the near
future. In response to Computerworld’s age-old question of “What will be the killer applica-
tions in the corporation?” Dr. Penzias replied, “Data mining.” He then added, “Data mining
will become much more important and companies will throw away nothing about their
customers because it will be so valuable. If you’re not doing this, you’re out of business.”
Similarly, in an article in Harvard Business Review, Thomas Davenport (2006) argued that
the latest strategic weapon for companies is analytical decision making, providing examples
of companies such as Amazon.com, Capital One, Marriott International, and others that
have used analytics to better understand their customers and optimize their extended sup-
ply chains to maximize their returns on investment while providing the best customer ser-
vice. This level of success is highly dependent on a company’s thorough understanding of
its customers, vendors, business processes, and the extended supply chain.
A large portion of “understanding the customer” can come from analyzing the vast
amount of data that a company collects. The cost of storing and processing data has decreased
dramatically in the recent past, and, as a result, the amount of data stored in electronic form
has grown at an explosive rate. With the creation of large databases, the possibility of ana-
lyzing the data stored in them has emerged. The term data mining was originally used to
describe the process through which previously unknown patterns in data were discovered.
This definition has since been stretched beyond those limits by some software vendors to
include most forms of data analysis in order to increase sales with the popularity of the data
mining label. In this chapter, we accept the original definition of data mining.
Although the term data mining is relatively new, the ideas behind it are not. Many
of the techniques used in data mining have their roots in traditional statistical analysis
and artificial intelligence work done since the early part of the 1980s. Why, then, has it
http://www-03.ibm.com/software/businesscasestudies/om/en/corp?synkey=C894638H25952N07
http://www-03.ibm.com/software/businesscasestudies/om/en/corp?synkey=C894638H25952N07
http://www.informationbuilders.com/solutions/gov-lea
http://Amazon.com
Chapter 4 • Data Mining Process, Methods, and Algorithms 199
suddenly gained the attention of the business world? Following are some of the most
important reasons:
• More intense competition at the global scale driven by customers’ ever-changing
needs and wants in an increasingly saturated marketplace.
• General recognition of the untapped value hidden in large data sources.
• Consolidation and integration of database records, which enables a single view of
customers, vendors, transactions, and so on.
• Consolidation of databases and other data repositories into a single location in the
form of a data warehouse.
• The exponential increase in data processing and storage technologies.
• Significant reduction in the cost of hardware and software for data storage and
processing.
• Movement toward the demassification (conversion of information resources into
nonphysical form) of business practices.
Data generated by the Internet are increasing rapidly in both volume and complex-
ity. Large amounts of genomic data are being generated and accumulated all over the
world. Disciplines such as astronomy and nuclear physics create huge quantities of data
on a regular basis. Medical and pharmaceutical researchers constantly generate and store
data that can then be used in data mining applications to identify better ways to accu-
rately diagnose and treat illnesses and to discover new and improved drugs.
On the commercial side, perhaps the most common use of data mining has been
in the finance, retail, and healthcare sectors. Data mining is used to detect and reduce
fraudulent activities, especially in insurance claims and credit card use (Chan et al., 1999);
to identify customer buying patterns (Hoffman, 1999); to reclaim profitable customers
(Hoffman, 1998); to identify trading rules from historical data; and to aid in increased
profitability using market-basket analysis. Data mining is already widely used to bet-
ter target clients, and with the widespread development of e-commerce, this can only
become more imperative with time. See Application Case 4.1 for information on how
Infinity P&C has used predictive analytics and data mining to improve customer service,
combat fraud, and increase profit.
When card issuers first started using automated busi-
ness rules software to counter debit and credit card
fraud, the limits on that technology were quickly
evident: Customers reported frustrating payment
rejections on dream vacations or critical business
trips. Visa works with its clients to improve cus-
tomer experience by providing cutting-edge fraud
risk tools and consulting services that make its
strategies more effective. Through this approach,
Visa enhances customer experience and minimizes
invalid transaction declines.
The company’s global network connects
thousands of financial institutions with millions of
merchants and cardholders every day. It has been
a pioneer in cashless payments for more than 50
years. By using SAS® Analytics, Visa is supporting
financial institutions to reduce fraud without upset-
ting customers with unnecessary payment rejections.
Whenever it processes a transaction, Visa analyzes
up to 500 unique variables in real time to assess the
risk of that transaction. Using vast data sets, includ-
ing global fraud hot spots and transactional patterns,
the company can more accurately assess whether
you’re buying escargot in Paris or someone who
stole your credit card is.
“What that means is that if you are likely to
travel we know it, and we tell your financial insti-
tution so you’re not declined at the point of sale,”
Application Case 4.1 Visa Is Enhancing the Customer Experience while Reducing
Fraud with Predictive Analytics and Data Mining
(Continued )
200 Part II • Predictive Analytics/Machine Learning
says Nathan Falkenborg, head of Visa Performance
Solutions for North Asia. “We also will assist your
bank in developing the right strategies for using the
Visa tools and scoring systems,” he adds. Visa esti-
mates that Big Data analytics works; state-of-the-art
models and scoring systems have the potential to
prevent an incremental $2 billion of fraudulent pay-
ment volume annually.
A globally recognized name, Visa facilitates
electronic funds transfer through branded products
that are issued by its thousands of financial institu-
tion partners. The company processed 64.9 billion
transactions in 2014, and $4.7 trillion in purchases
were made with Visa cards in the same year.
Visa has the computing capability to process
56,000 transaction messages per second, which is
more than four times the actual peak transaction rate
to date. Visa does not just process and compute—it
is continually using analytics to share strategic and
operational insights with its partner financial insti-
tutions and assist them in improving performance.
This business goal is supported by a robust data
management system. Visa also assists its clients in
improving performance by developing and deliver-
ing deep analytical insight.
“We understand patterns of behavior by per-
forming clustering and segmentation at a granular
level, and we provide this insight to our financial
institution partners,” says Falkenborg. “It’s an effec-
tive way to help our clients communicate better and
deepen their understanding of the customer.”
As an example of marketing support, Visa has
assisted clients globally in identifying segments of
customers that should be offered a different Visa
product. “Understanding the customer lifecycle is
incredibly important, and Visa provides information
to clients that help them take action and offer the
right product to the right customer before a value
proposition becomes stale,” says Falkenborg.
How Can Using In-Memory Analytics
Make a Difference?
In a recent proof of concept, Visa used a high-
performance solution from SAS that relies on
in-memory computing to power statistical and
machine-learning algorithms and then present the
information visually. In-memory analytics reduces
the need to move data and perform additional model
iterations, making it much faster and accurate.
Falkenborg describes the solution as like
having the information memorized versus having
to get up and go to a filing cabinet to retrieve it.
“In-memory analytics is just taking your brain and
making it bigger. Everything is instantly accessible.”
Ultimately, solid analytics helps the company
do more than just process payments. “We can deepen
the client conversation and serve our clients even
better with our incredible big data set and expertise
in mining transaction data,” says Falkenborg. “We
use our consulting and analytics capabilities to assist
our clients in tackling business challenges and pro-
tect the payment ecosystem. And that’s what we do
with high-performance analytics.”
Falkenborg elaborates,
The challenge that we have, as with any com-
pany managing and using massive data sets,
is how we use all necessary information to
solve a business challenge—whether that is
improving our fraud models, or assisting a cli-
ent to more effectively communicate with its
customers. In-memory analytics enables us to
be more nimble; with a 100* analytical system
processing speed improvement, our data and
decision scientists can iterate much faster.
Fast and accurate predictive analytics allows
Visa to better serve clients with tailored consult-
ing services, helping them succeed in today’s fast-
changing payments industry.
Questions for Case 4.1
1. What challenges were Visa and the rest of the
credit card industry facing?
2. How did Visa improve customer service while also
improving concepts related to retention of fraud?
3. What is in-memory analytics, and why was it
necessary?
Source: “Enhancing the Customer Experience While Reducing
Fraud (SAS® Analytics) High-Performance Analytics Empowers
Visa to Enhance Customer Experience While Reducing Debit and
Credit Card Fraud.” Copyright © 2018 SAS Institute Inc., Cary, NC,
USA. Reprinted with permission. All rights reserved.
Application Case 4.1 (Continued)
Chapter 4 • Data Mining Process, Methods, and Algorithms 201
Definitions, Characteristics, and Benefits
Simply defined, data mining is a term used to describe discovering or “mining” knowl-
edge from large amounts of data. When considered by analogy, one can easily realize
that the term data mining is a misnomer; that is, mining of gold from within rocks or dirt
is referred to as “gold” mining rather than “rock” or “dirt” mining. Therefore, data min-
ing perhaps should have been named “knowledge mining” or “knowledge discovery.”
Despite the mismatch between the term and its meaning, data mining has become the
choice of the community. Many other names that are associated with data mining include
knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern
searching, and data dredging.
Technically speaking, data mining is a process that uses statistical, mathematical,
and artificial intelligence techniques to extract and identify useful information and subse-
quent knowledge (or patterns) from large sets of data. These patterns can be in the form
of business rules, affinities, correlations, trends, or prediction models (see Nemati and
Barko, 2001). Most literature defines data mining as “the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data stored in
structured databases,” where the data are organized in records structured by categori-
cal, ordinal, and continuous variables (Fayyad, Piatetsky-Shapiro, and Smyth, 1996, pp.
40–41). In this definition, the meanings of the key term are as follows:
• Process implies that data mining comprises many iterative steps.
• Nontrivial means that some experimental type search or inference is involved; that
is, it is not as straightforward as a computation of predefined quantities.
• Valid means that the discovered patterns should hold true on new data with a suf-
ficient degree of certainty.
• Novel means that the patterns are not previously known to the user within the con-
text of the system being analyzed.
• Potentially useful means that the discovered patterns should lead to some benefit to
the user or task.
• Ultimately understandable means that the pattern should make business sense that
leads to the user saying, “Mmm! It makes sense; why didn’t I think of that,” if not
immediately, at least after some postprocessing.
Data mining is not a new discipline but rather a new definition for the use of
many disciplines. Data mining is tightly positioned at the intersection of many disci-
plines, including statistics, artificial intelligence, machine learning, management sci-
ence, information systems (IS), and databases (see Figure 4.1). Using advances in all of
these disciplines, data mining strives to make progress in extracting useful information
and knowledge from large databases. It is an emerging field that has attracted much
attention in a very short time.
The following are the major characteristics and objectives of data mining:
• Data are often buried deep within very large databases, which sometimes contain
data from several years. In many cases, the data are cleansed and consolidated into
a data warehouse. Data can be presented in a variety of formats (see Chapter 3 for
a brief taxonomy of data).
• The data mining environment is usually a client/server architecture or a Web-based
IS architecture.
• Sophisticated new tools, including advanced visualization tools, help remove the
information ore buried in corporate files or archival public records. Finding it involves
massaging and synchronizing the data to get the right results. Cutting-edge data min-
ers are also exploring the usefulness of soft data (i.e., unstructured text stored in such
places as Lotus Notes databases, text files on the Internet, or enterprisewide intranets).
202 Part II • Predictive Analytics/Machine Learning
• The miner is often an end user empowered by data drills and other powerful query tools
to ask ad hoc questions and obtain answers quickly with little or no programming skill.
• “Striking it rich” often involves finding an unexpected result and requires end users to
think creatively throughout the process, including the interpretation of the findings.
• Data mining tools are readily combined with spreadsheets and other software devel-
opment tools. Thus, the mined data can be analyzed and deployed quickly and easily.
• Because of the large amounts of data and massive search efforts, it is sometimes
necessary to use parallel processing for data mining.
A company that effectively leverages data mining tools and technologies can ac-
quire and maintain a strategic competitive advantage. Data mining offers organizations
an indispensable decision-enhancing environment to exploit new opportunities by trans-
forming data into a strategic weapon. See Nemati and Barko (2001) for a more detailed
discussion on the strategic benefits of data mining.
How Data Mining Works
Using existing and relevant data obtained from within and outside the organization, data
mining builds models to discover patterns among the attributes presented in the data
set. Models are the mathematical representations (simple linear relationships/affinities
and/or complex and highly nonlinear relationships) that identify the patterns among the
attributes of the things (e.g., customers, events) described within the data set. Some of
these patterns are explanatory (explaining the interrelationships and affinities among the
attributes), whereas others are predictive (foretelling future values of certain attributes).
In general, data mining seeks to identify four major types of patterns:
1. Associations find the commonly co-occurring groupings of things, such as beer and
diapers going together in market-basket analysis.
Statistics
Artificial
Intelligence
Information
Visualization
Database
Management
& Data
Warehousing
Management
Science &
Information
Systems
Machine
Learning &
Pattern
Recognition
DATA MINING
(Knowledge
Discovery)
FIGURE 4.1 Data Mining is a Blend of Multiple Disciplines.
Chapter 4 • Data Mining Process, Methods, and Algorithms 203
2. Predictions tell the nature of future occurrences of certain events based on what has
happened in the past, such as predicting the winner of the Super Bowl or forecasting
the absolute temperature of a particular day.
3. Clusters identify natural groupings of things based on their known characteristics,
such as assigning customers in different segments based on their demographics and
past purchase behaviors.
4. Sequential relationships discover time-ordered events, such as predicting that an
existing banking customer who already has a checking account will open a savings
account followed by an investment account within a year.
Application Case 4.2 shows how American Honda uses data mining (a critical
component of advanced analytics tools) to enhance their understanding of the warranty
claims, forecast feature part and resource needs, and better understand customer needs,
wants, and opinions.
Background
When a car or truck owner brings a vehicle into
an Acura or Honda dealership in the United States,
there’s more to the visit than a repair or a service
check. During each visit, the service technicians gen-
erate data on the repairs, including any warranty
claims to American Honda Motor Co., Inc., that feed
directly into its database. This includes what type of
work was performed, what the customer paid, ser-
vice advisor comments, and many other data points.
Now, multiply this process by dozens of visits
a day at over 1,200 dealerships nationwide, and it’s
clear—American Honda has big data. It’s up to peo-
ple like Kendrick Kau, assistant manager of American
Honda’s Advanced Analytics group, to draw insights
from this data and turn it into a useful asset.
Examining Warranty Data to Make
Maintenance More Efficient
Like any other major automobile distributor,
American Honda works with a network of dealer-
ships that perform warrantied repair work on its
vehicles. This can be a significant cost for the com-
pany, so American Honda uses analytics to make
sure that warranty claims are complete and accurate
upon submission.
In the case of warranty claims, Kau’s team
helps empower dealers to understand the appro-
priate warranty processes by providing them with
useful information via an online report. To support
a goal of reducing inappropriate warranty costs,
Kau and his team must sift through information on
repairs, parts, customers, and other details. They
chose a visual approach to business intelligence and
analytics, powered by SAS, to identify cost reduction
opportunities.
To decrease warranty expense, the Advanced
Analytics team used SAS Analytics to create a propri-
etary process to surface suspicious warranty claims
for scrutiny on a daily basis to make sure they are
in compliance with existing guidelines. The effort to
identify and scrutinize claims was once fairly man-
ual, tedious, and time-intensive.
“Before SAS, it took one of our staff members
one week out of each month to aggregate and report
warranty data within Microsoft Excel spreadsheets,”
Kau says. “Now, with SAS, we populate those same
reports on an easily accessible online dashboard
automatically, and we recovered a week of man-
power that we could put on other projects.”
By applying SAS Analytics to warranty data, the
Advanced Analytics group gave the Claims group
and field personnel the ability to quickly and accu-
rately identify claims that were incomplete, inaccu-
rate, or noncompliant. The results were impressive.
“Initially, it took our examiners over three min-
utes on average to identify a potentially noncompli-
ant claim, and even then, they were only finding a
truly noncompliant claim 35 percent of the time,”
Kau says. “Now, with SAS, it takes less than a minute
to identify a suspicious claim. And in that time, they
are finding a noncompliant claim 76 percent of the
time.”
Application Case 4.2 American Honda Uses Advanced Analytics to Improve
Warranty Claims
(Continued )
204 Part II • Predictive Analytics/Machine Learning
The effort to increase warranty compliance has
paid off for American Honda. Through more com-
plete analysis of warranty claims—and more edu-
cation at the dealerships—American Honda saw a
reduction in labor costs for 52 percent of its available
labor codes.
Using Service Data to Forecast
Future Needs
The American Honda Advanced Analytics team also
uses service and parts data to develop stronger bonds
with customers by ensuring dealers have in-demand
parts available for customer repairs. Having the right
parts available—at the right time—is paramount, so
vehicle repairs data feed directly into American Honda’s
marketing and customer retention efforts.
“For the marketing team, we provide strate-
gic insight to help shape their programs that are
designed to drive customers to the dealers, and ulti-
mately, keep them loyal to our brand,” Kau says.
“The goal of Honda is lifetime owner loyalty. We
want our customers to have a good experience, and
one of the ways to do that is through exceptional
service.”
American Honda uses SAS Forecast Server to
assist with business planning to ensure adequate
resources are available to meet future demands
for services. Using historical information on repair
orders and certifications, they developed a time
series using years of previous repairs. By combining
time series information with sales data, Kau’s team
can project where the company’s greatest opportu-
nities are in the years ahead.
“Our goal is to forecast the number of vehi-
cles in operation in order to predict the volume of
customers coming into the dealerships,” Kau says.
“And that translates to how many parts we should
have on hand and helps us to plan staffing to meet
customer demands. Looking backward on a year-by-
year basis, we’ve been within 1 percent of where we
forecast to be. That’s extremely good for a forecast,
and I attribute much of that to the abilities of the
SAS software.”
Customer Feedback that Drives
the Business
Another way American Honda uses analytics is to
quickly evaluate customer survey data. Using SAS,
the Advanced Analytics team mines survey data
to gain insight into how vehicles are being used
and identify design changes that are most likely to
improve customer satisfaction.
On a weekly basis, the analytics team exam-
ines customer survey data. Kau’s team uses SAS to
flag emerging trends that may require the attention
of design, manufacturing, engineering, or other
groups. With SAS technology, users can drill down
from high-level issues to more specific responses to
understand a potential root cause.
“We can look into the data and see what the
customers are saying,” Kau says. “And that leads
to a number of questions that we can tackle. Is a
component designed in the most optimal way? Is
it a customer education issue? Is it something that
we should address at the manufacturing process?
Because of SAS, these are critical questions that we
can now identify using our data.”
Questions for Case 4.2
1. How does American Honda use analytics to
improve warranty claims?
2. In addition to warranty claims, for what other
purposes does American Honda use advanced
analytics methods?
3. Can you think of other uses of advanced analyt-
ics in the automotive industry? You can search
the Web to find some answers to this question.
Source: SAS Case Study “American Honda Motor Co., Inc. uses
SAS advanced analytics to improve warranty claims” https://
www.sas.com/en_us/customers/american-honda.html
(accessed June 2018)
Application Case 4.2 (Continued)
Honda–Facts & Figures
Faster claims
analysis
Reduced labor
costs
3X $$ 1200
Dealerships
nationwide
https://www.sas.com/en_us/customers/american-honda.html
https://www.sas.com/en_us/customers/american-honda.html
Chapter 4 • Data Mining Process, Methods, and Algorithms 205
These types of patterns have been manually extracted from data by humans for
centuries, but the increasing volume of data in modern times has created the need for
more automatic approaches. As data sets have grown in size and complexity, direct
manual data analysis has increasingly been augmented with indirect, automatic data-
processing tools that use sophisticated methodologies, methods, and algorithms. The
manifestation of such evolution of automated and semiautomated means of processing
large data sets is now commonly referred to as data mining.
Generally speaking, data mining tasks can be classified into three main categories:
prediction, association, and clustering. Based on the way in which the patterns are ex-
tracted from the historical data, the learning algorithms of data mining methods can be
classified as either supervised or unsupervised. With supervised learning algorithms, the
training data includes both the descriptive attribute (i.e., independent variables or deci-
sion variables) and the class attribute (i.e., output variable or result variable). In contrast,
with unsupervised learning, the training includes only descriptive attributes. Figure 4.2
shows a simple taxonomy for data mining tasks along with the learning methods and
popular algorithms for each of the data mining tasks.
PREDICTION Prediction is commonly referred to as the act of telling about the future. It
differs from simple guessing by taking into account the experiences, opinions, and other
relevant information in conducting the task of foretelling. A term that is commonly associ-
ated with prediction is forecasting. Even though many believe that these two terms are
synonymous, there is a subtle but critical difference between the two. Whereas prediction
is largely experience and opinion based, forecasting is data and model based. That is, in
order of increasing reliability, one might list the relevant terms as guessing, predicting,
and forecasting, respectively. In data mining terminology, prediction and forecasting are
used synonymously, and the term prediction is used as the common representation of the
act. Depending on the nature of what is being predicted, prediction can be named more
specifically as classification (where the predicted thing, such as tomorrow’s forecast, is a
class label such as “rainy” or “sunny”) or regression (where the predicted thing, such as
tomorrow’s temperature, is a real number, such as “65°F”).
CLASSIFICATION Classification, or supervised induction, is perhaps the most common of
all data mining tasks. The objective of classification is to analyze historical data stored in a
database and automatically generate a model that can predict future behavior. This induced
model consists of generalizations over the records of a training data set, which help distin-
guish predefined classes. The hope is that the model can then be used to predict the classes
of other unclassified records and, more importantly, to accurately predict actual future events.
Common classification tools include neural networks and decision trees (from
machine learning), logistic regression and discriminant analysis (from traditional statis-
tics), and emerging tools such as rough sets, support vector machines (SVMs), and ge-
netic algorithms. Statistics-based classification techniques (e.g., logistic regression and
discriminant analysis) have received their share of criticism—that they make unrealistic
assumptions about the data, such as independence and normality—which limit their use
in classification-type data mining projects.
Neural networks involve the development of mathematical structures (somewhat
resembling the biological neural networks in the human brain) that have the capability
to learn from past experiences presented in the form of well-structured data sets. They
tend to be more effective when the number of variables involved is rather large and the
relationships among them are complex and imprecise. Neural networks have disadvan-
tages as well as advantages. For example, providing a good rationale for the predictions
made by a neural network is usually very difficult. Also, training neural networks usually
206 Part II • Predictive Analytics/Machine Learning
Data Mining Algorithms
K-means, Expectation Maximization (EM)
Autoregressive Methods, Averaging
Methods, Exponential Smoothing, ARIMA
Expectation Maximization, Apriori
Algorithm, Graph-Based Matching
Apriori, OneR, ZeroR, Eclat, GA
Linear/Nonlinear Regression, ANN,
Regression Trees, SVM, kNN, GA
Decision Trees, Neural Networks, Support
Vector Machines, kNN, Naïve Bayes, GA
Data Mining Tasks and Methods
Prediction
Classification
Regression
Segmentation
Association
Link Analysis
Sequence Analysis
Clustering
Apriori Algorithm, FP-Growth,
Graph-Based Matching
Time series
Market-Basket
Outlier Analysis
Learning Type
K-means, Expectation Maximization (EM)
Supervised
Unsupervised
Supervised
Supervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
FIGURE 4.2 Simple Taxonomy for Data Mining Tasks, Methods, and Algorithms.
takes a considerable amount of time. Unfortunately, the time needed for training tends to
increase exponentially as the volume of data increases, and in general, neural networks
cannot be trained on very large databases. These and other factors have limited the ap-
plicability of neural networks in data-rich domains.
Decision trees classify data into a finite number of classes based on the values of
the input variables. Decision trees are essentially a hierarchy of if-then statements and are
thus significantly faster than neural networks. They are most appropriate for categorical
data and interval data. Therefore, incorporating continuous variables into a decision
Chapter 4 • Data Mining Process, Methods, and Algorithms 207
tree framework requires discretization, that is, converting continuous valued numerical
variables to ranges and categories.
A related category of classification tools is rule induction. Unlike with a decision
tree, with rule induction the if-then statements are induced from the training data directly,
and they need not be hierarchical in nature. Other, more recent techniques such as SVM,
rough sets, and genetic algorithms are gradually finding their way into the arsenal of clas-
sification algorithms.
CLUSTERING Clustering partitions a collection of things (e.g., objects, events, presented
in a structured data set) into segments (or natural groupings) whose members share simi-
lar characteristics. Unlike in classification, in clustering, the class labels are unknown. As
the selected algorithm goes through the data set, identifying the commonalities of things
based on their characteristics, the clusters are established. Because the clusters are de-
termined using a heuristic-type algorithm and because different algorithms could end up
with different sets of clusters for the same data set, before the results of clustering tech-
niques are put to actual use, it could be necessary for an expert to interpret, and poten-
tially modify, the suggested clusters. After reasonable clusters have been identified, they
can be used to classify and interpret new data.
Not surprisingly, clustering techniques include optimization. The goal of clustering
is to create groups so that the members within each group have maximum similarity and
the members across groups have minimum similarity. The most commonly used cluster-
ing techniques include k-means (from statistics) and self-organizing maps (from machine
learning), which is a unique neural network architecture developed by Kohonen (1982).
Firms often effectively use their data mining systems to perform market segmenta-
tion with cluster analysis. Cluster analysis is a means of identifying classes of items so that
items in a cluster have more in common with each other than with items in other clusters.
Cluster analysis can be used in segmenting customers and directing appropriate market-
ing products to the segments at the right time in the right format at the right price. Cluster
analysis is also used to identify natural groupings of events or objects so that a common
set of characteristics of these groups can be identified to describe them.
ASSOCIATIONS Associations, or association rule learning in data mining, is a popular
and well-researched technique for discovering interesting relationships among variables
in large databases. Thanks to automated data-gathering technologies such as bar code
scanners, the use of association rules for discovering regularities among products in large-
scale transactions recorded by point-of-sale systems in supermarkets has become a com-
mon knowledge discovery task in the retail industry. In the context of the retail industry,
association rule mining is often called market-basket analysis.
Two commonly used derivatives of association rule mining are link analysis and
sequence mining. With link analysis, the linkage among many objects of interest is dis-
covered automatically, such as the link between Web pages and referential relationships
among groups of academic publication authors. With sequence mining, relationships
are examined in terms of their order of occurrence to identify associations over time.
Algorithms used in association rule mining include the popular Apriori (where frequent
itemsets are identified) and FP-Growth, OneR, ZeroR, and Eclat.
VISUALIZATION AND TIME-SERIES FORECASTING Two techniques often associated with
data mining are visualization and time-series forecasting. Visualization can be used in
conjunction with other data mining techniques to gain a clearer understanding of underly-
ing relationships. As the importance of visualization has increased in recent years, a new
term, visual analytics, has emerged. The idea is to combine analytics and visualization in a
single environment for easier and faster knowledge creation. Visual analytics is covered in
208 Part II • Predictive Analytics/Machine Learning
detail in Chapter 3. In time-series forecasting, the data consist of values of the same vari-
able that are captured and stored over time in regular intervals. These data are then used
to develop forecasting models to extrapolate the future values of the same variable.
Data Mining Versus Statistics
Data mining and statistics have a lot in common. They both look for relationships within
data. Most people call statistics the “foundation of data mining.” The main difference
between the two is that statistics starts with a well-defined proposition and hypothesis
whereas data mining starts with a loosely defined discovery statement. Statistics collects
sample data (i.e., primary data) to test the hypothesis whereas data mining and analyt-
ics use all the existing data (i.e., often observational, secondary data) to discover novel
patterns and relationships. Another difference comes from the size of data that they use.
Data mining looks for data sets that are as “big” as possible, whereas statistics looks for
the right size of data (if the data are larger than what is needed/required for the statisti-
cal analysis, a sample of them is used). The meaning of “large data” is rather different
between statistics and data mining. A few hundred to a thousand data points are large
enough to a statistician, but several million to a few billion data points are considered
large for data mining studies.
uSECTION 4.2 REVIEW QUESTIONS
1. Define data mining. Why are there many different names and definitions for data
mining?
2. What recent factors have increased the popularity of data mining?
3. Is data mining a new discipline? Explain.
4. What are some major data mining methods and algorithms?
5. What are the key differences between the major data mining tasks?
4.3 DATA MINING APPLICATIONS
Data mining has become a popular tool in addressing many complex business problems
and opportunities. It has been proven to be very successful and helpful in many areas,
some of which are shown by the following representative examples. The goal of many of
these business data mining applications is to solve a pressing problem or to explore an
emerging business opportunity to create a sustainable competitive advantage.
• Customer relationship management. CRM is the extension of traditional mar-
keting. The goal of CRM is to create one-on-one relationships with customers by
developing an intimate understanding of their needs and wants. As businesses build
relationships with their customers over time through a variety of interactions (e.g.,
product inquiries, sales, service requests, warranty calls, product reviews, social
media connections), they accumulate tremendous amounts of data. When combined
with demographic and socioeconomic attributes, this information-rich data can be
used to (1) identify most likely responders/buyers of new products/services (i.e.,
customer profiling), (2) understand the root causes of customer attrition to improve
customer retention (i.e., churn analysis), (3) discover time-variant associations be-
tween products and services to maximize sales and customer value, and (4) identify
the most profitable customers and their preferential needs to strengthen relation-
ships and to maximize sales.
• Banking. Data mining can help banks with the following: (1) automating the
loan application process by accurately predicting the most probable defaulters,
Chapter 4 • Data Mining Process, Methods, and Algorithms 209
(2) detecting fraudulent credit card and online banking transactions, (3) identifying
ways to maximize value for customers by selling them products and services that
they are most likely to buy, and (4) optimizing the cash return by accurately fore-
casting the cash flow on banking entities (e.g., ATM machines, banking branches).
• Retailing and logistics. In the retailing industry, data mining can be used to (1)
predict accurate sales volumes at specific retail locations to determine correct inven-
tory levels, (2) identify sales relationships between different products (with market-
basket analysis) to improve the store layout and optimize sales promotions, (3) forecast
consumption levels of different product types (based on seasonal and environmental
conditions) to optimize logistics and, hence, maximize sales, and (4) discover interest-
ing patterns in the movement of products (especially for products that have a limited
shelf life because they are prone to expiration, perishability, and contamination) in
a supply chain by analyzing sensory and radio-frequency identification (RFID) data.
• Manufacturing and production. Manufacturers can use data mining to (1) pre-
dict machinery failures before they occur through the use of sensory data (enabling
what is called condition-based maintenance), (2) identify anomalies and common-
alities in production systems to optimize manufacturing capacity, and (3) discover
novel patterns to identify and improve product quality.
• Brokerage and securities trading. Brokers and traders use data mining to
(1) predict when and how much certain bond prices will change, (2) forecast the
range and direction of stock fluctuations, (3) assess the effect of particular issues
and events on overall market movements, and (4) identify and prevent fraudulent
activities in securities trading.
• Insurance. The insurance industry uses data mining techniques to (1) forecast
claim amounts for property and medical coverage costs for better business planning,
(2) determine optimal rate plans based on the analysis of claims and customer data,
(3) predict which customers are more likely to buy new policies with special fea-
tures, and (4) identify and prevent incorrect claim payments and fraudulent activities.
• Computer hardware and software. Data mining can be used to (1) predict disk
drive failures well before they actually occur, (2) identify and filter unwanted Web
content and e-mail messages, (3) detect and prevent computer network security
breaches, and (4) identify potentially unsecure software products.
• Government and defense. Data mining also has a number of military appli-
cations. It can be used to (1) forecast the cost of moving military personnel and
equipment, (2) predict an adversary’s moves and, hence, develop more successful
strategies for military engagements, (3) predict resource consumption for better
planning and budgeting, and (4) identify classes of unique experiences, strategies,
and lessons learned from military operations for better knowledge sharing through-
out the organization.
• Travel industry (airlines, hotels/resorts, rental car companies). Data mining
has a variety of uses in the travel industry. It is successfully used to (1) predict sales
of different services (seat types in airplanes, room types in hotels/resorts, car types in
rental car companies) in order to optimally price services to maximize revenues as a
function of time-varying transactions (commonly referred to as yield management),
(2) forecast demand at different locations to better allocate limited organizational
resources, (3) identify the most profitable customers and provide them with person-
alized services to maintain their repeat business, and (4) retain valuable employees
by identifying and acting on the root causes for attrition.
• Healthcare. Data mining has a number of healthcare applications. It can be used
to (1) identify people without health insurance and the factors underlying this unde-
sired phenomenon, (2) identify novel cost–benefit relationships between different
210 Part II • Predictive Analytics/Machine Learning
treatments to develop more effective strategies, (3) forecast the level and the time of
demand at different service locations to optimally allocate organizational resources,
and (4) understand the underlying reasons for customer and employee attrition.
• Medicine. Use of data mining in medicine should be viewed as an invaluable
complement to traditional medical research, which is mainly clinical and biological
in nature. Data mining analyses can (1) identify novel patterns to improve surviv-
ability of patients with cancer, (2) predict success rates of organ transplantation
patients to develop better organ donor matching policies, (3) identify the functions
of different genes in the human chromosome (known as genomics), and (4) dis-
cover the relationships between symptoms and illnesses (as well as illnesses and
successful treatments) to help medical professionals make informed and correct
decisions in a timely manner.
• Entertainment industry. Data mining is successfully used by the entertainment
industry to (1) analyze viewer data to decide what programs to show during prime
time and how to maximize returns by knowing where to insert advertisements,
(2) predict the financial success of movies before they are produced to make invest-
ment decisions and to optimize the returns, (3) forecast the demand at different loca-
tions and different times to better schedule entertainment events and to optimally
allocate resources, and (4) develop optimal pricing policies to maximize revenues.
• Homeland security and law enforcement. Data mining has a number of home-
land security and law enforcement applications. It is often used to (1) identify patterns
of terrorist behaviors (see Application Case 4.3 for an example of the use of data min-
ing to track funding of terrorists’ activities), (2) discover crime patterns (e.g., locations,
timings, criminal behaviors, and other related attributes) to help solve criminal cases
in a timely manner, (3) predict and eliminate potential biological and chemical attacks
to the nation’s critical infrastructure by analyzing special-purpose sensory data, and
(4) identify and stop malicious attacks on critical information infrastructures (often
called information warfare).
The terrorist attack on the World Trade Center on
September 11, 2001, underlined the importance of
open source intelligence. The USA PATRIOT Act and
the creation of the U.S. Department of Homeland
Security heralded the potential application of infor-
mation technology and data mining techniques to
detect money laundering and other forms of terror-
ist financing. Law enforcement agencies had been
focusing on money laundering activities via normal
transactions through banks and other financial ser-
vice organizations.
Law enforcement agencies are now focusing
on international trade pricing as a terrorism fund-
ing tool. Money launderers have used international
trade to move money silently out of a country with-
out attracting government attention. They achieve
this transfer by overvaluing imports and undervalu-
ing exports. For example, a domestic importer and
foreign exporter could form a partnership and over-
value imports, thereby transferring money from the
home country, resulting in crimes related to customs
fraud, income tax evasion, and money laundering.
The foreign exporter could be a member of a terror-
ist organization.
Data mining techniques focus on analysis of
data on import and export transactions from the
U.S. Department of Commerce and commerce-
related entities. Import prices that exceed the upper
quartile of import prices and export prices that are
lower than the lower quartile of export prices are
tracked. The focus is on abnormal transfer prices
between corporations that might result in shifting
Application Case 4.3 Predictive Analytic and Data Mining Help Stop Terrorist
Funding
Chapter 4 • Data Mining Process, Methods, and Algorithms 211
taxable income and taxes out of the United States.
An observed price deviation could be related to
income tax avoidance/evasion, money laundering,
or terrorist financing. The observed price devia-
tion could also be due to an error in the U.S. trade
database.
Data mining will result in efficient evalua-
tion of data, which, in turn, will aid in the fight
against terrorism. The application of information
technology and data mining techniques to financial
transactions can contribute to better intelligence
information.
Questions for Case 4.3
1. How can data mining be used to fight terrorism?
Comment on what else can be done beyond
what is covered in this short application case.
2. Do you think data mining, although essential for
fighting terrorist cells, also jeopardizes individu-
als’ rights of privacy?
Sources: J. S. Zdanowic, “Detecting Money Laundering and
Terrorist Financing via Data Mining,” Communications of the ACM,
47(5), May 2004, p. 53; R. J. Bolton, “Statistical Fraud Detection: A
Review,” Statistical Science, 17(3), January 2002, p. 235.
• Sports. Data mining was used to improve the performance of National Basketball
Association (NBA) teams in the United States. Major League Baseball teams are into
predictive analytics and data mining to optimally utilize their limited resources for
a winning season. In fact, most, if not all, professional sports today employ data
crunchers and use data mining to increase their chances of winning. Data mining
applications are not limited to professional sports. In an article, Delen at al. (2012)
developed data mining models to predict National Collegiate Athletic Association
(NCAA) Bowl Game outcomes using a wide range of variables about the two op-
posing teams’ previous games statistics (more details about this case study are pro-
vided in Chapter 3). Wright (2012) used a variety of predictors for examination of
the NCAA men’s basketball championship (a.k.a. March Madness) bracket.
uSECTION 4.3 REVIEW QUESTIONS
1. What are the major application areas for data mining?
2. Identify at least five specific applications of data mining and list five common charac-
teristics of these applications.
3. What do you think is the most prominent application area for data mining? Why?
4. Can you think of other application areas for data mining not discussed in this section?
Explain.
4.4 DATA MINING PROCESS
To systematically carry out data mining projects, a general process is usually followed.
Based on best practices, data mining researchers and practitioners have proposed sev-
eral processes (workflows or simple step-by-step approaches) to maximize the chances
of success in conducting data mining projects. These efforts have led to several standard-
ized processes, some of which (a few of the most popular ones) are described in this
section.
One such standardized process, arguably the most popular one, the Cross-Industry
Standard Process for Data Mining—CRISP-DM—was proposed in the mid-1990s by a
European consortium of companies to serve as a nonproprietary standard methodology
for data mining (CRISP-DM, 2013). Figure 4.3 illustrates this proposed process, which is
a sequence of six steps that starts with a good understanding of the business and the
212 Part II • Predictive Analytics/Machine Learning
need for the data mining project (i.e., the application domain) and ends with the deploy-
ment of the solution that satisfies the specific business need. Even though these steps are
sequential in nature, there is usually a great deal of backtracking. Because data mining
is driven by experience and experimentation, depending on the problem situation and
the knowledge/experience of the analyst, the whole process can be very iterative (i.e.,
one should expect to go back and forth through the steps quite a few times) and time
consuming. Because later steps are built on the outcomes of the former ones, one should
pay extra attention to the earlier steps in order to not put the whole study on an incorrect
path from the onset.
Step 1: Business Understanding
The key element of any data mining study is to know what the study is for. Determining
this begins with a thorough understanding of the managerial need for new knowledge
and an explicit specification of the business objective regarding the study to be con-
ducted. Specific goals answering questions such as “What are the common characteristics
of the customers we have lost to our competitors recently?” or “What are typical profiles
of our customers, and how much value does each of them provide to us?” are needed.
Then a project plan for finding such knowledge is developed that specifies the people
responsible for collecting the data, analyzing the data, and reporting the findings. At this
early stage, a budget to support the study should also be established at least at a high
level with rough numbers.
Step 2: Data Understanding
A data mining study is specific to addressing a well-defined business task, and differ-
ent business tasks require different sets of data. Following the business understanding
Business
Understanding
1
Data
Understanding
2
Data
Preparation
3
Model
Building
4
Testing and
Evaluation
5
Deployment
6
Data
FIGURE 4.3 Six-Step CRISP-DM Data Mining Process.
Chapter 4 • Data Mining Process, Methods, and Algorithms 213
step, the main activity of the data mining process is to identify the relevant data from
many available databases. Some key points must be considered in the data iden-
tification and selection phase. First and foremost, the analyst should be clear and
concise about the description of the data mining task so that the most relevant data
can be identified. For example, a retail data mining project could seek to identify
spending behaviors of female shoppers who purchase seasonal clothes based on their
demographics, credit card transactions, and socioeconomic attributes. Furthermore,
the analyst should build an intimate understanding of the data sources (e.g., where
the relevant data are stored and in what form; what the process of collecting the data
is—automated versus manual; who the collectors of the data are and how often the
data are updated) and the variables (e.g., What are the most relevant variables? Are
there any synonymous and/or homonymous variables? Are the variables independent
of each other—do they stand as a complete information source without overlapping or
conflicting information?).
To better understand the data, the analyst often uses a variety of statistical and
graphical techniques, such as simple statistical summaries of each variable (e.g., for nu-
meric variables, the average, minimum/maximum, median, and standard deviation are
among the calculated measures whereas for categorical variables, the mode and fre-
quency tables are calculated), and correlation analysis, scatterplots, histograms, and box
plots can be used. A careful identification and selection of data sources and the most
relevant variables can make it easier for data mining algorithms to quickly discover useful
knowledge patterns.
Data sources for data selection can vary. Traditionally, data sources for business
applications include demographic data (such as income, education, number of house-
holds, and age), sociographic data (such as hobby, club membership, and entertain-
ment), transactional data (sales record, credit card spending, issued checks), and so on.
Today, data sources also use external (open or commercial) data repositories, social
media, and machine-generated data.
Data can be categorized as quantitative and qualitative. Quantitative data are mea-
sured using numeric values, or numeric data. They can be discrete (such as integers)
or continuous (such as real numbers). Qualitative data, also known as categorical data,
contain both nominal and ordinal data. Nominal data have finite nonordered values
(e.g., gender data, which have two values: male and female). Ordinal data have finite
ordered values. For example, customer credit ratings are considered ordinal data because
the ratings can be excellent, fair, and bad. A simple taxonomy of data (i.e., the nature of
data) is provided in Chapter 3.
Quantitative data can be readily represented by some sort of probability distribu-
tion. A probability distribution describes how the data are dispersed and shaped. For
instance, normally distributed data are symmetric and are commonly referred to as being
a bell-shaped curve. Qualitative data can be coded to numbers and then described by
frequency distributions. Once the relevant data are selected according to the data mining
business objective, data preprocessing should be pursued.
Step 3: Data Preparation
The purpose of data preparation (more commonly called data preprocessing) is to take
the data identified in the previous step and prepare it for analysis by data mining meth-
ods. Compared to the other steps in CRISP-DM, data preprocessing consumes the most
time and effort; most people believe that this step accounts for roughly 80 percent of
the total time spent on a data mining project. The reason for such an enormous effort
spent on this step is the fact that real-world data are generally incomplete (lacking at-
tribute values, lacking certain attributes of interest, or containing only aggregate data),
214 Part II • Predictive Analytics/Machine Learning
noisy (containing errors or outliers), and inconsistent (containing discrepancies in codes
or names). The nature of the data and the issues related to the preprocessing of data for
analytics are explained in detail in Chapter 3.
Step 4: Model Building
In this step, various modeling techniques are selected and applied to an already prepared
data set to address the specific business need. The model-building step also encompasses
the assessment and comparative analysis of the various models built. Because there is not
a universally known best method or algorithm for a data mining task, one should use a
variety of viable model types along with a well-defined experimentation and assessment
strategy to identify the “best” method for a given purpose. Even for a single method or
algorithm, a number of parameters need to be calibrated to obtain optimal results. Some
methods could have specific requirements in the way that the data are to be formatted;
thus, stepping back to the data preparation step is often necessary. Application Case 4.4
presents a research study in which a number of model types are developed and com-
pared to each other.
According to the American Cancer Society, half of all
men and one-third of all women in the United States
will develop cancer during their lifetimes; approxi-
mately 1.5 million new cancer cases were expected
to be diagnosed in 2013. Cancer is the second most
common cause of death in the United States and
in the world, exceeded only by cardiovascular dis-
ease. This year, more than 500,000 Americans are
expected to die of cancer—more than 1,300 peo-
ple a day—accounting for nearly one of every four
deaths.
Cancer is a group of diseases generally char-
acterized by uncontrolled growth and spread of
abnormal cells. If the growth and/or spread are not
controlled, cancer can result in death. Even though
the exact reasons are not known, cancer is believed
to be caused by both external factors (e.g., tobacco,
infectious organisms, chemicals, and radiation) and
internal factors (e.g., inherited mutations, hormones,
immune conditions, and mutations that occur from
metabolism). These causal factors can act together
or in sequence to initiate or promote carcinogenesis.
Cancer is treated with surgery, radiation, chemo-
therapy, hormone therapy, biological therapy, and
targeted therapy. Survival statistics vary greatly by
cancer type and stage at diagnosis.
The five-year relative survival rate for all can-
cers is improving, and the decline in cancer mortal-
ity had reached 20 percent in 2013, translating into
the avoidance of about 1.2 million deaths from can-
cer since 1991. That’s more than 400 lives saved per
day! The improvement in survival reflects progress
in diagnosing certain cancers at an earlier stage and
improvements in treatment. Further improvements
are needed to prevent and treat cancer.
Even though cancer research has traditionally
been clinical and biological in nature, in recent years,
data-driven analytic studies have become a common
complement. In medical domains where data- and
analytics-driven research has been applied success-
fully, novel research directions have been identified
to further advance the clinical and biological stud-
ies. Using various types of data, including molecular,
clinical, literature-based, and clinical trial data, along
with suitable data mining tools and techniques,
researchers have been able to identify novel pat-
terns, paving the road toward a cancer-free society.
In one study, Delen (2009) used three popular
data mining techniques (decision trees, artificial neu-
ral networks, and SVMs) in conjunction with logistic
regression to develop prediction models for prostate
cancer survivability. The data set contained around
Application Case 4.4 Data Mining Helps in Cancer Research
Chapter 4 • Data Mining Process, Methods, and Algorithms 215
120,000 records and 77 variables. A k-fold cross-
validation methodology was used in model build-
ing, evaluation, and comparison. The results showed
that support vector models are the most accurate
predictor (with a test set accuracy of 92.85%) for
this domain followed by artificial neural networks
and decision trees. Furthermore, using a sensitivity–
analysis-based evaluation method, the study also
revealed novel patterns related to prognostic factors
of prostate cancer.
In a related study, Delen, Walker, and Kadam
(2005) used two data mining algorithms (artificial
neural networks and decision trees) and logistic
regression to develop prediction models for breast
cancer survival using a large data set (more than
200,000 cases). Using a 10-fold cross-validation
method to measure the unbiased estimate of the
prediction models for performance comparison pur-
poses, the researchers determined that the results
indicated that the decision tree (C5 algorithm) was
the best predictor with 93.6 percent accuracy on
the holdout sample (which was the best prediction
accuracy reported in the literature) followed by arti-
ficial neural networks with 91.2 percent accuracy,
and logistic regression, with 89.2 percent accu-
racy. Further analysis of prediction models revealed
prioritized importance of the prognostic factors,
which can then be used as a basis for further clinical
and biological research studies.
In the most recent study, Zolbanin et al. (2015)
studied the impact of comorbidity in cancer sur-
vivability. Although prior research has shown that
diagnostic and treatment recommendations might
be altered based on the severity of comorbidities,
chronic diseases are still being investigated in isola-
tion from one another in most cases. To illustrate
the significance of concurrent chronic diseases
in the course of treatment, their study used the
Surveillance, Epidemiology, and End Results (SEER)
Program’s cancer data to create two comorbid data
sets: one for breast and female genital cancers and
another for prostate and urinal cancers. Several pop-
ular machine-learning techniques are then applied
to the resultant data sets to build predictive mod-
els (see Figure 4.4). Comparison of the results has
shown that having more information about comor-
bid conditions of patients can improve models’ pre-
dictive power, which in turn can help practitioners
make better diagnostic and treatment decisions.
Therefore, the study suggested that proper identi-
fication, recording, and use of patients’ comorbidity
status can potentially lower treatment costs and ease
the healthcare-related economic challenges.
These examples (among many others in the
medical literature) show that advanced data min-
ing techniques can be used to develop models
that possess a high degree of predictive as well as
explanatory power. Although data mining meth-
ods are capable of extracting patterns and relation-
ships hidden deep in large and complex medical
databases, without the cooperation and feedback
from medical experts, their results are not of much
use. The patterns found via data mining methods
should be evaluated by medical professionals who
have years of experience in the problem domain
to decide whether they are logical, actionable, and
novel enough to warrant new research directions.
In short, data mining is not meant to replace medi-
cal professionals and researchers but to comple-
ment their invaluable efforts to provide data-driven
new research directions and to ultimately save more
human lives.
Questions for Case 4.4
1. How can data mining be used for ultimately cur-
ing illnesses like cancer?
2. What do you think are the promises and major
challenges for data miners in contributing to
medical and biological research endeavors?
Sources: H. M. Zolbanin, D. Delen, & A. H. Zadeh, “Predicting
Overall Survivability in Comorbidity of Cancers: A Data Mining
Approach,” Decision Support Systems, 74, 2015, pp. 150–161;
D. Delen, “Analysis of Cancer Data: A Data Mining Approach,”
Expert Systems, 26(1), 2009, pp. 100–112; J. Thongkam, G. Xu,
Y. Zhang, & F. Huang, “Toward Breast Cancer Survivability
Prediction Models Through Improving Training Space,” Expert
Systems with Applications, 36(10), 2009, pp. 12200–12209;
D. Delen, G. Walker, & A. Kadam, “Predicting Breast Cancer
Survivability: A Comparison of Three Data Mining Methods,”
Artificial Intelligence in Medicine, 34(2), 2005, pp. 113–127.
(Continued )
216 Part II • Predictive Analytics/Machine Learning
Training and
calibrating the
model
Testing the
model
Artificial Neural
Networks (ANN)
Tabulated Model
Testing Results
(Accuracy,
Sensitivity, and
Specificity)
Partitioned data
(training &
testing)
Partitioned
data (training
& testing)
Training and
calibrating the
model
Testing the
model
Logistic
Regression (LR)
Training and
calibrating the
model
Testing the
model
Random
Forest (RF)
Assess
variable
importance
Tabulated
Relative Variable
Importance
Results
Data Preprocessing
Cleaning
Selecting
Transforming
Cancer DB 1 Cancer DB 2 Cancer DB n
Combined
Cancer DB
Partitioned
data (training
& testing)
FIGURE 4.4 Data Mining Methodology for Investigation of Comorbidity in Cancer Survivability.
Depending on the business need, the data mining task can be of a prediction (either
classification or regression), an association, or a clustering type. Each of these data min-
ing tasks can use a variety of data mining methods and algorithms. Some of these data
mining methods were explained earlier in this chapter, and some of the most popular
Application Case 4.4 (Continued)
Chapter 4 • Data Mining Process, Methods, and Algorithms 217
algorithms, including decision trees for classification, k-means for clustering, and the
Apriori algorithm for association rule mining, are described later in this chapter.
Step 5: Testing and Evaluation
In step 5, the developed models are assessed and evaluated for their accuracy and gen-
erality. This step assesses the degree to which the selected model (or models) meets
the business objectives and, if so, to what extent (i.e., Do more models need to be
developed and assessed?). Another option is to test the developed model(s) in a real-
world scenario if time and budget constraints permit. Even though the outcome of the
developed models is expected to relate to the original business objectives, other find-
ings that are not necessarily related to the original business objectives but that might
also unveil additional information or hints for future directions often are discovered.
The testing and evaluation step is a critical and challenging task. No value is added
by the data mining task until the business value obtained from discovered knowledge
patterns is identified and recognized. Determining the business value from discovered
knowledge patterns is somewhat similar to playing with puzzles. The extracted knowledge
patterns are pieces of the puzzle that need to be put together in the context of the specific
business purpose. The success of this identification operation depends on the interaction
among data analysts, business analysts, and decision makers (such as business managers).
Because data analysts might not have the full understanding of the data mining objectives
and what they mean to the business and the business analysts, and decision makers might
not have the technical knowledge to interpret the results of sophisticated mathematical
solutions, interaction among them is necessary. To properly interpret knowledge patterns,
it is often necessary to use a variety of tabulation and visualization techniques (e.g., pivot
tables, cross-tabulation of findings, pie charts, histograms, box plots, scatterplots).
Step 6: Deployment
Development and assessment of the models is not the end of the data mining project.
Even if the purpose of the model is to have a simple exploration of the data, the knowl-
edge gained from such exploration will need to be organized and presented in a way that
the end user can understand and benefit from. Depending on the requirements, the de-
ployment phase can be as simple as generating a report or as complex as implementing
a repeatable data mining process across the enterprise. In many cases, it is the customer,
not the data analyst, who carries out the deployment steps. However, even if the analyst
will not carry out the deployment effort, it is important for the customer to understand
up front what actions need to be carried out to actually make use of the created models.
The deployment step can also include maintenance activities for the deployed models.
Because everything about the business is constantly changing, the data that reflect the busi-
ness activities also are changing. Over time, the models (and the patterns embedded within
them) built on the old data can become obsolete, irrelevant, or misleading. Therefore, moni-
toring and maintenance of the models are important if the data mining results are to become
a part of the day-to-day business and its environment. A careful preparation of a maintenance
strategy helps avoid unnecessarily long periods of incorrect usage of data mining results. To
monitor the deployment of the data mining result(s), the project needs a detailed plan on the
monitoring process, which might not be a trivial task for complex data mining models.
Other Data Mining Standardized Processes and Methodologies
To be applied successfully, a data mining study must be viewed as a process that follows
a standardized methodology rather than as a set of automated software tools and tech-
niques. In addition to CRISP-DM, there is another well-known methodology developed
218 Part II • Predictive Analytics/Machine Learning
by the SAS Institute, called SEMMA (2009). The acronym SEMMA stands for “sample,
explore, modify, model, and assess.”
Beginning with a statistically representative sample of the data, SEMMA makes it
easy to apply exploratory statistical and visualization techniques, select and transform
the most significant predictive variables, model the variables to predict outcomes, and
confirm a model’s accuracy. A pictorial representation of SEMMA is given in Figure 4.5.
By assessing the outcome of each stage in the SEMMA process, the model devel-
oper can determine how to model new questions raised by the previous results and
thus proceed back to the exploration phase for additional refinement of the data; that
is, as with CRISP-DM, SEMMA is driven by a highly iterative experimentation cycle.
The main difference between CRISP-DM and SEMMA is that CRISP-DM takes a more
comprehensive approach—including understanding of the business and the relevant
data—to data mining projects whereas SEMMA implicitly assumes that the data mining
project’s goals and objectives along with the appropriate data sources have been identi-
fied and understood.
Some practitioners commonly use the term knowledge discovery in databases
(KDD) as a synonym for data mining. Fayyad et al. (1996) defined knowledge discov-
ery in databases as a process of using data mining methods to find useful information
and patterns in the data as opposed to data mining, which involves using algorithms
to identify patterns in data derived through the KDD process (see Figure 4.6). KDD is
a comprehensive process that encompasses data mining. The input to the KDD pro-
cess consists of organizational data. The enterprise data warehouse enables KDD to
be implemented efficiently because it provides a single source for data to be mined.
Dunham (2003) summarized the KDD process as consisting of the following steps:
data selection, data preprocessing, data transformation, data mining, and interpretation/
evaluation.
Figure 4.7 shows the polling results for the question, “What main methodology are
you using for data mining?” (conducted by KDnuggets.com in August 2007).
SEMMA
Feedback
Sample
(Generate a representative
sample of the data.)
Explore
(Visualize and provide a
basic description of
the data.)
Assess
(Evaluate the accuracy and
usefulness of the models.)
Modify
(Select variables, transform
variable representations.)
Model
(Use a variety of statistical
and machine learning models.)
FIGURE 4.5 SEMMA Data Mining Process.
http://KDnuggets.com
Chapter 4 • Data Mining Process, Methods, and Algorithms 219
Target
Data
Preprocessed
Data
1 2 3
4 5
Transformed
Data
Extracted
Patterns
Knowledge
“Actionable
Insight”
Data
Selection
Data
Cleaning
Data
Transformation
Data Mining
Internalization
Feedback
Sources for
Raw Data
FIGURE 4.6 KDD (Knowledge Discovery in Databases) Process.
CRISP-DM
My own
SEMMA
KDD Process
Domain-specific methodology
None
Other methodology (not domain specific)
My organization’s
0 10 20 30 40 50 60 70
FIGURE 4.7 Ranking of Data Mining Methodologies/Processes. Source: Used with
permission from KDnuggets.com.
http://KDnuggets.com
220 Part II • Predictive Analytics/Machine Learning
uSECTION 4.4 REVIEW QUESTIONS
1. What are the major data mining processes?
2. Why do you think the early phases (understanding of the business and understand-
ing of the data) take the longest amount of time in data mining projects?
3. List and briefly define the phases in the CRISP-DM process.
4. What are the main data-preprocessing steps? Briefly describe each step and provide
relevant examples.
5. How does CRISP-DM differ from SEMMA?
4.5 DATA MINING METHODS
Various methods are available for performing data mining studies, including classification,
regression, clustering, and association. Most data mining software tools employ more
than one technique (or algorithm) for each of these methods. This section describes the
most popular data mining methods and explains their representative techniques.
Classification
Classification is perhaps the most frequently used data mining method for real-world
problems. As a popular member of the machine-learning family of techniques, classifica-
tion learns patterns from past data (a set of information—traits, variables, features—on
characteristics of the previously labeled items, objects, or events) to place new instances
(with unknown labels) into their respective groups or classes. For example, one could use
classification to predict whether the weather on a particular day will be “sunny,” “rainy,”
or “cloudy.” Popular classification tasks include credit approval (i.e., good or bad credit
risk), store location (e.g., good, moderate, bad), target marketing (e.g., likely customer,
no hope), fraud detection (i.e., yes/no), and telecommunication (e.g., likely to turn to
another phone company, yes/no). If what is being predicted is a class label (e.g., “sunny,”
“rainy,” or “cloudy”), the prediction problem is called a classification; if it is a numeric
value (e.g., temperature, such as 68°F), the prediction problem is called a regression.
Even though clustering (another popular data mining method) can also be used
to determine groups (or class memberships) of things, there is a significant difference
between the two. Classification learns the function between the characteristics of things
(i.e., independent variables) and their membership (i.e., output variable) through a su-
pervised learning process in which both types (input and output) of variables are pre-
sented to the algorithm; in clustering, the membership of the objects is learned through
an unsupervised learning process by which only the input variables are presented to the
algorithm. Unlike classification, clustering does not have a supervising (or controlling)
mechanism that enforces the learning process; instead, clustering algorithms use one or
more heuristics (e.g., multidimensional distance measure) to discover natural groupings
of objects.
The most common two-step methodology of classification-type prediction involves
model development/training and model testing/deployment. In the model development
phase, a collection of input data, including the actual class labels, is used. After a model
has been trained, the model is tested against the holdout sample for accuracy assessment
and eventually deployed for actual use where it is to predict classes of new data instances
(where the class label is unknown). Several factors are considered in assessing the model,
including the following.
• Predictive accuracy. The model’s ability to correctly predict the class label of
new or previously unseen data. Prediction accuracy is the most commonly used
assessment factor for classification models. To compute this measure, actual class
Chapter 4 • Data Mining Process, Methods, and Algorithms 221
labels of a test data set are matched against the class labels predicted by the model.
The accuracy can then be computed as the accuracy rate, which is the percentage
of test data set samples correctly classified by the model (more on this topic is pro-
vided later in the chapter).
• Speed. The computational costs involved in generating and using the model
where faster is deemed to be better.
• Robustness. The model’s ability to make reasonably accurate predictions given
noisy data or data with missing and erroneous values.
• Scalability. The ability to construct a prediction model efficiently given a rather
large amount of data.
• Interpretability. The level of understanding and insight provided by the model
(e.g., how and/or what the model concludes on certain predictions).
Estimating the True Accuracy of Classification Models
In classification problems, the primary source for accuracy estimation is the confusion
matrix (also called a classification matrix or a contingency table). Figure 4.8 shows a
confusion matrix for a two-class classification problem. The numbers along the diagonal
from the upper left to the lower right represent correct decisions, and the numbers out-
side this diagonal represent the errors.
Table 4.1 provides equations for common accuracy metrics for classification models.
When the classification problem is not binary, the confusion matrix gets bigger (a
square matrix with the size of the unique number of class labels), and accuracy metrics
become limited to per class accuracy rates and the overall classifier accuracy.
(True Classification Rate)i =
(True Classification)
a
n
i = 1
1False Classification2
(Overall Classifier Accuracy)i =
a
n
i = 1
(Ture Classification)i
Total Number of Cases
Estimating the accuracy of a classification model (or classifier) induced by a super-
vised learning algorithm is important for the following two reasons: First, it can be used
True
Positive (TP)
Count
False
Negative (FN)
Count
True
Negative (TN)
Count
False
Positive (FP)
Count
Positive
True/Observed Class
N
e
g
a
ti
ve
P
o
s
it
iv
e
Negative
P
re
d
ic
te
d
C
la
s
s
FIGURE 4.8 Simple Confusion Matrix for Tabulation of Two-Class Classification Results.
222 Part II • Predictive Analytics/Machine Learning
to estimate its future prediction accuracy, which could imply the level of confidence one
should have in the classifier’s output in the prediction system. Second, it can be used for
choosing a classifier from a given set (identifying the “best” classification model among
the many trained). The following are among the most popular estimation methodologies
used for classification-type data mining models.
SIMPLE SPLIT The simple split (or holdout or test sample estimation) partitions the data
into two mutually exclusive subsets called a training set and a test set (or holdout set). It
is common to designate two-thirds of the data as the training set and the remaining one-
third as the test set. The training set is used by the inducer (model builder), and the built
classifier is then tested on the test set. An exception to this rule occurs when the classifier
is an artificial neural network. In this case, the data are partitioned into three mutually ex-
clusive subsets: training, validation, and testing. The validation set is used during model
building to prevent overfitting. Figure 4.9 shows the simple split methodology.
The main criticism of this method is that it makes the assumption that the data in
the two subsets are of the same kind (i.e., have the exact same properties). Because this
is a simple random partitioning, in most realistic data sets where the data are skewed
on the classification variable, such an assumption might not hold true. To improve this
TABLE 4.1 Common Accuracy Metrics for Classification Models
Metric Description
Accuracy =
TP + TN
TP + TN + FP + FN
The ratio of correctly classified instances (positives and
negatives) divided by the total numbers of instances
True Positive Rate =
TP
TP + FN
(a.k.a. Sensitivity) The ratio of correctly classified positives
divided by the total positive count (i.e., hit rate or recall)
True Negative Rate =
TN
TN + FP
(a.k.a. Specificity) The ratio of correctly classified negatives
divided by the total negative count (i.e., false alarm rate)
Precision =
TP
TP + FP
The ratio of correctly classified positives divided by the sum of
correctly classified positives and incorrectly classified positives
Recall =
TP
TP + FN
Ratio of correctly classified positives divided by the sum
of correctly classified positives and incorrectly classified
negatives
Preprocessed
Data
Training Data
Model
Development
Model
Assessment (scoring)
Testing Data
1/3
2/3
Trained
Classifier
Prediction
Accuracy
FN TN
TP FP
FIGURE 4.9 Simple Random Data Splitting.
Chapter 4 • Data Mining Process, Methods, and Algorithms 223
situation, stratified sampling is suggested, where the strata become the output variable.
Even though this is an improvement over the simple split, it still has a bias associated
from the single random partitioning.
K-FOLD CROSS-VALIDATION To minimize the bias associated with the random sampling
of the training and holdout data samples in comparing the predictive accuracy of two
or more methods, one can use a methodology called k-fold cross-validation. In k-fold
cross-validation, also called rotation estimation, the complete data set is randomly split
into k mutually exclusive subsets of approximately equal size. The classification model
is trained and tested k times. Each time it is trained on all but one fold and then tested
on the remaining single fold. The cross-validation estimate of the overall accuracy of a
model is calculated by simply averaging the k individual accuracy measures, as shown in
the following equation:
CVA =
1
k a
k
i = 1
Ai
where CVA stands for cross-validation accuracy, k is the number of folds used, and A is
the accuracy measure (e.g., hit rate, sensitivity, specificity) of each fold. Figure 4.10 shows
a graphical illustration of k-fold cross-validation where k is set to 10.
ADDITIONAL CLASSIFICATION ASSESSMENT METHODOLOGIES Other popular assessment
methodologies include the following:
• Leave one out. The leave-one-out method is similar to the k-fold cross-validation
where the k takes the value of 1; that is, every data point is used for testing once
as many models are developed as there are data points. This is a time-consuming
methodology, but sometimes for small data sets, it is a viable option.
• Bootstrapping. With bootstrapping, a fixed number of instances from the origi-
nal data are sampled (with replacement) for training, and the rest of the data set is
used for testing. This process is repeated as many times as desired.
• Jackknifing. Though similar to the leave-one-out methodology, with jackknifing,
the accuracy is calculated by leaving one sample out at each iteration of the estima-
tion process.
• Area under the ROC curve. The area under the ROC curve is a graphical
assessment technique that plots the true positive rate on the y-axis and the false
positive rate on the x-axis. The area under the ROC curve determines the accuracy
…
10%
10%10% 10%
10%10%
10%10%
10%10%
10%
10% 10%
10%
10%10%
10%10%
10%10%
10%
10%
10%
10%10%
10%10%
10%10%
10% Repeated for
all 10 folds
FIGURE 4.10 Graphical Depiction of k-Fold Cross-Validation.
224 Part II • Predictive Analytics/Machine Learning
measure of a classifier: A value of 1 indicates a perfect classifier; 0.5 indicates no
better than random chance; in reality, the values would range between the two
extreme cases. For example, in Figure 4.11, A has a better classification performance
than B, whereas C is not any better than the random chance of flipping a coin.
Estimating the Relative Importance of Predictor Variables
Data mining methods (i.e., machine-learning algorithms) are really good at capturing
complex relationships between input and output variables (producing very accurate
prediction models) but are not nearly as good at explaining how they do what they do
(i.e., model transparency). To mitigate this deficiency (also called the black-box syn-
drome), the machine-learning community proposed several methods, most of which are
characterized as sensitivity analysis. In the context of predictive modeling, sensitivity
analysis refers to an exclusive experimentation process aimed at discovering the cause-
and-effect relationship between the input variables and output variable. Some of the
variable importance methods are algorithm specific (i.e., applied to decision trees) and
some are algorithm agnostic. Here are the most commonly used variable importance
methods employed in machine learning and predictive modeling:
1. Developing and observing a well-trained decision tree model to see the relative dis-
cernibility of the input variables—the closer to the root of the tree a variable is used
to split, the greater is its importance/relative-contribution to the prediction model.
2. Developing and observing a rich and large random forest model and assessing the
variable split statistics. If the ratio of a given variable’s selection into candidate counts
(i.e., number of times a variable selected as the level-0 splitter is divided by number
of times it was picked randomly as one of split candidates) is larger, its importance/
relative-contribution is also greater.
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1-Specificity)
T
ru
e
P
o
s
it
iv
e
R
a
te
(
S
e
n
s
it
iv
it
y) A
B
C
FIGURE 4.11 Sample ROC Curve.
Chapter 4 • Data Mining Process, Methods, and Algorithms 225
3. Sensitivity analysis based on input value perturbation by which the input variables
are gradually changed/perturbed one at a time and the relative change in the output
is observed—the larger the change in the output, the greater the importance of the
perturbed variable. This method is often used in feed-forward neural network mod-
eling when all of the variables are numeric and standardized/normalized. Because
this method is covered in Chapter 6 within the context of deep learning and deep
neural networks, it is not be explained here.
4. Sensitivity analysis based on leave-one-out methodology. This method can be used
for any type of predictive analytics method and therefore is further explained as
follows.
The sensitivity analysis (based on leave-one-out methodology) relies on the experi-
mental process of systematically removing input variables, one at a time, from the input
variable set, developing and testing a model, and observing the impact of the absence of
this variable on the predictive performance of the machine-learning model. The model is
trained and tested (often using a k-fold cross validation) for each input variable (i.e., its
absence in the input variable collection) to measure its contribution/importance to the
model. A graphical depiction of the process is shown in Figure 4.12.
This method is often used for support vector machines, decision trees, logistic re-
gression, and artificial neural networks. In his sensitivity analysis book, Saltelli (2002)
formalized the algebraic representation of this measurement process:
Si =
Vi
V (Ft)
=
V (E(Ft|Xi))
V (Ft)
In the denominator of the equation, V(Ft) refers to the variance in the output vari-
able. In the numerator, V(E(Ft|Xi)), E is the expectation operator to call for an integral
over parameter Xi; that is, inclusive of all input variables except Xi, the V, the variance
operator, applies a further integral over Xi. The variable contribution (i.e., importance),
represented as Si for the i
th variable, is calculated as the normalized sensitivity mea-
sure. In a later study, Saltelli et al. (2004) proved that this equation is the most probable
Systematically
Perturb, Add/
Remove Inputs
Trained ML Model – “the black-box” Observed Change
in Outputs
w∙
x–
b=
–1
w∙
x–
b=
0
w∙
x–
b=
1
2
w
–6 –4X1
X2
–2 0 2
1
0.5
0
4 6
Im
p
o
rt
a
n
c
e
Variable Names
100
90
80
70
60
50 50
40
30
40
1
2 3
7654
8 9 10 11 12 13 14 15
D1
FIGURE 4.12 Graphical Depiction of the Sensitivity Analysis Process.
226 Part II • Predictive Analytics/Machine Learning
measure of model sensitivity that is capable of ranking input variables (i.e., the predic-
tors) in the order of importance for any combination of interactions including the non-
orthogonal relationships among the input variables. To properly combine the sensitivity
analysis results for several prediction methods, one can use an information fusion–based
methodology, particularly by modifying the preceding equation in such a way that the
sensitivity measure of an input variable n obtained based on the information combined
(i.e., fused) from m number of prediction models. The following equation represents this
weighted summation function.
Sn(fused ) = a
m
i = 1
viSin = v1S1n + v2S2n + … + vmSmn
In this equation, vi represents the normalized contribution/weight for each predic-
tion model in which the level of contribution/weight of a model is calculated as a func-
tion of its relative predictive power—the larger the prediction power (i.e., accuracy) is,
the higher is the value of v.
CLASSIFICATION TECHNIQUES A number of techniques (or algorithms) are used for clas-
sification modeling, including the following:
• Decision tree analysis. Decision tree analysis (a machine-learning technique)
is arguably the most popular classification technique in the data mining arena. A
detailed description of this technique is given in the following section.
• Statistical analysis. Statistical techniques were the primary classification algo-
rithm for many years until the emergence of machine-learning techniques. Statistical
classification techniques include logistic regression and discriminant analysis, both
of which make the assumptions that the relationships between the input and output
variables are linear in nature, the data are normally distributed, and the variables are
not correlated and are independent of each other. The questionable nature of these
assumptions has led to the shift toward machine-learning techniques.
• Neural networks. These are among the most popular machine-learning tech-
niques that can be used for classification-type problems.
• Case-based reasoning. This approach uses historical cases to recognize com-
monalities to assign a new case into the most probable category.
• Bayesian classifiers. This approach uses probability theory to build classifi-
cation models based on the past occurrences that are capable of placing a new
instance into a most probable class (or category).
• Genetic algorithms. This is the use of the analogy of natural evolution to build
directed-search-based mechanisms to classify data samples.
• Rough sets. This method takes into account the partial membership of class
labels to predefined categories in building models (collection of rules) for classifica-
tion problems.
A complete description of all of these classification techniques is beyond the scope
of this book; thus, only several of the most popular ones are presented here.
DECISION TREES Before describing the details of decision trees, we need to discuss
some simple terminology. First, decision trees include many input variables that might
have an impact on the classification of different patterns. These input variables are usu-
ally called attributes. For example, if we were to build a model to classify loan risks on
the basis of just two characteristics—income and a credit rating—these two characteristics
would be the attributes, and the resulting output would be the class label (e.g., low, me-
dium, or high risk). Second, a tree consists of branches and nodes. A branch represents
Chapter 4 • Data Mining Process, Methods, and Algorithms 227
the outcome of a test to classify a pattern using one of the attributes. A leaf node at the
end represents the final class choice for a pattern (a chain of branches from the root node
to the leaf node, which can be represented as a complex if-then statement).
The basic idea behind a decision tree is that it recursively divides a training set until
each division consists entirely or primarily of examples from one class. Each nonleaf
node of the tree contains a split point, which is a test of one or more attributes and deter-
mines how the data are to be divided further. Decision tree algorithms, in general, build
an initial tree from training data such that each leaf node is pure, and they then prune the
tree to increase its generalization, and hence, the prediction accuracy of test data.
In the growth phase, the tree is built by recursively dividing the data until each divi-
sion is either pure (i.e., contains members of the same class) or relatively small. The basic
idea is to ask questions whose answers would provide the most information, similar to
what we do when playing the game “Twenty Questions.”
The split used to partition the data depends on the type of the attribute used in
the split. For a continuous attribute A, splits are of the form value(A) 6 x, where x
is some “optimal” split value of A. For example, the split based on income could be
“Income 6 50,000.” For the categorical attribute A, splits are of the form that value(A)
belongs to x where x is a subset of A. As an example, the split could be on the basis of
gender: “Male versus Female.”
A general algorithm for building a decision tree is as follows:
1. Create a root node and assign all of the training data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the split. Split the data into mutually
exclusive (nonoverlapping) subsets along the lines of the specific split and move to
the branches.
4. Repeat steps 2 and 3 for each and every leaf node until a stopping criterion is
reached (e.g., the node is dominated by a single class label).
Many different algorithms have been proposed for creating decision trees. These
algorithms differ primarily in terms of the way in which they determine the splitting attri-
bute (and its split values), the order of splitting the attributes (splitting the same attribute
only once or many times), the number of splits at each node (binary versus ternary), the
stopping criteria, and the pruning of the tree (pre- versus postpruning). Some of the most
well-known algorithms are ID3 (followed by C4.5 and C5 as the improved versions of
ID3) from machine learning, classification and regression trees (CART) from statistics, and
the chi-squared automatic interaction detector (CHAID) from pattern recognition.
When building a decision tree, the goal at each node is to determine the attribute
and the split point of that attribute that best divides the training records to purify the class
representation at that node. To evaluate the goodness of the split, some splitting indices
have been proposed. Two of the most common ones are the Gini index and information
gain. The Gini index is used in CART and Scalable PaRallelizable INduction of Decision
Trees (SPRINT) algorithms. Versions of information gain are used in ID3 (and its newer
versions, C4.5 and C5).
The Gini index has been used in economics to measure the diversity of a popu-
lation. The same concept can be used to determine the purity of a specific class as the
result of a decision to branch along a particular attribute or variable. The best split is the
one that increases the purity of the sets resulting from a proposed split. Let us briefly look
into a simple calculation of the Gini index.
If a data set S contains examples from n classes, the Gini index is defined as
gini(S) = 1 – a
n
j = 1
p2j
228 Part II • Predictive Analytics/Machine Learning
where pj is a relative frequency of class j in S. If a data set S is split into two subsets,
S1 and S2 with sizes N1 and N2, respectively, the Gini index of the split data contains
examples from n classes, and the Gini index is defined as
ginisplit(S) =
N1
N
gini(S1) +
N2
N
gini(S2)
The attribute/split combination that provides the smallest ginisplit(S) is chosen to
split the node. In such a determination, one should enumerate all possible splitting points
for each attribute.
Information gain is the splitting mechanism used in ID3, which is perhaps the
most widely known decision tree algorithm. It was developed by Ross Quinlan in 1986,
and since then, he has evolved this algorithm into the C4.5 and C5 algorithms. The basic
idea behind ID3 (and its variants) is to use a concept called entropy in place of the Gini
index. Entropy measures the extent of uncertainty or randomness in a data set. If all the
data in a subset belong to just one class, there is no uncertainty or randomness in that
data set, so the entropy is zero. The objective of this approach is to build subtrees so that
the entropy of each final subset is zero (or close to zero). Let us also look at the calcula-
tion of the information gain.
Assume that there are two classes: P (positive) and N (negative). Let the set of ex-
amples S contain p counts of class P and n counts of class N. The amount of information
needed to decide if an arbitrary example in S belongs to P or N is defined as
I(p, n) = –
p
p + n
log2
p
p + n
–
n
p + n
log2
n
p + n
Assume that using attribute A, the set S will be partitioned into sets {S1, S2, . . ., g}. If Si
contains pi examples of P and ni examples of N, the entropy, or the expected information
needed to classify objects in all subtrees, Si, is
E(A) = a
n
i = 1
pi + ni
p + n
I(pi,ni)
Then the information that would be gained by branching on attribute A would be
Gain(A) = I(p,n) – E(A)
These calculations are repeated for each and every attribute, and the one with the highest
information gain is selected as the splitting attribute. The basic ideas behind these splitting
indices are rather similar, but the specific algorithmic details vary. A detailed definition of
the ID3 algorithm and its splitting mechanism can be found in Quinlan (1986).
Application Case 4.5 illustrates how significant the gains can be if the right data min-
ing techniques are used for a well-defined business problem.
Cluster Analysis for Data Mining
Cluster analysis is an essential data mining method for classifying items, events, or con-
cepts into common groupings called clusters. The method is commonly used in biology,
medicine, genetics, social network analysis, anthropology, archaeology, astronomy, char-
acter recognition, and even in management information systems (MIS) development. As
data mining has increased in popularity, its underlying techniques have been applied to
business, especially to marketing. Cluster analysis has been used extensively for fraud
Chapter 4 • Data Mining Process, Methods, and Algorithms 229
Influence Health, Inc. provides the healthcare
industry’s only integrated digital consumer engage-
ment and activation platform. It enables providers,
employers, and payers to positively influence con-
sumer decision making and health behaviors well
beyond the physical care setting through personal-
ized and interactive multichannel engagement. Since
1996, the Birmingham, Alabama–based company
has helped more than 1,100 provider organizations
influence consumers in a way that transforms finan-
cial and quality outcomes.
Healthcare is a personal business. Each
patient’s needs are different and require an indi-
vidual response. On the other hand—as the cost
of providing healthcare services continues to rise—
hospitals and health systems increasingly need to
harness economies of scale by catering to larger
and larger populations. The challenge then becomes
to provide a personalized approach while operat-
ing on a large scale. Influence Health specializes in
helping its healthcare sector clients solve this chal-
lenge by getting to know their existing and poten-
tial patients better and targeting each individual with
the appropriate health services at the right time.
Advanced predictive analytics technology from IBM
allows Influence Health to help its clients discover
the factors that have the most influence on patients’
healthcare decisions. By assessing the propensity of
hundreds of millions of prospects to require spe-
cific healthcare services, Influence Health is able to
boost revenues and response rates for healthcare
campaigns, improving outcomes for its clients and
their patients alike.
Targeting the Savvy Consumer
Today’s healthcare industry is becoming more com-
petitive than ever before. If the use of an organiza-
tion’s services drops, so do its profits. Rather than
simply seeking out the nearest hospital or clinic,
consumers are now more likely to make positive
choices among healthcare providers. Paralleling
efforts that are common in other industries, health-
care organizations must make increased efforts to
market themselves effectively to both existing and
potential patients, building long-term engagement
and loyalty.
The keys to successful healthcare marketing
are timeliness and relevance. If you can predict
what kind of health services an individual prospect
might need, you can engage and influence her or
him much more effectively for wellness care.
Venky Ravirala, chief analytics officer at
Influence Health, explains, “Healthcare organiza-
tions risk losing people’s attention if they bombard
them with irrelevant messaging. We help our clients
avoid this risk by using analytics to segment their
existing and potential prospects and market to them
in a much more personal and relevant way.”
Faster and More Flexible Analytics
As its client base has expanded, the total volume
of data in Influence Health’s analytics systems has
grown to include over 195 million patient records
with a detailed disease encounter history for several
million patients. Ravirala comments, “With so much
data to analyze, our existing method of scoring data
was becoming too complex and time-consuming.
We wanted to be able to extract insights at greater
speed and accuracy.”
By leveraging predictive analytics software from
IBM, Influence Health is now able to develop models
that calculate how likely each patient is to require
particular services and express this likelihood as a
percentage score. Microsegmentation and numerous
disease-specific models draw on demographic, socio-
economic, geographical, behavioral, disease history,
and census data and examine different aspects of
each patient’s predicted healthcare needs.
“The IBM solution allows us to combine all
these models using an ensemble technique, which
helps to overcome the limitations of individual mod-
els and provide more accurate results,” comments
Venky Ravirala, chief analytics officer at Influence
Health. “It gives us the flexibility to apply multiple
techniques to solve a problem and arrive at the best
solution. It also automates much of the analytics
process, enabling us to respond to clients’ requests
faster than before, and often give them a much
deeper level of insight into their patient population.”
For example, Influence Health decided to find
out how disease prevalence and risk vary between
different cohorts within the general population. By
Application Case 4.5 Influence Health Uses Advanced Predictive Analytics to Focus on
the Factors That Really Influence People’s Healthcare Decisions
(Continued )
230 Part II • Predictive Analytics/Machine Learning
detection (both credit card and e-commerce) and market segmentation of customers in
contemporary CRM systems. More applications in business continue to be developed as
the strength of cluster analysis is recognized and used.
Cluster analysis is an exploratory data analysis tool for solving classification prob-
lems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters,
so that the degree of association is strong among members of the same cluster and weak
among members of different clusters. Each cluster describes the class to which its mem-
bers belong. An obvious one-dimensional example of cluster analysis is to establish score
ranges into which to assign class grades for a college class. This is similar to the cluster
analysis problem that the U.S. Treasury faced when establishing new tax brackets in the
1980s. A fictional example of clustering occurs in J. K. Rowling’s Harry Potter books. The
Sorting Hat determines to which House (e.g., dormitory) to assign first-year students at
the Hogwarts School. Another example involves determining how to seat guests at a wed-
ding. As far as data mining goes, the importance of cluster analysis is that it can reveal
using very sophisticated cluster analysis techniques,
the company was able to discover new comorbidity
patterns that improve risk predictability for over 100
common diseases by up to 800 percent.
This helps to reliably differentiate between
high-risk and very high-risk patients—making it eas-
ier to target campaigns at the patients and prospects
who need them most. With insights like these in
hand, Influence Health is able to use its healthcare
marketing expertise to advise its clients on how best
to allocate marketing resources.
“Our clients make significant budgeting deci-
sions based on the guidance we give them,” states
Ravirala. “We help them maximize the impact of
one-off campaigns—such as health insurance mar-
ketplace campaigns when Obamacare began—as
well as their long-term strategic plans and ongoing
marketing communications.”
Reaching the Right Audience
By enabling its clients to target their marketing activi-
ties more effectively, Influence Health is helping to
drive increased revenue and enhance population
health. “Working with us, clients have been able to
achieve return on investment of up to 12 to 1 through
better targeted marketing,” elaborates Ravirala. “And
it’s not just about revenues: by ensuring that vital
healthcare information gets sent to the people who
need it, we are helping our clients improve general
health levels in the communities they serve.”
Influence Health continues to refine its modeling
techniques, gaining an ever-deeper understanding
of the critical attributes that influence healthcare deci-
sions. With a flexible analytics toolset at its fingertips,
the company is well equipped to keep improving its
service to clients. Ravirala explains, “In the future,
we want to take our understanding of patient and
prospect data to the next level, identifying patterns
in behavior and incorporating analysis with machine-
learning libraries. IBM SPSS has already given us the
ability to apply and combine multiple models without
writing a single line of code. We’re eager to further
leverage this IBM solution as we expand our health-
care analytics to support clinical outcomes and popu-
lation health management services.”
“We are achieving analytics on an unprec-
edented scale. Today, we can analyze 195 million
records with 35 different models in less than two
days—a task which was simply not possible for us
in the past,” says Ravirala.
Questions for Case 4.5
1. What does Influence Health do?
2. What were the company’s challenges, proposed
solutions, and obtained results?
3. How can data mining help companies in the
healthcare industry (in ways other than the ones
mentioned in this case)?
Source: Reprint Courtesy of International Business Machines
Corporation, © (2018) International Business Machines Corporation.
Application Case 4.5 (Continued)
Chapter 4 • Data Mining Process, Methods, and Algorithms 231
associations and structures in data that were not previously apparent but are sensible and
useful once found.
Cluster analysis results can be used to
• Identify a classification scheme (e.g., types of customers).
• Suggest statistical models to describe populations.
• Indicate rules for assigning new cases to classes for identification, targeting, and
diagnostic purposes.
• Provide measures of definition, size, and change in what were previously broad
concepts.
• Find typical cases to label and represent classes.
• Decrease the size and complexity of the problem space for other data mining
methods.
• Identify outliers in a specific domain (e.g., rare-event detection).
DETERMINING THE OPTIMAL NUMBER OF CLUSTERS Clustering algorithms usually re-
quire one to specify the number of clusters to find. If this number is not known from
prior knowledge, it should be chosen in some way. Unfortunately, there is not an op-
timal way to calculate what this number is supposed to be. Therefore, several different
heuristic methods have been proposed. The following are among the most commonly
referenced ones:
• Look at the percentage of variance explained as a function of the number of clus-
ters; that is, choose a number of clusters so that adding another cluster would not
give much better modeling of the data. Specifically, if one graphs the percentage of
variance explained by the clusters, there is a point at which the marginal gain will
drop (giving an angle in the graph), indicating the number of clusters to be chosen.
• Set the number of clusters to (n>2)1>2, where n is the number of data points.
• Use the Akaike information criterion (AIC), which is a measure of the goodness of
fit (based on the concept of entropy), to determine the number of clusters.
• Use Bayesian information criterion, a model-selection criterion (based on maximum
likelihood estimation), to determine the number of clusters.
ANALYSIS METHODS Cluster analysis might be based on one or more of the following
general methods:
• Statistical methods (including both hierarchical and nonhierarchical), such as k-
means or k-modes.
• Neural networks (with the architecture called self-organizing map).
• Fuzzy logic (e.g., fuzzy c-means algorithm).
• Genetic algorithms.
Each of these methods generally works with one of two general method classes:
• Divisive. With divisive classes, all items start in one cluster and are broken apart.
• Agglomerative. With agglomerative classes, all items start in individual clusters,
and the clusters are joined together.
Most cluster analysis methods involve the use of a distance measure to calculate
the closeness between pairs of items. Popular distance measures include Euclidian dis-
tance (the ordinary distance between two points that one would measure with a ruler)
and Manhattan distance (also called the rectilinear distance or taxicab distance) between
two points. Often, they are based on true distances that are measured, but this need not
be so, as is typically the case in IS development. Weighted averages can be used to es-
tablish these distances. For example, in an IS development project, individual modules of
232 Part II • Predictive Analytics/Machine Learning
the system can be related by the similarity between their inputs, outputs, processes, and
the specific data used. These factors are then aggregated, pairwise by item, into a single
distance measure.
K-MEANS CLUSTERING ALGORITHM The k-means algorithm (where k stands for the pre-
determined number of clusters) is arguably the most referenced clustering algorithm. It has
its roots in traditional statistical analysis. As the name implies, the algorithm assigns each
data point (customer, event, object, etc.) to the cluster whose center (also called the cen-
troid) is the nearest. The center is calculated as the average of all the points in the cluster;
that is, its coordinates are the arithmetic mean for each dimension separately over all the
points in the cluster. The algorithm steps follow and are shown graphically in Figure 4.13:
Initialization step: Choose the number of clusters (i.e., the value of K ).
Step 1: Randomly generate k random points as initial cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Recompute the new cluster centers.
Repetition step: Repeat steps 2 and 3 until some convergence criterion is met (usually
that the assignment of points to clusters becomes stable).
Association Rule Mining
Association rule mining (also known as affinity analysis or market-basket analysis) is a
popular data mining method that is commonly used as an example to explain what data
mining is and what it can do to a technologically less-savvy audience. Most of you might
have heard the famous (or infamous, depending on how you look at it) relationship dis-
covered between the sales of beer and diapers at grocery stores. As the story goes, a large
supermarket chain (maybe Walmart, maybe not; there is no consensus on which super-
market chain it was) did an analysis of customers’ buying habits and found a statistically
significant correlation between purchases of beer and purchases of diapers. It was theo-
rized that the reason for this was that fathers (presumably young men) were stopping off
at the supermarket to buy diapers for their babies (especially on Thursdays), and because
they could no longer go down to the sports bar as often, would buy beer as well. As a
result of this finding, the supermarket chain is alleged to have placed the diapers next to
the beer, resulting in increased sales of both.
In essence, association rule mining aims to find interesting relationships (affinities)
between variables (items) in large databases. Because of its successful application to
retail business problems, it is commonly called market-basket analysis. The main idea
Step 1 Step 2 Step 3
FIGURE 4.13 Graphical Illustration of the Steps in the k-Means Algorithm.
Chapter 4 • Data Mining Process, Methods, and Algorithms 233
in market-basket analysis is to identify strong relationships among different products (or
services) that are usually purchased together (show up in the same basket together, ei-
ther a physical basket at a grocery store or a virtual basket at an e-commerce Web site).
For example, 65 percent of those who buy comprehensive automobile insurance also
buy health insurance; 80 percent of those who buy books online also buy music online;
60 percent of those who have high blood pressure and are overweight have high cho-
lesterol; 70 percent of the customers who buy a laptop computer and virus protection
software also buy extended service plans.
The input to market-basket analysis is the simple point-of-sale transaction data
when a number of products and/or services purchased together (just like the content
of a purchase receipt) are tabulated under a single transaction instance. The outcome of
the analysis is invaluable information that can be used to better understand customer-
purchase behavior to maximize the profit from business transactions. A business can take
advantage of such knowledge by (1) putting the items next to each other to make it more
convenient for the customers to pick them up together and not forget to buy one when
buying the others (increasing sales volume), (2) promoting the items as a package—do
not put one on sale if the other(s) are on sale, and (3) placing them apart from each other
so that the customer has to walk the aisles to search for it, and by doing so, potentially
seeing and buying other items.
Applications of market-basket analysis include cross-marketing, cross-selling, store
design, catalog design, e-commerce site design, optimization of online advertising, prod-
uct pricing, and sales/promotion configuration. In essence, market-basket analysis helps
businesses infer customer needs and preferences from their purchase patterns. Outside
the business realm, association rules are successfully used to discover relationships
between symptoms and illnesses, diagnosis and patient characteristics and treatments
(which can be used in a medical decision support system), and genes and their functions
(which can be used in genomics projects), among others. Here are a few common areas
and uses for association rule mining:
• Sales transactions: Combinations of retail products purchased together can be
used to improve product placement on the sales floor (placing products that go to-
gether in close proximity) and promotional pricing of products (not having promo-
tions on both products that are often purchased together).
• Credit card transactions: Items purchased with a credit card provide insight
into other products the customer is likely to purchase or fraudulent use of credit
card numbers.
• Banking services: The sequential patterns of services used by customers (check-
ing account followed by savings account) can be used to identify other services they
might be interested in (investment account).
• Insurance service products: Bundles of insurance products bought by custom-
ers (car insurance followed by home insurance) can be used to propose additional
insurance products (life insurance), or unusual combinations of insurance claims
can be a sign of fraud.
• Telecommunication services: Commonly purchased groups of options (e.g.,
call waiting, caller ID, three-way calling) help better structure product bundles to
maximize revenue; the same is also applicable to multichannel telecom providers
with phone, television, and Internet service offerings.
• Medical records: Certain combinations of conditions can indicate increased risk
of various complications; or, certain treatment procedures at certain medical facili-
ties can be tied to certain types of infections.
A good question to ask with respect to the patterns/relationships that association
rule mining can discover is “Are all association rules interesting and useful?” To answer
234 Part II • Predictive Analytics/Machine Learning
such a question, association rule mining uses two common metrics: support, and con-
fidence and lift. Before defining these terms, let’s get a little technical by showing what
an association rule looks like:
X 1 Y 3Supp(%), Conf (%)4
{Laptop Computer, Antivirus software} 1 {Extended Service Plan} [30%, 70%]
Here, X (products and/or service—called the left-hand side, LHS, or antecedent)
is associated with Y (products and/or service—called the right-hand side, RHS, or con-
sequent). S is the support, and C is the confidence for this particular rule. Here are the
simple formulas for Supp, Conf, and Lift.
Support = Supp(X 1 Y ) =
Number of Baskets that contains both X and Y
Total Number of Baskets
Confidence = Conf (X 1 Y) =
Supp (X 1 Y )
Supp (X)
Lift(X 1 Y ) =
Conf (X 1 Y )
Expected Conf (X 1 Y )
=
S(X 1 Y )
S(X)
S(X) * S(Y )
S(X)
=
S(X 1 Y )
S(X) * S(Y )
The support (S) of a collection of products is the measure of how often these
products and/or services (i.e., LHS + RHS = Laptop Computer, Antivirus Software, and
Extended Service Plan) appear together in the same transaction; that is, the proportion
of transactions in the data set that contain all of the products and/or services mentioned
in a specific rule. In this example, 30 percent of all transactions in the hypothetical store
database had all three products present in a single sales ticket. The confidence of a rule
is the measure of how often the products and/or services on the RHS (consequent) go
together with the products and/or services on the LHS (antecedent), that is, the propor-
tion of transactions that include LHS while also including the RHS. In other words, it is
the conditional probability of finding the RHS of the rule present in transactions where
the LHS of the rule already exists. The lift value of an association rule is the ratio of the
confidence of the rule and the expected confidence of the rule. The expected confidence
of a rule is defined as the product of the support values of the LHS and the RHS divided
by the support of the LHS.
Several algorithms are available for discovering association rules. Some well-known
algorithms include Apriori, Eclat, and FP-Growth. These algorithms only do half the job,
which is to identify the frequent itemsets in the database. Once the frequent itemsets are
identified, they need to be converted into rules with antecedent and consequent parts.
Determination of the rules from frequent itemsets is a straightforward matching process,
but the process can be time consuming with large transaction databases. Even though
there can be many items on each section of the rule, in practice the consequent part usu-
ally contains a single item. In the following section, one of the most popular algorithms
for identification of frequent itemsets is explained.
APRIORI ALGORITHM The Apriori algorithm is the most commonly used algorithm
to discover association rules. Given a set of itemsets (e.g., sets of retail transactions,
each listing individual items purchased), the algorithm attempts to find subsets that are
common to at least a minimum number of the itemsets (i.e., complies with a minimum
support). Apriori uses a bottom-up approach by which frequent subsets are extended
Chapter 4 • Data Mining Process, Methods, and Algorithms 235
one item at a time (a method known as candidate generation by which the size of
frequent subsets increases from one-item subsets to two-item subsets, then three-item
subsets, etc.), and groups of candidates at each level are tested against the data for
minimum support. The algorithm terminates when no further successful extensions are
found.
As an illustrative example, consider the following. A grocery store tracks sales trans-
actions by SKU (stock-keeping unit) and thus knows which items are typically purchased
together. The database of transactions along with the subsequent steps in identifying
the frequent itemsets is shown in Figure 4.14. Each SKU in the transaction database cor-
responds to a product, such as “1 = butter,” “2 = bread,” “3 = water,” and so on. The
first step in Apriori is to count the frequencies (i.e., the supports) of each item (one-item
itemsets). For this overly simplified example, let us set the minimum support to 3 (or 50,
meaning an itemset is considered to be a frequent itemset if it shows up in at least 3 of 6
transactions in the database). Because all the one-item itemsets have at least 3 in the sup-
port column, they are all considered frequent itemsets. However, had any of the one-item
itemsets not been frequent, they would not have been included as a possible member of
possible two-item pairs. In this way, Apriori prunes the tree of all possible itemsets. As
Figure 4.14 shows, using one-item itemsets, all possible two-item itemsets are generated
and the transaction database is used to calculate their support values. Because the two-
item itemset {1, 3} has a support less than 3, it should not be included in the frequent
itemsets that will be used to generate the next-level itemsets (three-item itemsets). The
algorithm seems deceivingly simple, but only for small data sets. In much larger data sets,
especially those with huge amounts of items present in low quantities and small amounts
of items present in big quantities, the search and calculation become a computationally
intensive process.
uSECTION 4.5 REVIEW QUESTIONS
1. Identify at least three of the main data mining methods.
2. Give examples of situations in which classification would be an appropriate data
mining technique. Give examples of situations in which regression would be an
appropriate data mining technique.
3. List and briefly define at least two classification techniques.
4. What are some of the criteria for comparing and selecting the best classification
technique?
Itemset
(SKUs)
Support
Transaction
No
SKUs
(item no.)
1001234
1001235
1001236
1001237
1001238
1001239
1, 2, 3, 4
2, 3, 4
2, 3
1, 2, 4
1, 2, 3, 4
2, 4
Raw Transaction Data
1
2
3
4
3
6
4
5
Itemset
(SKUs)
Support
1, 2
1, 3
1, 4
2, 3
3
2
3
4
3, 4
5
3
2, 4
Itemset
(SKUs)
Support
1, 2, 4
2, 3, 4
3
3
One-Item Itemsets Two-Item Itemsets Three-Item Itemsets
FIGURE 4.14 Identification of Frequent Itemsets in the Apriori Algorithm.
236 Part II • Predictive Analytics/Machine Learning
5. Briefly describe the general algorithm used in decision trees.
6. Define Gini index. What does it measure?
7. Give examples of situations in which cluster analysis would be an appropriate data
mining technique.
8. What is the major difference between cluster analysis and classification?
9. What are some of the methods for cluster analysis?
10. Give examples of situations in which association would be an appropriate data min-
ing technique.
4.6 DATA MINING SOFTWARE TOOLS
Many software vendors provide powerful data mining tools. Examples of these ven-
dors include IBM (IBM SPSS Modeler, formerly known as SPSS PASW Modeler and
Clementine), SAS (Enterprise Miner), Dell (Statistica, formerly known as StatSoft
Statistica Data Miner), SAP (Infinite Insight, formerly known as KXEN Infinite Insight),
Salford Systems (CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO,
KnowledgeSEEKER), and Megaputer (PolyAnalyst). Noticeably but not surprisingly, the
most popular data mining tools are developed by the well-established statistical soft-
ware companies (SAS, SPSS, and StatSoft)—largely because statistics is the foundation of
data mining, and these companies have the means to cost-effectively develop them into
full-scale data mining systems. Most of the business intelligence tool vendors (e.g., IBM
Cognos, Oracle Hyperion, SAP Business Objects, Tableau, Tibco, Qlik, MicroStrategy,
Teradata, and Microsoft) also have some level of data mining capabilities integrated into
their software offerings. These BI tools are still primarily focused on multidimensional
modeling and data visualization and are not considered to be direct competitors of the
data mining tool vendors.
In addition to these commercial tools, several open source and/or free data mining
software tools are available online. Traditionally, especially in educational circles, the
most popular free and open source data mining tool is Weka, which was developed by
a number of researchers from the University of Waikato in New Zealand (the tool can be
downloaded from cs.waikato.ac.nz/ml/weka). Weka includes a large number of algo-
rithms for different data mining tasks and has an intuitive user interface. Recently, a num-
ber of free open source, highly capable data mining tools emerged: leading the pack are
KNIME (knime.org) and RapidMiner (rapidminer.com). Their graphically enhanced
user interfaces, employment of a rather large number of algorithms, and incorporation of
a variety of data visualization features set them apart from the rest of the free tools. These
two free software tools are also platform agnostic (i.e., can natively run on both Windows
and Mac operating systems). With a recent change in its offerings, RapidMiner has created
a scaled-down version of its analytics tool for free (i.e., community edition) while making
the full commercial product. Therefore, once listed under the free/open source tools cat-
egory, RapidMiner today is often listed under commercial tools. The main difference be-
tween commercial tools, such as SAS Enterprise Miner, IBM SPSS Modeler, and Statistica,
and free tools, such as Weka, RapidMiner (community edition), and KNIME, is the com-
putational efficiency. The same data mining task involving a rather large and feature-rich
data set can take much longer to complete with the free software tools, and for some
algorithms, the job might not even be completed (i.e., crashing due to the inefficient use
of computer memory). Table 4.2 lists a few of the major products and their Web sites.
A suite of business intelligence and analytics capabilities that has become increas-
ingly more popular for data mining studies is Microsoft’s SQL Server (it has included
http://cs.waikato.ac.nz/ml/weka
http://knime.org
http://rapidminer.com
Chapter 4 • Data Mining Process, Methods, and Algorithms 237
increasingly more analytics capabilities, such as BI and predictive modeling modules,
starting with the SQL Server 2012 version) where data and the models are stored in the
same relational database environment, making model management a considerably easier
task. Microsoft Enterprise Consortium serves as the worldwide source for access to
Microsoft’s SQL Server software suite for academic purposes—teaching and research. The
consortium has been established to enable universities around the world to access en-
terprise technology without having to maintain the necessary hardware and software on
their own campus. The consortium provides a wide range of business intelligence devel-
opment tools (e.g., data mining, cube building, business reporting) as well as a number
of large, realistic data sets from Sam’s Club, Dillard’s, and Tyson Foods. The Microsoft
Enterprise Consortium is free of charge and can be used only for academic purposes. The
Sam M. Walton College of Business at the University of Arkansas hosts the enterprise sys-
tem and allows consortium members and their students to access these resources using a
simple remote desktop connection. The details about becoming a part of the consortium
along with easy-to-follow tutorials and examples can be found at walton.uark.edu/
enterprise/.
In May 2016, KDnuggets.com conducted the 13th Annual Software Poll on the
following question: “What software have you used for Analytics, Data Mining, Data
Science, Machine Learning projects in the past 12 months?” The poll received remarkable
participation from analytics and data science community and vendors, attracting 2,895
TABLE 4.2 Selected Data Mining Software
Product Name Web Site (URL)
IBM SPSS Modeler www-01.ibm.com/software/analytics/spss/products/
modeler/
IBM Watson Analytics ibm.com/analytics/watson-analytics/
SAS Enterprise Miner sas.com/en_id/software/analytics/enterprise-miner.html
Dell Statistica statsoft.com/products/statistica/product-index
PolyAnalyst megaputer.com/site/polyanalyst.php
CART, RandomForest salford-systems.com
Insightful Miner solutionmetrics.com.au/products/iminer/default.html
XLMiner solver.com/xlminer-data-mining
SAP InfiniteInsight (KXEN) help.sap.com/ii
GhostMiner qs.pl/ghostminer
SQL Server Data Mining msdn.microsoft.com/en-us/library/bb510516.aspx
Knowledge Miner knowledgeminer.com
Teradata Warehouse Miner teradata.com/products-and-services/teradata-warehouse-
miner/
Oracle Data Mining (ODM) oracle.com/technetwork/database/options/odm/
FICO Decision Management fico.com/en/analytics/decision-management-suite/
Orange Data Mining Tool orange.biolab.si/
Zementis Predictive Analytics zementis.com
http://walton.uark.edu/enterprise/
http://walton.uark.edu/enterprise/
http://KDnuggets.com
http://www-01.ibm.com/software/analytics/spss/products/modeler/
http://www-01.ibm.com/software/analytics/spss/products/modeler/
http://ibm.com/analytics/watson-analytics/
http://sas.com/en_id/software/analytics/enterprise-miner.html
http://statsoft.com/products/statistica/product-index
http://megaputer.com/site/polyanalyst.php
http://salford-systems.com
http://solutionmetrics.com.au/products/iminer/default.html
http://solver.com/xlminer-data-mining
http://help.sap.com/ii
http://qs.pl/ghostminer
http://msdn.microsoft.com/en-us/library/bb510516.aspx
http://knowledgeminer.com
http://teradata.com/products-and-services/teradata-warehouse-miner/
http://teradata.com/products-and-services/teradata-warehouse-miner/
http://oracle.com/technetwork/database/options/odm/
http://fico.com/en/analytics/decision-management-suite/
http://orange.biolab.si/
http://zementis.com
238 Part II • Predictive Analytics/Machine Learning
voters, who chose from a record number of 102 different tools. Here are some of the
interesting findings that came from the poll:
• R remains the leading tool, with 49 percent shares (up from 46.9% in 2015), but
Python usage grew faster and almost caught up to R with 45.8 percent shares (up
from 30.3%).
• RapidMiner remains the most popular general platform for data mining/data science
with about 33 percent shares. Notable tools with the most growth in popularity include
g, Dataiku, MLlib, H2O, Amazon Machine Learning, scikit-learn, and IBM Watson.
• The increased choice of tools is reflected in wider usage. The average number of
tools used was 6.0 (versus 4.8 in May 2015).
• The usage of Hadoop/Big Data tools increased to 39 percent up from 29 percent
in 2015 (and 17% in 2014) driven by Apache Spark, MLlib (Spark Machine Learning
Library), and H2O.
• The participation by region was United States/Canada (40%), Europe (39%), Asia
(9.4%), Latin America (5.8%), Africa/MidEast (2.9%), and Australia/New Zealand (2.2%).
• This year, 86 percent of voters used commercial software, and 75 percent used free
software. About 25 percent used only commercial software, and 13 percent used
only open source/free software. A majority of 61 percent used both free and com-
mercial software, similar to 64 percent in 2015.
• The use of Hadoop/Big Data tools increased to 39 percent, up from 29 percent in
2015 and 17 percent in 2014, driven mainly by big growth in Apache Spark, MLlib
(Spark Machine Learning Library), and H2O, which we include among Big Data tools.
• For the second year, KDnuggets.com’s poll included Deep Learning tools. This
year, 18 percent of voters used Deep Learning tools, doubling the 9 percent in
2015—Google Tensorflow jumped to first place, displacing last year’s leader,
Theano/Pylearn2 ecosystem.
• In the programming languages category, Python, Java, Unix tools, and Scala grew in
popularity, while C/C++, Perl, Julia, F#, Clojure, and Lisp declined.
To reduce bias through multiple voting, in this poll KDnuggets.com used e-mail
verification and, by doing so, aimed to make results more representative of the reality in
the analytics world. The results for the top 40 software tools (as per total number of votes
received) are shown in Figure 4.15. The horizontal bar chart also makes a differentiation
among free/open source, commercial, and Big Data/Hadoop tools using a color-coding
schema.
Application Case 4.6 is about a research study in which a number of software tools
and data mining techniques were used to build data mining models to predict financial
success (box-office receipts) of Hollywood movies while they are nothing more than
ideas.
uSECTION 4.6 REVIEW QUESTIONS
1. What are the most popular commercial data mining tools?
2. Why do you think the most popular tools are developed by statistics-based
companies?
3. What are the most popular free data mining tools? Why are they gaining overwhelm-
ing popularity (especially R)?
4. What are the main differences between commercial and free data mining software
tools?
5. What would be your top five selection criteria for a data mining tool? Explain.
http://KDnuggets.com
http://KDnuggets.com
Chapter 4 • Data Mining Process, Methods, and Algorithms 239
0
Orange 89
Gnu Octave 89
Salford SPM/CART/RF/MARS/TreeNet 100
Rattle 103
121
Apache Pig 132
IBM Watson
Other Hadoop/HDFS-based tools 141
Microsoft Azure Machine Learning 147
QlikView 153
Hbase 158
Microsoft Power BI 161
Scala 180
SAS Enterprise Miner 162
H2O 193
Other programming and data languages 197
Other free analytics/data mining tools 198
C/C++ 210
211
222
225
227
242
263
301
314
315
337
359
462
487
497
521
536
624
944
972
1,029
1,325
1,419
641
SQL on Hadoop tools
IBM SPSS Modeler
SAS base
Dataiku
IBM SPSS Statistics
MATLAB
Unix shell/awk/gawk
Microsoft SQL Server
Weka
Mllib
Hive
Anaconda
Java
SciKit-Learn
KNIME
Tableau
Spark
Hadoop
RapidMiner
Excel
SQL
Python
R
300 600 900 1200 1500
Free/Open Source tools
Commercial tools
Hadoop/Big Data tools
FIGURE 4.15 Popular Data Mining Software Tools (Poll Results). Source: Used with permission from KDnuggets.com.
Predicting box-office receipts (i.e., financial success)
of a particular motion picture is an interesting and
challenging problem. According to some domain
experts, the movie industry is the “land of hunches
and wild guesses” due to the difficulty associated
with forecasting product demand, making the movie
Application Case 4.6 Data Mining goes to Hollywood: Predicting Financial
Success of Movies
(Continued )
http://KDnuggets.com
240 Part II • Predictive Analytics/Machine Learning
business in Hollywood a risky endeavor. In sup-
port of such observations, Jack Valenti (the longtime
president and CEO of the Motion Picture Association
of America) once mentioned that “no one can
tell you how a movie is going to do in the market-
place . . . not until the film opens in darkened the-
atre and sparks fly up between the screen and the
audience.” Entertainment industry trade journals and
magazines have been full of examples, statements,
and experiences that support such a claim.
Like many other researchers who have attempted
to shed light on this challenging real-world problem,
Ramesh Sharda and Dursun Delen have been explor-
ing the use of data mining to predict the financial per-
formance of a motion picture at the box office before
it even enters production (while the movie is nothing
more than a conceptual idea). In their highly publi-
cized prediction models, they convert the forecasting
(or regression) problem into a classification problem;
that is, rather than forecasting the point estimate of
box-office receipts, they classify a movie based on
its box-office receipts in one of nine categories, rang-
ing from “flop” to “blockbuster,” making the problem
a multinomial classification problem. Table 4.3 illus-
trates the definition of the nine classes in terms of the
range of box-office receipts.
Data
Data were collected from a variety of movie-related
databases (e.g., ShowBiz, IMDb, IMSDb, AllMovie,
BoxofficeMojo) and consolidated into a single data
set. The data set for the most recently developed
models contained 2,632 movies released between
1998 and 2006. A summary of the independent vari-
ables along with their specifications is provided in
Table 4.4. For more descriptive details and justifi-
cation for inclusion of these independent variables,
the reader is referred to Sharda and Delen (2006).
The Methodology
Using a variety of data mining methods, including
neural networks, decision trees, SVMs, and three
types of ensembles, Sharda and Delen (2006) devel-
oped the prediction models. The data from 1998 to
2005 were used as training data to build the pre-
diction models, and the data from 2006 were used
as the test data to assess and compare the models’
prediction accuracy. Figure 4.16 shows a screenshot
of IBM SPSS Modeler (formerly Clementine data
mining tool) depicting the process map employed
for the prediction problem. The upper-left side of
the process map shows the model development
Application Case 4.6 (Continued)
TABLE 4.3 Movie Classification based on Receipts
Class No. 1 2 3 4 5 6 7 8 9
Range
(in millions of dollars)
71 71 710 720 740 765 7100 7150 7200
(Flop) 66 10 620 66 40 66 65 66 100 66 150 66 200 (Blockbuster)
TABLE 4.4 Summary of Independent Variables
Independent Variable Number of Values Possible Values
MPAA Rating 5 G, PG, PG-13, R, NR
Competition 3 High, medium, low
Star value 3 High, medium, low
Genre 10 Sci-Fi, Historic Epic Drama, Modern Drama, Politically Related,
Thriller, Horror, Comedy, Cartoon, Action, Documentary
Special effects 3 High, medium, low
Sequel 2 Yes, no
Number of screens 1 A positive integer between 1 and 3,876
Chapter 4 • Data Mining Process, Methods, and Algorithms 241
process, and the lower-right corner of the process
map shows the model assessment (i.e., testing or
scoring) process (more details on the IBM SPSS
Modeler tool and its usage can be found on the
book’s Web site).
The Results
Table 4.5 provides the prediction results of all three
data mining methods as well as the results of the three
different ensembles. The first performance measure is
the percentage of correct classification rate, which is
called Bingo. Also reported in the table is the 1-Away
correct classification rate (i.e., within one category).
The results indicate that SVM performed the best
among the individual prediction models followed by
ANN; the worst of the three was the CART decision
tree algorithm. In general, the ensemble models per-
formed better than the individual prediction models
of which the fusion algorithm performed the best.
What is probably more important to decision makers
and standing out in the results table is the significantly
low standard deviation obtained from the ensembles
compared to the individual models.
The Conclusion
The researchers claim that these prediction results are
better than any reported in the published literature for
this problem domain. Beyond the attractive accuracy
of their prediction results of the box-office receipts,
these models could also be used to further analyze
(and potentially optimize) the decision variables to
maximize the financial return. Specifically, the param-
eters used for modeling could be altered using the
already trained prediction models to better understand
the impact of different parameters on the end results.
During this process, which is commonly referred to as
sensitivity analysis, the decision maker of a given enter-
tainment firm could find out, with a fairly high accuracy
level, how much value a specific actor (or a specific
release date, or the addition of more technical effects,
etc.) brings to the financial success of a film, making
the underlying system an invaluable decision aid.
Model
Development
Process
Model
Assessment
Process
Table
DataMining_Movie_All.
1996–2005
Data
SVM
2006 Data
Class
Class
Class
Class Analysis
Analysis
Analysis
CART Decision
Tree
Neural Net
FIGURE 4.16 Process Flow Screenshot for the Box-Office Prediction System. Source: Reprint Courtesy of International
Business Machines Corporation, © International Business Machines Corporation.
(Continued )
242 Part II • Predictive Analytics/Machine Learning
Questions for Case 4.6
1. Why is it important for many Hollywood pro-
fessionals to predict the financial success of
movies?
2. How can data mining be used for predicting
financial success of movies before the start of
their production process?
3. How do you think Hollywood performed, and
perhaps still is performing, this task without the
help of data mining tools and techniques?
Sources: R. Sharda & D. Delen, “Predicting Box-Office Success
of Motion Pictures with Neural Networks,” Expert Systems with
Applications, 30, 2006, pp. 243–254; D. Delen, R. Sharda, &
P. Kumar, “Movie Forecast Guru: A Web-Based DSS for Hollywood
Managers,” Decision Support Systems, 43(4), 2007, pp. 1151–1170.
Application Case 4.6 (Continued)
4.7 DATA MINING PRIVACY ISSUES, MYTHS, AND BLUNDERS
Data that are collected, stored, and analyzed in data mining often contain information about
real people. Such information can include identification data (name, address, Social Security
number, driver’s license number, employee number, etc.), demographic data (e.g., age,
sex, ethnicity, marital status, number of children), financial data (e.g., salary, gross family
income, checking or savings account balance, home ownership, mortgage or loan account
specifics, credit card limits and balances, investment account specifics), purchase history
(i.e., what is bought from where and when—either from vendor’s transaction records or
from credit card transaction specifics), and other personal data (e.g., anniversary, preg-
nancy, illness, loss in the family, bankruptcy filings). Most of these data can be accessed
through some third-party data providers. The main question here is the privacy of the per-
son to whom the data belong. To maintain the privacy and protection of individuals’ rights,
data mining professionals have ethical (and often legal) obligations. One way to accomplish
this is the process of de-identification of the customer records prior to applying data mining
applications so that the records cannot be traced to an individual. Many publicly available
data sources (e.g., CDC data, SEER data, UNOS data) are already de-identified. Prior to ac-
cessing these data sources, users are often asked to consent that under no circumstances
will they try to identify the individuals behind those figures.
There have been a number of instances in the recent past when companies shared
their customer data with others without seeking the explicit consent of their customers. For
instance, as most of you might recall, in 2003, JetBlue Airlines provided more than 1 million
passenger records of customers to Torch Concepts, a U.S. government contractor. Torch
TABLE 4.5 Tabulated Prediction Results for Individual and Ensemble Models
Prediction Models
Individual Models Ensemble Models
Performance Measure SVM ANN CART Random Forest Boosted Tree
Fusion
(average)
Count (Bingo) 192 182 140 189 187 194
Count (1-Away) 104 120 126 121 104 120
Accuracy (% Bingo) 55.49% 52.60% 40.46% 54.62% 54.05% 56.07%
Accuracy (% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%
Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63
Chapter 4 • Data Mining Process, Methods, and Algorithms 243
then subsequently augmented the passenger data with additional information such as fam-
ily sizes and Social Security numbers—information purchased from the data broker Acxiom.
The consolidated personal database was intended to be used for a data mining project to
develop potential terrorist profiles. All of this was done without notification or consent of
passengers. When news of the activities got out, however, dozens of privacy lawsuits were
filed against JetBlue, Torch, and Acxiom, and several U.S. senators called for an investiga-
tion into the incident (Wald, 2004). Similar, but not as dramatic, privacy-related news was
reported in the recent past about popular social network companies that allegedly were
selling customer-specific data to other companies for personalized target marketing.
Another peculiar story about privacy concerns made it to the headlines in 2012.
In this instance, the company, Target, did not even use any private and/or personal
data. Legally speaking, there was no violation of any laws. The story is summarized in
Application Case 4.7.
In early 2012, an infamous story appeared concern-
ing Target’s practice of predictive analytics. The story
was about a teenage girl who was being sent adver-
tising flyers and coupons by Target for the kinds of
things that a mother-to-be would buy from a store like
Target. The story goes like this: An angry man went
into a Target outside of Minneapolis, demanding to
talk to a manager: “My daughter got this in the mail!”
he said. “She’s still in high school, and you’re sending
her coupons for baby clothes and cribs? Are you trying
to encourage her to get pregnant?” The manager had
no idea what the man was talking about. He looked at
the mailer. Sure enough, it was addressed to the man’s
daughter and contained advertisements for maternity
clothing, nursery furniture, and pictures of smiling
infants. The manager apologized and then called a few
days later to apologize again. On the phone, though,
the father was somewhat abashed. “I had a talk with
my daughter,” he said. “It turns out there’s been some
activities in my house I haven’t been completely aware
of. She’s due in August. I owe you an apology.”
As it turns out, Target figured out a teen girl
was pregnant before her father did! Here is how
the company did it. Target assigns every customer a
Guest ID number (tied to his or her credit card, name,
or e-mail address) that becomes a placeholder that
keeps a history of everything the person has bought.
Target augments these data with any demographic
information that it had collected from the customer
or had bought from other information sources. Using
this information, Target looked at historical buying
data for all the females who had signed up for Target
baby registries in the past. They analyzed the data
from all directions, and soon enough, some useful
patterns emerged. For example, lotions and special
vitamins were among the products with interesting
purchase patterns. Lots of people buy lotion, but
what an analyst noticed was that women on the baby
registry were buying larger quantities of unscented
lotion around the beginning of their second trimester.
Another analyst noted that sometime in the first 20
weeks, pregnant women loaded up on supplements
like calcium, magnesium, and zinc. Many shoppers
purchase soap and cotton balls, but when someone
suddenly starts buying lots of scent-free soap and
extra-large bags of cotton balls, in addition to hand
sanitizers and washcloths, it signals that they could
be getting close to their delivery date. In the end, the
analysts were able to identify about 25 products that,
when analyzed together, allowed them to assign each
shopper a “pregnancy prediction” score. More impor-
tant, they could also estimate a woman’s due date to
within a small window, so Target could send cou-
pons timed to very specific stages of her pregnancy.
If you look at this practice from a legal perspec-
tive, you would conclude that Target did not use any
information that violates customer privacy; rather,
they used transactional data that almost every other
retail chain is collecting and storing (and perhaps
analyzing) about their customers. What was disturb-
ing in this scenario was perhaps the targeted con-
cept: pregnancy. Certain events or concepts should
Application Case 4.7 Predicting Customer Buying Patterns—The Target Story
(Continued )
244 Part II • Predictive Analytics/Machine Learning
Data Mining Myths and Blunders
Data mining is a powerful analytical tool that enables business executives to advance from
describing the nature of the past (looking at a rearview mirror) to predicting the future
(looking ahead) to better manage their business operations (making accurate and timely
decisions). Data mining helps marketers find patterns that unlock the mysteries of customer
behavior. The results of data mining can be used to increase revenue and reduce cost by
identifying fraud and discovering business opportunities, offering a whole new realm of
competitive advantage. As an evolving and maturing field, data mining is often associated
with a number of myths, including those listed in Table 4.6 (Delen, 2014; Zaima, 2003).
Data mining visionaries have gained enormous competitive advantage by under-
standing that these myths are just that: myths.
Although the value proposition and therefore its necessity are obvious to anyone,
those who carry out data mining projects (from novice to seasoned data scientist) some-
times make mistakes that result in projects with less-than-desirable outcomes. The follow-
ing 16 data mining mistakes (also called blunders, pitfalls, or bloopers) are often made
in practice (Nisbet et al., 2009; Shultz, 2004; Skalak, 2001), and data scientists should be
aware of them and, to the extent that is possible, do their best to avoid them:
1. Selecting the wrong problem for data mining. Not every business problem can be
solved with data mining (i.e., the magic bullet syndrome). When there are no represen-
tative data (large and feature rich), there cannot be a practicable data mining project.
2. Ignoring what your sponsor thinks data mining is and what it really can and cannot
do. Expectation management is the key for successful data mining projects.
TABLE 4.6 Data Mining Myths
Myth Reality
Data mining provides instant, crystal-ball-
like predictions.
Data mining is a multistep process that requires
deliberate, proactive design and use.
Data mining is not yet viable for mainstream
business applications.
The current state of the art is ready for almost
any business type and/or size.
Data mining requires a separate, dedicated
database.
Because of the advances in database technology,
a dedicated database is not required.
Only those with advanced degrees can do
data mining.
Newer Web-based tools enable managers of all
educational levels to do data mining.
Data mining is only for large firms that have
lots of customer data.
If the data accurately reflect the business or its
customers, any company can use data mining.
be off limits or treated extremely cautiously, such as
terminal disease, divorce, and bankruptcy.
Questions for Case 4.7
1. What do you think about data mining and its
implication for privacy? What is the threshold
between discovery of knowledge and infringe-
ment of privacy?
2. Did Target go too far? Did it do anything ille-
gal? What do you think Target should have done?
What do you think Target should do next (quit
these types of practices)?
Sources: K. Hill, “How Target Figured Out a Teen Girl Was
Pregnant Before Her Father Did,” Forbes, February 16, 2012; R.
Nolan, “Behind the Cover Story: How Much Does Target Know?”,
February 21, 2012. NYTimes.com.
Application Case 4.7 (Continued)
http://NYTimes.com
Chapter 4 • Data Mining Process, Methods, and Algorithms 245
3. Beginning without the end in mind. Although data mining is a process of knowledge
discovery, one should have a goal/objective (a stated business problem) in mind to
succeed. Because, as the saying goes, “If you don’t know where you are going, you
will never get there.”
4. Defining the project around a foundation that your data cannot support. Data mining
is all about data; that is, the biggest constraint that you have in a data mining project
is the richness of the data. Knowing what the limitations of data are helps you craft
feasible projects that deliver results and meet expectations.
5. Leaving insufficient time for data preparation. It takes more effort than is generally
understood. The common knowledge suggests that up to one-third of the total proj-
ect time is spent on data acquisition, understanding, and preparation tasks. To suc-
ceed, avoid proceeding into modeling until after your data are properly processed
(aggregated, cleaned, and transformed).
6. Looking only at aggregated results, not at individual records. Data mining is at its
best when the data are at a granular representation. Try to avoid unnecessarily ag-
gregating and overly simplifying data to help data mining algorithms—they don’t
really need your help; they are more than capable of figuring it out themselves.
7. Being sloppy about keeping track of the data mining procedure and results. Because
data mining is a discovery process that involves many iterations and experimenta-
tions, its user is highly likely to lose track of the findings. Success requires a system-
atic and orderly planning, execution, and tracking/recording of all data mining tasks.
8. Using data from the future to predict the future. Because of the lack of description and
understanding of the data, oftentimes analysts include variables that are unknown at the
time when the prediction is supposed to be made. By doing so, their prediction models
produce unbelievably accurate results (a phenomenon that is often called fool’s gold).
If your prediction results are too good to be true, they usually are; in that case, the first
thing that you need to look for is the incorrect use of a variable from the future.
9. Ignoring suspicious findings and quickly moving on. The unexpected findings are
often the indicators of real novelties in data mining projects. Proper investigation of
such oddities can lead to surprisingly pleasing discoveries.
10. Starting with a high-profile complex project that will make you a superstar. Data
mining projects often fail if they are not thought out carefully from start to end.
Success often comes with a systematic and orderly progression of projects from
smaller/simpler to larger/complex ones. The goal should be to show incremental
and continuous value added as opposed to taking on a large project that will con-
sume resources without producing any valuable outcomes.
11. Running data mining algorithms repeatedly and blindly. Although today’s data mining
tools are capable of consuming data and setting algorithmic parameters to produce
results, one should know how to transform the data and set the proper parameter
values to obtain the best possible results. Each algorithm has its own unique way to
process data, and knowing that is necessary to get the most out of each model type.
12. Ignore the subject matter experts. Understanding the problem domain and the
related data requires a highly involved collaboration between the data mining and
the domain experts. Working together helps the data mining expert to go beyond
the syntactic representation and to obtain semantic nature (i.e., the true meaning
of the variables) of the data.
13. Believing everything you are told about the data. Although it is necessary to talk to
domain experts to better understand the data and the business problem, the data
scientist should not take anything for granted. Validation and verification through a
critical analysis is the key to intimate understanding and processing of the data.
14. Assuming that the keepers of the data will be fully on board with cooperation. Many
data mining projects fail because the data mining expert did not know/understand
the organizational politics. One of the biggest obstacles in data mining projects can
246 Part II • Predictive Analytics/Machine Learning
be the people who own and control the data. Understanding and managing the
politics is a key to identify, access, and properly understand the data to produce a
successful data mining project.
15. Measuring your results differently from the way your sponsor measures them. The
results should talk/appeal to the end user (manager/decision maker) who will be
using them. Therefore, producing the results in a measure and format that appeals
to the end user tremendously increases the likelihood of true understanding and
proper use of the data mining outcomes.
16. Follow the advice in a well-known quote: “If you build it, they will come”: don’t worry
about how to serve it up. Usually, data mining experts think they have finished once
they build models that meet and hopefully exceed the needs/wants/expectations of
the end user (i.e., the customer). Without a proper deployment, the value deliverance
of data mining outcomes is rather limited. Therefore, deployment is a necessary last
step in the data mining process in which models are integrated into the organizational
decision support infrastructure for enablement of better and faster decision making.
uSECTION 4.7 REVIEW QUESTIONS
1. What are the privacy issues in data mining?
2. How do you think the discussion between privacy and data mining will progress? Why?
3. What are the most common myths about data mining?
4. What do you think are the reasons for these myths about data mining?
5. What are the most common data mining mistakes/blunders? How can they be allevi-
ated or completely eliminated?
Chapter Highlights
• Data mining is the process of discovering new
knowledge from databases.
• Data mining can use simple flat files as data sources,
or it can be performed on data in data warehouses.
• There are many alternative names and definitions
for data mining.
• Data mining is at the intersection of many dis-
ciplines, including statistics, artificial intelligence,
and mathematical modeling.
• Companies use data mining to better understand
their customers and optimize their operations.
• Data mining applications can be found in virtually
every area of business and government, including
healthcare, finance, marketing, and homeland security.
• Three broad categories of data mining tasks are
prediction (classification or regression), cluster-
ing, and association.
• Similar to other IS initiatives, a data mining proj-
ect must follow a systematic project management
process to be successful.
• Several data mining processes have been pro-
posed: CRISP-DM, SEMMA, KDD, for example.
• CRISP-DM provides a systematic and orderly way
to conduct data mining projects.
• The earlier steps in data mining projects (i.e., un-
derstanding the domain and the relevant data)
consume most of the total project time (often
more than 80% of the total time).
• Data preprocessing is essential to any successful
data mining study. Good data lead to good infor-
mation; good information leads to good decisions.
• Data preprocessing includes four main steps: data
consolidation, data cleaning, data transformation,
and data reduction.
• Classification methods learn from previous ex-
amples containing inputs and the resulting class
labels, and once properly trained, they are able to
classify future cases.
• Clustering partitions pattern records into natural
segments or clusters. Each segment’s members
share similar characteristics.
• A number of different algorithms are commonly
used for classification. Commercial implementa-
tions include ID3, C4.5, C5, CART, CHAID, and
SPRINT.
• Decision trees partition data by branching along
different attributes so that each leaf node has all
the patterns of one class.
Chapter 4 • Data Mining Process, Methods, and Algorithms 247
• The Gini index and information gain (entropy)
are two popular ways to determine branching
choices in a decision tree.
• The Gini index measures the purity of a sample.
If everything in a sample belongs to one class,
the Gini index value is zero.
• Several assessment techniques can measure the
prediction accuracy of classification models, in-
cluding simple split, k-fold cross-validation, boot-
strapping, and the area under the ROC curve.
• There are a number of methods to assess the vari-
able importance of data mining models. Some of
these methods are model type specific, some are
model type agnostic.
• Cluster algorithms are used when data records do
not have predefined class identifiers (i.e., it is not
known to what class a particular record belongs).
• Cluster algorithms compute measures of similarity
in order to group similar cases into clusters.
• The most commonly used similarity measure in
cluster analysis is a distance measure.
• The most commonly used clustering algorithms
are k-means and self-organizing maps.
• Association rule mining is used to discover two or
more items (or events or concepts) that go together.
• Association rule mining is commonly referred to
as market-basket analysis.
• The most commonly used association algorithm
is Apriori by which frequent itemsets are identi-
fied through a bottom-up approach.
• Association rules are assessed based on their sup-
port and confidence measures.
• Many commercial and free data mining tools are
available.
• The most popular commercial data mining tools
are IBM SPSS Modeler and SAS Enterprise Miner.
• The most popular free data mining tools are
KNIME, RapidMiner, and Weka.
Key Terms
Apriori algorithm
area under the ROC curve
association
bootstrapping
categorical data
classification
clustering
confidence
CRISP-DM
data mining
decision tree
distance measure
ensemble
entropy
Gini index
information gain
interval data
k-fold cross-
validation
KNIME
knowledge discovery in
databases (KDD)
lift
link analysis
Microsoft Enterprise
Consortium
Microsoft SQL Server
nominal data
numeric data
ordinal data
prediction
RapidMiner
regression
SEMMA
sensitivity analysis
sequence mining
simple split
support
Weka
Questions for Discussion
1. Define data mining. Why are there many names and
definitions for data mining?
2. What are the main reasons for the recent popularity of
data mining?
3. Discuss what an organization should consider before
making a decision to purchase data mining software.
4. Distinguish data mining from other analytical tools and
techniques.
5. Discuss the main data mining methods. What are the
fundamental differences among them?
6. What are the main data mining application areas?
Discuss the commonalities of these areas that make
them a prospect for data mining studies.
7. Why do we need a standardized data mining pro-
cess? What are the most commonly used data mining
processes?
8. Discuss the differences between the two most com-
monly used data mining processes.
9. Are data mining processes a mere sequential set of
activities? Explain.
10. Why do we need data preprocessing? What are the main
tasks and relevant techniques used in data preprocessing?
11. Discuss the reasoning behind the assessment of clas-
sification models.
12. What is the main difference between classification and
clustering? Explain using concrete examples.
13. Moving beyond the chapter discussion, where else can
association be used?
14. What are the privacy issues with data mining? Do you
think they are substantiated?
15. What are the most common myths and mistakes about
data mining?
248 Part II • Predictive Analytics/Machine Learning
Exercises
Teradata University Network (TUN) and Other Hands-On
Exercises
1. Visit teradatauniversitynetwork.com. Identify case
studies and white papers about data mining. Describe
recent developments in the field of data mining and
predictive modeling.
2. Go to teradatauniversitynetwork.com. Locate Web
seminars related to data mining. In particular, locate
and watch a seminar given by C. Imhoff and T. Zouqes.
Then answer the following questions:
a. What are some of the interesting applications of data
mining?
b. What types of payoffs and costs can organizations
expect from data mining initiatives?
3. For this exercise, your goal is to build a model to iden-
tify inputs or predictors that differentiate risky custom-
ers from others (based on patterns pertaining to previous
customers) and then use those inputs to predict new risky
customers. This sample case is typical for this domain.
The sample data to be used in this exercise are in
Online File W4.1 in the file CreditRisk.xlsx. The data set
has 425 cases and 15 variables pertaining to past and
current customers who have borrowed from a bank for
various reasons. The data set contains customer-related
information such as financial standing, reason for the
loan, employment, demographic information, and the
outcome or dependent variable for credit standing, clas-
sifying each case as good or bad based on the institu-
tion’s past experience.
Take 400 of the cases as training cases and set
aside the other 25 for testing. Build a decision tree mod-
el to learn the characteristics of the problem. Test its per-
formance on the other 25 cases. Report on your model’s
learning and testing performance. Prepare a report that
identifies the decision tree model and training param-
eters as well as the resulting performance on the test set.
Use any decision tree software. (This exercise is cour-
tesy of StatSoft, Inc., based on a German data set from
ftp.ics.uc,i.edu/pub/machine-learning-databases/
statlog/german renamed CreditRisk and altered.)
4. For this exercise, you will replicate (on a smaller
scale) the box-office prediction modeling explained in
Application Case 4.6. Download the training data set
from Online File W4.2, MovieTrain.xlsx, which is in
Microsoft Excel format. Use the data description given
in Application Case 4.6 to understand the domain and
the problem you are trying to solve. Pick and choose
your independent variables. Develop at least three clas-
sification models (e.g., decision tree, logistic regression,
neural networks). Compare the accuracy results using
10-fold cross-validation and percentage split techniques,
use confusion matrices, and comment on the outcome.
Test the models you have developed on the test set (see
Online File W4.3, MovieTest.xlsx). Analyze the results
with different models, and find the best classification
model, supporting it with your results.
5. This exercise introduces you to association rule min-
ing. The Excel data set baskets1ntrans.xlsx has around
2,800 observations/records of supermarket transaction
products data. Each record contains the customer’s ID
and products that they have purchased. Use this data
set to understand the relationships among products (i.e.,
which products are purchased together). Look for inter-
esting relationships and add screenshots of any subtle
association patterns that you might find. More specifi-
cally, answer the following questions.
• Which association rules do you think are most important?
• Based on some of the association rules you found,
make at least three business recommendations that
might be beneficial to the company. These recom-
mendations can include ideas about shelf organiza-
tion, up-selling, or cross-selling products. (Bonus
points will be given to new/innovative ideas.)
• What are the Support, Confidence, and Lift values for
the following rule?
Wine, Canned Veg S Frozen Meal
6. In this assignment, you will use a free/open source
data mining tool, KNIME (knime.org), to build pre-
dictive models for a relatively small Customer Churn
Analysis data set. You are to analyze the given data
set (about the customer retention/attrition behavior
for 1,000 customers) to develop and compare at least
three prediction (i.e., classification) models. For exam-
ple, you can include decision trees, neural networks,
SVM, k nearest neighbor, and/or logistic regression
models in your comparison. Here are the specifics for
this assignment:
• Install and use the KNIME software tool from
(knime.org).
• You can also use MS Excel to preprocess the data (if
you need to/want to).
• Download CustomerChurnData.csv data file from the
book’s Web site.
• The data are given in comma-separated value (CSV)
format. This format is the most common flat-file for-
mat that many software tools can easily open/handle
(including KNIME and MS Excel).
• Present your results in a well-organized professional
document.
• Include a cover page (with proper information about
you and the assignment).
• Make sure to nicely integrate figures (graphs, charts,
tables, screenshots) within your textual description in
a professional manner. The report should have six
main sections (resembling CRISP-DM phases).
• Try not to exceed 15 pages in total, including the
cover (use 12-point Times New Roman fonts, and 1.5-
line spacing).
http://teradatauniversitynetwork.com
http://teradatauniversitynetwork.com
http://ftp.ics.uc,i.edu/pub/machine-learning-databases/statlog/german
http://ftp.ics.uc,i.edu/pub/machine-learning-databases/statlog/german
http://knime.org
http://knime.org
Chapter 4 • Data Mining Process, Methods, and Algorithms 249
Team Assignments and Role-Playing Projects
1. Examine how new data capture devices such as RFID
tags help organizations accurately identify and segment
their customers for activities such as targeted marketing.
Many of these applications involve data mining. Scan the
literature and the Web and then propose five potential
new data mining applications that can use the data cre-
ated with RFID technology. What issues could arise if a
country’s laws required such devices to be embedded
in everyone’s body for a national identification system?
2. Interview administrators in your college or executives in
your organization to determine how data mining, data
warehousing, Online Analytics Processing (OLAP), and
visualization tools could assist them in their work. Write
a proposal describing your findings. Include cost esti-
mates and benefits in your report.
3. A very good repository of data that has been used to
test the performance of many data mining algorithms is
available at ics.uci.edu/~mlearn/MLRepository.html.
Some of the data sets are meant to test the limits of cur-
rent machine-learning algorithms and to compare their
performance with new approaches to learning. However,
some of the smaller data sets can be useful for exploring
the functionality of any data mining software, such as
RapidMiner or KNIME. Download at least one data set
from this repository (e.g., Credit Screening Databases,
Housing Database) and apply decision tree or clustering
methods, as appropriate. Prepare a report based on your
results. (Some of these exercises, especially the ones that
involve large/challenging data/problem may be used as
semester-long term projects.)
4. Large and feature-rich data sets are made available by
the U.S. government or its subsidiaries on the Internet.
For instance, see a large collection of government data
sets (data.gov), the Centers for Disease Control and
Prevention data sets (www.cdc.gov/DataStatistics),
Surveillance, Cancer.org’s Epidemiology and End Results
data sets (http://seer.cancer.gov/data), and the
Department of Transportation’s Fatality Analysis Reporting
System crash data sets (www.nhtsa.gov/FARS). These
data sets are not preprocessed for data mining, which
makes them a great resource to experience the complete
data mining process. Another rich source for a collec-
tion of analytics data sets is listed on KDnuggets.com
(KDnuggets.com/datasets/index.html).
5. Consider the following data set, which includes three
attributes and a classification for admission decisions
into an MBA program:
GMAT GPA
Quantitative GMAT
Score (percentile) Decision
650 2.75 35 No
580 3.50 70 No
600 3.50 75 Yes
450 2.95 80 No
700 3.25 90 Yes
GMAT GPA
Quantitative GMAT
Score (percentile) Decision
590 3.50 80 Yes
400 3.85 45 No
640 3.50 75 Yes
540 3.00 60 ?
690 2.85 80 ?
490 4.00 65 ?
a. Using the data shown, develop your own manual
expert rules for decision making.
b. Use the Gini index to build a decision tree. You can
use manual calculations or a spreadsheet to perform
the basic calculations.
c. Use an automated decision tree software program to
build a tree for the same data.
Internet Exercises
1. Visit the AI Exploratorium at cs.ualberta.ca/~aixplore.
Click the Decision Tree link. Read the narrative on bas-
ketball game statistics. Examine the data, and then build
a decision tree. Report your impressions of its accuracy.
Also explore the effects of different algorithms.
2. Survey some data mining tools and vendors. Start with
fico.com and egain.com. Consult dmreview.com,
and identify some data mining products and service
providers that are not mentioned in this chapter.
3. Find recent cases of successful data mining applications.
Visit the Web sites of some data mining vendors, and
look for cases or success stories. Prepare a report sum-
marizing five new case studies.
4. Go to vendor Web sites (especially those of SAS, SPSS,
Cognos, Teradata, StatSoft, and Fair Isaac) and look at
success stories for BI (OLAP and data mining) tools.
What do the various success stories have in common?
How do they differ?
5. Go to statsoft.com (now a Dell company). Download
at least three white papers on applications. Which of
these applications might have used the data/text/Web
mining techniques discussed in this chapter?
6. Go to sas.com. Download at least three white papers
on applications. Which of these applications could have
used the data/text/Web mining techniques discussed in
this chapter?
7. Go to spss.com (an IBM company). Download at least
three white papers on applications. Which of these
applications could have used the data/text/Web mining
techniques discussed in this chapter?
8. Go to teradata.com. Download at least three white
papers on applications. Which of these applications
could have used the data/text/Web mining techniques
discussed in this chapter?
9. Go to fico.com. Download at least three white papers
on applications. Which of these applications could have
used the data/text/Web mining techniques discussed in
this chapter?
http://ics.uci.edu/~mlearn/MLRepository.html
http://data.gov
http://www.cdc.gov/DataStatistics
http://seer.cancer.gov/data
http://www.nhtsa.gov/FARS
http://KDnuggets.com
http://KDnuggets.com/datasets/index.html
http://cs.ualberta.ca/~aixplore
http://fico.com
http://egain.com
http://dmreview.com
http://statsoft.com
http://sas.com
http://spss.com
http://teradata.com
http://fico.com
250 Part II • Predictive Analytics/Machine Learning
10. Go to salfordsystems.com. Download at least three
white papers on applications. Which of these applica-
tions could have used the data/text/Web mining tech-
niques discussed in this chapter?
11. Go to rulequest.com. Download at least three white
papers on applications. Which of these applications
could have used the data/text/Web mining techniques
discussed in this chapter?
12. Go to KDnuggets.com. Explore the sections on
applications as well as software. Find names of at
least three additional packages for data mining and
text mining.
References
Chan, P., Phan, W., Prodromidis, A., & Stolfo, S. (1999). “Dis-
tributed Data Mining in Credit Card Fraud Detection.” IEEE
Intelligent Systems, 14(6), 67–74.
CRISP-DM. (2013). “Cross-Industry Standard Process for Data Min-
ing (CRISP-DM).” http://crisp-dm.orgwww.the- modeling-
agency.com/crisp-dm (accessed February 2, 2013).
Davenport, T. (2006, January). “Competing on Analytics.”
Harvard Business Review, 99–107.
Delen, D. (2009). “Analysis of Cancer Data: A Data Mining
Approach.” Expert Systems, 26(1), 100–112.
Delen, D. (2014). Real-World Data Mining: Applied Business
Analytics and Decision Making. Upper Saddle River, NJ:
Pearson.
Delen, D., Cogdell, D., & Kasap, N. (2012). “A Comparative
Analysis of Data Mining Methods in Predicting NCAA
Bowl Outcomes.” International Journal of Forecasting,
28, 543–552.
Delen, D., Sharda, R., & Kumar, P. (2007). “Movie Forecast
Guru: A Web-Based DSS for Hollywood Managers.” Deci-
sion Support Systems, 43(4), 1151–1170.
Delen, D., Walker, G., & Kadam, A. (2005). “Predicting Breast
Cancer Survivability: A Comparison of Three Data Mining
Methods.” Artificial Intelligence in Medicine, 34(2), 113–127.
Dunham, M. (2003). Data Mining: Introductory and Ad-
vanced Topics. Upper Saddle River, NJ: Prentice Hall.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). “From Knowl-
edge Discovery in Databases.” AI Magazine, 17(3), 37–54.
Hoffman, T. (1998, December 7). “Banks Turn to IT to Re-
claim Most Profitable Customers.” Computerworld.
Hoffman, T. (1999, April 19). “Insurers Mine for Age-
Appropriate Offering.” Computerworld.
Kohonen, T. (1982). “Self-Organized Formation of Topologically
Correct Feature Maps.” Biological Cybernetics, 43(1), 59–69.
Nemati, H., & Barko, C. (2001). “Issues in Organizational Data
Mining: A Survey of Current Practices.” Journal of Data
Warehousing, 6(1), 25–36.
Nisbet, R., Miner, G., & Elder IV, J. (2009). “Top 10 Data Min-
ing Mistakes.” Handbook of Statistical Analysis and Data
Mining Applications. Academic Press, pp. 733–754.
Quinlan, J. (1986). “Induction of Decision Trees.” Machine
Learning, 1, 81–106.
Saltelli, A. (2002). “Making Best Use of Model Evaluations to
Compute Sensitivity Indices,” Computer Physics Communi-
cations, 145, 280–297.
Saltelli, A., Tarantola, S., Campolongo, F., & Ratto, M. (2004).
Sensitivity Analysis in Practice – A Guide to Assessing Sci-
entific Models. Hoboken, NJ: John Wiley.
SEMMA. (2009). “SAS’s Data Mining Process: Sample, Explore,
Modify, Model, Assess.” sas.com/offices/europe/uk/
technologies/analytics/datamining/miner/semma.
html (accessed August 2009).
Sharda, R., & Delen, D. (2006). “Predicting Box-Office Success
of Motion Pictures with Neural Networks.” Expert Systems
with Applications, 30, 243–254.
Shultz, R. (2004, December 7). “Live from NCDM: Tales of
Database Buffoonery.” directmag.com/news/ncdm-12-
07-04/index.html (accessed April 2009).
Skalak, D. (2001). “Data Mining Blunders Exposed!” DB2
Magazine, 6(2), 10–13.
Thongkam, J., Xu, G., Zhang, Y., & Huang, F. (2009). “Toward
Breast Cancer Survivability Prediction Models Through Im-
proving Training Space.” Expert Systems with Applications,
36(10), 12200–12209.
Wald, M. (2004, February 21). “U.S. Calls Release of JetBlue
Data Improper.” The New York Times.
Wright, C. (2012). “Statistical Predictors of March Mad-
ness: An Examination of the NCAA Men’s Basketball
Championship.” http://economics-files.pomona.
edu/GarySmith/Econ190/Wright%20March%20
Madness%20Final%20Paper (accessed February
2, 2013).
Zaima, A. (2003). “The Five Myths of Data Mining.” What
Works: Best Practices in Business Intelligence and Data
Warehousing, Vol. 15. Chatsworth, CA: The Data Ware-
housing Institute, pp. 42–43.
Zolbanin, H., Delen, D., & Zadeh, A. (2015). “Predicting Over-
all Survivability in Comorbidity of Cancers: A Data Mining
Approach.” Decision Support Systems, 74, 150–161.
http://salfordsystems.com
http://rulequest.com
http://KDnuggets.com
http://crisp-dm.orgwww.the-modeling-agency.com/crisp-dm
http://crisp-dm.orgwww.the-modeling-agency.com/crisp-dm
http://sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html
http://sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html
http://sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html
http://directmag.com/news/ncdm-12-07-04/index.html
http://directmag.com/news/ncdm-12-07-04/index.html
http://economics-files.pomona.edu/GarySmith/Econ190/Wright%20March%20Madness%20Final%20Paper
http://economics-files.pomona.edu/GarySmith/Econ190/Wright%20March%20Madness%20Final%20Paper
http://economics-files.pomona.edu/GarySmith/Econ190/Wright%20March%20Madness%20Final%20Paper
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.