SOLUTION: San Jose State University Continuous and Categorical Attributes Discussion

INTRODUCTION TO DATA MINING
INTRODUCTION TO DATA MINING
SECOND EDITION
PANG-NING TAN
Michigan State Universit
MICHAEL STEINBACH
University of Minnesota
ANUJ KARPATNE
University of Minnesota
VIPIN KUMAR
University of Minnesota
330 Hudson Street, NY NY 10013
Director, Portfolio Management: Engineering, Computer Science & Global
Editions: Julian Partridge
Specialist, Higher Ed Portfolio Management: Matt Goldstein
Portfolio Management Assistant: Meghan Jacoby
Managing Content Producer: Scott Disanno
Content Producer: Carole Snyder
Web Developer: Steve Wright
Rights and Permissions Manager: Ben Ferrini
Manufacturing Buyer, Higher Ed, Lake Side Communications Inc (LSC):
Maura Zaldivar-Garcia
Inventory Manager: Ann Lam
Product Marketing Manager: Yvonne Vannatta
Field Marketing Manager: Demetrius Hall
Marketing Assistant: Jon Bryant
Cover Designer: Joyce Wells, jWellsDesign
Full-Service Project Management: Chandrasekar Subramanian, SPi Global
Copyright ©2019 Pearson Education, Inc. All rights reserved. Manufactured in
the United States of America. This publication is protected by Copyright, and
permission should be obtained from the publisher prior to any prohibited
reproduction, storage in a retrieval system, or transmission in any form or by
any means, electronic, mechanical, photocopying, recording, or likewise. For
information regarding permissions, request forms and the appropriate
contacts within the Pearson Education Global Rights & Permissions
department, please visit www.pearsonhighed.com/permissions/.
Many of the designations by manufacturers and sellers to distinguish their
products are claimed as trademarks. Where those designations appear in this
book, and the publisher was aware of a trademark claim, the designations
have been printed in initial caps or all caps.
Library of Congress Cataloging-in-Publication Data on File
Names: Tan, Pang-Ning, author. | Steinbach, Michael, author. | Karpatne,
Anuj, author. | Kumar, Vipin, 1956- author.
Title: Introduction to Data Mining / Pang-Ning Tan, Michigan State University,
Michael Steinbach, University of Minnesota, Anuj Karpatne, University of
Minnesota, Vipin Kumar, University of Minnesota.
Description: Second edition. | New York, NY : Pearson Education, [2019] |
Includes bibliographical references and index.
Identifiers: LCCN 2017048641 | ISBN 9780133128901 | ISBN 0133128903
Subjects: LCSH: Data mining.
Classification: LCC QA76.9.D343 T35 2019 | DDC 006.3/12–dc23 LC record
available at https://lccn.loc.gov/2017048641
1 18
ISBN-10: 0133128903
ISBN-13: 9780133128901
To our families …
Preface to the Second Edition
Since the first edition, roughly 12 years ago, much has changed in the field of
data analysis. The volume and variety of data being collected continues to
increase, as has the rate (velocity) at which it is being collected and used to
make decisions. Indeed, the term, Big Data, has been used to refer to the
massive and diverse data sets now available. In addition, the term data
science has been coined to describe an emerging area that applies tools and
techniques from various fields, such as data mining, machine learning,
statistics, and many others, to extract actionable insights from data, often big
data.
The growth in data has created numerous opportunities for all areas of data
analysis. The most dramatic developments have been in the area of predictive
modeling, across a wide range of application domains. For instance, recent
advances in neural networks, known as deep learning, have shown
impressive results in a number of challenging areas, such as image
classification, speech recognition, as well as text categorization and
understanding. While not as dramatic, other areas, e.g., clustering,
association analysis, and anomaly detection have also continued to advance.
This new edition is in response to those advances.
Overview
As with the first edition, the second edition of the book provides a
comprehensive introduction to data mining and is designed to be accessible
and useful to students, instructors, researchers, and professionals. Areas
covered include data preprocessing, predictive modeling, association
analysis, cluster analysis, anomaly detection, and avoiding false discoveries.
The goal is to present fundamental concepts and algorithms for each topic,
thus providing the reader with the necessary background for the application of
data mining to real problems. As before, classification, association analysis
and cluster analysis, are each covered in a pair of chapters. The introductory
chapter covers basic concepts, representative algorithms, and evaluation
techniques, while the more following chapter discusses advanced concepts
and algorithms. As before, our objective is to provide the reader with a sound
understanding of the foundations of data mining, while still covering many
important advanced topics. Because of this approach, the book is useful both
as a learning tool and as a reference.
To help readers better understand the concepts that have been presented, we
provide an extensive set of examples, figures, and exercises. The solutions to
the original exercises, which are already circulating on the web, will be made
public. The exercises are mostly unchanged from the last edition, with the
exception of new exercises in the chapter on avoiding false discoveries. New
exercises for the other chapters and their solutions will be available to
instructors via the web. Bibliographic notes are included at the end of each
chapter for readers who are interested in more advanced topics, historically
important papers, and recent trends. These have also been significantly
updated. The book also contains a comprehensive subject and author index.
What is New in the Second Edition?
Some of the most significant improvements in the text have been in the two
chapters on classification. The introductory chapter uses the decision tree
classifier for illustration, but the discussion on many topics—those that apply
across all classification approaches—has been greatly expanded and
clarified, including topics such as overfitting, underfitting, the impact of training
size, model complexity, model selection, and common pitfalls in model
evaluation. Almost every section of the advanced classification chapter has
been significantly updated. The material on Bayesian networks, support vector
machines, and artificial neural networks has been significantly expanded. We
have added a separate section on deep networks to address the current
developments in this area. The discussion of evaluation, which occurs in the
section on imbalanced classes, has also been updated and improved.
The changes in association analysis are more localized. We have completely
reworked the section on the evaluation of association patterns (introductory
chapter), as well as the sections on sequence and graph mining (advanced
chapter). Changes to cluster analysis are also localized. The introductory
chapter added the K-means initialization technique and an updated the
discussion of cluster evaluation. The advanced clustering chapter adds a new
section on spectral graph clustering. Anomaly detection has been greatly
revised and expanded. Existing approaches—statistical, nearest
neighbor/density-based, and clustering based—have been retained and
updated, while new approaches have been added: reconstruction-based, oneclass classification, and information-theoretic. The reconstruction-based
approach is illustrated using autoencoder networks that are part of the deep
learning paradigm. The data chapter has been updated to include discussions
of mutual information and kernel-based techniques.
The last chapter, which discusses how to avoid false discoveries and produce
valid results, is completely new, and is novel among other contemporary
textbooks on data mining. It supplements the discussions in the other
chapters with a discussion of the statistical concepts (statistical significance,
p-values, false discovery rate, permutation testing, etc.) relevant to avoiding
spurious results, and then illustrates these concepts in the context of data
mining techniques. This chapter addresses the increasing concern over the
validity and reproducibility of results obtained from data analysis. The addition
of this last chapter is a recognition of the importance of this topic and an
acknowledgment that a deeper understanding of this area is needed for those
analyzing data.
The data exploration chapter has been deleted, as have the appendices, from
the print edition of the book, but will remain available on the web. A new
appendix provides a brief discussion of scalability in the context of big data.
To the Instructor
As a textbook, this book is suitable for a wide range of students at the
advanced undergraduate or graduate level. Since students come to this
subject with diverse backgrounds that may not include extensive knowledge of
statistics or databases, our book requires minimal prerequisites. No database
knowledge is needed, and we assume only a modest background in statistics
or mathematics, although such a background will make for easier going in
some sections. As before, the book, and more specifically, the chapters
covering major data mining topics, are designed to be as self-contained as
possible. Thus, the order in which topics can be covered is quite flexible. The
core material is covered in chapters 2 (data), 3 (classification), 5 (association
analysis), 7 (clustering), and 9 (anomaly detection). We recommend at least a
cursory coverage of Chapter 10 (Avoiding False Discoveries) to instill in
students some caution when interpreting the results of their data analysis.
Although the introductory data chapter (2) should be covered first, the basic
classification (3), association analysis (5), and clustering chapters (7), can be
covered in any order. Because of the relationship of anomaly detection (9) to
classification (3) and clustering (7), these chapters should precede Chapter 9.
Various topics can be selected from the advanced classification, association
analysis, and clustering chapters (4, 6, and 8, respectively) to fit the schedule
and interests of the instructor and students. We also advise that the lectures
be augmented by projects or practical exercises in data mining. Although they
are time consuming, such hands-on assignments greatly enhance the value of
the course.
Support Materials
Support materials available to all readers of this book are available at
http://www-users.cs.umn.edu/~kumar/dmbook.
PowerPoint lecture slides
Suggestions for student projects
Data mining resources, such as algorithms and data sets
Online tutorials that give step-by-step examples for selected data mining
techniques described in the book using actual data sets and data analysis
software
Additional support materials, including solutions to exercises, are available
only to instructors adopting this textbook for classroom use. The book’s
resources will be mirrored at www.pearsonhighered.com/cs-resources.
Comments and suggestions, as well as reports of errors, can be sent to the
authors through dmbook@cs.umn.edu.
Acknowledgments
Many people contributed to the first and second editions of the book. We
begin by acknowledging our families to whom this book is dedicated. Without
their patience and support, this project would have been impossible.
We would like to thank the current and former students of our data mining
groups at the University of Minnesota and Michigan State for their
contributions. Eui-Hong (Sam) Han and Mahesh Joshi helped with the initial
data mining classes. Some of the exercises and presentation slides that they
created can be found in the book and its accompanying slides. Students in our
data mining groups who provided comments on drafts of the book or who
contributed in other ways include Shyam Boriah, Haibin Cheng, Varun
Chandola, Eric Eilertson, Levent Ertöz, Jing Gao, Rohit Gupta, Sridhar Iyer,
Jung-Eun Lee, Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey,
Kashif Riaz, Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and
Pusheng Zhang. We would also like to thank the students of our data mining
classes at the University of Minnesota and Michigan State University who
worked with early drafts of the book and provided invaluable feedback. We
specifically note the helpful suggestions of Bernardo Craemer, Arifin Ruslim,
Jamshid Vayghan, and Yu Wei.
Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of Florida)
class tested early versions of the book. We also received many useful
suggestions directly from the following UT students: Pankaj Adhikari, Rajiv
Bhatia, Frederic Bosche, Arindam Chakraborty, Meghana Deodhar, Chris
Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones, Ajay Joshi,
Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung Park, Aashish
Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and Mei Yang.
Ronald Kostoff (ONR) read an early version of the clustering chapter and
offered numerous suggestions. George Karypis provided invaluable LATEX
assistance in creating an author index. Irene Moulitsas also provided
assistance with LATEX and reviewed some of the appendices. Musetta
Steinbach was very helpful in finding errors in the figures.
We would like to acknowledge our colleagues at the University of Minnesota
and Michigan State who have helped create a positive environment for data
mining research. They include Arindam Banerjee, Dan Boley, Joyce Chai, Anil
Jain, Ravi Janardan, Rong Jin, George Karypis, Claudia Neuhauser, Haesun
Park, William F. Punch, György Simon, Shashi Shekhar, and Jaideep
Srivastava. The collaborators on our many data mining projects, who also
have our gratitude, include Ramesh Agrawal, Maneesh Bhargava, Steve
Cannon, Alok Choudhary, Imme Ebert-Uphoff, Auroop Ganguly, Piet C. de
Groen, Fran Hill, Yongdae Kim, Steve Klooster, Kerry Long, Nihar Mahapatra,
Rama Nemani, Nikunj Oza, Chris Potter, Lisiane Pruinelli, Nagiza Samatova,
Jonathan Shapiro, Kevin Silverstein, Brian Van Ness, Bonnie Westra, Nevin
Young, and Zhi-Li Zhang.
The departments of Computer Science and Engineering at the University of
Minnesota and Michigan State University provided computing resources and a
supportive environment for this project. ARDA, ARL, ARO, DOE, NASA,
NOAA, and NSF provided research support for Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar. In particular, Kamal Abdali, Mitra
Basu, Dick Brackney, Jagdish Chandra, Joe Coughlan, Michael Coyle,
Stephen Davis, Frederica Darema, Richard Hirsch, Chandrika Kamath,
Tsengdar Lee, Raju Namburu, N. Radhakrishnan, James Sidoran, Sylvia
Spengler, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova, Aidong
Zhang, and Xiaodong Zhang have been supportive of our research in data
mining and high-performance computing.
It was a pleasure working with the helpful staff at Pearson Education. In
particular, we would like to thank Matt Goldstein, Kathy Smith, Carole Snyder,
and Joyce Wells. We would also like to thank George Nichols, who helped
with the art work and Paul Anagnostopoulos, who provided LATEX support.
We are grateful to the following Pearson reviewers: Leman Akoglu (Carnegie
Mellon University), Chien-Chung Chan (University of Akron), Zhengxin Chen
(University of Nebraska at Omaha), Chris Clifton (Purdue University), Joydeep Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute of
Technology), J. Michael Hardin (University of Alabama), Jingrui He (Arizona
State University), James Hearne (Western Washington University), Hillol
Kargupta (University of Maryland, Baltimore County and Agnik, LLC), Eamonn
Keogh (University of California-Riverside), Bing Liu (University of Illinois at
Chicago), Mariofanna Milanova (University of Arkansas at Little Rock),
Srinivasan Parthasarathy (Ohio State University), Zbigniew W. Ras (University
of North Carolina at Charlotte), Xintao Wu (University of North Carolina at
Charlotte), and Mohammed J. Zaki (Rensselaer Polytechnic Institute).
Over the years since the first edition, we have also received numerous
comments from readers and students who have pointed out typos and various
other issues. We are unable to mention these individuals by name, but their
input is much appreciated and has been taken into account for the second
edition.
Contents
Preface to the Second Edition v
1 Introduction 1
1.1 What Is Data Mining? 4
1.2 Motivating Challenges 5
1.3 The Origins of Data Mining 7
1.4 Data Mining Tasks 9
1.5 Scope and Organization of the Book 13
1.6 Bibliographic Notes 15
1.7 Exercises 21
2 Data 23
2.1 Types of Data 26
2.1.1 Attributes and Measurement 27
2.1.2 Types of Data Sets 34
2.2 Data Quality 42
2.2.1 Measurement and Data Collection Issues 42
2.2.2 Issues Related to Applications 49
2.3 Data Preprocessing 50
2.3.1 Aggregation 51
2.3.2 Sampling 52
2.3.3 Dimensionality Reduction 56
2.3.4 Feature Subset Selection 58
2.3.5 Feature Creation 61
2.3.6 Discretization and Binarization 63
2.3.7 Variable Transformation 69
2.4 Measures of Similarity and Dissimilarity 71
2.4.1 Basics 72
2.4.2 Similarity and Dissimilarity between Simple Attributes 74
2.4.3 Dissimilarities between Data Objects 76
2.4.4 Similarities between Data Objects 78
2.4.5 Examples of Proximity Measures 79
2.4.6 Mutual Information 88
2.4.7 Kernel Functions* 90
2.4.8 Bregman Divergence* 94
2.4.9 Issues in Proximity Calculation 96
2.4.10 Selecting the Right Proximity Measure 98
2.5 Bibliographic Notes 100
2.6 Exercises 105
3 Classification: Basic Concepts and Techniques 113
3.1 Basic Concepts 114
3.2 General Framework for Classification 117
3.3 Decision Tree Classifier 119
3.3.1 A Basic Algorithm to Build a Decision Tree 121
3.3.2 Methods for Expressing Attribute Test Conditions 124
3.3.3 Measures for Selecting an Attribute Test Condition 127
3.3.4 Algorithm for Decision Tree Induction 136
3.3.5 Example Application: Web Robot Detection 138
3.3.6 Characteristics of Decision Tree Classifiers 140
3.4 Model Overfitting 147
3.4.1 Reasons for Model Overfitting 149
3.5 Model Selection 156
3.5.1 Using a Validation Set 156
3.5.2 Incorporating Model Complexity 157
3.5.3 Estimating Statistical Bounds 162
3.5.4 Model Selection for Decision Trees 162
3.6 Model Evaluation 164
3.6.1 Holdout Method 165
3.6.2 Cross-Validation 165
3.7 Presence of Hyper-parameters 168
3.7.1 Hyper-parameter Selection 168
3.7.2 Nested Cross-Validation 170
3.8 Pitfalls of Model Selection and Evaluation 172
3.8.1 Overlap between Training and Test Sets 172
3.8.2 Use of Validation Error as Generalization Error 172
3.9 Model Comparison* 173
3.9.1 Estimating the Confidence Interval for Accuracy 174
3.9.2 Comparing the Performance of Two Models 175
3.10 Bibliographic Notes 176
3.11 Exercises 185
4 Classification: Alternative Techniques 193
4.1 Types of Classifiers 193
4.2 Rule-Based Classifier 195
4.2.1 How a Rule-Based Classifier Works 197
4.2.2 Properties of a Rule Set 198
4.2.3 Direct Methods for Rule Extraction 199
4.2.4 Indirect Methods for Rule Extraction 204
4.2.5 Characteristics of Rule-Based Classifiers 206
4.3 Nearest Neighbor Classifiers 208
4.3.1 Algorithm 209
4.3.2 Characteristics of Nearest Neighbor Classifiers 210
4.4 Naïve Bayes Classifier 212
4.4.1 Basics of Probability Theory 213
4.4.2 Naïve Bayes Assumption 218
4.5 Bayesian Networks 227
4.5.1 Graphical Representation 227
4.5.2 Inference and Learning 233
4.5.3 Characteristics of Bayesian Networks 242
4.6 Logistic Regression 243
4.6.1 Logistic Regression as a Generalized Linear Model 244
4.6.2 Learning Model Parameters 245
4.6.3 Characteristics of Logistic Regression 248
4.7 Artificial Neural Network (ANN) 249
4.7.1 Perceptron 250
4.7.2 Multi-layer Neural Network 254
4.7.3 Characteristics of ANN 261
4.8 Deep Learning 262
4.8.1 Using Synergistic Loss Functions 263
4.8.2 Using Responsive Activation Functions 266
4.8.3 Regularization 268
4.8.4 Initialization of Model Parameters 271
4.8.5 Characteristi…
Purchase answer to see full
attachment

Haven’t Found The Relevant Content? Hire a Subject Expert to Help You With
SOLUTION: San Jose State University Continuous and Categorical Attributes Discussion
Post Your Own Question And Get A Custom Answer
Hire Writer
Written Assignments
Get 20% Discount on This Paper
Pages (550 words)
Approximate price: -

Why Choose Us?

Quality Papers

We value our clients. For this reason, we ensure that each paper is written carefully as per the instructions provided by the client. Our editing team also checks all the papers to ensure that they have been completed as per the expectations.

Professional Academic Writers

Over the years, our Written Assignments has managed to secure the most qualified, reliable and experienced team of writers. The company has also ensured continued training and development of the team members to ensure that it keeps up with the rising Academic Trends.

Affordable Prices

Our prices are fairly priced in such a way that ensures affordability. Additionally, you can get a free price quotation by clicking on the "Place Order" button.

On-Time delivery

We pay strict attention to deadlines. For this reason, we ensure that all papers are submitted earlier, even before the deadline indicated by the customer. For this reason, the client can go through the work and review everything.

100% Originality

At Written Assignments, all papers are plagiarism-free as they are written from scratch. We have taken strict measures to ensure that there is no similarity on all papers and that citations are included as per the standards set.

Customer Support 24/7

Our support team is readily available to provide any guidance/help on our platform at any time of the day/night. Feel free to contact us via the Chat window or support email: support@writtenassignments.com.

Try it now!

Order Now to Get 20% Discount

We'll send you the first draft for approval by at
Total price:
$0.00

How our best essay writing service works?

Follow these simple steps to get your paper done

Place your order

Fill in the order form and provide all details of your assignment.

Proceed with the payment

Choose the payment system that suits you most.

Receive the final file

Once your paper is ready, we will email it to you.

Our Services

Written Assignments has stood as the world’s leading custom essay writing paper services provider. Once you enter all the details in the order form under the place order button, the rest is up to us.

Essays

Cheapest Essay Writing Service

At Written Assignments, we prioritize all aspects that bring about a good grade such as impeccable grammar, proper structure, zero plagiarism and conformance to guidelines. Our experienced team of writers will help you completed your essays and other assignments.

Admissions

Admission and Business Papers

Be assured that you’ll get accepted to the Master’s level program at any university once you enter all the details in the order form. We won’t leave you here; we will also help you secure a good position in your aspired workplace by creating an outstanding resume or portfolio once you place an order.

Editing

Editing and Proofreading

Our skilled editing and writing team will help you restructure your paper, paraphrase, correct grammar and replace plagiarized sections on your paper just on time. The service is geared toward eliminating any mistakes and rather enhancing better quality.

Coursework

Technical papers

We have writers in almost all fields including the most technical fields. You don’t have to worry about the complexity of your paper. Simply enter as many details as possible in the place order section.