Cover image for Machine Learning with Spark and Python : Essential Techniques for Predictive Analytic
Title:
Machine Learning with Spark and Python : Essential Techniques for Predictive Analytic
Personal Author:
Edition:
Second Edition
Physical Description:
xxvii, 340 pages : illustrations ; 24 cm.
ISBN:
9781119561934
Abstract:
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics, Second Edition simplifies ML for practical uses by focusing on two key algorithms. This new second edition improves with the addition of Spark―a ML framework from the Apache foundation. By implementing Spark, machine learning students can easily process much large data sets and call the spark algorithms using ordinary Python code. Machine Learning with Spark and Python focuses on two algorithm families (linear methods and ensemble methods) that effectively predict outcomes. This type of problem covers many use cases such as what ad to place on a web page, predicting prices in securities markets, or detecting credit card fraud. The focus on two families gives enough room for full descriptions of the mechanisms at work in the algorithms. Then the code examples serve to illustrate the workings of the machinery with specific hackable code.

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010371675 Q325.5 B69 2019 Open Access Book Book
Searching...

On Order

Summary

Summary

Machine Learning with Spark and Python Essential Techniques for Predictive Analytics, Second Edition simplifies ML for practical uses by focusing on two key algorithms. This new second edition improves with the addition of Spark--a ML framework from the Apache foundation. By implementing Spark, machine learning students can easily process much large data sets and call the spark algorithms using ordinary Python code.

Machine Learning with Spark and Python focuses on two algorithm families (linear methods and ensemble methods) that effectively predict outcomes. This type of problem covers many use cases such as what ad to place on a web page, predicting prices in securities markets, or detecting credit card fraud. The focus on two families gives enough room for full descriptions of the mechanisms at work in the algorithms. Then the code examples serve to illustrate the workings of the machinery with specific hackable code.


Author Notes

Michael Bowles teaches machine learning at UC Berkeley, University of New Haven and Hacker Dojo in Silicon Valley, consults on machine learning projects, and is involved in a number of startups in such areas as semi conductor inspection, drug design and optimization and trading in the financial markets. Following an assistant professorship at MIT, Michael went on to found and run two Silicon Valley startups, both of which went public. His courses are always popular and receive great feedback from participants.


Table of Contents

Introductionp. xxi
Chapter 1 The Two Essential Algorithms for Making Predictionsp. 1
Why Are These Two Algorithms So Useful?p. 2
What Are Penalized Regression Methods?p. 7
What Are Ensemble Methods?p. 9
How to Decide Which Algorithm to Usep. 11
The Process Steps for Building a Predictive Modelp. 13
Framing a Machine Learning Problemp. 15
Feature Extraction and Feature Engineeringp. 17
Determining Performance of a Trained Modelp. 18
Chapter Contents and Dependenciesp. 18
Summaryp. 20
Chapter 2 Understand the Problem by Understanding the Datap. 23
The Anatomy of a New Problemp. 24
Different Types of Attributes and Labels Drive Modeling Choicesp. 26
Things to Notice about Your New Data Setp. 27
Classification Problems: Detecting Unexploded Mines Using Sonarp. 28
Physical Characteristics of the Rocks Versus Mines Data Setp. 29
Statistical Summaries of the Rocks Versus Mines Data Setp. 32
Visualization of Outliers Using a Quantile-Quantile Plotp. 34
Statistical Characterization of Categorical Attributesp. 35
How to Use Python Pandas to Summarize the Rocks Versus Mines Data Setp. 36
Visualizing Properties of the Rocks Versus Mines Data Setp. 39
Visualizing with Parallel Coordinates Plotsp. 39
Visualizing Interrelationships between Attributes and Labelsp. 41
Visualizing Attribute and Label Correlations Using a Heat Mapp. 48
Summarizing the Process for Understanding the Rocks Versus Mines Data Setp. 50
Real-Valued Predictions with Factor Variables: How Old Is Your Abalone?p. 50
Parallel Coordinates for Regression Problems-Visualize Variable Relationships for the Abalone Problemp. 55
How to Use a Correlation Heat Map for Regression-Visualize Pair-Wise Correlations for the Abalone Problemp. 59
Real-Valued Predictions Using Real-Valued Attributes:
Calculate How Your Wine Tastesp. 61
Multiclass Classification Problem: What Type of Glass Is That?p. 67
Using PySpark to Understand Large Data Setsp. 72
Summaryp. 75
Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Datap. 77
The Basic Problem: Understanding Function Approximationp. 78
Working with Training Datap. 79
Assessing Performance of Predictive Modelsp. 81
Factors Driving Algorithm Choices and Performance-Complexity and Datap. 82
Contrast between a Simple Problem and a Complex Problemp. 82
Contrast between a Simple Model and a Complex Modelp. 85
Factors Driving Predictive Algorithm Performancep. 89
Choosing an Algorithm: Linear or Nonlinear?p. 90
Measuring the Performance of Predictive Modelsp. 91
Performance Measures for Different Types of Problemsp. 91
Simulating Performance of Deployed Modelsp. 105
Achieving Harmony between Model and Datap. 107
Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Sizep. 107
Using Forward Stepwise Regression to Control Overfittingp. 109
Evaluating and Understanding Your Predictive Modelp. 114
Control Overfitting by Penalizing Regression Coefficients-Ridge Regressionp. 116
Using PySpark for Training Penalized Regression Models on Extremely Large Data Setsp. 124
Summaryp. 127
Chapter 4 Penalized Linear Regressionp. 129
Why Penalized Linear Regression Methods Are So Usefulp. 130
Extremely Fast Coefficient Estimationp. 130
Variable Importance Informationp. 131
Extremely Fast Evaluation When Deployedp. 131
Reliable Performancep. 131
Sparse Solutionsp. 132
Problem May Require Linear Modelp. 132
When to Use Ensemble Methodsp. 132
Penalized Linear Regression: Regulating Linear Regression for Optimum Performancep. 132
Training Linear Models: Minimizing Errors and Morep. 135
Adding a Coefficient Penalty to the OLS Formulationp. 136
Other Useful Coefficient Penalties-Manhattan and ElasticNetp. 137
Why Lasso Penalty Leads to Sparse Coefficient Vectorsp. 138
ElasticNet Penalty Includes Both Lasso and Ridgep. 140
Solving the Penalized Linear Regression Problemp. 141
Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regressionp. 141
How LARS Generates Hundreds of Models of Varying Complexityp. 145
Choosing the Best Model from the Hundreds LARS Generatesp. 147
Using Glmnet: Very Fast and Very Generalp. 152
Comparison of the Mechanics of Glmnet and LARS Algorithmsp. 153
Initializing and Iterating the Glmnet Algorithmp. 153
Extension of Linear Regression to Classification Problemsp. 157
Solving Classification Problems with Penalized Regressionp. 157
Working with Classification Problems Having More Than Two Outcomesp. 161
Understanding Basis Expansion: Using Linear Methods on Nonlinear Problemsp. 161
Incorporating Non-Numeric Attributes into Linear Methodsp. 163
Summaryp. 166
Chapter 5 Building Predictive Models Using Penalized Linear Methodsp. 169
Python Packages for Penalized Linear Regressionp. 170
Multivariable Regression: Predicting Wine Tastep. 171
Building and Testing a Model to Predict Wine Tastep. 172
Training on the Whole Data Set before Deploymentp. 175
Basis Expansion: Improving Performance by Creating New Variables from Old Onesp. 179
Binary Classification: Using Penalized Linear Regression to Detect Unexploded Minesp. 182
Build a Rocks Versus Mines Classifier for Deploymentp. 191
Multiclass Classification: Classifying Crime Scene Glass Samplesp. 200
Linear Regression and Classification Using PySparkp. 203
Using PySpark to Predict Wine Tastep. 204
Logistic Regression with PySpark: Rocks Versus Minesp. 208
Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Ringsp. 213