Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010371675 | Q325.5 B69 2019 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics, Second Edition simplifies ML for practical uses by focusing on two key algorithms. This new second edition improves with the addition of Spark--a ML framework from the Apache foundation. By implementing Spark, machine learning students can easily process much large data sets and call the spark algorithms using ordinary Python code.
Machine Learning with Spark and Python focuses on two algorithm families (linear methods and ensemble methods) that effectively predict outcomes. This type of problem covers many use cases such as what ad to place on a web page, predicting prices in securities markets, or detecting credit card fraud. The focus on two families gives enough room for full descriptions of the mechanisms at work in the algorithms. Then the code examples serve to illustrate the workings of the machinery with specific hackable code.
Author Notes
Michael Bowles teaches machine learning at UC Berkeley, University of New Haven and Hacker Dojo in Silicon Valley, consults on machine learning projects, and is involved in a number of startups in such areas as semi conductor inspection, drug design and optimization and trading in the financial markets. Following an assistant professorship at MIT, Michael went on to found and run two Silicon Valley startups, both of which went public. His courses are always popular and receive great feedback from participants.
Table of Contents
Introduction | p. xxi |
Chapter 1 The Two Essential Algorithms for Making Predictions | p. 1 |
Why Are These Two Algorithms So Useful? | p. 2 |
What Are Penalized Regression Methods? | p. 7 |
What Are Ensemble Methods? | p. 9 |
How to Decide Which Algorithm to Use | p. 11 |
The Process Steps for Building a Predictive Model | p. 13 |
Framing a Machine Learning Problem | p. 15 |
Feature Extraction and Feature Engineering | p. 17 |
Determining Performance of a Trained Model | p. 18 |
Chapter Contents and Dependencies | p. 18 |
Summary | p. 20 |
Chapter 2 Understand the Problem by Understanding the Data | p. 23 |
The Anatomy of a New Problem | p. 24 |
Different Types of Attributes and Labels Drive Modeling Choices | p. 26 |
Things to Notice about Your New Data Set | p. 27 |
Classification Problems: Detecting Unexploded Mines Using Sonar | p. 28 |
Physical Characteristics of the Rocks Versus Mines Data Set | p. 29 |
Statistical Summaries of the Rocks Versus Mines Data Set | p. 32 |
Visualization of Outliers Using a Quantile-Quantile Plot | p. 34 |
Statistical Characterization of Categorical Attributes | p. 35 |
How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set | p. 36 |
Visualizing Properties of the Rocks Versus Mines Data Set | p. 39 |
Visualizing with Parallel Coordinates Plots | p. 39 |
Visualizing Interrelationships between Attributes and Labels | p. 41 |
Visualizing Attribute and Label Correlations Using a Heat Map | p. 48 |
Summarizing the Process for Understanding the Rocks Versus Mines Data Set | p. 50 |
Real-Valued Predictions with Factor Variables: How Old Is Your Abalone? | p. 50 |
Parallel Coordinates for Regression Problems-Visualize Variable Relationships for the Abalone Problem | p. 55 |
How to Use a Correlation Heat Map for Regression-Visualize Pair-Wise Correlations for the Abalone Problem | p. 59 |
Real-Valued Predictions Using Real-Valued Attributes: | |
Calculate How Your Wine Tastes | p. 61 |
Multiclass Classification Problem: What Type of Glass Is That? | p. 67 |
Using PySpark to Understand Large Data Sets | p. 72 |
Summary | p. 75 |
Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data | p. 77 |
The Basic Problem: Understanding Function Approximation | p. 78 |
Working with Training Data | p. 79 |
Assessing Performance of Predictive Models | p. 81 |
Factors Driving Algorithm Choices and Performance-Complexity and Data | p. 82 |
Contrast between a Simple Problem and a Complex Problem | p. 82 |
Contrast between a Simple Model and a Complex Model | p. 85 |
Factors Driving Predictive Algorithm Performance | p. 89 |
Choosing an Algorithm: Linear or Nonlinear? | p. 90 |
Measuring the Performance of Predictive Models | p. 91 |
Performance Measures for Different Types of Problems | p. 91 |
Simulating Performance of Deployed Models | p. 105 |
Achieving Harmony between Model and Data | p. 107 |
Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size | p. 107 |
Using Forward Stepwise Regression to Control Overfitting | p. 109 |
Evaluating and Understanding Your Predictive Model | p. 114 |
Control Overfitting by Penalizing Regression Coefficients-Ridge Regression | p. 116 |
Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets | p. 124 |
Summary | p. 127 |
Chapter 4 Penalized Linear Regression | p. 129 |
Why Penalized Linear Regression Methods Are So Useful | p. 130 |
Extremely Fast Coefficient Estimation | p. 130 |
Variable Importance Information | p. 131 |
Extremely Fast Evaluation When Deployed | p. 131 |
Reliable Performance | p. 131 |
Sparse Solutions | p. 132 |
Problem May Require Linear Model | p. 132 |
When to Use Ensemble Methods | p. 132 |
Penalized Linear Regression: Regulating Linear Regression for Optimum Performance | p. 132 |
Training Linear Models: Minimizing Errors and More | p. 135 |
Adding a Coefficient Penalty to the OLS Formulation | p. 136 |
Other Useful Coefficient Penalties-Manhattan and ElasticNet | p. 137 |
Why Lasso Penalty Leads to Sparse Coefficient Vectors | p. 138 |
ElasticNet Penalty Includes Both Lasso and Ridge | p. 140 |
Solving the Penalized Linear Regression Problem | p. 141 |
Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression | p. 141 |
How LARS Generates Hundreds of Models of Varying Complexity | p. 145 |
Choosing the Best Model from the Hundreds LARS Generates | p. 147 |
Using Glmnet: Very Fast and Very General | p. 152 |
Comparison of the Mechanics of Glmnet and LARS Algorithms | p. 153 |
Initializing and Iterating the Glmnet Algorithm | p. 153 |
Extension of Linear Regression to Classification Problems | p. 157 |
Solving Classification Problems with Penalized Regression | p. 157 |
Working with Classification Problems Having More Than Two Outcomes | p. 161 |
Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems | p. 161 |
Incorporating Non-Numeric Attributes into Linear Methods | p. 163 |
Summary | p. 166 |
Chapter 5 Building Predictive Models Using Penalized Linear Methods | p. 169 |
Python Packages for Penalized Linear Regression | p. 170 |
Multivariable Regression: Predicting Wine Taste | p. 171 |
Building and Testing a Model to Predict Wine Taste | p. 172 |
Training on the Whole Data Set before Deployment | p. 175 |
Basis Expansion: Improving Performance by Creating New Variables from Old Ones | p. 179 |
Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines | p. 182 |
Build a Rocks Versus Mines Classifier for Deployment | p. 191 |
Multiclass Classification: Classifying Crime Scene Glass Samples | p. 200 |
Linear Regression and Classification Using PySpark | p. 203 |
Using PySpark to Predict Wine Taste | p. 204 |
Logistic Regression with PySpark: Rocks Versus Mines | p. 208 |
Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings | p. 213 |