Practical text mining with perl

This book is devoted to the fundamentals of text mining using Perl, an open-source programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectives--statistics, data mining, linguistics, and information retrieval--and provides readers with the means to successfully complete text mining tasks on their own.

The book begins with an introduction to regular expressions, a text pattern methodology, and quantitative text summaries, all of which are fundamental tools of analyzing text. Then, it builds upon this foundation to explore:

Probability and texts, including the bag-of-words model Information retrieval techniques such as the TF-IDF similarity measure Concordance lines and corpus linguistics Multivariate techniques such as correlation, principal components analysis, and clustering Perl modules, German, and permutation tests

Each chapter is devoted to a single key topic, and the author carefully and thoughtfully introduces mathematical concepts as they arise, allowing readers to learn as they go without having to refer to additional books. The inclusion of numerous exercises and worked-out examples further complements the book's student-friendly format.

Practical Text Mining with Perl is ideal as a textbook for undergraduate and graduate courses in text mining and as a reference for a variety of professionals who are interested in extracting information from text documents.

Author Notes

Roger Bilisoly , PhD, is an Assistant Professor of Statistics at Central Connecticut State University, where he developed and teaches a new graduate-level course in text mining for the school's data mining program.

Reviews 1

Choice Review

This is clearly one of the best programming books that this reviewer has seen recently. Perl is a language that is perfectly suited for processing text, or more generally, strings. More importantly, Perl is a useful language for learning the concepts of computer science. Further, text processing and text mining are excellent platforms to illustrate computer science topics. Bilisoly (statistics, Central Connecticut State Univ.) clearly discusses all the basics of text processing--pattern extraction, probabilistic text sampling, information retrieval, corpus linguistics, etc. The code examples are excellent because the author very clearly presents relationships between the data, mathematical formulas, and the code. Students just starting to learn about text mining as well as audiences interested in more advanced text mining issues will find this book of value. This reviewer hopes that the author will think about writing a second volume. More advanced topics related to knowledge management or natural language understanding would make a valuable continuation. Excellent topic, excellent treatment. Summing Up: Highly recommended. Upper-division undergraduates through professionals; informed general readers. J. Brzezinski DePaul University

Preface

Acknowledgments

1 Introduction

1.1 Overview of this Book

1.2 Text Mining and Related Fields

1.2.1 Chapter 2 Pattern Matching

1.2.2 Chapter 3 Data Structures

1.2.3 Chapter 4 Probability

1.2.4 Chapter 5 Information Retrieval

1.2.5 Chapter 6 Corpus Linguistics

1.2.6 Chapter 7Multivariate Statistics

1.2.7 Chapter 8 Clustering

1.2.8 Chapter 9 Three Additional Topics

1.3 Advice for Reading this Book

2 Text Patterns

2.1 Introduction

2.2 Regular Expressions

2.2.1 First Regex: Finding the Word "Cat"

2.2.2 Character Ranges and Finding Telephone Numbers

2.2.3 Testing Regexes with Perl

2.3 Finding Words in a Text

2.3.1 Regex Summary

2.3.2 Nineteenth Century Literature

2.3.3 Perl Variables and the Function split

2.3.4 Match Variables

2.4 Decomposing Poe's "The Tell-Tale Heart" into Words

2.4.1 Dashes and String Substitutions

2.4.2 Hyphens

2.4.3 Apostrophes

2.5 A Simple Concordance

2.5.1 Command Line Arguments

2.5.2 Writing to Files

2.6 First Attempt at Extracting Sentences

2.6.1 Sentence Segmentation Preliminaries

2.6.2 Sentence Segmentation for "A Christmas Carol"

2.6.3 Leftmost Greediness and Sentence Segmentation

2.7 Regex Odds and Ends

2.7.1 Match Variables and Backreferences

2.7.2 Regular Expression Operators and Their Output

2.7.3 Lookaround

2.8 References

Problems3 Quantitative Text Summaries

3.1 Introduction

3.2 Scalars, Interpolation and Context in Perl

3.3 Arrays and Context in Perl

3.4 Word Length Application

3.5 Arrays and Functions

3.5.1 Adding and Removing Entries from Arrays

3.5.2 Selecting Subsets of an Array

3.5.3 Sorting an Array

3.6 Hashes

3.6.1 Using a Hash

3.7 Two Text Applications

3.7.1 Zipf's Law

3.7.2 Perl for Word Games

3.7.2.1 An Aid to Crossword Puzzles

3.7.2.2 Word Anagrams

3.7.2.3 Finding Words in a Set of Letters

3.8 Complex Data Structures

3.8.1 References and Pointers

3.8.2 Arrays of Arrays and Beyond

3.8.3 Application: Comparing the Words in Two Poe Stories

3.9 References

3.10 First Transition

Problems4 Probability and Texts

4.1 Introduction

4.2 Probability

4.2.1 Probability and Coin Flipping

4.2.2 Probabilities and Texts

4.2.2.1 Estimating Letter Probabilities

4.2.2.2 Estimating Letter Bigram Probabilities

4.3 Conditional Probability

4.3.1 Independence

4.4 Mean and Variance of Random Variables

4.4.1 Sampling and Error Estimates

4.5 The Bag-of-Words Model

4.6 The Effect of Sample Size

4.6.1 Tokens vs. Types

4.7 References

Problems5 Applying Information Retrieval to Text Mining

5.1 Introduction

5.2 Text Counts and Vectors

5.2.1 Counting Words with Perl

5.2.2 Pronouns

5.3 Text Counts and Vectors

5.3.1 Vectors and Angles

5.3.2 Computing Angles between Vectors

5.3.2.1 Subroutines in Perl

5.3.2.2 Computing the Angle between Vectors

5.4 The Term-Document Matrix

5.5 Matrix Multiplication

5.5.1 A Text Application of Matrix Multiplication

5.6 Functions of Counts

5.7 Document Similarity

5.7.1 Inverse Document Frequency

5.7.2 Poe Story Angles Revisited

5.8 References

Problems6 Concordance Lines and Corpus Linguistics

6.1 Introduction

6.2 Sampling

6.2.1 Statistical Survey Sampling

6.2.2 Text Sampling

6.3 Corpus as Baseline

6.3.1 Function vs. Content Words

6.4 Concordancing

6.4.1 Sorting Concordance Lines

6.4.1.1 Code for Sorting Concordance Lines

6.4.2 Application: Word Usage

6.4

Available:*

On Order

Summary

Summary

Author Notes

Reviews 1

Choice Review

Table of Contents