TIENE EN SU CESTA DE LA COMPRA
en total 0,00 €
A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline
Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also features:
Extensive sample code and tutorials using PythonT along with its technical libraries
Core technologies of "Big Data,ö including their strengths and limitations and how they can be used to solve real-world problems
Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity
A wide variety of case studies from industry
Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed
The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set.
FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.
Table of Contents
Preface xvii
1 Introduction: Becoming a Unicorn 1
1.1 Aren't Data Scientists Just Overpaid Statisticians? 2
1.2 How Is This Book Organized? 3
1.3 How to Use This Book? 3
1.4 Why Is It All in Python, Anyway? 4
1.5 Example Code and Datasets 4
1.6 Parting Words 5
Part I The Stuff You'll Always Use 7
2 The Data Science Road Map 9
2.1 Frame the Problem 10
2.2 Understand the Data: Basic Questions 11
2.3 Understand the Data: Data Wrangling 12
2.4 Understand the Data: Exploratory Analysis 13
2.5 Extract Features 14
2.6 Model 15
2.7 Present Results 15
2.8 Deploy Code 16
2.9 Iterating 16
2.10 Glossary 17
3 Programming Languages 19
3.1 Why Use a Programming Language? What Are the Other Options? 19
3.2 A Survey of Programming Languages for Data Science 20
3.3 Python Crash Course 22
3.4 Strings 27
3.5 Defining Functions 32
3.6 Python's Technical Libraries 37
3.7 Other Python Resources 42
3.8 Further Reading 42
3.9 Glossary 43
3a Interlude: My Personal Toolkit 45
4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 47
4.1 The Worst Dataset in the World 48
4.2 How to Identify Pathologies 48
4.3 Problems with Data Content 49
4.4 Formatting Issues 51
4.5 Example Formatting Script 54
4.6 Regular Expressions 55
4.7 Life in the Trenches 60
4.8 Glossary 60
5 Visualizations and Simple Metrics 61
5.1 A Note on Python's Visualization Tools 62
5.2 Example Code 62
5.3 Pie Charts 63
5.4 Bar Charts 65
5.5 Histograms 66
5.6 Means, Standard Deviations, Medians, and Quantiles 69
5.7 Boxplots 70
5.8 Scatterplots 72
5.9 Scatterplots with Logarithmic Axes 74
5.10 Scatter Matrices 76
5.11 Heatmaps 77
5.12 Correlations 78
5.13 Anscombe's Quartet and the Limits of Numbers 80
5.14 Time Series 80
5.15 Further Reading 85
5.16 Glossary 85
6 Machine Learning Overview 87
6.1 Historical Context 88
6.2 Supervised versus Unsupervised 89
6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting 89
6.4 Further Reading 91
6.5 Glossary 91
7 Interlude: Feature Extraction Ideas 93
7.1 Standard Features 93
7.2 Features That Involve Grouping 94
7.3 Preview of More Sophisticated Features 95
7.4 Defining the Feature You Want to Predict 95
8 Machine Learning Classification 97
8.1 What Is a Classifier, and What Can You Do with It? 97
8.2 A Few Practical Concerns 98
8.3 Binary versus Multiclass 99
8.4 Example Script 99
8.5 Specific Classifiers 101
8.6 Evaluating Classifiers 114
8.7 Selecting Classification Cutoffs 117
8.8 Further Reading 119
8.9 Glossary 119
9 Technical Communication and Documentation 121
9.1 Several Guiding Principles 122
9.2 Slide Decks 124
9.3 Written Reports 128
9.4 Speaking: What Has Worked for Me 130
9.5 Code Documentation 131
9.6 Further Reading 132
9.7 Glossary 132
Part II Stuff You Still Need to Know 133
10 Unsupervised Learning: Clustering and Dimensionality Reduction 135
10.1 The Curse of Dimensionality 136
10.2 Example: Eigenfaces for Dimensionality Reduction 138
10.3 Principal Component Analysis and Factor Analysis 140
10.4 Skree Plots and Understanding Dimensionality 142
10.5 Factor Analysis 143
10.6 Limitations of PCA 143
10.7 Clustering 144
10.8 Further Reading 151
10.9 Glossary 151
11 Regression 153
11.1 Example: Predicting Diabetes Progression 153
11.2 Least Squares 156
11.3 Fitting Nonlinear Curves 157
11.4 Goodness of Fit: R2 and Correlation 159
11.5 Correlation of Residuals 160
11.6 Linear Regression 161
11.7 LASSO Regression and Feature Selection 162
11.8 Further Reading 164
11.9 Glossary 164
12 Data Encodings and File Formats 165
12.1 Typical File Format Categories 165
12.2 CSV Files 167
12.3 JSON Files 168
12.4 XML Files 170
12.5 HTML Files 172
12.6 Tar Files 174
12.7 GZip Files 175
12.8 Zip Files 175
A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted.
Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also features: Extensive sample code and tutorials using Python along with its technical libraries Core technologies of Big Data, including their strengths and limitations and how they can be used to solve real-world problems Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity A wide variety of case studies from industry Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools.
The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such.
This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set. FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups.
He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.