Librería Portfolio Librería Portfolio

Búsqueda avanzada

TIENE EN SU CESTA DE LA COMPRA

0 productos

en total 0,00 €

BLUEPRINTS FOR TEXT ANALYTICS USING PYTHON
Título:
BLUEPRINTS FOR TEXT ANALYTICS USING PYTHON
Subtítulo:
Autor:
ALBRECHT, J
Editorial:
O´REILLY
Año de edición:
2020
ISBN:
978-1-4920-7408-3
Páginas:
424
67,50 €

 

Sinopsis

Turning text into valuable information is essential for businesses looking to gain a competitive advantage. With recent improvements in natural language processing (NLP), users now have many options for solving complex challenges. But it´s not always clear which NLP tools or libraries would work for a business´s needs, or which techniques you should use and in what order.

This practical book provides data scientists and developers with blueprints for best practice solutions to common tasks in text analytics and natural language processing. Authors Jens Albrecht, Sidharth Ramachandran, and Christian Winkler provide real-world case studies and detailed code examples in Python to help you get started quickly.

Extract data from APIs and web pages
Prepare textual data for statistical analysis and machine learning
Use machine learning for classification, topic modeling, and summarization
Explain AI models and classification results
Explore and visualize semantic similarities with word embeddings
Identify customer sentiment in product reviews
Create a knowledge graph based on named entities and their relations



Table of contents

Preface
Approach of the Book
Prerequisites
Some Important Libraries to Know
Books to Read
Conventions Used in This Book
Using Code Examples
O'Reilly Online Learning
How to Contact Us
Acknowledgments
1. Gaining Early Insights from Textual Data
What You'll Learn and What We'll Build
Exploratory Data Analysis
Introducing the Dataset
Blueprint: Getting an Overview of the Data with Pandas
Calculating Summary Statistics for Columns
Checking for Missing Data
Plotting Value Distributions
Comparing Value Distributions Across Categories
Visualizing Developments Over Time
Blueprint: Building a Simple Text Preprocessing Pipeline
Performing Tokenization with Regular Expressions
Treating Stop Words
Processing a Pipeline with One Line of Code
Blueprints for Word Frequency Analysis
Blueprint: Counting Words with a Counter
Blueprint: Creating a Frequency Diagram
Blueprint: Creating Word Clouds
Blueprint: Ranking with TF-IDF
Blueprint: Finding a Keyword-in-Context
Blueprint: Analyzing N-Grams
Blueprint: Comparing Frequencies Across Time Intervals and Categories
Creating Frequency Timelines
Creating Frequency Heatmaps
Closing Remarks
2. Extracting Textual Insights with APIs
What You'll Learn and What We'll Build
Application Programming Interfaces
Blueprint: Extracting Data from an API Using the Requests Module
Pagination
Rate Limiting
Blueprint: Extracting Twitter Data with Tweepy
Obtaining Credentials
Installing and Configuring Tweepy
Extracting Data from the Search API
Extracting Data from a User's Timeline
Extracting Data from the Streaming API
Closing Remarks
3. Scraping Websites and Extracting Data
What You'll Learn and What We'll Build
Scraping and Data Extraction
Introducing the Reuters News Archive
URL Generation
Blueprint: Downloading and Interpreting robots.txt
Blueprint: Finding URLs from sitemap.xml
Blueprint: Finding URLs from RSS
Downloading Data
Blueprint: Downloading HTML Pages with Python
Blueprint: Downloading HTML Pages with wget
Extracting Semistructured Data
Blueprint: Extracting Data with Regular Expressions
Blueprint: Using an HTML Parser for Extraction
Blueprint: Spidering
Introducing the Use Case
Error Handling and Production-Quality Software
Density-Based Text Extraction
Extracting Reuters Content with Readability
Summary Density-Based Text Extraction
All-in-One Approach
Blueprint: Scraping the Reuters Archive with Scrapy
Possible Problems with Scraping
Closing Remarks and Recommendation
4. Preparing Textual Data for Statistics and Machine Learning
What You'll Learn and What We'll Build
A Data Preprocessing Pipeline
Introducing the Dataset: Reddit Self-Posts
Loading Data Into Pandas
Blueprint: Standardizing Attribute Names
Saving and Loading a DataFrame
Cleaning Text Data
Blueprint: Identify Noise with Regular Expressions
Blueprint: Removing Noise with Regular Expressions
Blueprint: Character Normalization with textacy
Blueprint: Pattern-Based Data Masking with textacy
Tokenization
Blueprint: Tokenization with Regular Expressions
Tokenization with NLTK
Recommendations for Tokenization
Linguistic Processing with spaCy
Instantiating a Pipeline
Processing Text
Blueprint: Customizing Tokenization
Blueprint: Working with Stop Words
Blueprint: Extracting Lemmas Based on Part of Speech
Blueprint: Extracting Noun Phrases
Blueprint: Extracting Named Entities
Feature Extraction on a Large Dataset
Blueprint: Creating One Function to Get It All
Blueprint: Using spaCy on a Large Dataset
Persisting the Result
A Note on Execution Time
There Is More
Language Detection
Spell-Checking
Token Normalization
Closing Remarks and Recommendations
5. Feature Engineering and Syntactic Similarity
What You'll Learn and What We'll Build
A Toy Dataset for Experimentation
Blueprint: Building Your Own Vectorizer
Enumerating the Vocabulary
Vectorizing Documents
The Document-Term Matrix
The Similarity Matrix
Bag-of-Words Models
Blueprint: Using scikit-learn's CountVectorizer
Blueprint: Calculating Similarities
TF-IDF Models
Optimized Document Vectors with TfidfTransformer
Introducing the ABC Dataset
Blueprint: Reducing Feature Dimensions
Blueprint: Improving Features by Making Them More Specific
Blueprint: Using Lemmas Instead of Words for Vectorizing Documents
Blueprint: Limit Word Types
Blueprint: Remove Most Common Words
Blueprint: Adding Context via N-Grams
Syntactic Similarity in the ABC Dataset
Blueprint: Finding Most Similar Headlines to a Made-up Headline