Machine Learning Guide

By OCDevel

Machine learning audio course, teaching the fundamentals of machine learning and artificial intelligence. It covers intuition, models (shallow and deep), math, languages, frameworks, etc. Where your o... more

4.9

759759 ratings

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Machine Learning Guide:

How many episodes does Machine Learning Guide have?

The podcast currently has 59 episodes available.

Machine Learning Guide episodes:

November 08, 2020MLG 032 Cartesian Similarity Metrics
Try a walking desk to stay healthy while you study or work!

Show notes at ocdevel.com/mlg/32.

L1/L2 norm, Manhattan, Euclidean, cosine distances, dot product

Normed distances link

A norm is a function that assigns a strictly positive length to each vector in a vector space. link

Minkowski is generalized. p_root(sum(xi-yi)^p). "p" = ? (1, 2, ..) for below.

L1: Manhattan/city-block/taxicab. abs(x2-x1)+abs(y2-y1). Grid-like distance (triangle legs). Preferred for high-dim space.

L2: Euclidean. sqrt((x2-x1)^2+(y2-y1)^2. sqrt(dot-product). Straight-line distance; min distance (Pythagorean triangle edge)

Others: Mahalanobis, Chebyshev (p=inf), etc

Dot product

A type of inner product. Outer-product: lies outside the involved planes. Inner-product: dot product lies inside the planes/axes involved link. Dot product: inner product on a finite dimensional Euclidean space link

Cosine (normalized dot)
...more
42min
November 08, 2020MLA 011 Practical Clustering Tools
Primary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices.
Links

Notes and resources at ocdevel.com/mlg/mla-11

Try a walking desk stay healthy & sharp while you learn & code

K-means Clustering

K-means is the most widely used clustering algorithm and is typically the first method to try for general clustering tasks.

The scikit-learn KMeans implementation is suitable for small to medium-sized datasets, while Faiss's kmeans is more efficient and accurate for very large datasets.

K-means requires the number of clusters to be specified in advance and relies on the Euclidean distance metric, which performs poorly in high-dimensional spaces.

When document embeddings have high dimensionality (e.g., 768 dimensions from sentence transformers), K-means becomes less effective due to the limitations of Euclidean distance in such spaces.

Alternatives to K-means for High Dimensions

For text embeddings with high dimensionality, agglomerative (hierarchical) clustering methods are preferable, particularly because they allow the use of different similarity metrics.

Agglomerative clustering in scikit-learn accepts a pre-computed cosine similarity matrix, which is more appropriate for natural language processing.

Constructing the pre-computed distance (or similarity) matrix involves normalizing vectors and computing dot products, which can be efficiently achieved with linear algebra libraries like PyTorch.

Hierarchical algorithms do not use inertia in the same way as K-means and instead rely on external metrics, such as silhouette score.

Other clustering algorithms exist, including spectral, mean shift, and affinity propagation, which are not covered in this episode.

Semantic Search and Vector Indexing

Libraries such as Faiss, Annoy, and HNSWlib provide approximate nearest neighbor search for efficient semantic search on large-scale vector data.

These systems create an index of your embeddings to enable rapid similarity search, often with the ability to specify cosine similarity as the metric.

Sample code using these libraries with sentence transformers can be found in the UKP Lab sentence-transformers examples directory.

Determining the Optimal Number of Clusters

Both K-means and agglomerative clustering require a predefined number of clusters, but this is often unknown beforehand.

The "elbow" method involves running the clustering algorithm with varying cluster counts and plotting the inertia (sum of squared distances within clusters) to visually identify the point of diminishing returns; see kmeans.inertia_.

The kneed package can automatically detect the "elbow" or "knee" in the inertia plot, eliminating subjective human judgment; sample code available here.

The silhouette score, calculated via silhouette_score, considers both inter- and intra-cluster distances and allows for direct selection of the number of clusters with the maximum score.

The silhouette score can be computed using a pre-computed distance matrix (such as from cosine similarities), making it well-suited for applications involving non-Euclidean metrics and hierarchical clustering.

Density-Based Clustering: DBSCAN and HDBSCAN

DBSCAN is a hierarchical clustering method that does not require specifying the number of clusters, instead discovering clusters based on data density.

HDBSCAN is a more popular and versatile implementation of density-based clustering, capable of handling various types of data without significant parameter tuning.

DBSCAN and HDBSCAN can be preferable to K-means or agglomerative clustering when automatic determination of cluster count or robustness to noise is important.

However, these algorithms may not perform well with all types of high-dimensional embedding data, as illustrated by the challenges faced when clustering 768-dimensional text embeddings.

Summary Recommendations and Links

For low- to medium-sized, low-dimensional data, use K-means with silhouette score to choose the optimal number of clusters: scikit-learn KMeans, silhouette_score.

For very large data or vector search, use Faiss.kmeans.

For high-dimensional data using cosine similarity, use Agglomerative Clustering with a pre-computed square matrix of cosine similarities; sample code.

For density-based clustering, consider DBSCAN or HDBSCAN.

Exploratory code and further examples can be found in the UKP Lab sentence-transformers examples.
...more
35min
October 28, 2020MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK
The landscape of Python natural language processing tools has evolved from broad libraries like NLTK toward more specialized packages such as Gensim for topic modeling, SpaCy for linguistic analysis, and Hugging Face Transformers for advanced tasks, with Sentence Transformers extending transformer models to enable efficient semantic search and clustering. Each library occupies a distinct place in the NLP workflow, from fundamental text preprocessing to semantic document comparison and large-scale language understanding.
Links

Notes and resources at ocdevel.com/mlg/mla-10

Try a walking desk stay healthy & sharp while you learn & code

Historical Foundation: NLTK

NLTK ("Natural Language Toolkit") was one of the earliest and most popular Python libraries for natural language processing, covering tasks from tokenization and stemming to document classification and syntax parsing.

NLTK remains a catch-all "Swiss Army knife" for NLP, but many of its functions have been supplemented or superseded by newer tools tailored to specific tasks.

Specialized Topic Modeling and Phrase Analysis: Gensim

Gensim emerged as the leading library for topic modeling in Python, most notably via its LDA Topic Modeling implementation, which groups documents according to topic distributions.

Topic modeling workflows often use NLTK for initial preprocessing (tokenization, stop word removal, lemmatization), then vectorize with scikit-learn’s TF-IDF, and finally model topics with Gensim’s LDA.

Gensim also provides effective Bigrams/Trigrams, allowing the detection and combination of commonly-used word pairs or triplets (n-grams) to enhance analysis accuracy.

Linguistic Structure and Manipulation: SpaCy and Related Tools

spaCy is a deep-learning-based library for high-performance linguistic analysis, focusing on tasks such as part-of-speech tagging, named entity recognition, and syntactic parsing.

SpaCy supports integrated sentence and word tokenization, stop word removal, and lemmatization, but for advanced lemmatization and inflection, LemmInflect can be used to derive proper inflections for part-of-speech tags.

For even more accurate (but slower) linguistic tasks, consider Stanford CoreNLP via SpaCy integration as spacy-stanza.

SpaCy can examine parse trees to identify sentence components, enabling sophisticated NLP applications like grammatical corrections and intent detection in conversation agents.

High-Level NLP Tasks: Hugging Face Transformers

huggingface/transformers provides interfaces to transformer-based models (like BERT and its successors) capable of advanced NLP tasks including question answering, summarization, translation, and sentiment analysis.

Its Pipelines allow users to accomplish over ten major NLP applications with minimal code.

The library’s model repository hosts a vast collection of pre-trained models that can be used for both research and production.

Semantic Search and Clustering: Sentence Transformers

UKPLab/sentence-transformers extends the transformer approach to create dense document embeddings, enabling semantic search, clustering, and similarity comparison via cosine distance or similar metrics.

Example applications include finding the most similar documents, clustering user entries, or summarizing clusters of text.

The repository offers application examples for tasks such as semantic search and clustering, often using cosine similarity.

For very large-scale semantic search (such as across Wikipedia), approximate nearest neighbor (ANN) libraries like Annoy, FAISS, and hnswlib enable rapid similarity search with embeddings; practical examples are provided in the Sentence Transformers documentation.

Additional Resources and Library Landscape

For a comparative overview and discovery of further libraries, see Analytics Steps Top 10 NLP Libraries in Python, which reviews several packages beyond those discussed here.

Summary of Library Roles and Use Cases

NLTK: Foundational and comprehensive for most classic NLP needs; still covers a broad range of preprocessing and basic analytic tasks.

Gensim: Best for topic modeling and phrase extraction (bigrams/trigrams); especially useful in workflows relying on document grouping and label generation.

SpaCy: Leading tool for syntactic, linguistic, and grammatical analysis; supports integration with advanced lemmatizers and external tools like Stanford CoreNLP.

Hugging Face Transformers: The standard for modern, high-level NLP tasks and quick prototyping, featuring simple pipelines and an extensive model hub.

Sentence Transformers: The main approach for embedding text for semantic search, clustering, and large-scale document comparison, supporting ANN methodologies via companion libraries.
...more
27min
November 06, 2018MLA 009 Charting and Visualization Tools for Data Science
Python charting libraries - Matplotlib, Seaborn, and Bokeh - explaining, their strengths from quick EDA to interactive, HTML-exported visualizations, and clarifies where D3.js fits as a JavaScript alternative for end-user applications. It also evaluates major software solutions like Tableau, Power BI, QlikView, and Excel, detailing how modern BI tools now integrate drag-and-drop analytics with embedded machine learning, potentially allowing business users to automate entire workflows without coding.
Links

Notes and resources at ocdevel.com/mlg/mla-9

Try a walking desk stay healthy & sharp while you learn & code

Core Phases in Data Science Visualization

Exploratory Data Analysis (EDA):

EDA occupies an early stage in the Business Intelligence (BI) pipeline, positioned just before or sometimes merged with the data cleaning (“munging”) phase.

The outputs of EDA (e.g., correlation matrices, histograms) often serve as inputs to subsequent machine learning steps.

Python Visualization Libraries 1. Matplotlib

The foundational plotting library in Python, supporting static, basic chart types.

Requires substantial boilerplate code for custom visualizations.

Serves as the core engine for many higher-level visualization tools.

Common EDA tasks (like plotting via .corr(), .hist(), and .scatter() methods on pandas DataFrames) depend on Matplotlib under the hood.

2. Pandas Plotting

Pandas integrates tightly with Matplotlib and exposes simple, one-line commands for common plots (e.g., df.corr(), df.hist()).

Designed to make quick EDA accessible without requiring detailed knowledge of Matplotlib’s verbose syntax.

3. Seaborn

A high-level wrapper around Matplotlib, analogous to how Keras wraps TensorFlow.

Sets sensible defaults for chart styles, fonts, colors, and sizes, improving aesthetics with minimal effort.

Importing Seaborn can globally enhance the appearance of all Matplotlib plots, even without direct usage of Seaborn’s plotting functions.

4. Bokeh

A powerful library for creating interactive, web-ready plots from Python.

Enables user interactions such as hovering, zooming, and panning within rendered plots.

Exports visualizations as standalone HTML files or can operate as a server-linked app for live data exploration.

Supports advanced features like cross-filtering, allowing dynamic slicing and dicing of data across multiple axes or columns.

More suited for creating reusable, interactive dashboards rather than quick, one-off EDA visuals.

5. D3.js

Unlike previous libraries, D3.js is a JavaScript framework for creating complex, highly customized data visualizations for web and mobile apps.

Used predominantly on the client-side to build interactive front-end graphics for end users, not as an EDA tool for analysts.

Common in production-grade web apps, but not typically part of a Python-based data science workflow.

Dedicated Visualization and BI Software Tableau

Leading commercial drag-and-drop BI tool for data visualization and dashboarding.

Connects to diverse data sources (CSV, Excel, databases), auto-detects column types, and suggests default chart types.

Users can interactively build visualizations, cross-filter data, and switch chart types without coding.

Power BI

Microsoft’s BI suite, similar to Tableau, supporting end-to-end data analysis and visualization.

Integrates data preparation, visualization, and increasingly, built-in machine learning workflows.

Focused on empowering business users or analysts to run the BI pipeline without programming.

QlikView

Another major BI offering is QlikView, emphasizing interactive dashboards and data exploration.

Excel

Still widely used for basic EDA and visualizations directly on spreadsheets.

Offers limited but accessible charting tools for histograms, scatter plots, and simple summary statistics.

Data often originates from Excel/CSV files before being ingested for further analysis in Python/pandas.

Trends & Insights

Workflow Integration: Modern BI tools are converging, adding both classic EDA capabilities and basic machine learning modeling, often through a code-free interface.

Automation Risks and Opportunities: As drag-and-drop BI tools increase in capabilities (including model training and selection), some data science coding work traditionally required for BI pipelines may become accessible to non-programmers.

Distinctions in Use:

Python libraries (Matplotlib, Seaborn, Bokeh) excel in automating and scripting EDA, report generation, and static analysis as part of data pipelines.

BI software (Tableau, Power BI, QlikView) shines for interactive exploration and democratized analytics, integrated from ingestion to reporting.

D3.js stands out for tailored, production-level, end-user app visualizations, rarely leveraged by data scientists for EDA.

Key Takeaways

For quick, code-based EDA: Use Pandas’ built-in plotters (wrapping Matplotlib).

For pre-styled, pretty plots: Use Seaborn (with or without direct API calls).

For interactive, shareable dashboards: Use Bokeh for Python or BI tools for no-code operation.

For enterprise, end-user-facing dashboards: Choose BI software like Tableau or build custom apps using D3.js for total control.
...more
25min
October 26, 2018MLA 008 Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) sits at the critical pre-modeling stage of the data science pipeline, focusing on uncovering missing values, detecting outliers, and understanding feature distributions through both statistical summaries and visualizations, such as Pandas' info(), describe(), histograms, and box plots. Visualization tools like Matplotlib, along with processes including imputation and feature correlation analysis, allow practitioners to decide how best to prepare, clean, or transform data before it enters a machine learning model.
Links

Notes and resources at ocdevel.com/mlg/mla-8

Try a walking desk stay healthy & sharp while you learn & code

EDA in the Data Science Pipeline

Position in Pipeline: EDA is an essential pre-processing step in the business intelligence (BI) or data science pipeline, occurring after data acquisition but before model training.

Purpose: The goal of EDA is to understand the data by identifying:

Missing values (nulls)

Outliers

Feature distributions

Relationships or correlations between variables

Data Acquisition and Initial Inspection

Data Sources: Data may arrive from various streams (e.g., Twitter, sensors) and is typically stored in structured formats such as databases or spreadsheets.

Loading Data: In Python, data is often loaded into a Pandas DataFrame using commands like pd.read_csv('filename.csv').

Initial Review:

df.info(): Displays data types and counts of non-null entries by column, quickly highlighting missing values.

df.describe(): Provides summary statistics for each column, including count, mean, standard deviation, min/max, and quartiles.

Handling Missing Data and Outliers

Imputation:

Missing values must often be filled (imputed), as most machine learning algorithms cannot handle nulls.

Common strategies: impute with mean, median, or another context-appropriate value.

For example, missing ages can be filled with the column's average rather than zero, to avoid introducing skew.

Outlier Strategy:

Outliers can be removed, replaced (e.g., by nulls and subsequently imputed), or left as-is if legitimate.

Treatment depends on whether outliers represent true data points or data errors.

Visualization Techniques

Purpose: Visualizations help reveal data distributions, outliers, and relationships that may not be apparent from raw statistics.

Common Visualization Tools:

Matplotlib: The primary Python library for static data visualizations.

Visualization Methods:

Histogram: Ideal for visualizing the distribution of a single variable (e.g., age), making outliers visible as isolated bars.

Box Plot: Summarizes quartiles, median, and range, with 'whiskers' showing min/max; useful for spotting outliers and understanding data spread.

Line Chart: Used for time-series data, highlighting trends and anomalies (e.g., sudden spikes in stock price).

Correlation Matrix: Visual grid (often of scatterplots) comparing each feature against every other, helping to detect strong or weak linear relationships between features.

Feature Correlation and Dimensionality

Correlation Plot:

Generated with df.corr() in Pandas to assess linear relationships between features.

High correlation between features may suggest redundancy (e.g., number of bedrooms and square footage) and inform feature selection or removal.

Limitations:

While correlation plots provide intuition, automated approaches like Principal Component Analysis (PCA) or autoencoders are typically superior for feature reduction and target prediction tasks.

Data Transformation Prior to Modeling

Scaling:

Machine learning models, especially neural networks, often require input features to be scaled (normalized or standardized).

StandardScaler (from scikit-learn): Standardizes features, but is sensitive to outliers.

RobustScaler: A variant that compresses the influence of outliers, keeping data within interquartile ranges, simplifying preprocessing steps.

Summary of EDA Workflow

Initial Steps:

Load data into a DataFrame.

Examine data types and missing values with df.info().

Review summary statistics with df.describe().

Visualization:

Use histograms and box plots to explore feature distributions and detect anomalies.

Leverage correlation matrices to identify related features.

Data Preparation:

Impute missing values thoughtfully (e.g., with means or medians).

Decide on treatment for outliers: removal, imputation, or scaling with tools like RobustScaler.

Outcome:

Proper EDA ensures that data is cleaned, features are well-understood, and inputs are suitable for effective machine learning model training.
...more
26min
October 16, 2018MLA 007 Jupyter Notebooks
Jupyter Notebooks, originally conceived as IPython Notebooks, enable data scientists to combine code, documentation, and visual outputs in an interactive, browser-based environment supporting multiple languages like Python, Julia, and R. This episode details how Jupyter Notebooks structure workflows into executable cells - mixing markdown explanations and inline charts - which is essential for documenting, demonstrating, and sharing data analysis and machine learning pipelines step by step.
Links

Notes and resources at ocdevel.com/mlg/mla-7

Try a walking desk stay healthy & sharp while you learn & code

Overview of Jupyter Notebooks

Historical Context and Scope

Jupyter Notebooks began as IPython Notebooks focused solely on Python.

The project was renamed Jupyter to support additional languages - namely Julia ("JU"), Python ("PY"), and R ("R") - broadening its applicability for data science and machine learning across multiple languages.

Interactive, Narrative-Driven Coding

Jupyter Notebooks allow for the mixing of executable code, markdown documentation, and rich media outputs within a browser-based interface.

The coding environment is structured as a sequence of cells where each cell can independently run code and display its output directly underneath.

Unlike traditional Python scripts, which output results linearly and impermanently, Jupyter Notebooks preserve the stepwise development process and its outputs for later review or publication.

Typical Workflow Example

Stepwise Data Science Pipeline Construction

Import necessary libraries: Each new notebook usually starts with a cell for imports (e.g., matplotlib, scikit-learn, keras, pandas).

Data ingestion phase: Read data into a pandas DataFrame via read_csv for CSVs or read_sql for databases.

Exploratory analysis steps: Use DataFrame methods like .info() and .describe() to inspect the dataset; results are rendered below the respective cell.

Model development: Train a machine learning model - for example using Keras - and output performance metrics such as loss, mean squared error, or classification accuracy directly beneath the executed cell.

Data visualization: Leverage charting libraries like matplotlib to produce inline plots (e.g., histograms, correlation matrices), which remain visible as part of the notebook for later reference.

Publishing and Documentation Features

Markdown Support and Storytelling

Markdown cells enable the inclusion of formatted explanations, section headings, bullet points, and even inline images and videos, allowing for clear documentation and instructional content interleaved with code.

This format makes it simple to delineate different phases of a pipeline (e.g., "Data Ingestion", "Data Cleaning", "Model Evaluation") with descriptive context.

Inline Visual Outputs

Outputs from code cells, such as tables, charts, and model training logs, are preserved within the notebook interface, making it easy to communicate findings and reasoning steps alongside the code.

Visualization libraries (like matplotlib) can render charts directly in the notebook without the need to generate separate files.

Reproducibility and Sharing

Notebooks can be published to platforms like GitHub, where the full code, markdown, and most recent cell outputs are viewable in-browser.

This enables transparent workflow documentation and facilitates tutorials, blog posts, and collaborative analysis.

Practical Considerations and Limitations

Cell-based Execution Flexibility

Each cell can be run independently, so developers can repeatedly rerun specific steps (e.g., re-trying a modeling cell after code fixes) without needing to rerun the entire notebook.

This is especially useful for iterative experimentation with large or slow-to-load datasets.

Primary Use Cases

Jupyter Notebooks excel at "storytelling" - presenting an analytical or modeling process along with its rationale and findings, primarily for publication or demonstration.

For regular development, many practitioners prefer traditional editors or IDEs (like PyCharm or Vim) due to advanced features such as debugging, code navigation, and project organization.

Summary
Jupyter Notebooks serve as a central tool for documenting, presenting, and sharing the entirety of a machine learning or data analysis pipeline - combining code, output, narrative, and visualizations into a single, comprehensible document ideally suited for tutorials, reports, and reproducible workflows.
...more
17min
July 19, 2018MLA 006 Salaries for Data Science & Machine Learning
O'Reilly's 2017 Data Science Salary Survey finds that location is the most significant salary determinant for data professionals, with median salaries ranging from $134,000 in California to under $30,000 in Eastern Europe, and highlights that negotiation skills can lead to salary differences as high as $45,000. Other key factors impacting earnings include company age and size, job title, industry, and education, while popular tools and languages—such as Python, SQL, and Spark—do not strongly influence salary despite widespread use.
Links

Notes and resources at ocdevel.com/mlg/mla-6

Try a walking desk stay healthy & sharp while you learn & code

Global and Regional Salary Differences

Median Global Salary: $90,000 USD, up from $85,000 the previous year.

Regional Breakdown:

United States: $112,000 median; California leads at $134,000.

Western Europe: $57,000—about half the US median.

Australia & New Zealand: Second after the US.

Eastern Europe: Below $30,000.

Asia: Wide interquartile salary range, indicating high variability.

Demographic and Personal Factors

Gender: Women's median salaries are $8,000 lower than men's. Women make up 20% of respondents but are increasing in number.

Age & Experience: Higher age/experience correlates with higher salaries, but the proportion of older professionals declines.

Education: Nearly all respondents have at least a master's; PhD holders earn only about $5,000 more than those with a master’s.

Negotiation Skills: Self-reported strong salary negotiation skills are linked to $45,000 higher median salaries (from $70,000 for lowest to $115,000 for highest bargaining skill).

Industry, Company, and Role

Industry Impact:

Highest salaries found in search/social networking and media/entertainment.

Education and non-profit offer the lowest pay.

Company Age & Size:

Companies aged 2–5 years offer higher than average pay; less than 2 years old offer much lower salaries (~$40,000).

Large organizations generally pay more.

Job Title:

"Data scientist" and "data analyst" titles carry higher medians than "engineer" titles by around $7,000.

Executive titles (CTO, VP, Director) see the highest pay, with CTOs at $150,000 median.

Tools, Languages, and Technologies

Operating Systems:

Windows: 67% usage, but declining.

Linux: 55%; Unix: 18%; macOS: 46%; Unix-based systems are rising in use.

Programming Languages:

SQL: 64% (most used for database querying).

Python: 63% (most popular procedural language).

R: 54%.

Others (Java, Scala, C/C++, C#): Each less than 20%.

Salary difference across languages is minor; C/C++ users earn more but not enough to outweigh the difficulty.

Databases:

MySQL (37%), MS SQL Server (30%), PostgreSQL (28%).

Popularity of the database has little impact on pay.

Big Data and Search Tools:

Spark: Most popular big data platform, especially for large-scale data processing.

Elasticsearch: Most common search engine, but Solr pays more.

Machine Learning Libraries:

Scikit-learn (37%) and Spark MLlib (16%) are most used.

Visualization Tools:

R’s ggplot2 and Python’s matplotlib are leading choices.

Key Salary Differentiators (per Machine Learning Analysis)

Top Predictors (explaining ~60% of salary variance):

World/US region

Experience

Gender

Company size

Education (but amounting to only ~$5,000 difference)

Job title

Industry

Lesser Impact: Specific tools, languages, and databases do not meaningfully affect salary.

Summary Takeaways

The greatest leverage for a higher salary comes from geography and individual negotiation capability, with up to $45,000 differences possible.

Role/title selection, industry, company age, and size are also significant, while mastering the most commonly used tools is essential but does not strongly differentiate pay.

For aspiring data professionals: focus on developing negotiation skills and, where possible, optimize for location and title to maximize earning potential.
...more
20min
June 09, 2018MLA 005 Shapes and Sizes: Tensors and NDArrays
Explains the fundamental differences between tensor dimensions, size, and shape, clarifying frequent misconceptions—such as the distinction between the number of features (“columns”) and true data dimensions—while also demystifying reshaping operations like expand_dims, squeeze, and transpose in NumPy. Through practical examples from images and natural language processing, listeners learn how to manipulate tensors to match model requirements, including scenarios like adding dummy dimensions for grayscale images or reordering axes for sequence data.
Links

Notes and resources at ocdevel.com/mlg/mla-5

Try a walking desk stay healthy & sharp while you learn & code

Definitions

Tensor: A general term for an array of any number of dimensions.

0D Tensor (Scalar): A single number (e.g., 5).

1D Tensor (Vector): A simple list of numbers.

2D Tensor (Matrix): A grid of numbers (rows and columns).

3D+ Tensors: Higher-dimensional arrays, such as images or batches of images.

NDArray (NumPy): Stands for "N-dimensional array," the foundational array type in NumPy, synonymous with "tensor."

Tensor Properties Dimensions

Number of nested levels in the array (e.g., a matrix has two dimensions: rows and columns).

Access in NumPy: Via .ndim property (e.g., array.ndim).

Size

Total number of elements in the tensor.

Examples:

Scalar: size = 1

Vector: size equals number of elements (e.g., 5 for [1, 2, 3, 4, 5])

Matrix: size = rows × columns (e.g., 10×10 = 100)

Access in NumPy: Via .size property.

Shape

Tuple listing the number of elements per dimension.

Example: An image with 256×256 pixels and 3 color channels has shape = (256, 256, 3).

Common Scenarios & Examples Data Structures in Practice

CSV/Spreadsheet Example: Dataset with 1 million housing examples and 50 features:

Shape: (1_000_000, 50)

Size: 50,000,000

Image Example (RGB): 256×256 pixel image:

Shape: (256, 256, 3)

Dimensions: 3 (width, height, channels)

Batching for Models:

For a convolutional neural network, shape might become (batch_size, width, height, channels), e.g., (32, 256, 256, 3).

Conceptual Clarifications

The term "dimensions" in data science often refers to features (columns), but technically in tensors it means the number of structural axes.

The "curse of dimensionality" often uses "dimensions" to refer to features, not tensor axes.

Reshaping and Manipulation in NumPy Reshaping Tensors

Adding Dimensions:

Useful when a model expects higher-dimensional input than currently available (e.g., converting grayscale image from shape (256, 256) to (256, 256, 1)).

Use np.expand_dims or array.reshape.

Removing Singleton Dimensions:

Occurs when, for example, model output is (N, 1) and single dimension should be removed to yield (N,).

Use np.squeeze or array.reshape.

Wildcard with -1:

In reshaping, -1 is a placeholder for NumPy to infer the correct size, useful when batch size or another dimension is variable.

Flattening:

Use np.ravel to turn a multi-dimensional tensor into a contiguous 1D array.

Axis Reordering

Transposing Axes:

Needed when model input or output expects axes in a different order (e.g., sequence length and embedding dimensions in NLP).

Use np.transpose for general axis permutations.

Use np.swapaxes to swap two specific axes but prefer transpose for clarity and flexibility.

Practical Example

In NLP sequence models:

3D tensor with (batch_size, sequence_length, embedding_dim) might need to be reordered to (batch_size, embedding_dim, sequence_length) for certain models.

Achieved using: array.transpose(0, 2, 1)

Core NumPy Functions for Manipulation

reshape: General function for changing the shape of a tensor, including adding or removing dimensions.

expand_dims: Adds a new axis with size 1.

squeeze: Removes axes with size 1.

ravel: Flattens to 1D.

transpose: Changes the order of axes.

swapaxes: Swaps specified axes (less general than transpose).

Summary Table of Operations Operation NumPy Function Purpose Add dimension np.expand_dims Convert (256,256) to (256,256,1) Remove dimension np.squeeze Convert (N,1) to (N,) General reshape np.reshape Any change matching total size Flatten np.ravel Convert (a,b) to (a*b,) Swap axes np.swapaxes Exchange positions of two axes Permute axes np.transpose Reorder any sequence of axes Closing Notes

A deep understanding of tensor structure - dimensions, size, and shape - is vital for preparing data for machine learning models.

Reshaping, expanding, squeezing, and transposing tensors are everyday tasks in model development, especially for adapting standard datasets and models to each other.
...more
28min
May 24, 2018MLA 003 Storage: HDF, Pickle, Postgres
Practical workflow of loading, cleaning, and storing large datasets for machine learning, moving from ingesting raw CSVs or JSON files with pandas to saving processed datasets and neural network weights using HDF5 for efficient numerical storage. It clearly distinguishes among storage options—explaining when to use HDF5, pickle files, or SQL databases—while highlighting how libraries like pandas, TensorFlow, and Keras interact with these formats and why these choices matter for production pipelines.
Links

Notes and resources at ocdevel.com/mlg/mla-3

Try a walking desk stay healthy & sharp while you learn & code

Data Ingestion and Preprocessing

Data Sources and Formats:

Datasets commonly originate as CSV (comma-separated values), TSV (tab-separated values), fixed-width files (FWF), JSON from APIs, or directly from databases.

Typical applications include structured data (e.g., real estate features) or unstructured data (e.g., natural language corpora for sentiment analysis).

Pandas as the Core Ingestion Tool:

Pandas provides versatile functions such as read_csv, read_json, and others to load various file formats with robust options for handling edge cases (e.g., file encodings, missing values).

After loading, data cleaning is performed using pandas: dropping or imputing missing values, converting booleans and categorical columns to numeric form.

Data Encoding for Machine Learning:

All features must be numerical before being supplied to machine learning models like TensorFlow or Keras.

Categorical data is one-hot encoded using pandas.get_dummies, converting strings to binary indicator columns.

The underlying NumPy array of a DataFrame is accessed via df.values for direct integration with modeling libraries.

Numerical Data Storage Options

HDF5 for Storing Processed Arrays:

HDF5 (Hierarchical Data Format version 5) enables efficient storage of large multidimensional NumPy arrays.

Libraries like h5py and built-in pandas functions (to_hdf) allow seamless saving and retrieval of arrays or DataFrames.

TensorFlow and Keras use HDF5 by default to store neural network weights as multi-dimensional arrays for model checkpointing and early stopping, accommodating robust recovery and rollback.

Pickle for Python Objects:

Python's pickle protocol serializes arbitrary objects, including machine learning models and arrays, into files for later retrieval.

While convenient for quick iterations or heterogeneous data, pickle is less efficient with NDarrays compared to HDF5, lacks significant compression, and poses security risks if not properly safeguarded.

SQL Databases and Spreadsheets:

For mixed or heterogeneous data, or when producing results for sharing and collaboration, relational databases like PostgreSQL or spreadsheets such as CSVs are used.

Databases serve as the endpoint for production systems, where model outputs—such as generated recommendations or reports—are published for downstream use.

Storage Workflow in Machine Learning Pipelines

Typical Process:

Data is initially loaded and processed with pandas, then converted to numerical arrays suitable for model training.

Intermediate states and model weights are saved using HDF5 during model development and training, ensuring recovery from interruptions and facilitating early stopping.

Final outputs, especially those requiring sharing or production use, are published to SQL databases or shared as spreadsheet files.

Best Practices and Progression:

Quick project starts may involve pickle for accessible storage during early experimentation.

For large-scale, high-performance applications, migration to HDF5 for numerical data and SQL for production-grade results is recommended.

Alternative options like Feather and PyTables (an interface on top of HDF5) exist for specialized needs.

Summary

HDF5 is optimal for numerical array storage due to its efficiency, built-in compression, and integration with major machine learning frameworks.

Pickle accommodates arbitrary Python objects but is suboptimal for numerical data persistence or security.

SQL databases and spreadsheets are used for disseminating results, especially when human consumption or application integration is required.

The selection of a storage format is determined by data type, pipeline stage, and end-use requirements within machine learning workflows.
...more
18min
May 24, 2018MLA 002 Numpy & Pandas
NumPy enables efficient storage and vectorized computation on large numerical datasets in RAM by leveraging contiguous memory allocation and low-level C/Fortran libraries, drastically reducing memory footprint compared to native Python lists. Pandas, built on top of NumPy, introduces labelled, flexible tabular data manipulation—facilitating intuitive row and column operations, powerful indexing, and seamless handling of missing data through tools like alignment, reindexing, and imputation.
Links

Notes and resources at ocdevel.com/mlg/mla-2

Try a walking desk stay healthy & sharp while you learn & code

NumPy: Efficient Numerical Arrays and Vectorized Computation

Purpose and Design

NumPy ("Numerical Python") is the foundational library for handling large numerical datasets in RAM.

It introduces the ndarray (n-dimensional array), which is synonymous with a tensor—enabling storage of vectors, matrices, or higher-dimensional data.

Memory Efficiency

NumPy arrays are homogeneous: all elements share a consistent data type (e.g., float64, int32, bool).

This data type awareness enables allocation of tightly-packed, contiguous memory blocks, optimizing both RAM usage and data access speed.

Memory footprint can be orders of magnitude lower than equivalent native Python lists; for example, tasks that exhausted 32GB of RAM using Python lists could drop to just 6GB with NumPy structures.

Vectorized Operations

NumPy supports vectorized calculations: operations (such as squaring all elements) are applied across entire arrays in a single step, without explicit Python loops.

These operations are operator-overloaded and are executed by delegating instructions to low-level, highly optimized C or Fortran routines, delivering significant computational speed gains.

Conditional operations and masking, such as zeroing out negative numbers (akin to a ReLU activation), can be done efficiently with Boolean masks.

Pandas: Advanced Tabular Data Manipulation

Relationship to NumPy

Pandas builds upon NumPy, leveraging its underlying optimized array storage and computation for numerical columns in its data structures.

Supports additional types like strings for non-numeric data, which are common in real-world datasets.

2D Data Handling and Directional Operations

The core Pandas structure is the DataFrame, which handles labelled rows and columns, analogous to a spreadsheet or SQL table.

Operations are equally intuitive row-wise and column-wise, facilitating both SQL-like ("row-oriented") and "columnar" manipulations.

This dual-orientation enables many complex data transformations to be succinct one-liners instead of lengthy Python code.

Indexing and Alignment

Pandas uses flexible and powerful indexing, enabling functions such as joining disparate datasets via a shared index (e.g., timestamp alignment in financial time series).

When merging DataFrames (e.g., two stocks with differing trading days), Pandas automatically aligns data on the index, introducing NaN (null) values for unmatched dates.

Handling Missing Data (Imputation)

Pandas includes robust features for detecting and filling missing values, known as imputation.

Options include forward filling, backfilling, or interpolating missing values based on surrounding data.

Datasets can be reindexed against standardized sequences, such as all valid trading days, to enforce consistent time frames and further identify or fill data gaps.

Use Cases and Integration

Pandas simplifies ETL (extract, transform, load) for CSV and database-derived data, merging NumPy’s computation power with tools for advanced data cleaning and integration.

When preparing data for machine learning frameworks (e.g., TensorFlow or Keras), Pandas DataFrames can be converted back into NumPy arrays for computation, maintaining tight integration across the data science stack.

Summary: NumPy underpins high-speed numerical operations and memory efficiency, while Pandas extends these capabilities to powerful, flexible, and intuitive manipulation of labelled multi-dimensional data -together forming the backbone of data analysis and preparation in Python machine learning workflows.
...more
19min

FAQs about Machine Learning Guide:

How many episodes does Machine Learning Guide have?

The podcast currently has 59 episodes available.

More shows like Machine Learning Guide

Data Skeptic by Kyle Polich

Data Skeptic

481 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

590 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

298 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

331 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

141 Listeners

DataFramed by DataCamp

DataFramed

267 Listeners

Practical AI by Practical AI LLC

Practical AI

192 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

139 Listeners

Last Week in AI by Skynet Today

Last Week in AI

287 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

88 Listeners

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

141 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

201 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

The Morgan Housel Podcast by Morgan Housel

The Morgan Housel Podcast

988 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

491 Listeners