🎵 Python Music Information Retrieval (MIR) - Complete Guide #

Comprehensive toolkit for analyzing, understanding, and processing musical audio with Python

🎯 What is Music Information Retrieval (MIR)? #

Music Information Retrieval is an interdisciplinary field that combines computer science, signal processing, machine learning, musicology, and psychology to automatically extract meaningful information from musical audio signals. MIR enables computers to “understand” music through computational analysis.

🌟 Core MIR Tasks #

🎼 Pitch & Melody: Fundamental frequency estimation, melody extraction, pitch tracking
🥁 Rhythm & Beat: Beat tracking, tempo estimation, meter detection, onset detection
🎨 Harmony & Chords: Chord recognition, key detection, harmonic analysis
🎭 Structure & Form: Music segmentation, structural analysis, repetition detection
🏷️ Classification: Genre classification, mood detection, instrument recognition
🔍 Similarity & Retrieval: Music recommendation, cover song identification, audio fingerprinting
🎚️ Source Separation: Isolating instruments, vocals, or harmonic/percussive components
📊 Feature Extraction: Spectral, temporal, and perceptual audio features

📦 Essential MIR Libraries #

🎼 librosa - Audio Analysis Foundation #

# Installation
pip install librosa
conda install -c conda-forge librosa

# Core usage
import librosa
import numpy as np

# Load audio file
y, sr = librosa.load('audio.wav', sr=22050)

# Basic feature extraction
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)

# Beat and tempo analysis
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)

🎛️ Essentia - Comprehensive MIR Toolkit #

# Installation
pip install essentia-tensorflow  # With TensorFlow models
pip install essentia             # Standard version

# Core usage
import essentia
import essentia.standard as es

# Load audio
loader = es.MonoLoader(filename='audio.wav')
audio = loader()

# Feature extraction
windowing = es.Windowing(type='hann')
spectrum = es.Spectrum()
mfcc = es.MFCC()

# Extract MFCC features
frames = []
for frame in es.FrameGenerator(audio, frameSize=1024, hopSize=512):
    mfcc_bands, mfcc_coeffs = mfcc(spectrum(windowing(frame)))
    frames.append(mfcc_coeffs)

🥁 madmom - Beat Tracking & Onset Detection #

# Installation
pip install madmom

# Beat tracking
from madmom.features.beats import RNNBeatProcessor, DBNBeatTrackingProcessor

# Process audio for beat tracking
proc = DBNBeatTrackingProcessor(fps=100)
act = RNNBeatProcessor()('audio.wav')
beats = proc(act)

# Onset detection
from madmom.features.onsets import OnsetPeakPickingProcessor, RNNOnsetProcessor

# Detect onsets
rnn = RNNOnsetProcessor()
onsets_proc = OnsetPeakPickingProcessor(fps=100)
onsets = onsets_proc(rnn('audio.wav'))

📊 mir_eval - Evaluation Metrics #

# Installation
pip install mir_eval

# Beat tracking evaluation
import mir_eval

# Load reference and estimated beats
ref_beats = mir_eval.io.load_events('reference_beats.txt')
est_beats = mir_eval.io.load_events('estimated_beats.txt')

# Evaluate beat tracking performance
scores = mir_eval.beat.evaluate(ref_beats, est_beats)
print(f"F-measure: {scores['F-measure']:.3f}")

# Chord evaluation
ref_intervals, ref_labels = mir_eval.io.load_labeled_intervals('ref_chords.txt')
est_intervals, est_labels = mir_eval.io.load_labeled_intervals('est_chords.txt')
chord_scores = mir_eval.chord.evaluate(ref_intervals, ref_labels, 
                                       est_intervals, est_labels)

🔬 Core MIR Techniques #

🎼 Pitch & Melody Analysis #

Fundamental Frequency Estimation #

import librosa
import numpy as np

# Load audio
y, sr = librosa.load('vocal.wav')

# Pitch tracking with PYIN algorithm
f0, voiced_flag, voiced_probs = librosa.pyin(y, 
                                             fmin=librosa.note_to_hz('C2'), 
                                             fmax=librosa.note_to_hz('C7'))

# Convert to MIDI notes
midi_notes = librosa.hz_to_midi(f0)

# Melody extraction using predominant pitch
pitches, magnitudes = librosa.piptrack(y=y, sr=sr, threshold=0.1)
melody = []
for t in range(pitches.shape[1]):
    index = magnitudes[:, t].argmax()
    pitch = pitches[index, t]
    melody.append(pitch)

Chroma Features for Harmony Analysis #

# Chroma features (pitch class profiles)
chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)
chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr)

# Enhanced chroma with harmonic-percussive separation
y_harmonic, y_percussive = librosa.effects.hpss(y)
chroma_harmonic = librosa.feature.chroma_cqt(y=y_harmonic, sr=sr)

# Key detection using chroma
key_profile = np.mean(chroma_harmonic, axis=1)
major_profile = [6.35, 2.23, 3.48, 2.33, 4.38, 4.09, 2.52, 5.19, 2.39, 3.66, 2.29, 2.88]
key_correlations = []
for shift in range(12):
    shifted_profile = np.roll(major_profile, shift)
    correlation = np.corrcoef(key_profile, shifted_profile)[0, 1]
    key_correlations.append(correlation)

estimated_key = np.argmax(key_correlations)

🥁 Rhythm & Beat Analysis #

Beat Tracking and Tempo Estimation #

# Advanced beat tracking
tempo, beats = librosa.beat.beat_track(y=y, sr=sr, units='time')

# Onset detection for rhythm analysis
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, units='time')
onset_strength = librosa.onset.onset_strength(y=y, sr=sr)

# Tempo estimation
tempo_static = librosa.beat.tempo(onset_envelope=onset_strength, sr=sr)[0]

# Advanced rhythm features
tempogram = librosa.feature.tempogram(onset_envelope=onset_strength, sr=sr)

🎨 Spectral Feature Extraction #

Comprehensive Spectral Analysis #

# Spectral features suite
spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
spectral_flatness = librosa.feature.spectral_flatness(y=y)

# Zero crossing rate for texture analysis
zcr = librosa.feature.zero_crossing_rate(y)

# Mel-frequency cepstral coefficients
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
delta_mfccs = librosa.feature.delta(mfccs)
delta2_mfccs = librosa.feature.delta(mfccs, order=2)

# Tonnetz (harmonic network) features
tonnetz = librosa.feature.tonnetz(y=y_harmonic, sr=sr)

🤖 Machine Learning for MIR #

🧠 Deep Learning Models #

CNN for Music Classification #

import tensorflow as tf
from tensorflow.keras import layers, models

def create_music_cnn(input_shape, num_classes):
    """CNN for music genre classification"""
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model

# Prepare spectrogram data
def audio_to_spectrogram(y, sr):
    """Convert audio to mel-spectrogram for CNN input"""
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    S_db = librosa.power_to_db(S, ref=np.max)
    return S_db

RNN for Sequential Music Analysis #

def create_music_rnn(sequence_length, num_features, num_classes):
    """RNN for music sequence analysis"""
    model = models.Sequential([
        layers.LSTM(128, return_sequences=True, input_shape=(sequence_length, num_features)),
        layers.Dropout(0.3),
        layers.LSTM(64, return_sequences=False),
        layers.Dropout(0.3),
        layers.Dense(32, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model

🎯 Traditional ML Approaches #

Feature-Based Classification #

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

def extract_comprehensive_features(y, sr):
    """Extract comprehensive feature set for traditional ML"""
    features = []
    
    # Spectral features
    spectral_centroids = np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
    spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(y=y, sr=sr))
    spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=y, sr=sr))
    zcr = np.mean(librosa.feature.zero_crossing_rate(y))
    
    # MFCC features
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfcc_means = np.mean(mfccs, axis=1)
    mfcc_stds = np.std(mfccs, axis=1)
    
    # Chroma features
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_means = np.mean(chroma, axis=1)
    
    # Combine all features
    features.extend([spectral_centroids, spectral_rolloff, spectral_bandwidth, zcr])
    features.extend(mfcc_means)
    features.extend(mfcc_stds)
    features.extend(chroma_means)
    
    return np.array(features)

🎼 Advanced MIR Applications #

🎵 Music Recommendation System #

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

class MusicRecommendationSystem:
    """Content-based music recommendation using audio features"""
    
    def __init__(self):
        self.features_db = None
        self.track_metadata = None
        self.scaler = StandardScaler()
        
    def extract_audio_features(self, y, sr):
        """Extract audio features for recommendation"""
        # Spectral features
        spectral_features = [
            np.mean(librosa.feature.spectral_centroid(y=y, sr=sr)),
            np.mean(librosa.feature.spectral_rolloff(y=y, sr=sr)),
            np.mean(librosa.feature.spectral_bandwidth(y=y, sr=sr)),
            np.mean(librosa.feature.zero_crossing_rate(y))
        ]
        
        # Rhythm features
        tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
        rhythm_features = [tempo]
        
        # Timbral features (MFCCs)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        mfcc_features = list(np.mean(mfccs, axis=1))
        
        # Harmonic features (Chroma)
        chroma = librosa.feature.chroma_stft(y=y, sr=sr)
        chroma_features = list(np.mean(chroma, axis=1))
        
        return spectral_features + rhythm_features + mfcc_features + chroma_features
    
    def recommend_similar_tracks(self, query_audio, n_recommendations=5):
        """Recommend similar tracks based on audio content"""
        # Extract features from query audio
        y, sr = librosa.load(query_audio)
        query_features = self.extract_audio_features(y, sr)
        query_features_scaled = self.scaler.transform([query_features])
        
        # Calculate similarities
        similarities = cosine_similarity(query_features_scaled, self.features_db)[0]
        
        # Get top recommendations
        top_indices = np.argsort(similarities)[::-1][:n_recommendations]
        
        recommendations = []
        for idx in top_indices:
            recommendations.append({
                'track': self.track_metadata.iloc[idx]['title'],
                'artist': self.track_metadata.iloc[idx]['artist'],
                'similarity': similarities[idx]
            })
        
        return recommendations

🎭 Chord Recognition System #

class ChordRecognizer:
    """Real-time chord recognition system"""
    
    def __init__(self):
        self.chord_templates = self.create_chord_templates()
        
    def create_chord_templates(self):
        """Create chord templates based on music theory"""
        templates = {}
        
        # Chord intervals (semitones from root)
        chord_types = {
            'major': [0, 4, 7],    # Root, major third, perfect fifth
            'minor': [0, 3, 7],    # Root, minor third, perfect fifth
            'dim': [0, 3, 6],      # Diminished
            'aug': [0, 4, 8],      # Augmented
        }
        
        note_names = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
        
        for root in range(12):
            for chord_type, intervals in chord_types.items():
                chord_name = f"{note_names[root]}{chord_type if chord_type != 'major' else ''}"
                template = np.zeros(12)
                
                for interval in intervals:
                    template[(root + interval) % 12] = 1
                
                templates[chord_name] = template
        
        return templates
    
    def recognize_chords(self, y, sr, hop_length=512):
        """Recognize chord progression in audio"""
        # Use harmonic component for better chord recognition
        y_harmonic, _ = librosa.effects.hpss(y)
        
        # Extract chroma features
        chroma = librosa.feature.chroma_cqt(y=y_harmonic, sr=sr, hop_length=hop_length)
        chroma_norm = librosa.util.normalize(chroma, axis=0)
        
        chord_progression = []
        times = librosa.frames_to_time(np.arange(chroma.shape[1]), 
                                       sr=sr, hop_length=hop_length)
        
        for i in range(chroma.shape[1]):
            chroma_frame = chroma_norm[:, i]
            
            # Find best matching chord template
            best_chord = None
            best_score = -1
            
            for chord_name, template in self.chord_templates.items():
                score = np.corrcoef(chroma_frame, template)[0, 1]
                
                if not np.isnan(score) and score > best_score:
                    best_score = score
                    best_chord = chord_name
            
            chord_progression.append({
                'time': times[i],
                'chord': best_chord,
                'confidence': best_score
            })
        
        return chord_progression

🎚️ Source Separation & Audio Enhancement #

🎵 Harmonic-Percussive Separation #

def advanced_source_separation(y, sr):
    """Advanced source separation techniques"""
    
    # Basic harmonic-percussive separation
    y_harmonic, y_percussive = librosa.effects.hpss(y)
    
    # Advanced separation with different kernels
    D = librosa.stft(y)
    
    # Harmonic separation with longer horizontal kernel
    D_harmonic = librosa.decompose.hpss(D, kernel_size=(1, 31))[0]
    
    # Percussive separation with longer vertical kernel  
    D_percussive = librosa.decompose.hpss(D, kernel_size=(31, 1))[1]
    
    # Convert back to time domain
    y_harmonic_adv = librosa.istft(D_harmonic)
    y_percussive_adv = librosa.istft(D_percussive)
    
    return {
        'harmonic': y_harmonic_adv,
        'percussive': y_percussive_adv,
        'original': y
    }

# Non-negative Matrix Factorization (NMF)
def nmf_source_separation(y, sr, n_components=8):
    """Source separation using NMF"""
    # Compute magnitude spectrogram
    S = np.abs(librosa.stft(y))
    
    # Apply NMF
    W, H = librosa.decompose.decompose(S, n_components=n_components, sort=True)
    
    # Reconstruct sources
    sources = []
    for i in range(n_components):
        # Reconstruct component
        S_component = np.outer(W[:, i], H[i, :])
        
        # Apply original phase
        D = librosa.stft(y)
        phase = np.angle(D)
        
        # Reconstruct audio
        D_component = S_component * np.exp(1j * phase)
        y_component = librosa.istft(D_component)
        sources.append(y_component)
    
    return sources

📊 Evaluation & Benchmarking #

🎯 MIR Evaluation Metrics #

# Beat tracking evaluation
def evaluate_beat_tracking(reference_beats, estimated_beats):
    """Comprehensive beat tracking evaluation"""
    scores = mir_eval.beat.evaluate(reference_beats, estimated_beats)
    
    return {
        'F-measure': scores['F-measure'],
        'Cemgil': scores['Cemgil'],
        'Goto': scores['Goto'],
        'P-score': scores['P-score'],
        'Continuity': scores['Continuity']
    }

# Onset detection evaluation
def evaluate_onset_detection(reference_onsets, estimated_onsets, tolerance=0.05):
    """Evaluate onset detection performance"""
    scores = mir_eval.onset.evaluate(reference_onsets, estimated_onsets, 
                                     window=tolerance)
    
    return {
        'Precision': scores['Precision'],
        'Recall': scores['Recall'],
        'F-measure': scores['F-measure']
    }

# Chord recognition evaluation
def evaluate_chord_recognition(ref_intervals, ref_labels, est_intervals, est_labels):
    """Evaluate chord recognition performance"""
    scores = mir_eval.chord.evaluate(ref_intervals, ref_labels, 
                                     est_intervals, est_labels)
    
    return {
        'Root': scores['Root'],
        'Majmin': scores['Majmin'],
        'Sevenths': scores['Sevenths'],
        'Weighted_score': scores['Weighted_score']
    }

🚀 Real-World MIR Applications #

🎵 Music Streaming Services #

Spotify: Audio feature analysis for recommendations and playlist generation
Shazam: Audio fingerprinting for music identification
Apple Music: Genre classification and mood detection
YouTube: Content ID for copyright detection

🎼 Music Production Tools #

Auto-tune: Real-time pitch correction
Beat Detective: Automatic beat alignment
Chord Detection: DAW plugins for harmonic analysis
Stem Separation: AI-powered source separation

🎓 Music Education #

Transcription Tools: Automatic music notation generation
Practice Apps: Real-time feedback for instrument learning
Music Theory: Chord progression analysis and suggestions
Ear Training: Interval and chord recognition games

🔬 Musicology Research #

Corpus Studies: Large-scale musical analysis
Cultural Analysis: Cross-cultural music comparison
Historical Studies: Evolution of musical styles
Computational Musicology: Data-driven music research

🛠️ Best Practices & Tips #

⚡ Performance Optimization #

# Efficient audio loading
y, sr = librosa.load('audio.wav', sr=22050, mono=True)  # Downsample for speed

# Batch processing
def process_audio_batch(audio_files):
    """Process multiple audio files efficiently"""
    results = []
    
    for audio_file in audio_files:
        y, sr = librosa.load(audio_file, sr=22050)
        
        # Extract features in one pass
        features = {
            'mfcc': librosa.feature.mfcc(y=y, sr=sr),
            'chroma': librosa.feature.chroma_stft(y=y, sr=sr),
            'spectral_centroid': librosa.feature.spectral_centroid(y=y, sr=sr)
        }
        
        results.append(features)
    
    return results

# Memory-efficient processing for long audio
def process_long_audio(audio_file, chunk_size=30):
    """Process long audio files in chunks"""
    y, sr = librosa.load(audio_file)
    chunk_samples = chunk_size * sr
    
    features = []
    for i in range(0, len(y), chunk_samples):
        chunk = y[i:i + chunk_samples]
        chunk_features = extract_comprehensive_features(chunk, sr)
        features.append(chunk_features)
    
    return np.array(features)

🎯 Common Pitfalls & Solutions #

Audio Loading Issues #

# Handle different sample rates
def safe_audio_load(filename, target_sr=22050):
    """Safely load audio with error handling"""
    try:
        y, sr = librosa.load(filename, sr=target_sr)
        
        # Check for empty audio
        if len(y) == 0:
            raise ValueError("Empty audio file")
        
        # Normalize audio
        y = librosa.util.normalize(y)
        
        return y, sr
    except Exception as e:
        print(f"Error loading {filename}: {e}")
        return None, None

Feature Extraction Robustness #

def robust_feature_extraction(y, sr):
    """Extract features with error handling"""
    features = {}
    
    try:
        # MFCC with error handling
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        if not np.any(np.isnan(mfccs)):
            features['mfcc'] = np.mean(mfccs, axis=1)
        else:
            features['mfcc'] = np.zeros(13)
    except:
        features['mfcc'] = np.zeros(13)
    
    try:
        # Chroma with error handling
        chroma = librosa.feature.chroma_stft(y=y, sr=sr)
        if not np.any(np.isnan(chroma)):
            features['chroma'] = np.mean(chroma, axis=1)
        else:
            features['chroma'] = np.zeros(12)
    except:
        features['chroma'] = np.zeros(12)
    
    return features

📚 Learning Resources #

📖 Essential Books #

“Fundamentals of Music Processing” by Meinard Müller
“An Introduction to Audio Content Analysis” by Alexander Lerch
“Music Information Retrieval” by Downie, West, Ehmann, Vincent

🌐 Online Courses & Tutorials #

CCRMA MIR Workshop (Stanford University)
Audio Signal Processing for Music Applications (Coursera)
musicinformationretrieval.com - Interactive notebooks

🔬 Research Conferences #

ISMIR - International Society for Music Information Retrieval
ICASSP - International Conference on Acoustics, Speech and Signal Processing
DAFx - International Conference on Digital Audio Effects

💻 Code Repositories #

librosa/librosa - Core audio analysis library
MTG/essentia - Comprehensive MIR toolkit
CPJKU/madmom - Beat tracking and onset detection
mir-evaluation/mir_eval - Evaluation metrics

🎵 Conclusion #

Music Information Retrieval represents the fascinating intersection of technology and artistry, enabling computers to understand and analyze the rich complexity of musical audio. With powerful Python libraries like librosa, essentia, madmom, and mir_eval, researchers and developers can build sophisticated systems for music analysis, recommendation, and understanding.

The field continues to evolve rapidly with advances in deep learning, transformer models, and self-supervised learning, opening new possibilities for musical AI applications. Whether you’re building the next music streaming service, developing educational tools, or conducting musicological research, Python’s MIR ecosystem provides the foundation for innovation.

Ready to dive into the world of Music Information Retrieval? Start with librosa, explore the techniques, and build amazing musical AI applications! 🎶✨