Data Healthv1

Generated 2026-05-04 20:40 UTC
Pipeline Summary
2,237,153 done
1 running
0 pending
38 error
64 in-plan cells · 1 in progress · 22 planned

Artist Coverage Grid

Rank tiers · 50K artists per tier · generated 2026-05-04 20:40 UTC
Cell encoding complete in progress planned (gap to fill) out of scope

Project Timeline

One-off pulls, incidents, and (eventually) scheduled ETLs. 17 entries · 1,489,381 total entities · 39.5 GB approximate footprint
Date Kind Title Entities Size Description
2026-05-03 one-off Data Health Dashboard v1 0 110 KB Coverage grid (rank tier × source) with 4-state encoding + appendable scan_log + this wave history ledger. Unified inventory view across all pipelines.
2026-04-27 → 2026-04-28 one-off April 27 discovery refresh 181,055 1.8 GB Bulk-pass that doubled the roster from ~190K to 389K. Added cm_stats + profile metadata for 181,055 newly discovered artists. NO daily history pulled — these artists are snapshot-only (the yellow band on the coverage grid).
2026-04-23 → 2026-04-30 one-off Instagram daily history 181,055 9.0 GB Full IG history 2017–now for all 181,055 artists with IG presence per cmStats. 125,882 actually wrote a parquet (others had API gaps). Unlocks social-first vs streaming-first typology + IG-as-leading-indicator analysis.
2026-04-15 → 2026-04-16 incident Disk-fill incident & recovery 41 macOS code_sign_clone bloat filled root partition to 100%, causing 41 silent parquet corruptions and 13 errored entities. Documented in RESUME_AFTER_REBOOT.md; reset to pending and resumed.
2026-04-11 → 2026-05-03 one-off Roster extension (5K–20K follower band) 75,000 6.0 GB Discovery + cmStats + metadata + full daily for ~75K artists in the 5K–20K sp_followers band — fills the gap below Wave 1's 15.7K floor. The 'developing artists' negative-example cohort from the strategic risks doc.
2026-04-01 → 2026-04-12 one-off Track full history (pop>=30/50) 122,608 3.2 GB Full 2017–now daily on 8 platforms for tracks with sp_popularity >= 30 (~91K initial, expanded to 122,608). Async client with platform pre-filter.
2026-03-25 → 2026-04-15 one-off Wave 4 — Full artist history (priority sources) 99,826 8.0 GB Full daily 2017–now for 100K artists × 5 priority sources. Interrupted by disk-fill on Apr 15; recovered via RESUME_AFTER_REBOOT.md playbook (41 silent corruptions, 13 errors reset).
2026-03-16 → 2026-03-25 one-off Wave 3 — Track daily (730d lookback) 188,194 2.2 GB 188,194 tracks × 8 platforms × 730d. First use of async client (aiohttp + 4 req/sec sliding window) with track-level platform pre-filter (~47% call savings, avg 4.2/8 platforms per track).
2026-03-14 → 2026-03-18 one-off Wave 3 — Playlist registry + daily snapshots 116,859 493 MB 100 editorial Spotify playlists; daily snapshots back to Jan 2023. 116,859 (playlist × date) snapshots.
2026-03-09 → 2026-03-11 one-off DJ Universe — Chartmetric name search 5,995 19 MB SequenceMatcher + genre bonus matched 5,995 of 6,402 unmatched DJ names (93.6% hit rate). 4 parallel Sonnet agents vetted 199 suspect matches; 118 rejects applied to master.
2026-03-03 → 2026-03-04 one-off Wave 2.7 — Artist profile metadata 99,826 700 MB 59-col extraction from /artist/:id (genres, career_stage, hometown, label, etc.). Filled the gap left by Wave 1, which collected cmStats but not profile metadata.
2026-03-01 → 2026-03-03 one-off Wave 2.5 — Track universe + metadata 188,194 650 MB Track universe = union of chart-discovered + filter-discovered + per-artist catalog tracks. 188,194 tracks × 51-field cm_statistics snapshot.
2026-03-01 → 2026-03-05 one-off Wave 2.6 — Artist 2yr daily lookback 99,826 7.0 GB 99,826 artists × 5 priority sources (Spotify, YouTube channel/artist, TikTok, Wikipedia) over Mar 2024–Mar 2026. cmStats source pre-filter saved ~30% of calls.
2026-02-28 → 2026-03-02 one-off Wave 2.0 — Chart snapshots (7 series, 2017–2026) 23,429 54 MB Snapshot-based ingestion across 7 chart series back to 2017. ~270x fewer API calls than per-artist chart history.
2026-02-27 → 2026-02-28 one-off Wave 1 — 100K artist roster + cmStats 99,826 320 MB Bulk pagination discovery yielded 99,826 artists. cmStats enrichment added 341 columns per artist (76 base + 130 weekly diff + 130 monthly diff + pct variants).
2026-02-26 one-off Wave 0 — Seed artists 20 10 MB 20 hand-picked top artists (Bad Bunny, Drake, Taylor Swift…) ingested at full depth — metadata, daily metrics across all 12 sources, track catalogs. Pipeline validation.
2026-02-25 → 2026-03-08 one-off DJ Universe — Discovery + ID resolution 7,627 30 MB Editorial-seeded DJ/electronic artist discovery from 12 sources. 22K raw → 13.5K cleansed → 7,627 ID-matched (4,435 Chartmetric + 3,767 Resident Advisor profiles).
Source: data/raw/chartmetric/* + data/ingestion_progress.db. Snapshot log: projects/data_health/data/snapshots/scan_log.parquet. Wave history: projects/data_health/data/wave_history.jsonl.