Methodology
Last updated: April 2026 · Pipeline v1.1.0
Overview
GeekPeak discovers which programming books real developers recommend by scanning every public article on DEV.to. We don't rely on bestseller lists, publisher data, or expert panels — we listen to what working developers actually write about.
1.27M
Articles scanned
12,568
Book articles found
657
Books ranked
4,717
Mentions tracked
Data Pipeline
Our pipeline has five stages, each with measurable quality metrics.
Corpus Collection
We crawled all 2.42M article IDs on DEV.to and retrieved 1,271,389 existing articles (the rest are deleted/draft). Every public article was saved — 100% recovery rate.
Book Article Detection
A multi-layer detector identifies articles that recommend books. It looks for Amazon links, ISBNs, publisher URLs, recommendation phrases, and known book titles.
Layer 1: Deterministic
Amazon ASINs, ISBNs, publisher links
Layer 2: Heuristic
Title patterns, recommendation phrases
Layer 3: Lexical
Known book title dictionary
Book Extraction & Deduplication
From each detected article, we extract individual book references (via ASINs, ISBNs, Markdown links, and text patterns), then merge duplicates using fuzzy title matching and 100+ manual merge rules.
Metadata Enrichment
We fill in authors, publication years, and cover images using Google Books API and Open Library API, with manual verification for the top books.
Scoring & Ranking
Each book gets a score based on how many articles mention it, how many different authors recommend it, and how recent the recommendations are.
Scoring Formula
score = (unique_article_mentions × 1.0)
+ (unique_authors × 1.5)
+ (recency_boost × 0.8)
- (duplicate_author_penalty)
Accuracy Metrics
We measure three key quality metrics using random sampling and manual review.
99.7%
Book Precision
Of 657 ranked books, 99.7% are verified real published books with correct metadata.
99.0%
Article Recall
Of articles not flagged as book articles, only 1% actually contained book recommendations.
97.6%
Extraction Recall
Of books present in detected articles, 97.6% were successfully extracted and counted.
How we measured these numbers
Book Precision: We sampled 98 books (every 7th from score-sorted list) and manually verified each is a real published book with correct title and author. Found 2 non-books and 10 minor issues. After a full audit of all 684 books, we removed 27 non-book entries and corrected 345 title/author issues.
Article Recall: We sampled 100 articles from the 1.26M non-detected articles (stratified by engagement: 25 each from 0–4, 5–19, 20–99, 100+ reactions). Only 1 article was a clear miss — an article summarizing Fowler's PoEAA book without using typical recommendation language.
Extraction Recall: We sampled 20 detected articles and compared extractor output against all books actually present in the text. Of 41 total books, 40 were found. The one miss was a book title mentioned in prose without any link or formatting.
Quality Assurance
Known Limitations
How We Improve
We continuously refine our pipeline:
- •New detection patterns are added as we discover missed book formats
- •Deduplication rules grow as new book aliases appear
- •Accuracy metrics are re-measured with each major pipeline update
- •Additional data sources (Hashnode, etc.) are planned
Found an issue with our data?
If you notice a wrong book, missing title, or data error, please let us know at hello@geekpeak.dev