Methodology

Last updated: April 2026 · Pipeline v1.1.0

Overview

GeekPeak discovers which programming books real developers recommend by scanning every public article on DEV.to. We don't rely on bestseller lists, publisher data, or expert panels — we listen to what working developers actually write about.

1.27M

Articles scanned

12,568

Book articles found

657

Books ranked

4,717

Mentions tracked

Data Pipeline

Our pipeline has five stages, each with measurable quality metrics.

1

Corpus Collection

We crawled all 2.42M article IDs on DEV.to and retrieved 1,271,389 existing articles (the rest are deleted/draft). Every public article was saved — 100% recovery rate.

2.42M IDs checked 1.27M articles saved 3.1 GB corpus
2

Book Article Detection

A multi-layer detector identifies articles that recommend books. It looks for Amazon links, ISBNs, publisher URLs, recommendation phrases, and known book titles.

Layer 1: Deterministic

Amazon ASINs, ISBNs, publisher links

Layer 2: Heuristic

Title patterns, recommendation phrases

Layer 3: Lexical

Known book title dictionary

12,568 book articles detected 0.99% of all articles
3

Book Extraction & Deduplication

From each detected article, we extract individual book references (via ASINs, ISBNs, Markdown links, and text patterns), then merge duplicates using fuzzy title matching and 100+ manual merge rules.

74,734 raw candidates 43,617 after noise removal 2,983 articles with valid mentions 657 unique books
4

Metadata Enrichment

We fill in authors, publication years, and cover images using Google Books API and Open Library API, with manual verification for the top books.

Authors known: 99.6% Pub year known: 99.1% Cover images: 99.6%
5

Scoring & Ranking

Each book gets a score based on how many articles mention it, how many different authors recommend it, and how recent the recommendations are.

Scoring Formula

score = (unique_article_mentions × 1.0)

+ (unique_authors × 1.5)

+ (recency_boost × 0.8)

- (duplicate_author_penalty)

A
Article mentions — The number of distinct articles that recommend this book. More articles = stronger community signal.
U
Unique authors (×1.5) — Weighted higher because diverse recommendations are more meaningful than one person mentioning a book repeatedly.
R
Recency boost (×0.8) — Recent mentions (last 90 days) receive a bonus, so trending books surface naturally.
D
Duplicate penalty (−0.5) — When the same author mentions a book in multiple articles, extra mentions are discounted to prevent gaming.

Accuracy Metrics

We measure three key quality metrics using random sampling and manual review.

99.7%

Book Precision

Of 657 ranked books, 99.7% are verified real published books with correct metadata.

99.0%

Article Recall

Of articles not flagged as book articles, only 1% actually contained book recommendations.

97.6%

Extraction Recall

Of books present in detected articles, 97.6% were successfully extracted and counted.

How we measured these numbers

Book Precision: We sampled 98 books (every 7th from score-sorted list) and manually verified each is a real published book with correct title and author. Found 2 non-books and 10 minor issues. After a full audit of all 684 books, we removed 27 non-book entries and corrected 345 title/author issues.

Article Recall: We sampled 100 articles from the 1.26M non-detected articles (stratified by engagement: 25 each from 0–4, 5–19, 20–99, 100+ reactions). Only 1 article was a clear miss — an article summarizing Fowler's PoEAA book without using typical recommendation language.

Extraction Recall: We sampled 20 detected articles and compared extractor output against all books actually present in the text. Of 41 total books, 40 were found. The one miss was a book title mentioned in prose without any link or formatting.

Quality Assurance

Full manual audit — All 684 books were individually reviewed. 27 non-book entries (courses, novels, duplicates) were removed. 345 title/author corrections were applied.
Non-book filtering — Physical products (keyboards, monitors), GitHub repositories, video courses, and spam are excluded using 70+ filter patterns.
Deduplication — 100+ manual merge rules handle common variants (e.g., "DDIA" and "Designing Data-Intensive Applications" are the same book).
Source transparency — Every book's detail page links to the actual articles that recommended it, so you can verify the data yourself.

Known Limitations

DEV.to only — We currently scan DEV.to articles. Hashnode, Medium, and personal blogs are not yet included. This means some books recommended on other platforms may be underrepresented.
Pattern-based detection — Our detector uses regular expressions and heuristics, not AI/LLM. Books mentioned without any structural signal (no link, no bold, no "I recommend") may be missed.
English articles only — Non-English articles may contain book recommendations that our patterns don't capture well.
Popularity bias — Widely-known books get mentioned more often. Excellent niche books with smaller audiences may rank lower than their quality deserves.

How We Improve

We continuously refine our pipeline:

  • New detection patterns are added as we discover missed book formats
  • Deduplication rules grow as new book aliases appear
  • Accuracy metrics are re-measured with each major pipeline update
  • Additional data sources (Hashnode, etc.) are planned

Found an issue with our data?

If you notice a wrong book, missing title, or data error, please let us know at hello@geekpeak.dev