AI Catalog & Listing Engine

Ecommerce companies sit on catalogs full of fragmented product data — different SKUs for the same product across customers, missing specs, inconsistent formatting, no marketplace-ready listings. I built a three-stage pipeline that transforms raw SKU data into a unified catalog with publish-ready listings.

Stage 1: Product classification and attributes

The foundation. A 16-step pipeline classifies 192K SKUs into a unified merchandising hierarchy (department → category → subcategory) and extracts 47 attribute types (brand, model, storage, color, CPU, screen size, and more).

Classification uses a tiered approach:

Amazon catalog data — Highest confidence. If a SKU maps to an ASIN, use Amazon’s own hierarchy
Structured SKU decoding — Parse encoded SKUs like PH-AP-IP14PM-256-GD into brand, model, storage, color using lookup tables
TF-IDF keyword scoring — The primary engine. A pre-built keyword table (25,000 keywords derived from classified titles) scores unclassified SKUs against all subcategories. Keywords unique to one subcategory get high weight; generic terms get low weight. This handles 84% of all classifications

I used AI to identify and build 600+ classification patterns across 250+ subcategories, define which attributes constitute product identity vs. variants for each category, and architect the pipeline itself. The patterns were converted to seed keywords that anchor the TF-IDF scoring.

The system deduplicates 192K SKUs down to 108K unique products, achieves 94% hierarchy coverage, and cross-pollinates attributes across sibling SKUs — if one SKU of an iPhone 14 has a memory spec extracted, all SKUs for that product inherit it.

Stage 2: Catalog enrichment

The classification handles the bulk. This stage handles the exceptions. Upload the remaining products, and the system searches the web for specifications, extracts structured attributes using Claude, and standardizes everything through a multi-pass quality pipeline:

Preparation — Categorizes each product and generates optimized search queries
Extraction — Web search + structured data extraction
Standardization — Normalizes casing, units, dimensions, and technical terms
Validation — A separate LLM pass catches logical inconsistencies like swapped dimensions or impossible weights
Correction — Rule-based fixes for common conversion errors

The attribute schema is pulled dynamically from the data warehouse, so the system stays in sync with the catalog structure without code changes.

Stage 3: Listing generation

The final stage takes enriched product data and generates marketplace-ready listings. Each marketplace has different rules — Amazon allows 200-character titles, eBay caps at 80, Walmart requires specific prefixes. Bullet point counts, description lengths, and condition labeling all vary.

A configuration-driven architecture handles the complexity. Marketplace specs and product category specs are defined in JSON. Prompt templates are composed dynamically based on the selected marketplace and category. Upload a batch of products, select the target marketplace, and download ready-to-publish listings.

The full picture

Raw SKU data enters one end. Classified, deduplicated, attribute-rich products with marketplace-optimized listings come out the other. A workflow that previously required hours of manual research and copywriting per product now runs at catalog scale.

Tech stack

Python, BigQuery, Claude API, Streamlit, Google Cloud. The classification pipeline runs as orchestrated SQL steps. The enrichment and listing tools support parallel processing with rate limiting and retry logic for production throughput.

AI Catalog & Listing Engine

Stage 1: Product classification and attributes

Stage 2: Catalog enrichment

Stage 3: Listing generation

The full picture

Tech stack

View Next

PredictEdge

Cortex