AI Catalog & Listing Engine
Ecommerce companies sit on catalogs full of fragmented product data — different SKUs for the same product across customers, missing specs, inconsistent formatting, no marketplace-ready listings. I built a three-stage pipeline that transforms raw SKU data into a unified catalog with publish-ready listings.
Stage 1: Product classification and attributes
The foundation. A 16-step pipeline classifies 192K SKUs into a unified merchandising hierarchy (department → category → subcategory) and extracts 47 attribute types (brand, model, storage, color, CPU, screen size, and more).
Classification uses a tiered approach:
- Amazon catalog data — Highest confidence. If a SKU maps to an ASIN, use Amazon’s own hierarchy
- Structured SKU decoding — Parse encoded SKUs like
PH-AP-IP14PM-256-GDinto brand, model, storage, color using lookup tables - TF-IDF keyword scoring — The primary engine. A pre-built keyword table (25,000 keywords derived from classified titles) scores unclassified SKUs against all subcategories. Keywords unique to one subcategory get high weight; generic terms get low weight. This handles 84% of all classifications
I used AI to identify and build 600+ classification patterns across 250+ subcategories, define which attributes constitute product identity vs. variants for each category, and architect the pipeline itself. The patterns were converted to seed keywords that anchor the TF-IDF scoring.
The system deduplicates 192K SKUs down to 108K unique products, achieves 94% hierarchy coverage, and cross-pollinates attributes across sibling SKUs — if one SKU of an iPhone 14 has a memory spec extracted, all SKUs for that product inherit it.
Stage 2: Catalog enrichment
The classification handles the bulk. This stage handles the exceptions. Upload the remaining products, and the system searches the web for specifications, extracts structured attributes using Claude, and standardizes everything through a multi-pass quality pipeline:
- Preparation — Categorizes each product and generates optimized search queries
- Extraction — Web search + structured data extraction
- Standardization — Normalizes casing, units, dimensions, and technical terms
- Validation — A separate LLM pass catches logical inconsistencies like swapped dimensions or impossible weights
- Correction — Rule-based fixes for common conversion errors
The attribute schema is pulled dynamically from the data warehouse, so the system stays in sync with the catalog structure without code changes.
Stage 3: Listing generation
The final stage takes enriched product data and generates marketplace-ready listings. Each marketplace has different rules — Amazon allows 200-character titles, eBay caps at 80, Walmart requires specific prefixes. Bullet point counts, description lengths, and condition labeling all vary.
A configuration-driven architecture handles the complexity. Marketplace specs and product category specs are defined in JSON. Prompt templates are composed dynamically based on the selected marketplace and category. Upload a batch of products, select the target marketplace, and download ready-to-publish listings.
The full picture
Raw SKU data enters one end. Classified, deduplicated, attribute-rich products with marketplace-optimized listings come out the other. A workflow that previously required hours of manual research and copywriting per product now runs at catalog scale.
Tech stack
Python, BigQuery, Claude API, Streamlit, Google Cloud. The classification pipeline runs as orchestrated SQL steps. The enrichment and listing tools support parallel processing with rate limiting and retry logic for production throughput.