Saudi Arabic
Conversations Dataset
50,000 synthetic customer service conversations in authentic Saudi Arabic dialects. Built for fine-tuning Arabic LLMs, chatbot training, and NLP research.
50,000 synthetic customer service conversations in authentic Saudi Arabic dialects. Built for fine-tuning Arabic LLMs, chatbot training, and NLP research.
{
"id": "uuid",
"status": "completed",
"metadata": { "dialect": "Najdi", "sector": "Fintech", "sentiment": "Angry", "topic": "Transfer Failed" },
"conversation": [ { "role": "user", "content": "..." }, { "role": "agent", "content": "..." } ],
"slug": "transfer-failed-a1b2c3"
}Visitors can browse real completed conversations and only download the first 500 examples in public preview format.
Public preview currently exposes 100 completed conversations · download is capped at the first 500 rows.
Each conversation includes rich metadata, authentic dialect markers, brand-specific vocabulary, and realistic resolution patterns — not template-generated filler.
OTP failures, unknown charges, bill disputes, missing orders, account locks, transfer errors, SIM replacements, appointment issues, and more.
Angry/Frustrated, Urgent/Panic, Confused/Inquiring, Neutral/Polite — each with distinct opening styles and escalation patterns.
Not every case gets magically resolved. 40% full resolution, 30% partial fix, 20% escalation, 10% unresolved — mirroring real call center data.
Fictional but authentic-sounding brands across fintech wallets, telecom providers, food delivery apps, and e-government platforms — each with sector-accurate capabilities and limitations.
Drop the JSONL directly into your training pipeline. Format-ready for Hugging Face, Axolotl, and LLaMA-Factory.
Build Saudi customer service bots that actually sound local. Real dialect vocabulary, not translated MSA.
Sentiment analysis, dialect classification, named-entity extraction. Labeled metadata included per row.
We don't just generate — we validate. Every conversation goes through a multi-layer quality gate before it enters the dataset.
Levantine, Egyptian, and Maghrebi contamination is auto-rejected. Only authentic Saudi vocabulary passes.
18 Saudi brands with enforced capability rules. Agents can't offer services their brand doesn't provide.
Template phrases like "هل يمكنني مساعدتك" are banned. Every agent sounds like a real Saudi CSR.
Not every case gets a magic fix. The system enforces realistic escalations, partial fixes, and honest limitations.
Dialect markers are frequency-capped. No conversation uses يا خوي 5 times — that's caricature, not data.
Turn order, turn count, verification flow, and brand mention — all validated before a row is marked complete.
Message us on WhatsApp — we'll confirm and send the file directly.