TamizhGen - SLM Architecture
1️⃣ Core Model & Processing
TamizhGen is a Small Language Model (SLM) built specifically for Tamil.
- Uses a GRU-based Seq2Seq architecture for AI-generated Tamil responses.
- Integrated Multihead Attention to enhance contextual understanding.
- Retrieval-Augmented Generation (RAG) + FIASS ensures accurate retrieval and response generation.
- Input: English ✅
- Output: Tamil AI-generated response ✅ (Even if English is present, Tamil follows).
2️⃣ AI-Generated Tamil Responses (Not Just Translation!)
TamizhGen does not just translate but generates Tamil responses intelligently.
Example:
- Input: "Tell me about Chennai"
- Process: AI generates a detailed Tamil response based on RAG + FIASS retrieval.
- Output: Tamil AI-generated text about Chennai (not just a translation).
3️⃣ AI-Generated Content Types
- ✅ Letters – Love letters, formal letters, business letters.
- ✅ Stories – Short stories, folktales, historical fiction.
- ✅ Poems – Romantic, inspirational, and classical Tamil poetry.
- ✅ Literature – Essays, philosophical texts, and classic Tamil-style writing.
- ✅ Hallucination Enabled – AI generates creative content even if exact data is unavailable.
4️⃣ Retrieval System (AI-Generated with FIASS)
Retrieval system also uses AI generation to enhance responses.
FIASS RAG (Fact-Informed Augmented Small-Scale Retrieval-Augmented Generation).
FIASS (Inspired from Facebook AI Similarity Search) + RAG retrieves high-relevance Tamil text.
Example Queries:
- Input: "Give me Bharathiyar's poem 'Achamillai Achamillai'."
- Process: AI retrieves and reconstructs the poem using FIASS & RAG.
- Output: Exact poem, AI-retrieved, with structured Tamil formatting.
- Input: "Tell me Thirukkural 1st Kural with meaning."
- Output: AI-retrieved exact Kural + AI-generated explanation.
5️⃣ Built-in Translation for Tamil Script
Reason: Many users type in English due to keyboard limitations.
Process: AI first generates a Tamil response → Converts it into Tamil script if needed.
Example:
- Input: "Tell me about Chennai"
- Process: AI generates Tamil response → Converts to Tamil script (if necessary).
6️⃣ Sequence Processing
- Processes input left to right, updating GRU hidden states at each token.
- Multihead Attention helps maintain contextual depth across sequences.
- RAG + FIASS retrieves contextually similar Tamil text before AI generation.
7️⃣ Decoding, Training & Optimization
- Decoding: Beam Search for better sentence fluency.
- Loss Function: Cross-Entropy Loss (for token-level accuracy).
- Optimization: Adam optimizer + Learning Rate Scheduling.
- Training Method: Uses Teacher Forcing to improve token prediction accuracy.
8️⃣ Tokenization & Data Handling
- Custom Tokenizer: Converts raw input into token IDs.
- Handles: Sequence padding, truncation, and special tokens.
- Data Pipeline:
- Custom Dataset Class – Efficient data loading, tokenization, and batching.
- Dataloader – Manages batch processing and shuffling for training.
9️⃣ Device Compatibility
Runs on both CPU and GPU (CUDA-enabled when available).
Auto-selection of the best available hardware for optimized performance.