TamizhGen - SLM Architecture

1️⃣ Core Model & Processing

TamizhGen is a Small Language Model (SLM) built specifically for Tamil.

Uses a GRU-based Seq2Seq architecture for AI-generated Tamil responses.
Integrated Multihead Attention to enhance contextual understanding.
Retrieval-Augmented Generation (RAG) + FIASS ensures accurate retrieval and response generation.
Input: English ✅
Output: Tamil AI-generated response ✅ (Even if English is present, Tamil follows).

2️⃣ AI-Generated Tamil Responses (Not Just Translation!)

TamizhGen does not just translate but generates Tamil responses intelligently.

Example:

Input: "Tell me about Chennai"
Process: AI generates a detailed Tamil response based on RAG + FIASS retrieval.
Output: Tamil AI-generated text about Chennai (not just a translation).

3️⃣ AI-Generated Content Types

✅ Letters – Love letters, formal letters, business letters.
✅ Stories – Short stories, folktales, historical fiction.
✅ Poems – Romantic, inspirational, and classical Tamil poetry.
✅ Literature – Essays, philosophical texts, and classic Tamil-style writing.
✅ Hallucination Enabled – AI generates creative content even if exact data is unavailable.

4️⃣ Retrieval System (AI-Generated with FIASS)

Retrieval system also uses AI generation to enhance responses.

FIASS RAG (Fact-Informed Augmented Small-Scale Retrieval-Augmented Generation).

FIASS (Inspired from Facebook AI Similarity Search) + RAG retrieves high-relevance Tamil text.

Example Queries:

Input: "Give me Bharathiyar's poem 'Achamillai Achamillai'."
Process: AI retrieves and reconstructs the poem using FIASS & RAG.
Output: Exact poem, AI-retrieved, with structured Tamil formatting.
Input: "Tell me Thirukkural 1st Kural with meaning."
Output: AI-retrieved exact Kural + AI-generated explanation.

5️⃣ Built-in Translation for Tamil Script

Reason: Many users type in English due to keyboard limitations.

Process: AI first generates a Tamil response → Converts it into Tamil script if needed.

Example:

Input: "Tell me about Chennai"
Process: AI generates Tamil response → Converts to Tamil script (if necessary).

6️⃣ Sequence Processing

Processes input left to right, updating GRU hidden states at each token.
Multihead Attention helps maintain contextual depth across sequences.
RAG + FIASS retrieves contextually similar Tamil text before AI generation.

7️⃣ Decoding, Training & Optimization

Decoding: Beam Search for better sentence fluency.
Loss Function: Cross-Entropy Loss (for token-level accuracy).
Optimization: Adam optimizer + Learning Rate Scheduling.
Training Method: Uses Teacher Forcing to improve token prediction accuracy.

8️⃣ Tokenization & Data Handling

Custom Tokenizer: Converts raw input into token IDs.
Handles: Sequence padding, truncation, and special tokens.
Data Pipeline:

Custom Dataset Class – Efficient data loading, tokenization, and batching.
Dataloader – Manages batch processing and shuffling for training.

9️⃣ Device Compatibility

Runs on both CPU and GPU (CUDA-enabled when available).

Auto-selection of the best available hardware for optimized performance.