AI4Bhārat’s Post

🚨🚨🚨𝗘𝘅𝗰𝗶𝘁𝗲𝗱 𝘁𝗼 𝘀𝗵𝗮𝗿𝗲 𝗼𝘂𝗿 𝗹𝗮𝘁𝗲𝘀𝘁 𝘄𝗼𝗿𝗸: "𝗣𝗿𝗮𝗹𝗲𝗸𝗵𝗮: 𝗔𝗻 𝗜𝗻𝗱𝗶𝗰 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸", 𝗳𝗼𝗰𝘂𝘀𝗶𝗻𝗴 𝗼𝗻 𝗱𝗼𝗰𝘂𝗺𝗲𝗻𝘁-𝗹𝗲𝘃𝗲𝗹 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 11 𝗜𝗻𝗱𝗶𝗰 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀. 🔍 𝗪𝗵𝗮𝘁 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 𝗮𝗿𝗲 𝘄𝗲 𝘀𝗼𝗹𝘃𝗶𝗻𝗴? Document alignment, identifying semantically equivalent text across languages, is critical for NLP tasks like machine translation. Existing sentence-based methods often fall short for document-level challenges, especially in Indic languages. 🌟 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗣𝗥𝗔𝗟𝗘𝗞𝗛𝗔 PRALEKHA is a large-scale benchmark for evaluating document-level alignment techniques. It includes 2M+ documents, covering 11 Indic languages and English, with a balanced mix of aligned and unaligned pairs. 💡 𝗢𝘂𝗿 𝗖𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: 1) Benchmark dataset: Robust evaluation of document alignment techniques. 2) Novel alignment approach: Document Alignment Coefficient (DAC) 📊 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 & 𝗥𝗲𝘀𝘂𝗹𝘁𝘀: We analyzed embedding models, granularity levels (sentence, chunk, document), and alignment algorithms across noisy and clean data scenarios. DAC outperformed baseline methods, achieving 20–30% higher precision and 15–20% higher F1 scores. PRALEKHA enables evaluation of cross-lingual document alignment and lays the groundwork for mining high-quality parallel documents to power long-context cross-lingual NMT. 𝗣𝗮𝗽𝗲𝗿 📄: https://lnkd.in/g_xWqkgm 𝗖𝗼𝗱𝗲 💻: https://lnkd.in/gnvsk6yq 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲 🤗: https://lnkd.in/g-9dFV2x Work done by: ⁣⁣Sanjay Suryanarayanan Haiyue Song Mohammed Safi Ur Rahman Khan Mitesh Khapra Anoop Kunchukuttan Raj Dabre #AI4Bharat #NLP #AI #IndianLanguages #Benchmark #Evaluation #MachineTranslation #Multilingual #ParallelCorpora #Dataset

Pralekha: An Indic Document Alignment Evaluation Benchmark

Pralekha: An Indic Document Alignment Evaluation Benchmark

arxiv.org

To view or add a comment, sign in

Explore topics