| | --- |
| | datasets: |
| | - PrompTart/PTT_advanced_en_ko |
| | language: |
| | - en |
| | - ko |
| | base_model: |
| | - facebook/m2m100_418M |
| | library_name: transformers |
| | --- |
| | |
| | # M2M100 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset |
| |
|
| | ## Model Overview |
| |
|
| | This is a **M2M100** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://arxiv.org/abs/2410.00683) dataset. [The PTT dataset](https://huggingface.co/datasets/PrompTart/PTT_advanced_en_ko) focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain. |
| |
|
| |
|
| | ## Example Usage |
| |
|
| | Hereโs how to use this fine-tuned model with the Hugging Face `transformers` library: |
| |
|
| | <span style="color:red">*Note:</span> `M2M100Tokenizer` depends on <span style="color:blue">sentencepiece</span>, so make sure to install it before running the example.* To install `sentencepiece`, run `pip install sentencepiece` |
| |
|
| | ```python |
| | from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer |
| | |
| | model_name = "PrompTart/m2m100_418M_PTT_en_ko" |
| | tokenizer = M2M100Tokenizer.from_pretrained(model_name) |
| | model = M2M100ForConditionalGeneration.from_pretrained(model_name) |
| | |
| | # Example sentence |
| | text = "The model was fine-tuned using knowledge distillation techniques.\ |
| | The training dataset was created using a collaborative multi-agent framework powered by large language models." |
| | |
| | # Tokenize and generate translation |
| | tokenizer.src_lang = "en" |
| | encoded = tokenizer(text.split('. '), return_tensors="pt", padding=True) |
| | generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id("ko")) |
| | outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
| | print('\n'.join(outputs)) |
| | # => "์ด ๋ชจ๋ธ์ ์ง์ ์ฆ๋ฅ ๊ธฐ๋ฒ(knowledge distillation techniques)์ ์ฌ์ฉํ์ฌ ๋ฏธ์ธ ์กฐ์ ๋์์ต๋๋ค. |
| | # ํ๋ จ ๋ฐ์ดํฐ์
(training dataset)์ ๋ํ ์ธ์ด ๋ชจ๋ธ(large language models)์ ๊ธฐ๋ฐ์ผ๋ก ํ ํ์
๋ค์ค ์์ด์ ํธ ํ๋ ์์ํฌ(collaborative multi-agent framework)๋ฅผ ์ฌ์ฉํ์ฌ ์์ฑ๋์์ต๋๋ค." |
| | |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | - **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set. |
| | - **Incomplete Parenthetical Annotation**: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected. |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite the original dataset and paper: |
| |
|
| | ```tex |
| | @misc{myung2024efficienttechnicaltermtranslation, |
| | title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation}, |
| | author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han}, |
| | year={2024}, |
| | eprint={2410.00683}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2410.00683}, |
| | } |
| | ``` |
| |
|
| | ## Contact |
| |
|
| | For questions or feedback, please contact [jiyoon0424@gmail.com](mailto:jiyoon0424@gmail.com). |