Skip to content

Latest commit

 

History

History
83 lines (70 loc) · 2.71 KB

README.md

File metadata and controls

83 lines (70 loc) · 2.71 KB

KoBART-dialect

WIP

  • 만족할만한 성능 나오면 배포

Data preparing

  • First, download the file below from aihub and set it up as follows.
.
└── data/
│   ├── 한국어 방언 발화 데이터(강원도)
│   │   ├── Training/[라벨]강원도_학습데이터_1.zip
│   │   └── Validation/[라벨]강원도_학습데이터_2.zip
│   ├── 한국어 방언 발화 데이터(경상도)
│   │   ├── Training/[라벨]경상도_학습데이터_1.zip
│   │   └── Validation/[라벨]경상도_학습데이터_2.zip
│   ├── 한국어 방언 발화 데이터(전라도)
│   │   ├── Training/[라벨]전라도_학습데이터_1.zip
│   │   └── Validation/[라벨]전라도_학습데이터_2.zip
│   ├── 한국어 방언 발화 데이터(제주도)
│   │   ├── Training/[라벨]제주도_학습데이터_1.zip
│   │   └── Validation/[라벨]제주도_학습데이터_3.zip
│   └── 한국어 방언 발화 데이터(충청도)
│       ├── Training/[라벨]충청도_학습데이터_1.zip
│       └── Validation/[라벨]충청도_학습데이터_2.zip
├── kodialect/..
├── .gitignore
├── LICENSE
└── README.md
  • Second, unzip files
$ sh unzip.sh
  • Third, run prepare_data.py
    • There may be errors in the json data itself provided by aihub. Please refer to the issue and edit the file directly and run the above python script.
$ python prepare_data.py
  • Final data folder
.
└── data/
│   ├── chungcheongdo/..
│   ├── gangwondo/..
│   ├── gyeongsangdo/..
│   ├── jejudo/..
│   ├── jeollado/..
│   ├── style_classification/..
│   ├── style_transfer/..
│   ├── train_dialect.json
│   └── valid_dialect.json
├── kodialect/..
├── .gitignore
├── LICENSE
└── README.md

Citations

@inproceedings{lai-etal-2021-thank,
    title = "Thank you {BART}! Rewarding Pre-Trained Models Improves Formality Style Transfer",
    author = "Lai, Huiyuan and Toral, Antonio and Nissim, Malvina",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-short.62",
    doi = "10.18653/v1/2021.acl-short.62",
    pages = "484--494",
}