You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
m-1-x versions are primarily meant to be as a demonstration, or piloting of the tools I'll be building. 1 means that the architecture does not change from that of a vanilla BART. This model does not regard idioms as a single entity.
m-1-1
The very first baseline of Idiomify. This model is trained only on the first 146 entries of PIE dataset.
m-1-2
The scaled-up version of the previous model. No significant change has been made, it is just that m-1-2 is now trained on the entire entries of PIE dataset (train=0.8). This is also the first version that is deployed to the web via streamlit & huggingface.
m-1-3
you have to search every single word to see where the change is!
This is rather inconvenient. We need some way of telling the user that signifies "here is the part that has been changed". m-1-3 is a new version for doing that. It is trained on the same dataset as the previous version, but now two special tokens are added before and after idioms: <idiom> & </idiom>
experiment - does it perform better than before (m-1-4 & m-1-3)?
experiment - does it perform better than GPT-3?
m-2-x models 🏷 (NER with BERT)
m-1-x models demonstrated some potentials, but not without problems. There are some problems that stem from the nature of a seq2seq approach to solving Idiomify.
The model occasionally distorts the input sentence. We hope that the model will learn to "copy", and it indeed does, but we can never be entirely sure about this with a seq2seq approach. In an NER approach, you can be entirely sure that the source sentence is preserved because we get to label the sentence rather than transform the sentence.
Normalising variations of idioms into their lemma is like squaring a peg in a round hole. Yeah, you could do something like You were <idiom> beat around the bush </idiom> when I first interviewed you last time, where beating around the bush is the correct form the idiom. if recommending with a normalise form is what you want to do at the end of the day, then the task is more of a NER task than a seq2seq task, where each idiom is a named entity.
So, what could be better is an NER system rather than a translation system. Granted, it does not explicitly "idiomify" sentences, but it can recommend what idioms to use for what parts of the sentence. I'm not sure if this will turn out to perform better than seq2seq, but one thing we can gurantee for sure is that NER won't distort the source sentence.
m-2-1
This is the first version of m-2-x models. As for the labels, we just follow the IOB convention.
as for the tokenizer, we just use the pre-trained one. We need no additional tokens
v3.0 Idiomify with GPT-3
TL;DR - use GPT-3 rather than BERT.
But why a sudden switch from NER with BERT to seq2seq with GPT-3? This is for the following two reasons:
First, the few-shot performance of GPT-3 is surprisingly better than I thought. Just have a look at the example below.
an example of few-shot Idiomify. The proof is in the pudding!
Woah, and that is a result I got with only a handful, carefully curated list of examples, which is perfectly doable within a few hours. Yes, GPT-3 is expensive and I'll never use this if I were in the industry. The RoI of a GPT-3 based application will be stupidly low unless you charge 100 dollars a month to customers. But hey, I'm an academic. as long as I can get it to work on only a dozens of personal statements. It's okay to stop being an NLP engineer for a few months and just embrace the world of prompt engineering, especially if the performance gain is huge enough to justify the price.
But then what would be the point of your research, you may ask. Surely, merely presenting a use case of GPT-3 is by no means research in the field of NLP. That is just another interesting NLP project, because it does not improve any inductive bias on anything. Then what justifies my switch to GPT-3? I am not an NLP researcher, technically. I'm an SLA researcher. That is, the aim of my research should be (and frankly, should have been) about coming up with and justifying better ways of teaching a second language to EFL learners.
And that is the second reason for the switch to GPT-3. The top priority of my research should not be designing better inductive bias. Rather, you should just use the best tools out there to build the feedback system as soon as possible, and focus on asking the right questions and answering them with scientific methods.
So, here are the two reasons, re-iterated:
The Idiomify performance of GPT-3 is unexpectedly better than I thought
Suggesting better inductive bias is not my top priority
And so it begins, the world of prompt engineering.
The fine-tuning approach does not seem to work very well, for whatever reason that I don't know of. But I must come up with a complete version by this Friday. I should have a back-up plan.
This version is a minor upgrade from v3.0, where I try to keep v3.0 but password check from v3.1 is added. I'm doing this just in case I end up going back to this prompt design for my research.
Rather than allowing access to only those who know the master key, it is better to just open access to the web to anyone but request them to register their API-KEY. Since OpenAI gives away 30 dollars worth of API credit, that should account for enough requests so far as my research participants are concerned.
a script for generating fake alias for each participant
a detailed instructions for signing up to Open AI
a script for auto-generating Cloze tests (just recalling the definitions in Korean)
deploy!
v3.1 fine-tune Davinci with more quality examples
Approaching this with few-shot learning is not sustainable, as the API's are just too expensive. You must fine-tune a model to build this model successfully.
m-1-x
models 🔰 (Seq2Seq with BART)m-1-x
versions are primarily meant to be as a demonstration, or piloting of the tools I'll be building.1
means that the architecture does not change from that of a vanilla BART. This model does not regard idioms as a single entity.m-1-1
The very first baseline of Idiomify. This model is trained only on the first 146 entries of PIE dataset.
m-1-2
The scaled-up version of the previous model. No significant change has been made, it is just that
m-1-2
is now trained on the entire entries of PIE dataset (train=0.8). This is also the first version that is deployed to the web via streamlit & huggingface.m-1-3
This is rather inconvenient. We need some way of telling the user that signifies "here is the part that has been changed".
m-1-3
is a new version for doing that. It is trained on the same dataset as the previous version, but now two special tokens are added before and after idioms:<idiom>
&</idiom>
d-1-3
: PIE dataset - annotate the idioms with special tokens and add their definitions toidioms
artifacts #5t-1-1
: Saving a pre-trainedBartTokenizer
with the special tokens (<idiom>
,</idiom>
) #7m-1-3
: The same asm-1-2
, except that it prints out the special tokens before and after the idioms. #9m-1-4
- just don't split the sentencesWhy am I going back to using BART? This may not be absolutely terrible yet.
main_infer.py
: don't split sentences #27m-1-3
)?v3.0.1
)?m-1-5
just don't include the special tokens and treat this as a simple seq2seq problemWhy is only one idiom suggested? could it be because of the special tokens?
difflib
to highlight what has been changedm-1-4
&m-1-3
)?m-2-x
models 🏷 (NER with BERT)m-1-x
models demonstrated some potentials, but not without problems. There are some problems that stem from the nature of a seq2seq approach to solving Idiomify.You were <idiom> beat around the bush </idiom> when I first interviewed you last time
, wherebeating around the bush
is the correct form the idiom. if recommending with a normalise form is what you want to do at the end of the day, then the task is more of a NER task than a seq2seq task, where each idiom is a named entity.So, what could be better is an NER system rather than a translation system. Granted, it does not explicitly "idiomify" sentences, but it can recommend what idioms to use for what parts of the sentence. I'm not sure if this will turn out to perform better than seq2seq, but one thing we can gurantee for sure is that NER won't distort the source sentence.
m-2-1
This is the first version of
m-2-x
models. As for the labels, we just follow the IOB convention.entities:d-1-4
: define the entities #12d-1-4
: Preprocess PIE dataset to build NER labels for Idiomify task #11v3.0
Idiomify with GPT-3But why a sudden switch from NER with BERT to seq2seq with GPT-3? This is for the following two reasons:
First, the few-shot performance of GPT-3 is surprisingly better than I thought. Just have a look at the example below.
Woah, and that is a result I got with only a handful, carefully curated list of examples, which is perfectly doable within a few hours. Yes, GPT-3 is expensive and I'll never use this if I were in the industry. The RoI of a GPT-3 based application will be stupidly low unless you charge 100 dollars a month to customers. But hey, I'm an academic. as long as I can get it to work on only a dozens of personal statements. It's okay to stop being an NLP engineer for a few months and just embrace the world of prompt engineering, especially if the performance gain is huge enough to justify the price.
But then what would be the point of your research, you may ask. Surely, merely presenting a use case of GPT-3 is by no means research in the field of NLP. That is just another interesting NLP project, because it does not improve any inductive bias on anything. Then what justifies my switch to GPT-3? I am not an NLP researcher, technically. I'm an SLA researcher. That is, the aim of my research should be (and frankly, should have been) about coming up with and justifying better ways of teaching a second language to EFL learners.
And that is the second reason for the switch to GPT-3. The top priority of my research should not be designing better inductive bias. Rather, you should just use the best tools out there to build the feedback system as soon as possible, and focus on asking the right questions and answering them with scientific methods.
So, here are the two reasons, re-iterated:
And so it begins, the world of prompt engineering.
to-do's
Idiomifier
class with OpenAI's GPT-3 API #16v3.0
to streamlit cloudv3.0.1
- prompt design with a password checkThe fine-tuning approach does not seem to work very well, for whatever reason that I don't know of. But I must come up with a complete version by this Friday. I should have a back-up plan.
This version is a minor upgrade from
v3.0
, where I try to keepv3.0
but password check fromv3.1
is added. I'm doing this just in case I end up going back to this prompt design for my research.v3.0.1
v3.0.2
- pay-your-own-request versionRather than allowing access to only those who know the master key, it is better to just open access to the web to anyone but request them to register their API-KEY. Since OpenAI gives away 30 dollars worth of API credit, that should account for enough requests so far as my research participants are concerned.
V3.0.3
- preparing for automating the researchv3.1
fine-tune Davinci with more quality examplesApproaching this with few-shot learning is not sustainable, as the API's are just too expensive. You must fine-tune a model to build this model successfully.
They say aim for up to 500 examples.
readme.md
literal2idiomatic
dataset (at least three exemplar usages for each idiom) #21v3.1
Idiomify more than 1 phrases, given a long paragraph?v3.1
Idiomify give more natural suggestions? (Does it no longer "square a peg in a hole"?)v3.1
some time in the next version
you might want to evaluate your fine-tuned model with an extrinsic measure
The text was updated successfully, but these errors were encountered: