diff --git a/README.rst b/README.rst index 640f01c3..a817424a 100644 --- a/README.rst +++ b/README.rst @@ -75,13 +75,18 @@ Features - **Spell Correction** Using local Malaysia NLP researches to auto-correct any bahasa words. -- Stemmer +- **Stemmer** + + Use Character LSTM Seq2Seq with attention state-of-art to do Bahasa stemming. - **Subjectivity Analysis** From fine-tuning BERT, Attention-Recurrent model, Sparse Tensorflow and Self-Attention to build deep subjectivity analysis models. +- **Similarity** + + Use deep LSTM siamese, deep Dilated CNN siamese, deep Self-Attention, siamese, Doc2Vec and BERT to build deep semantic similarity models. - **Summarization** - Using skip-thought with attention state-of-art to give precise unsupervised summarization. + Using skip-thought and residual-network with attention state-of-art, LDA, LSA and Doc2Vec to give precise unsupervised summarization, and TextRank as scoring algorithm. - **Topic Modelling** Provide LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization. diff --git a/accuracy/models-accuracy.ipynb b/accuracy/models-accuracy.ipynb index 0dd43829..05dafd18 100644 --- a/accuracy/models-accuracy.ipynb +++ b/accuracy/models-accuracy.ipynb @@ -1111,6 +1111,103 @@ "```" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Similarity\n", + "\n", + "Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in [session/similarity](https://github.com/huseinzol05/Malaya/tree/master/session/similarity)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAuUAAAHeCAYAAAA8b3UNAAAYTmlDQ1BJQ0MgUHJvZmlsZQAAWIWVWQk4lN3bP8/sM/YZ+77vZN/3fd93EsY2thhrKJFkK1EhSiVZK5VCJSItSll6SZJIlkqhorLkeyz1vv/3f13fd31nrvM8v7nPfe7trPcMAFz7fSMjwxCMAIRHxFAdTA343dw9+LETAII/LEATIHzJ0ZH6dnZWAC6/3/9ZlgZhXrg8l9mQ9d/t/2th8g+IJgMA2cHYzz+aHA7j6wCg0smR1BgAMGowXSg+JnIDe8GYmQobCOPIDRy0hdM3sN8WLt7kcXIwhPFFAHC0vr7UIADom2E6fxw5CJZDPwS3ESP8KREw6yyMdcjBvv4AcEnDPNLh4bs3sBuMxf3+ISfoP2T6/ZHp6xv0B2/5sllwRpToyDDfPf/PcPzfJTws9rcOUbjSBlPNHDZ8huM2FLrbcgPTwng2ws/GFsZEGP+g+G/ywxhBCI41c97iR3CTow3hmAFWGMv5+xpZwpgbxiYRYTZW23S/QIqJOYzhGYJIoMSYO233zQyINnbclnmKutvB9jcOpBrqb/e97Evd1LvB3xkb6qy/LX8oOMD8t/xvicFOrls2IwlxFBcbGNPDmDU61NFyiwcpnBhsaPObhxrrsGG/MIw1AiJMDbbkI70DqSYO2/zU8Ojf/iIzgynmNtu4JCbYyWxbzkWy76b97DBuDojQd/4tJyDazeq3L/4BRsZbviN7AyKct/1FjkXGGDhs9/0SGWa3zY8iBISZbtAFYcwdHee43RelEwNPyC35KJvIGDunLTtRfiG+FnZb9qASgBUwBEaAH8TC1Q/sBiGA8my2aRb+ttViAnwBFQSBACCzTfndw3WzJQJ+OoJE8AlGASD6Tz+DzdYAEAfT1/5Qt54yIHCzNW6zRyiYgnE4sARh8PfYzV4Rf7S5gHcwhfJf2smwrWFw3Wj7b5o+TLHapsT+lsvP8JsTY4wxwphhTDASKE6UDkoTZQU/9eCqgFJDqf+29m9+9BS6D/0WPYAeQ7/cRUmj/ssffmANxmANJts++/3TZ5QoLFUZZYDShuXDslGsKE4gg1KCNemjdGHdyjDVcNvyDe//Lfs/fPhH1Lf58HJ4BJ4Nr4cX/3dPekl65T9SNmL6zwht2er3J66Gf1r+rd/wH5H2h9+W/+ZEZiIbkA+R7cguZAuyCfAj25DNyG7knQ38Zxa925xFv7U5bNoTCsuh/Jc+322dG5GMlquTey+3utUWE5AQs7HADHdH7qFSgoJj+PXhnT+A3zyCLCvNryAnrw7AxjmytU19ddg8HyDWnr9p5IMAqM4DgF/+mxb+FYArBHgbtf6bJuINLzMMANVT5Fhq3BYNtfFAAwJggFcUB+AFQkAc9kcBqMDnlR4wBhbAFjgBd+ANRzkYns9UEA+SQSrIADngKDgBSsAZcB5Ug0vgGmgCLaAdPABPQC8YAK/g2TMJPoJ5sARWIAjCQnQQCeKA+CARSApSgNQgHcgYsoIcIHfIBwqCIqBYKBk6AOVABVAJdA6qga5CN6F2qAvqg15C49B76Au0jEAiaBHMCB6EKGIHQg2hj7BEOCF2IoIQUYhERDriCKIYUY64iGhEtCOeIAYQY4iPiEUkQNIgWZECSBmkGtIQaYv0QAYiqch9yGxkIbIceRl5Cx7n58gx5CzyJwqDIqH4UTLwDDZDOaPIqCjUPlQuqgRVjWpEdaKeo8ZR86hfaDo0N1oKrYE2R7uhg9Dx6Ax0IboSfQN9H15Nk+glDAbDihHDqMKr0R0TgknC5GJOY+oxdzF9mAnMIhaL5cBKYbWxtlhfbAw2A3sSexHbhu3HTmJ/4GhwfDgFnAnOAxeBS8MV4mpxrbh+3DRuBc+IF8Fr4G3x/vg9+Dx8Bf4Wvgc/iV8hMBHECNoEJ0IIIZVQTLhMuE8YIXyloaERpFGnsaeh0OynKaa5QvOIZpzmJy2RVpLWkNaLNpb2CG0V7V3al7Rf6ejoROn06DzoYuiO0NXQ3aMbpftBT6KXpTen96dPoS+lb6Tvp//MgGcQYdBn8GZIZChkaGDoYZhlxDOKMhoy+jLuYyxlvMn4gnGRicQkz2TLFM6Uy1TL1MU0Q8QSRYnGRH9iOvE88R5xgoQkCZEMSWTSAVIF6T5pkhnDLMZszhzCnMN8ifkZ8zwLkUWJxYUlgaWU5Q7LGCuSVZTVnDWMNY/1Gusg6zIbD5s+WwBbFttltn627+xc7HrsAezZ7PXsA+zLHPwcxhyhHPkcTRyvOVGckpz2nPGcZZz3OWe5mLk0uchc2VzXuIa5EdyS3A7cSdznubu5F3l4eUx5InlO8tzjmeVl5dXjDeE9ztvK+56PxKfDR+E7ztfG94GfhV+fP4y/mL+Tf16AW8BMIFbgnMAzgRVBMUFnwTTBesHXQgQhNaFAoeNCHULzwnzC1sLJwnXCwyJ4ETWRYJEikYci30XFRF1FD4k2ic6IsYuZiyWK1YmNiNOJ64pHiZeL/yWBkVCTCJU4LdEriZBUlgyWLJXskUJIqUhRpE5L9UmjpdWlI6TLpV/I0Mroy8TJ1MmMy7LKWsmmyTbJft4hvMNjR/6Ohzt+ySnLhclVyL2SJ8pbyKfJ35L/oiCpQFYoVfhLkU7RRDFFsVlxQUlKKUCpTGlImaRsrXxIuUN5TUVVhapyWeW9qrCqj+op1RdqzGp2arlqj9TR6gbqKeot6j81VDRiNK5pzGnKaIZq1mrOaIlpBWhVaE1oC2r7ap/THtPh1/HROaszpiug66tbrvtWT0jPX69Sb1pfQj9E/6L+ZwM5A6rBDYPvhhqGew3vGiGNTI2yjZ4ZE42djUuMR00ETYJM6kzmTZVNk0zvmqHNLM3yzV6Y85iTzWvM5y1ULfZadFrSWjpalli+tZK0olrdskZYW1gfsx6xEbGJsGmyBbbmtsdsX9uJ2UXZ3bbH2NvZl9pPOcg7JDs8dCQ57nKsdVxyMnDKc3rlLO4c69zhwuDi5VLj8t3VyLXAdcxth9tetyfunO4U92YPrIeLR6XHoqex5wnPSS9lrwyvwZ1iOxN2dnlzeod539nFsMt3V4MP2sfVp9Zn1dfWt9x30c/c75TfPNmQXET+6K/nf9z/fYB2QEHAdKB2YEHgTJB20LGg98G6wYXBsxRDSgllIcQs5EzI91Db0KrQ9TDXsPpwXLhP+M0IYkRoROdu3t0Ju/sipSIzIseiNKJORM1TLamV0VD0zujmGGb4wt4dKx57MHY8TieuNO5HvEt8QwJTQkRC9x7JPVl7phNNEi8koZLISR3JAsmpyeN79fee2wft89vXkSKUkp4yud90f3UqITU09WmaXFpB2rcDrgdupfOk70+fOGh6sC6DPoOa8eKQ5qEzmahMSuazLMWsk1m/sv2zH+fI5RTmrOaScx8flj9cfHj9SOCRZ3kqeWVHMUcjjg7m6+ZXFzAVJBZMHLM+1nic/3j28W8ndp3oKlQqPFNEKIotGiu2Km4+KXzy6MnVkuCSgVKD0vpT3KeyTn0/7X+6v0yv7PIZnjM5Z5bPUs4OnTM911guWl54HnM+7vxUhUvFwwtqF2oqOStzKteqIqrGqh2qO2tUa2pquWvz6hB1sXXvL3pd7L1kdKn5sszlc/Ws9TlXwJXYKx+u+lwdvGZ5raNBreHydZHrp26QbmQ3Qo17GuebgpvGmt2b+25a3Oy4pXnrxm3Z21UtAi2ld1ju5LUSWtNb19sS2xbvRt6dbQ9qn+jY1fHqntu9vzrtO5/dt7z/6IHJg3sP9R+2PdJ+1NKl0XXzsdrjpicqTxq7lbtvPFV+euOZyrPGHtWe5l713lt9Wn2t/br97c+Nnj/4y/yvJwM2A32DzoNDL7xejA35D828DHu5MBw3vPJq/wh6JPs14+vCUe7R8jcSb+rHVMbujBuNd791fPtqgjzx8V30u9XJ9Cm6qcJpvumaGYWZlvcm73s/eH6Y/Bj5cWU24xPTp1OfxT9fn9Ob6553m59coC6sf8n9yvG16pvSt45Fu8XRpfClle/ZPzh+VP9U+/lw2XV5eiV+FbtavCaxduuX5a+R9fD19Uhfqu/mVQAJV0RgIABfqgCgcweA1AsAwXMrz9suSPjygYDfLpAs9BGRDp+oPagMtAkGiXmCLcZF4K0IEjRYmlnafrom+iqGSsZ6pmZiB+kJcy/LEOsbthn2jxwLnMtcazwIXiwfgZ9OgChIFGIVZhdhE2UX4xbnkeCX5JcSlBaWEZUV2yEtJyevqKCiqKGkq2ysYq5qrmaibqJhommopa+tpaOhq6Qnqy9qwGPIbEQwWjf+ajJl+tKs27zFotrymFWKdYiNm62xnbK9mAOXI6MTzhnpArki3FDueA9GTw4v4Z0y3hK7hH34fDn9WMgkf2IAKZA1iCtYkCIdohpqEuYSTolI3l0QWRF1llocnR+TG5sVlx1/JKF4T3Via9KrvWCfdMqu/SdTXx0QTN99sP0QJlMoSyHbIMcxN/Bw4pH8vOqjd/OHCxaPM52QKbQoCiw+cLKs5GZp/6l3pxfPYM9ynJMs1zpvW+F3IabyYFVhdXXNzdrHdcMXP1z6WY+7wnZV/Jpug/v1qBtZjaeb6pvbbnbd6rnd2/LkTkfr1bbSuyntuzo07hHvTXXevF/74NTDnEcJXX6PzZ/IdtN3zz69/+xUT2SvQR+pb6L/2vPUv+wHRAZRg+9fdA/VvywYjnnlMqL2mvP16ujom/axC+NZb3dPOL/TmhSGZ9nS9F8z198XfUj5GDZL/kT+HDmXM39jYe6r3rdzS6TvxT+llp+tpvzSWF//x/grIGdQBWhLDAvmNbYBl4sPIhjRSNIy0K7STdMPMQwxvmF6R/xE+sq8xLLGusK2xv6LY41ziesr9xzPFO8IXz//fYGbgpVCOcJhIlaikmJ4sQ/iXRI1ktlSFGlLGRlZOtm5HX1y1+WLFJIVyUr2ygYqCqoCakS1dfXPGiOaXVqN2uU6ubrxej76FgYKhpxGCKP3xs9Mrpjmm0WbO1moWLJZrli9sb5nU2ubb5dkH+jg6KjvJO8s4EJyxbouu310H/Ho9rzjVb/zrPexXYd8kn2pfhSyr79HgFOgfZBNsCXFMsQsVDNMNlwggmU3TSQicjXqB/Vn9FosOo4YL5SgsccpMTqpMLll71QKzX6+VJk07QM26X4H4zMOH6rMbMsazv6ey3xY4Yh9XsTRw/l1BY+OvTu+XshZpFxsdzK05GDpmVPNp3vLZs78OsdcLnFeu8LuArkytupQdTG8z3XXzV0iXlasd7wSdTXvWl1D5/WRG1+aMM0cNyVvady2aHG7E9ga05ZyN7X9QMfBexmdh+5nPsh+mPvocNfhx4efHO7OfZrzLKvnUG96X2r/3udxf0UN7B6MfBEzlPTy4PCxV+UjDa8fjL5882kcvCVOCL6Tn9SZMp/2mzn7/tNH5dmkT62ff81rLsR9ufz13SL7kuX3lB8NP6dXuFcd1rJ/dW6PvzFCH7kD+RnVjj6EccSKYxdwN/EZBAcabppR2vN04fTqDAiGdsZ0JgsiA7GXdJTZloWB5SlrNpsJO8TezBHBKcQ5xJXDrcP9iaeU14z3G18Zvxn/Z4HjghqCI0J7hfmFW0W8RVZFi8WUxLrFA8RXJY5JSkm2STlKTUmnyojIDMnm7jDY8U2uSt5TgU6hTTFSSUCpXzlNRUFlXDVPTVvtk3qphrnGouZ5LXutX9p1Ou66WN0bemR9ov5dg0hDfsNeozRjJeNpkxJTW/jecds8ykLK4p1lmZWHNav1c5sCWwc7kt2g/UkHb0dhxw9OV50TXYxdGVyH3Srdoz0MPGk9B73O7Az2VvBe2XXfJ9/Xy0/Cb4nc6X8swDdQMQgVNBhcS0kJcQqVDkOHvQm/FVG8Oz7SNUqDyhuNip6NGYhtj6uPL0vI25OaGJ8Umuy/d+c+txSn/Q6p9mn2BxzSnQ66Z+w8FJAZmhWdnZKTmVtwuOxITV7j0Xv5fQWjxz6fQBVKFHkVHz15v2TllOxpv7ITZx6fXS1XOB9QUXKhpwpVrVUTX1tf9/GS5OWQ+torc9dUGvZf727kaApr7rzFdzul5W2rVVtLu3zHxU6p+1cfGjwafpzQzfe0t+dwn9Nz0QEw+HHo3fCH1+CNyPiuidop9EziR/CpYp78VXdJ7afzavHG+G/93rdRMCoAnDgEwMbvOQ41AOReBEBsDwBscO5pRweAkzpACJgCaLEDQBbaf84PCE48CYAEeIAkUANmcH4ZBueURaAedIEJsAaxQ8qQIxQNnYCaoddwzieNcEGkIuoRo0gGpAEyHnkJOQVnaV6oUtQrOBPzQV9Af8KoYFIxz7A82HBsG46Eo+Da8Zz4WHw/QZFQRFilIdM8pVWnraZjp8ulR9An0X9niGVYYkxkgpiyiazECpIaqZc5hAXLcoHVmHWKLZNdir2XI4aTg7OVy5+bhvsajwcvkvcSnyecEfQJ5AnaCjEJPRcuEvEUFRCdErsoHi2hIQlJdknlS3vCs3Netn9Hi1ylfIHCPkWKkqOyhgqfKqQ6ptaiflwjRFNbi15rRLtGJ0ZXVw+n16ffYHDdsMnolnGryT3TLrMe80GLUctpqwXrFVucHau9qIOao5UT2TnZpdi11W3Gg+Sp7xW5s9x7wIfgq++XRG72/x6oFpQU3B5CCHUOqwhf3G0WWRY1F60VkxM7Gq+UcHTPQpJr8oN92imtqZZpE+lZGdqZIKsv58rhU3kF+WbHkMfvF+YXB5QYnpIuEzwrUq5UYVMZVV1a++QSqFe9atPgfiO4KfnmidvX7vS3LXXwdpo9iHl09vGz7rUemb6dz48M3B0iDZNHLo3OjnNPqE3qTcu/p//wYvbI5x1z7QtmXzq/KSyWLC3/sP95YXlhVWMt5dfdzf1ja/yJ8PhLAFVgAlxBCNgHjoM60AlGwQ+IBMlBNlAEdBRqgF4iAEICzvLTEFcRb+E83gqZjmxDrqC0UQdQ3Wh2dCC6EYPHeGMasYzYMOwTnDQuD7eI98I/IMgSimiQNFE047TOtI/pDOla6bXo78BZ7CNGe8ZROE9dJx4nyZKeMkfAmWczqy8bDVszeyAHK8dDzj1c0lzj3EU8trw43g6+/fwGAhiBp4KFQr7CssKrIt2iZWJR4sYSXBJfJB9LnZdOkfGU1dwhIccuj5dfVZhTnFB6ofxY5bbqRbUS9UMaVE1PLUNtSR1GnUXdYb1W/XqDq4YNRk3Gt03aTDvNHpv3WrywfGM1bb1gs2KHs2d1EHNUd7J29nfZ61ridtN92GPNS3CnhXfMrrM+PX4QWcU/IqAmcCpYlBIScjV0Odw0onD3TJQWdW90Wywqziq+KGEqUT3pSPL0PuOU6lT6tD0HpuH9pDfTIuthjllu9xGHvLH8lGO8x+8WBhbTn2wu9T9NKntwdm+5yvkvF65WxdZo1WEuDly+cCX5mtd1lUb6pomb128fuGPTxn53vKOmk/pA6xG2a/BJzdP9PV59Os9FBpgGHw05v5x8lfiaefTamNP46kT1pPs0w0zXh8xZy8+Mcy8Wzn4NWVT5jvjRs1y6GvRLcXv8kQADaDd3AHGgAs8ANxAODoIz4DYYhte/IGQBxUIV0CCCBmEEr/wOJA5pjzyD/IKyQFWh8Wgq+g3GCV7tNtgBHBn3E19IUCdM0pyk1aMdoUui56fvYohnlGScYDpD9CNJkL4zP2QpY01i82TX45DiZOei4UZwr/Is867yAwEsfAPlEZYV0RZ1EAsS3y9xUvIGnHfPyzLuUJBzld+nUKHYo7SiIqHqrlag3q/JrOWuXaEzp6etf9jgjZGicY7JuJmWeaHFFys760u2tHZh9o8dJZ1ynD+4WrjVeuA9KV4PvUV3HfSZ9DMkVwYgA/2D7lFEQzJCZ8KtIuojWaISqGMxRrGX49kT9u35mOQGr1OVlKpUjrQj6aiDyRlfMj2yrmav5zodrjqyfNQx//IxwnHKiQdFUsW5J+dKXU/dKRM9kw/v/f7nuy9oVlZVM9Uk1k5ddLzUUi96Je/qUoP39QeNMk1Hm+dv2d++fIfQGtjW2k7sCLjXeB/1wO5h6aOJxxJPKN2VT8d7OHvt+w72X3/+doAwKPfCYYj68shwzat7IwOvp0YX3qyOQ2+xE5h3mEkwuTz1aXp05un75g/lHzNnIz5Zf5aaw869mW9eyPri8VXi65dvLYtpS0bfMd87f6T81Py5sHxhxWOVsNq4Rv5F9+vauvvG+EcHKipsHh8QrQEA6NH19a+iAGALAFjLX19fKV9fXzsPJxsjANwN2/oPafOsYQTgLNcG6rsy+frf/+X8D78U1VXQOwEbAAABnWlUWHRYTUw6Y29tLmFkb2JlLnhtcAAAAAAAPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyIgeDp4bXB0az0iWE1QIENvcmUgNS40LjAiPgogICA8cmRmOlJERiB4bWxuczpyZGY9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkvMDIvMjItcmRmLXN5bnRheC1ucyMiPgogICAgICA8cmRmOkRlc2NyaXB0aW9uIHJkZjphYm91dD0iIgogICAgICAgICAgICB4bWxuczpleGlmPSJodHRwOi8vbnMuYWRvYmUuY29tL2V4aWYvMS4wLyI+CiAgICAgICAgIDxleGlmOlBpeGVsWERpbWVuc2lvbj43NDE8L2V4aWY6UGl4ZWxYRGltZW5zaW9uPgogICAgICAgICA8ZXhpZjpQaXhlbFlEaW1lbnNpb24+NDc4PC9leGlmOlBpeGVsWURpbWVuc2lvbj4KICAgICAgPC9yZGY6RGVzY3JpcHRpb24+CiAgIDwvcmRmOlJERj4KPC94OnhtcG1ldGE+CstlPJcAAEAASURBVHgB7d0JfBXlvf/xX0iAJCQBZAtLMCCKu4ACFRA3XFjVglRbxfW2igjY0iq2f1HvtWjVSylXEQVpi1WUagELiIq0ilawVKRgywWUCAiyQ9hJ4H++jz1zk5PUEDLnZDjzmdcrnNnnmfczJL/zzLOkFBUVHTEmBBBAAAEEEEAAAQQQqDaBGtV2ZS6MAAIIIIAAAggggAACToCgnAcBAQQQQAABBBBAAIFqFiAor+YM4PIIIIAAAggggAACCBCU8wwggAACCCCAAAIIIFDNAgTl1ZwBXB4BBBBAAAEEEEAAAYJyngEEEEAAAQQQQAABBKpZgKC8mjOAyyOAAAIIIIAAAgggkAYBAggggAACCCCAQBIIHDliu2b8wXa/966l1W9g9a691mq1PbXMjR3Zv992z3vbdr0512rl5Vn2FT0t/ayz3H6HCtbY9ilT7MDadZbVuaNlX9nL0po1K3MOVvgvQEm5/6acEQEEEEAAAQQQSLjAjqkv2ZcPPmgZ7c+1tKa59tl3BtqhdWvLpGP7iy/Y5kkTrV6fflYjK9tW9+ltxZs3mRUV2eqr+tmhyPwJA/rb3qVLbf3995U5nhXxEaCkPD6unBUBBBBAAAEEEEiowFfjxlnznz9q2b16uevumDXLCue+YSfc9h+l0rFzzhtWv18/q3PZZZZ5ySW2+Znxtn/JJ5ES8eZWvH2HNfvvX1pKaqqlNWpkq666yg7v3GU16uaUOgcL/gsQlPtvyhkRQAABBBBAAIGECqhKStH69ZZ++mnedTNOOskORqqjxE51e15pm559zmpkZNq+5X93m9PP62ip9era6Z99bnbwoO3/x6cuWM/q1pWAPBYwTssE5XGC5bQIIIAAAggggECiBIo3RaqfRKYadep4l6zZvIUdWBMJsmOm9JNPcQH8htE/dyXjmZ06WUpapEZzSoorIT/4RYGtvuYad1TjO+6wI8XFbn3MaVj0WYA65T6DcjoEEEAAAQQQQCDRAim1a7tLFu/c6V36yL59VrtVa29ZM0cOH7Yv7h5iTX74Q2u7+GM79a9/tQMr/9dVc4nuqMahKjHP/+1vbdMzz9iBT5ZEN/EZRwGC8jjicmoEEEAAAQQQQCARAjXq13eXKdqwwbvcvtWrrPZJpYPy4sh21RvP7nGZK/1ObdTY6vbubfv+vsx2z51rX474kTtedcrrdL/QUuvXs4Pr1nnnZCZ+AgTl8bPlzAgggAACCCCAQEIEUmrVsvoDvm1bp/zWVEJeOHu2Fb71ttU+/Ux3fQXc+/+22NIaN3bLOyNdJx7Zs8eKNnxpOyMNQrMjDT5rn3GGbZ82zfa89aYdidQrL5w9ywXwdS64MCH3EPaLUKc87E8A948AAggggAACSSHQ5L777fMbvmufntrW3U+Ln//carX+uqR8ywu/taxIY870DudafqQf8q+e+IVtnvCs26/hLbdYZpcullKzljW87VZbc/vtbn3tU0+xFk8+4UrLkwIo4DeRUlRUdCTgaSR5CCCAAAIIIIAAAkcjEBlAqGjtWkuNdGeYkpHxjUccilRLSWvQoMx+xdu2ulL0tLyW33g8G/0VICj315OzIYAAAggggAACCCBQaQFfq68UFBTYjBkzbG3kG1qHDh1s4MCBlhppKBA7rV692uZG6jatWrXKunfvbpdFOq+vE+nC50jk292rr75qH374odWPNFjoEnmVcvHFF8cezjICCCCAAAIIIIAAAkkl4FtDz8LCQhsyZIitXLnSekVGkpozZ46NHTu2DNbevXvt9khdpZo1a1rvSGvf1157zZ577jm3nwLyiRMnWrt27eycc86x++67zxYvXlzmHKxAAAEEEEAAAQQQQCCZBHwrKV+wYIFzGRcZ4lVTWlqajRgxwgYPHmzp6elunf5ZsmSJKTC/+eabXWC+detWmxZp6Tt8+HBXQq711113ndtfpekffPCBnXvuud7xzCCAAAIIIIAAAgggkGwCvgXl6yKNBdq3b+/5NGvWzM1v2bLFWrRo4a1XKXhmZqaNHj3aTjvtNJsSaQHcr18/t/0Xv/hFZDCpFFf95ZNPPrFFixbZjTfe6B3LDAIIIIAAAggggAACySjgW1C+fv16y8nJ8Ywa/6sfzH2RvjJLTqpjruB9VqRPzPnz57tS87y8PLdLjRpf16aZNGmSq/6i4P2EE04oeXi586ryoioxTAgggAACCCCAAAIIHI8CvgXlGZFud7Zt2+YZHDhwwM3n5uZ66zTz+uuv2/vvv+8ahCpwnzx5sj322GOusWe0UeiDDz5ow4YNswceeMCefPJJGz9+vHcOlbBPnz7dW47OnHfeedFZPhFAAAEEEEAAgVICaYv/aikPP1xqHQsISCCla1c7+JN7q4TRINK1ZLSQ+VhP5FtQ3qRJE1u+fLmXDpWcqweV7Oxsb51mVqxYYV0jNx8N1nv06GHPPvusbdy40QXhaiyqknQde0lkdKnnn3++1PEjR440/ZScOnfu7BqHllzHPAIIIIAAAgggEBXYHWnDVhBd4BOBEgLZdXMsL1K9uron33pf6datmwu4Fy5caJs2bXK9qHTq1MndnwLuqVOn2qFDhyw/P9+VlCs41/Jbb71lbdq0sebNm1urVq3sd7/7nW3YsMGdQz249O3bt7qNuD4CCCCAAAIIIIAAAnEV8K2kXIG1eloZOnSoS3DLli3t4X+9JlK/5WPGjLE+ffpY//79XbeJgwYNcvu1bdvWO+baa6911Vauvvpqt+3SSy/1GoHGVYGTI4AAAggggAACCCBQjQK+j+i5e/du27FjR6keV8q7P9U537x5c5n9VHr+5ZdfWsOGDd2AQuUdG7tO1VdUQs+EAAIIIIAAAgiUJ7B73jwruPWW8jaxLuQCOb16Wt74CdWu4FtJefROsrKyTD8VTbVr1y4TkOsYDSp04oknVnQ42xFAAAEEEEAAAQQQSBoB3+qUJ40IN4IAAggggAACCCCAQIIFCMoTDM7lEEAAAQQQQAABBBCIFSAojxVhGQEEEEAAAQQQQACBBAsQlCcYnMshgAACCCCAAAIIIBArQFAeK8IyAggggAACCCCAAAIJFiAoTzA4l0MAAQQQQAABBBBAIFaAoDxWhGUEEEAAAQQQQAABBBIsQFCeYHAuhwACCCCAAAIIIIBArABBeawIywgggAACCCCAAAIIJFiAoDzB4FwOAQQQQAABBBBAAIFYAYLyWBGWEUAAAQQQQAABBBBIsABBeYLBuRwCCCCAAAIIIIAAArECBOWxIiwjgAACCCCAAAIIIJBgAYLyBINzOQQQQAABBBBAAAEEYgUIymNFWEYAAQQQQAABBBBAIMECBOUJBudyCCCAAAIIIIAAAgjEChCUx4qwjAACCCCAAAIIIIBAggUIyhMMzuUQQAABBBBAAAEEEIgVICiPFWEZAQQQQAABBBBAAIEECxCUJxicyyGAAAIIIIAAAgggECtAUB4rwjICCCCAAAIIIIAAAgkWIChPMDiXQwABBBBAAAEEEEAgVoCgPFaEZQQQQAABBBBAAAEEEixAUJ5gcC6HAAIIIIAAAggggECsAEF5rAjLCCCAAAIIIIAAAggkWICgPMHgXA4BBBBAAAEEEEAAgVgBgvJYEZYRQAABBBBAAAEEEEiwAEF5gsG5HAIIIIAAAggggAACsQIE5bEiLCOAAAIIIIAAAgggkGABgvIEg3M5BBBAAAEEEEAAAQRiBQjKY0VYRgABBBBAAAEEEEAgwQIE5QkG53IIIIAAAggggAACCMQKEJTHirCMAAIIIIAAAggggECCBQjKEwzO5RBAAAEEEEAAAQQQiBUgKI8VYRkBBBBAAAEEEEAAgQQLEJQnGJzLIYAAAggggAACCCAQK0BQHivCMgIIIIAAAggggAACCRYgKE8wOJdDAAEEEEAAAQQQQCBWgKA8VoRlBBBAAAEEEEAAAQQSLEBQnmBwLocAAggggAACCCCAQKwAQXmsCMsIIIAAAggggAACCCRYgKA8weBcDgEEEEAAAQQQQACBWAGC8lgRlhFAAAEEEEAAAQQQSLAAQXmCwbkcAggggAACCCCAAAKxAgTlsSIsI4AAAggggAACCCCQYIE0P69XUFBgM2bMsLVr11qHDh1s4MCBlpqaWuYSq1evtrlz59qqVause/fudtlll1mdOnXcfvPmzbN33nnHiouL7YorrrAuXbpY7dq1y5yDFQgggAACCCCAAAIIJIuAbyXlhYWFNmTIEFu5cqX16tXL5syZY2PHji3jtHfvXrv99tutZs2a1rt3b3vttdfsueeec/u9//77dv/991vjxo3toosusnHjxtnzzz9f5hysQAABBBBAAAEEEEAgmQR8KylfsGCBc1EgrSktLc1GjBhhgwcPtvT0dLdO/yxZssQUmN98880uMN+6datNmzbNhg8fbosWLbIBAwbYsGHD3P67d++2iRMn2p133ukdzwwCCCCAAAIIIIAAAskm4FtQvm7dOmvfvr3n06xZMze/ZcsWa9Gihbe+Xbt2lpmZaaNHj7bTTjvNpkyZYv369XPbr776aq8ai1YsW7bMVYPxDmYGAQQQQAABBBBAAIEkFPAtKF+/fr3l5OR4RKqComnfvn3eOs2ojrmC91mzZtn8+fNdqXleXp7bp1WrVu5zz5499stf/tJVgYlWbXEb+AcBBBBAAAEEEEAAgSQU8C0oz8jIsG3btnlEBw4ccPO5ubneOs28/vrrprrjahCqwH3y5Mn22GOPucaeCthXrFjhqq9o28svv2z5+fmljlcJ+/Tp00ut04KqxTAhgAACCCCAAALlCaSt+dxSytvAutALFO7cVeU4skGDBhYtZD5WUN+C8iZNmtjy5cu9dKjkvH79+padne2t04yC7q5du1o0WO/Ro4c9++yztnHjRlMgf8cdd9htt91m3/3ud61GjbLtUEeOHGn6KTl17tzZVC2GCQEEEEAAAQQQKE9gd6QNW0F5G1gXeoHsujmWF4A4smzUe4xZ061bNxdwL1y40DZt2uQaaHbq1MmdTQH31KlT7dChQ67kWyXlCs61/NZbb1mbNm2sefPmrmT8lFNOsQsuuMB1q6guFlVXnQkBBBBAAAEEEEAAgWQW8K2kXIG1eloZOnSo82rZsqU9/PDDbl79lo8ZM8b69Olj/fv3d90mDho0yG1r27atd8zf/vY3++KLL1z/5lF0NQpV3XMmBBBAAAEEEEAAAQSSVSClqKjoiJ83p24Md+zYUarHlfLOr6oqmzdvrnC/8o6NXafqKyqhZ0IAAQQQQAABBMoT2B0ZnLDg1lvK28S6kAvk9OppeeMnVLuCbyXl0TvJysoy/VQ0aZTOkl0lVrQ/2xFAAAEEEEAAAQQQSFYB3+qUJysQ94UAAggggAACCCCAQLwFCMrjLcz5EUAAAQQQQAABBBCoQICgvAIgNiOAAAIIIIAAAgggEG8BgvJ4C3N+BBBAAAEEEEAAAQQqECAorwCIzQgggAACCCCAAAIIxFuAoDzewpwfAQQQQAABBBBAAIEKBAjKKwBiMwIIIIAAAggggAAC8RYgKI+3MOdHAAEEEEAAAQQQQKACAYLyCoDYjAACCCCAAAIIIIBAvAUIyuMtzPkRQAABBBBAAAEEEKhAgKC8AiA2I4AAAggggAACCCAQbwGC8ngLc34EEEAAAQQQQAABBCoQICivAIjNCCCAAAIIIIAAAgjEW4CgPN7CnB8BBBBAAAEEEEAAgQoECMorAGIzAggggAACCCCAAALxFiAoj7cw50cAAQQQQAABBBBAoAIBgvIKgNiMAAIIIIAAAggggEC8BQjK4y3M+RFAAAEEEEAAAQQQqECAoLwCIDYjgAACCCCAAAIIIBBvAYLyeAtzfgQQQAABBBBAAAEEKhAgKK8AiM0IIIAAAggggAACCMRbgKA83sKcHwEEEEAAAQQQQACBCgQIyisAYjMCCCCAAAIIIIAAAvEWICiPtzDnRwABBBBAAAEEEECgAgGC8gqA2IwAAggggAACCCCAQLwFCMrjLcz5EUAAAQQQQAABBBCoQICgvAIgNiOAAAIIIIAAAgggEG8BgvJ4C3N+BBBAAAEEEEAAAQQqECAorwCIzQgggAACCCCAAAIIxFuAoDzewpwfAQQQQAABBBBAAIEKBAjKKwBiMwIIIIAAAggggAAC8RYgKI+3MOdHAAEEEEAAAQQQQKACAYLyCoDYjAACCCCAAAIIIIBAvAUIyuMtzPkRQAABBBBAAAEEEKhAgKC8AiA2I4AAAggggAACCCAQbwGC8ngLc34EEEAAAQQQQAABBCoQICivAIjNCCCAAAIIIIAAAgjEW4CgPN7CnB8BBBBAAAEEEEAAgQoECMorAGIzAggggAACCCCAAALxFiAoj7cw50cAAQQQQAABBBBAoAIBgvIKgNiMAAIIIIAAAggggEC8BQjK4y3M+RFAAAEEEEAAAQQQqECAoLwCIDYjgAACCCCAAAIIIBBvAYLyeAtzfgQQQAABBBBAAAEEKhAgKK8AiM0IIIAAAggggAACCMRbgKA83sKcHwEEEEAAAQQQQACBCgR8DcoLCgrsV7/6lf34xz+2l156yYqLi8u9/OrVq+3pp5+2H/7whzZ9+nTbs2dPqf22b99u99xzj+3fv7/UehYQQAABBBBAAAEEEEhGAd+C8sLCQhsyZIitXLnSevXqZXPmzLGxY8eWMdu7d6/dfvvtVrNmTevdu7e99tpr9txzz7n9FITruAceeMA++OADKyoqKnM8KxBAAAEEEEAAAQQQSDaBNL9uaMGCBe5U48aNc59paWk2YsQIGzx4sKWnp3uXWbJkiSkwv/nmm11gvnXrVps2bZoNHz7cdu/ebdHzeAcwgwACCCCAAAIIIIBAkgv4VlK+bt06a9++vcfVrFkzN79lyxZvnWbatWtnmZmZNnr0aBeMT5kyxS6//HK3T8OGDe2RRx5xAXqpg1hAAAEEEEAAAQQQQCCJBXwrKV+/fr3l5OR4VI0bN3bz+/bt89ZpJjU11QXvs2bNsvnz57tS87y8vFL7sIAAAggggAACCCCAQJgEfAvKMzIybNu2bZ7dgQMH3Hxubq63TjOvv/66vf/++zZjxgxT4D558mR77LHH7LLLLnMBe6mdy1lQCbsah8ZOqhbDhAACCCCAAAIIlCeQtuZzSylvA+tCL1C4c5dVNY5s0KCBVbWQ2begvEmTJrZ8+XIvY1VyXr9+fcvOzvbWaWbFihXWtWtXiwbrPXr0sGeffdY2btxozZs3L7VveQsjR440/ZScOnfu7KrFlFzHPAIIIIAAAgggEBXYHWnDVhBd4BOBEgLZdXMsL1K9uron3+qUd+vWzQXcCxcutE2bNtnEiROtU6dO7v4UcE+dOtUOHTpk+fn5rqRcwbmW33rrLWvTps1RBeTVjcX1EUAAAQQQQAABBBCIh4BvJeUKrNXTytChQ106W7ZsaQ8//LCbX7t2rY0ZM8b69Olj/fv3d90mDho0yG1r27atd0zsDaak8KIp1oRlBBBAAAEEEEAAgeQTSIn0BX7Ez9tSt4Y7duywFi1afONpVed88+bNFe73jSf510ZVX1EJPRMCCCCAAAIIIFCewO5586zg1lvK28S6kAvk9OppeeMnVLuCbyXl0TvJysoy/VQ01a5d25eAvKLrsB0BBBBAAAEEEEAAgaAL+FanPOg3SvoQQAABBBBAAAEEEAiqAEF5UHOGdCGAAAIIIIAAAgiERoCgPDRZzY0igAACCCCAAAIIBFWAoDyoOUO6EEAAAQQQQAABBEIjQFAemqzmRhFAAAEEEEAAAQSCKkBQHtScIV0IIIAAAggggAACoREgKA9NVnOjCCCAAAIIIIAAAkEVICgPas6QLgQQQAABBBBAAIHQCBCUhyaruVEEEEAAAQQQQACBoAoQlAc1Z0gXAggggAACCCCAQGgECMpDk9XcKAIIIIAAAggggEBQBQjKg5ozpAsBBBBAAAEEEEAgNAIE5aHJam4UAQQQQAABBBBAIKgCBOVBzRnShQACCCCAAAIIIBAaAYLy0GQ1N4oAAggggAACCCAQVAGC8qDmDOlCAAEEEEAAAQQQCI0AQXlospobRQABBBBAAAEEEAiqAEF5UHOGdCGAAAIIIIAAAgiERoCgPDRZzY0igAACCCCAAAIIBFWAoDyoOUO6EEAAAQQQQAABBEIjQFAemqzmRhFAAAEEEEAAAQSCKkBQHtScIV0IIIAAAggggAACoREgKA9NVnOjCCCAAAIIIIAAAkEVICgPas6QLgQQQAABBBBAAIHQCBCUhyaruVEEEEAAAQQQQACBoAoQlAc1Z0gXAggggAACCCCAQGgECMpDk9XcKAIIIIAAAggggEBQBQjKg5ozpAsBBBBAAAEEEEAgNAIE5aHJam4UAQQQQAABBBBAIKgCBOVBzRnShQACCCCAAAIIIBAaAYLy0GQ1N4oAAggggAACCCAQVAGC8qDmDOlCAAEEEEAAAQQQCI0AQXlospobRQABBBBAAAEEEAiqAEF5UHOGdCGAAAIIIIAAAgiERoCgPDRZzY0igAACCCCAAAIIBFWAoDyoOUO6EEAAAQQQQAABBEIjQFAemqzmRhFAAAEEEEAAAQSCKkBQHtScIV0IIIAAAggggAACoREgKA9NVnOjCCCAAAIIIIAAAkEVICgPas6QLgQQQAABBBBAAIHQCBCUhyaruVEEEEAAAQQQQACBoAoQlAc1Z0gXAggggAACCCCAQGgECMpDk9XcKAIIIIAAAggggEBQBQjKg5ozpAsBBBBAAAEEEEAgNAIE5aHJam4UAQQQQAABBBBAIKgCaYlK2KJFi2zu3Ll26NAhu+SSS+yiiy4q99ILFy60t99+2+2nfbp37241avDdoVwsViKAAAIIIIAAAggkhUBCot1PP/3U7r77bsvIyLAuXbrYQw89ZPPnzy8DuHTpUrvvvvvs7LPPtg4dOrj93nvvvTL7sQIBBBBAAAEEEEAAgWQSSEhJ+WuvvWZXX321jRgxwtmtWLHCfv/739vFF19cylIBuALyvn37uvVa/uijj+zCCy8stR8LCCCAAAIIIIAAAggkk0BCSsq/+OILO/300z23vLw8W716tbccnenWrZt9+OGH9vzzz9ukSZPs3XffddVXotv5RAABBBBAAAEEEEAgGQUSUlKuoDwrK8vza9KkiW3fvt1bjs7k5uZa48aNbcKECW5VZmamnXDCCdHNfCKAAAIIIIAAAgggkJQCCQnKFVzv2LHDA9y/f7+1bdvWW47OjB071po1a2Yvv/yypaSk2E9/+lObPHmyPfLII9FdbPTo0TZ9+nRvOTqzZMmS6CyfCCCAAAIIIIBAKYG0NZ9bSqk1LCDwtUDhzl1W1TiyQYMGppogVZkSEpQ3bdrUNmzY4KVz7dq1dvLJJ3vL0Zm///3vdv3115uCeE2qc/7UU09FN7vPkSNHmn5KTp07d7Z27dqVXMU8AggggAACCCDgCezeutUKvCVmEPg/gey6OZYXgDgyIXXKFVy/+uqrtmnTJlu2bJmbV4NOTepx5c0333TzCtTnzZtnX331lRUWFto777xjPXv2dNv4BwEEEEAAAQQQQACBZBVISFCunlfOO+8816vKbbfdZmeddZbXw4r6JX/hhRec77Bhwyw9Pd369etnPXr0cFVY+vfvn6z23BcCCCCAAAIIIIAAAk4gpaio6EiiLDZv3uwGAlK9m2+aVP+8uLjYKtoveg5VX1Fwz4QAAggggAACCJQnsDvyJr7g1lvK28S6kAvk9OppeeO/7mSkOikSUqc8eoONGjWKzn7jZ7169b5xOxsRQAABBBBAAAEEEEgmgYRUX0kmMO4FAQQQQAABBBBAAAG/BQjK/RblfAgggAACCCCAAAIIVFKAoLySYOyOAAIIIIAAAggggIDfAgTlfotyPgQQQAABBBBAAAEEKilAUF5JMHZHAAEEEEAAAQQQQMBvAYJyv0U5HwIIIIAAAggggAAClRQgKK8kGLsjgAACCCCAAAIIIOC3AEG536KcDwEEEEAAAQQQQACBSgoQlFcSjN0RQAABBBBAAAEEEPBbgKDcb1HOhwACCCCAAAIIIIBAJQUIyisJxu4IIIAAAggggAACCPgtQFDutyjnQwABBBBAAAEEEECgkgIE5ZUEY3cEEEAAAQQQQAABBPwWICj3W5TzIYAAAggggAACCCBQSQGC8kqCsTsCCCCAAAIIIIAAAn4LEJT7Lcr5EEAAAQQQQAABBBCopABBeSXB2B0BBBBAAAEEEEAAAb8FCMr9FuV8CCCAAAIIIIAAAghUUoCgvJJg7I4AAggggAACCCCAgN8CBOV+i3I+BBBAAAEEEEAAAQQqKUBQXkkwdkcAAQQQQAABBBBAwG8BgnK/RTkfAggggAACCCCAAAKVFCAoryQYuyOAAAIIIIAAAggg4LcAQbnfopwPAQQQQAABBBBAAIFKChCUVxKM3RFAAAEEEEAAAQQQ8FuAoNxvUc6HAAIIIIAAAggggEAlBQjKKwnG7ggggAACCCCAAAII+C1AUO63KOdDAAEEEEAAAQQQQKCSAgTllQRjdwQQQAABBBBAAAEE/BYgKPdblPMhgAACCCCAAAIIIFBJAYLySoKxOwIIIIAAAggggAACfgsQlPstyvkQQAABBBBAAAEEEKikAEF5JcHYHQEEEEAAAQQQQAABvwUIyv0W5XwIIIAAAggggAACCFRSgKC8kmDsjgACCCCAAAIIIICA3wJpfp+Q8wVX4FDBGtvx6u/tYEGBZXbsZPWu/66lpKaWSvD+pUtt52uvllqnhTrfOt+yrrjCdv3hNSuc/47VyMmxulf2tMwLupfZlxUIIIAAAggggAAClROgpLxyXsft3ocLd9nnNw2y/f/8p9Xre5XtmDHDNj/xeJn7Sc3JtvSTT/F+auWfaFsmT7YjxUW25+23bcMTT1hW9+6WefY59sXdQ2zfXz4ocw5WIIAAAggggAACCFROgJLyynkdt3vv+fOfXdrznp3oPhvXTLM1gwZZo2HDLSU93buvmvmtrG7kJzptfWa8NbjpRsvu3cfWD7vbGt1wg9W99jtu8/5//MN2zpllGed3ie7OJwIIIIAAAggggMAxCBCUHwPa8XjIwbVrLStSBSU61WzRws0Wb95kaXkto6tLfR5Yvtw2jh5tp/518df77tljKXUyvX2OFBXZvhUrvGVmEEAAAQQQQAABBI5NgKD82NyOu6MOrltrNevW9dKd1rixmz+8b7+3LnZm81PjLPcnP7HURo3cpnq9e9umSMl57dYn2ZEDB23rlClW+9RTYg9jGQEEEEAAAQQQQKCSAgTllQQ7XndPzci0Q1u3esk/sv/rYDy1WVNvXcmZg599ZjtnzbamDz7src66oqftXrjIPo9UYanZKt/qXXO12ZEj3nZmEEAAAQQQQAABBI5NgIaex+Z23B2V1jTXDkR6XYlOh9ats7TmzS01Kzu6qtTnrhl/sJwrLrfUf5Woa2NR5JjGw4fbGWsK7JQ/vWu1muRa7ZPalDqOBQQQQAABBBBAAIHKCxCUV97suDwiu/uFtnfRItdbSvFXX9nm8U9bdqQXFU1FG760nb97wezQIe/edkW6Pcy5tIe3rJnCN+bYumHD3DrVN9/68lTLuviSUvuwgAACCCCAAAIIIFB5AYLyypsdl0fUjHRz2GzUA/bZddfZPzt1tIORUu8m997r7uVQpBHouvvvj9QTP+CWD+/Yafs++buln3lWqXute801bnlF1y62qldPa3z33ZZ+Vul9Sh3AAgIIIIAAAggggMBRCaQd1V7slBQC9W+93eoNvM6Kt28r1eNKRqfOdkbBF9491qhXt9RydIN6acl/+RUr+vJLS61f31IyMqKb+EQAAQQQQAABBBCoggBBeRXwjsdDU7KyLC3yU5UprVmzqhzOsQgggAACCCCAAAIxAr4G5QWRhoQzIiNFro1Uh+jQoYMNHDjQUmOGcZ89e7Z9/PHHMckwGzBggJ1yyinu+AULFlhWJHC87LLLrGvXrmX2ZQUCCCCAAAIIIIAAAskk4Fud8sLCQhsyZIitXLnSevXqZXPmzLGxY8eWsWrYsKG1adPG+6ldu7bNnDnT9Dl//nx75pln7Pzzz7czzzzTfvazn9nChQvLnIMVCCCAAAIIIIAAAggkk4BvJeUq3dY0btw495mWlmYjRoywwYMHW3qJYdw7depk+tF0+PBhu+uuu2zUqFGWn59vTz31lH3729+2/v37u+0rIqNFvvnmm9a5c2e3zD8IIIAAAggggAACCCSjgG8l5esivXm0b9/eM2r2r3rHW7Zs8dbFzrzyyiu2JzJ0+5VXXuk27d271zIz/28Y9+LiYlfyHnscywgggAACCCCAAAIIJJOAbyXl69evt5ycHM+m8b8Gndm3b5+3ruSM1k+YMMEeffRRq1Hj6+8Gl19+uf3mN79xpeaHIn1mz5o1y5pHBrhhQgABBBBAAAEEEEAgmQV8C8ozIt3jbdu2zbM68K8+r3Nzc711JWfmzp1rql9esmqKGnYuXrzYfvSjH5mC+osuushVcSl53OjRo2369OklV7n5JUuWlFlXmRW1Hn+sMruzb4gEDl9yqRWde16I7phbRQABBJJPIG3N55aSfLfFHfkgULhzl1U1jmzQoIHl5eVVKTW+BeVNmjSx5ZFRHqOTSs7rR/qyzs4ufxj3adOmWZ8+faK7u08dozroDz30kKWkpNgTTzxRqjqLdho5cqT7KXmgAvt27dqVXFXp+bWRUv5ds+dU+jgOSH6B/Ej/7llVfL6SX4k7RAABBIItsHvrVisIdhJJXTUJZNfNsbwA/J33rU55t27dTA0z1VvKpk2bbOLEiV6Dzo0bN9rUqVMjo7h/PYz79u3bbdWqVdalS5dS/G+99ZYXcH/66aeu+sqFF15Yah8WEEAAAQQQQAABBBBINgHfSsrVzaFKuYcOHeqMWrZsaQ8//LCbV7/lY8aMcSXjNWvWtGXLlrn1rVu3LuXZr18/++ijj6xnz56mwP3OO++0M844o9Q+LCCAAAIIIIAAAgggkGwCvgXlgrnppptcd4Y7duywFi1aeFYdO3Ys1d/4BRdcUGo5uqOOmTx5sm3YsMHq1atnqqfOhAACCCCAAAIIIIBAsgv4GpQLSyNx6qcqU9OmTatyOMcigAACCCCAAAIIIHBcCfhWp/y4umsSiwACCCCAAAIIIIBAgAQIygOUGSQFAQQQQAABBBBAIJwCBOXhzHfuGgEEEEAAAQQQQCBAAgTlAcoMkoIAAggggAACCCAQTgGC8nDmO3eNAAIIIIAAAgggECABgvIAZQZJQQABBBBAAAEEEAinAEF5OPOdu0YAAQQQQAABBBAIkABBeYAyg6QggAACCCCAAAIIhFOAoDyc+c5dI4AAAggggAACCARIgKA8QJlBUhBAAAEEEEAAAQTCKUBQHs58564RQAABBBBAAAEEAiRAUB6gzCApCCCAAAIIIIAAAuEUICgPZ75z1wgggAACCCCAAAIBEiAoD1BmkBQEEEAAAQQQQACBcAoQlIcz37lrBBBAAAEEEEAAgQAJEJQHKDNICgIIIIAAAggggEA4BQjKw5nv3DUCCCCAAAIIIIBAgAQIygOUGSQFAQQQQAABBBBAIJwCBOXhzHfuGgEEEEAAAQQQQCBAAgTlAcoMkoIAAggggAACCCAQTgGC8nDmO3eNAAIIIIAAAgggECABgvIAZQZJQQABBBBAAAEEEAinAEF5OPOdu0YAAQQQQAABBBAIkABBeYAyg6QggAACCCCAAAIIhFOAoDyc+c5dI4AAAggggAACCARIgKA8QJlBUhBAAAEEEEAAAQTCKUBQHs58564RQAABBBBAAAEEAiRAUB6gzCApCCCAAAIIIIAAAuEUICgPZ75z1wgggAACCCCAAAIBEiAoD1BmkBQEEEAAAQQQQACBcAoQlIcz37lrBBBAAAEEEEAAgQAJEJQHKDNICgIIIIAAAggggEA4BQjKw5nv3DUCCCCAAAIIIIBAgAQIygOUGSQFAQQQQAABBBBAIJwCBOXhzHfuGgEEEEAAAQQQQCBAAgTlAcoMkoIAAggggAACCCAQTgGC8nDmO3eNAAIIIIAAAgggECABgvIAZQZJQQABBBBAAAEEEAinAEF5OPOdu0YAAQQQQAABBBAIkABBeYAyg6QggAACCCCAAAIIhFOAoDyc+c5dI4AAAggggAACCARIgKA8QJlBUhBAAAEEEEAAAQTCKUBQHs58564RQAABBBBAAAEEAiRAUB6gzCApCCCAAAIIIIAAAuEUICgPZ75z1wgggAACCCCAAAIBEiAoD1BmkBQEEEAAAQQQQACBcAqk+XnbBQUFNmPGDFu7dq116NDBBg4caKmpqaUuMXv2bPv4449LrdPCgAEDrG3btrZw4UJ7++237dChQ3bRRRdZ9+7drUYNvjuUAWMFAggggAACCCCAQNII+BbtFhYW2pAhQ2zlypXWq1cvmzNnjo0dO7YMVMOGDa1NmzbeT+3atW3mzJmmz6VLl9p9991nZ599tgvqH3roIXvvvffKnIMVCCCAAAIIIIAAAggkk4BvJeULFixwLuPGjXOfaWlpNmLECBs8eLClp6d7Zp06dTL9aDp8+LDdddddNmrUKMvPz7ennnrKBeR9+/Z12xWQf/TRR3bhhRe6Zf5BAAEEEEAAAQQQQCAZBXwrKV+3bp21b9/eM2rWrJmb37Jli7cuduaVV16xPXv22JVXXuk2devWzT788EN7/vnnbdKkSfbuu++66iuxx7GMAAIIIIAAAggggEAyCfhWUr5+/XrLycnxbBo3buzm9+3b560rOaP1EyZMsEcffdSrM56bm2s6Tus1ZWZm2gknnFDyMOYRQAABBBBAAAEEEEg6Ad+C8oyMDNu2bZsHdODAATevQLu8ae7cuab65Z07d/Y2qw66SthffvllS0lJsZ/+9Kc2efJke+SRR7x9Ro8ebdOnT/eWozNLliyJzh7TZ62du47pOA5KfoE1az63oio+X8mvxB0igAACwRZIi/wuTwl2EkldNQkURmLAqsaRDRo0sLy8vCrdgW9BeZMmTWz58uVeYlRyXr9+fcvOzvbWlZyZNm2a9enTp+Qq+/vf/27XX3+9KyHXhosvvtjVMy+508iRI00/JScF9u3atSu5qtLza+vmGGF5pdlCcUB+fivLquLzFQoobhIBBBAIsMDurVutIMDpI2nVJ5AdiQHzAvB33rc65aoPvmLFCtel4aZNm2zixIleg86NGzfa1KlTXTeHIt++fbutWrXKunTpUioHTj75ZJs3b5599dVXpt5c3nnnHevZs2epfVhAAAEEEEAAAQQQQCDZBHwLytXNoXpaGTp0qKn3FAXi99xzj/NSv+VjxoyxaJWWZcuWufWtW7cu5Tls2DDXU0u/fv2sR48ergpL//79S+3DAgIIIIAAAggggAACySbgW/UVwdx0002mIHrHjh3WokULz6pjx46uBD264oILLii1HF1/4oknuuoqOr64uNhUP4cJAQQQQAABBBBAAIFkF/A1KBdWVlaW+6kKXL169apyOMcigAACCCCAAAIIIHBcCfhWfeW4umsSiwACCCCAAAIIIIBAgAQIygOUGSQFAQQQQAABBBBAIJwCBOXhzHfuGgEEEEAAAQQQQCBAAgTlAcoMkoIAAggggAACCCAQTgGC8nDmO3eNAAIIIIAAAgggECABgvIAZQZJQQABBBBAAAEEEAinAEF5OPOdu0YAAQQQQAABBBAIkABBeYAyg6QggAACCCCAAAIIhFOAoDyc+c5dI4AAAggggAACCARIgKA8QJlBUhBAAAEEEEAAAQTCKUBQHs58564RQAABBBBAAAEEAiRAUB6gzCApCCCAAAIIIIAAAuEUICgPZ75z1wgggAACCCCAAAIBEiAoD1BmkBQEEEAAAQQQQACBcAoQlIcz37lrBBBAAAEEEEAAgQAJEJQHKDNICgIIIIAAAggggEA4BQjKw5nv3DUCCCCAAAIIIIBAgAQIygOUGSQFAQQQQAABBBBAIJwCBOXhzHfuGgEEEEAAAQQQQCBAAgTlAcoMkoIAAggggAACCCAQTgGC8nDmO3eNAAIIIIAAAgggECABgvIAZQZJQQABBBBAAAEEEAinAEF5OPOdu0YAAQQQQAABBBAIkABBeYAyg6QggAACCCCAAAIIhFOAoDyc+c5dI4AAAggggAACCARIgKA8QJlBUhBAAAEEEEAAAQTCKUBQHs58564RQAABBBBAAAEEAiRAUB6gzCApCCCAAAIIIIAAAuEUICgPZ75z1wgggAACCCCAAAIBEiAoD1BmkBQEEEAAAQQQQACBcAoQlIcz37lrBBBAAAEEEEAAgQAJEJQHKDNICgIIIIAAAggggEA4BQjKw5nv3DUCCCCAAAIIIIBAgAQIygOUGSQFAQQQQAABBBBAIJwCaeG8be4aAQSCKHCoYI3tePX3drCgwDI7drJ613/XUlJTSyW18I8zbe9fF5dap4W6ffrYzj/+scz6tCZNrMGdg8usZwUCxyKwb+GHtnPW63b4YJHVvfwKq3PJJWVOs+VXY61427Yy6xvccadtfWZ8mfU8o2VIWIFAKAUoKQ9ltnPTCARP4HDhLvv8pkG2/5//tHp9r7IdM2bY5iceL5PQtEaNLf3kU7yfGrVr25bJk61GRqa3Lrp955/mW9FXX5U5BysQOBaBA8uX22cDB1pqRh3L7n6hrf3hcNvz9ttlTlUrP7/Us3hww0bbvWihpdROL7VezynPaBk+ViAQWgFKykOb9dw4AsES2PPnP7sE5T070X02rplmawYNskbDhltKerqX2IzO3zL9aDpy+LCtvf1Wa/nU/1itM85wP9Ed90WCoNTMTGt838joKj4RqJLA9ldetoa33WqNRt7vztPgH8tt69QXrU6PHqXOm9PvKm/58JYttjHyfLZ+8UVLrV/P6n7ve942nlGPghkEEIgIUFLOY4AAAoEQOLh2rWV963wvLTVbtHDzxZs3eetiZ3a99KIVF+627F59Sm0q3l1oXwwfbk0jwVPJgL7UTiwgUEmB/Z+ttoyzzvKOqnXiibbv0394y+XNbHholDW4doDVzG9VajPPaCkOFhBAICJASTmPAQIIBELg4Lq1VrNuXS8taY0bu/nD+/Z760rOHNm3zzY8/gs7cfwzkeKF0uULuyL10tNb5VtmtwtKHsI8AlUSOLD6M0vNyvbOUbNJrhWtX+8tx87sX7rUdsx83U5dsjR2k/GMliFhBQKhFyj9lyz0HAAggEB1CaRG6oQf2rrVu/yR/V8H46nNmnrrSs4UzpltNZs1s4zzu5RcbVZUZJsmPGsNb77FLCWl9DaWEKiCQGp2HSvasd07g57RzE6dvOXYmW0v/s4a/eD7rtpKqW08o6U4WEAAga8FCMp5EhBAIBACaU1z7UCk15XodGjdOktr3rxUyWR0mz63vDDFThgwoOQqN7/3Lx+40svMC7qX2cYKBKoiUCuvpR1a/6V3igNfFFjGaW295ZIzxdt32PaXXrKc3n1LrnbzPKNlSFiBAAIRAYJyHgMEEAiEgHqz2Ltoke2LBNXFkR5TNo9/OtLDxdeB9f5PPrHdb8z20lm8fZvtW/w3q9OtbOC9J9JlXb1rrqYuuafFjF8CdS+73Lb8erJ7PlU1ZcuUKZbZvoM7fewzuv+Tj10JeXqJOujRdPCMRiX4RACBkgIE5SU1mEcAgWoTqBnpHq7ZqAfss+uus3926mgHIyXlTe6916Vnzwfv2+ZJz3tpU0CkqeZJJ3nrojOFH3xgdc49L7rIJwK+CdTtP8CyL7zQPZ+r+/axrPM6WvZV17jzxz6jexf/1bK6dSvT3kE784z6liWcCIGkEkgpKio6crzfUefOnW3hwoVVuo21d/7Ads2eU6VzcHByCpz4/GTLuvTS5Ly5AN7Vkd27TSXhaZGqAkwIBFGgeNMmS4k0Lq7RsGEQk0ea/o3A7nnzrODWSFsTJgRiBHJ69bS88RNi1iZ+kd5XEm/OFRFA4BsEUrKyLC3yw4RAUAVS/9UzUFDTR7oQQOD4FPA1KC+INNKaERmFb22kv+EOHTrYQI18FjNE9uzZs+3jjz8uozUg0mBrXuRb7Pbt/9eyPbrTPffcY5mRQUCYEEAAAQQQQAABBBBIRgHf6pQXFhbakCFDbOXKldarVy+bM2eOjR07toxZw8jrvjZt2ng/tSNDZM+cOdP02bJlS2+99tkUeUW4ZMkSS0vz9btDmTSxAgEEEEAAAQQQQACB6hTwLdpdsGCBu49x48a5TwXSI0aMsMGDB1t6iSGyO0X6dNWPpsORIbLvuusuGzVqlOXn57sftyHyz9ZIf8WTJ0+2Z555xmrVqhVdzScCCCCAAAIIIIAAAkkn4FtJ+bpITwnt27f3gJpFBvXQtGXLFm9d7Mwrr7xie/bssSuvvDJ2k40ePdp69+5dKlAvsxMrEEAAAQQQQAABBBBIAgHfgvL1kaGGc3JyPJLG/2oIsy8yFHZ5k9ZPmDDBlZTXiBkie9myZfbee+/ZjTfeWN6hrEMAAQQQQAABBBBAIKkEfKu+kpGRYdu2bfNwDhw44OZzc3O9dSVn5s6da6pfru4MYyeVoKvhZ7169WI3uRL06dOnl1mvuudVmWrt3FWVwzk2iQXWrPnciqr4fPnBUzMygiUTAuUJFPfsaYcbVG/3fDW2brHUSFsiJgTKEzh0Q/UXsqVFfpenlJc41oVeoDASA1Y1jmzQoIHl5eVVydK3oLxJkya2fPlyLzEqOa9fv75lZ2d760rOTJs2zfr06VNylZvfsWOHKWB/7rnnymzTipEjR7qfkhsV2Ldr167kqkrPr62bY4TllWYLxQH5+a0sq4rPlx9QK+74vhVt2OjHqThHkgmc/qMfWVrTr6sMVtetFW340lbcemt1XZ7rBlggrWmunfHEk9Wewt2RtmoF1Z4KEhBEgexIDJgXgL/zvlVf6RYZuWzFihVuEB/1mjJx4kSvQefGjRtt6tSpdujQIZcX6vZw1apV1qVLlzJ5szQyUp+6PzzzzDPLbGMFAggggAACCCCAAALJKOBbUK4uDNXTytChQ61v376mQFz9i2tSv+VjxoyxaJUW1RnX1Lp1a/dZ8h/1Ya4Go7H1zEvuwzwCCCCAAAIIIIAAAskk4Fv1FaHcdNNN1r9/f1MVlBYtWnhOHTt2dCXo0RUXXHBBqeXoen0OGzas5CLzCCCAAAIIIIAAAggkvYCvQbm0siLDY+uHCQEEEEAAAQQQQAABBI5OwLfqK0d3OfZCAAEEEEAAAQQQQACBWAGC8lgRlhFAAAEEEEAAAQQQSLAAQXmCwbkcAggggAACCCCAAAKxAgTlsSIsI4AAAggggAACCCCQYAGC8gSDczkEEEAAAQQQQAABBGIFCMpjRVhGAAEEEEAAAQQQQCDBAgTlCQbncggggAACCCCAAAIIxAoQlMeKsIwAAggggAACCCCAQIIFCMoTDM7lEEAAAQQQQAABBBCIFSAojxVhGQEEEEAAAQQQQACBBAsQlCcYnMshgAACCCCAAAIIIBArQFAeK8IyAggggAACCCCAAAIJFiAoTzA4l0MAAQQQQAABBBBAIFaAoDxWhGUEEEAAAQQQQAABBBIsQFCeYHAuhwACCCCAAAIIIIBArABBeawIywgggAACCCCAAAIIJFiAoDzB4FwOAQQQQAABBBBAAIFYAYLyWBGWEUAAAQQQQAABBBBIsABBeYLBuRwCCCCAAAIIIIAAArECBOWxIiwjgAACCCCAAAIIIJBgAYLyBINzOQQQQAABBBBAAAEEYgUIymNFWEYAAQQQQAABBBBAIMECBOUJBudyCCCAAAIIIIAAAgjEChCUx4qwjAACCCCAAAIIIIBAggUIyhMMzuUQQAABBBBAAAEEEIgVICiPFWEZAQQQQAABBBBAAIEECxCUJxicyyGAAAIIIIAAAgggECtAUB4rwjICCCCAAAIIIIAAAgkWIChPMDiXQwABBBBAAAEEEEAgVoCgPFaEZQQQQAABBBBAAAEEEixAUJ5gcC6HAAIIIIAAAggggECsAEF5rAjLCCCAAAIIIIAAAggkWICgPMHgXA4BBBBAAAEEEEAAgVgBgvJYEZYRQAABBBBAAAEEEEiwAEF5gsG5HAIIIIAAAggggAACsQIE5bEiLCOAAAIIIIAAAgggkGABgvIEg3M5BBBAAAEEEEAAAQRiBQjKY0VYRgABBBBAAAEEEEAgwQIE5QkG53IIIIAAAggggAACCMQKEJTHirCMAAIIIIAAAggggECCBQjKEwzO5RBAAAEEEEAAAQQQiBUgKI8VYRkBBBBAAAEEEEAAgQQLEJQnGJzLIYAAAggggAACCCAQK0BQHivCMgIIIIAAAggggAACCRYgKE8wOJdDAAEEEEAAAQQQQCBWgKA8VoRlBBBAAAEEEEAAAQQSLJDm5/UKCgpsxowZtnbtWuvQoYMNHDjQUlNTS11i9uzZ9vHHH5dap4UBAwZY27ZtbcWKFfb666/b6tWr7fTTT7dbb73V6tSpU2Z/ViCAAAIIIIAAAgggkCwCvpWUFxYW2pAhQ2zlypXWq1cvmzNnjo0dO7aMU8OGDa1NmzbeT+3atW3mzJmmzy1bttgdd9xhhw4dcgH9hx9+aFOnTi1zDlYggAACCCCAAAIIIJBMAr6VlC9YsMC5jBs3zn2mpaXZiBEjbPDgwZaenu6ZderUyfSj6fDhw3bXXXfZqFGjLD8/3yZNmmTnnXeejRw50m1v1aqVK3V3C/yDAAIIIIAAAggggECSCvgWlK9bt87at2/vMTVr1szNq/S7RYsW3vqSM6+88ort2bPHrrzySrda1V6074MPPmibN2+2nj17Wo8ePUoewjwCCCCAAAIIIIAAAkkn4Fv1lfXr11tOTo4H1LhxYze/b98+b13JGa2fMGGCKymvUePrZKhO+osvvmhZWVnWrl07e/LJJ10d9ZLHMY8AAggggAACCCCAQLIJ+FZSnpGRYdu2bfN8Dhw44OZzc3O9dSVn5s6da6pf3rlzZ291cXGxq76iai+aGjRoYK+++qp95zvf8fYZPXq0TZ8+3VuOzpQ8T3RdpT9zm1b6EA4IgcD99wfkJlPMeEYDkhcBS8bV1wQjQTyfwciHoKXiSCRBJf7WV2vyeEarlT+wF//bEl+e0VmzZrnY9ljv07egvEmTJrZ8+XIvHSo5r1+/vmVnZ3vrSs5MmzbN+vTpU3KVNW3a1OrVq+et0/Fbt261I0eOWEpKJCCJTKpvHq1zHt1RAfnChQuji3wiEDgBntHAZQkJihHgGY0BYTFwAjyjgcsSElRCQM+nCpurMvlWfaVbt26uO0MFx5s2bbKJEyd6DTo3btzoelFRryqatm/fbqtWrbIuXbqUSnvXrl3tzTffdOfRPuqVpV+/fl5AXmpnFhBAAAEEEEAAAQQQSBIB30rK1c2heloZOnSoo2nZsqU9/PDDbl4NOMeMGeNKxmvWrGnLli1z61u3bl2KUQ07Fy1aZIMGDfLOce+995bahwUEEEAAAQQQQACBygvs3r3btdur/JEckQiBlKKiItX28m1Shu/YsePf9rhyNBdSjy379+839eASbQT6TcfxSuubdNgWBAGe0SDkAmn4JgGe0W/SYVsQBHhGjz0XFJv97Gc/M43/8tprr7n46tjPxpHlCfjxfPoelJeXUNYhgAACCCCAAAIIJF7g6aeftilTplitWrVcbYb+/fsnPhFc8agEfKtTflRXY6fACmggpx//+Me2evXqwKaRhIVXQD0zLV682FVvC68Cdx5kAY258e677/I7NMiZFLK0vfPOO26sl9/85jeuw4xdfJtGAAAWvElEQVTXX3/dCMiD/RAQlAc7fxKWOlUTUveV//3f/52wa3IhBI5GQA3Ehw8f7gYVu/vuu+2FF144msPYB4GECagzg+uuu840ovV3v/td++STTxJ2bS6EQHkCf/jDH1xPdSeeeKL98pe/dLuoS2mmYAtQfSXY+RPX1P3tb39zfb5HG+Tu3LnTrr76anvooYese/fucb02J0fgaATUdap6dNKAYqNGjXLBzg9+8AP73e9+Z2pczoRAdQuonm5hYaFdeumlrrcwlUr+/ve/Nz276enp1Z08rh8ygffff9/9nuzbt68bGb1Dhw5OQG/C33vvPdcTXn5+fshUjp/bpaT8+Mkr31OqHnI0iNOCBQvcuevWrWu33Xab+1Z98OBB36/HCRGorID+eOgPySWXXOK6RtVIv3r9qt6cmBAIgoBKItV4LlqQccMNN1idOnXc6NRBSB9pCIeAqqbo9+QPf/hD++1vf+ve3DRq1Mi7+QcffNBSU1Pt/sAMhucljZkSAgTlJTCSfVYDMU2aNMlUBUD1c9XJvbqwVJWVaBB+5ZVXmgZ+euWVV5Kdg/urZoFdu3aVScE///lP9zyqK1R9WezYsaNddNFF9vbbb3v7/sd//Id9+umn9uc//9lbxwwCiRJQt716W/P973/fNmzYYArCNdDdn/70J5cEBT4KjCZMmGBfffVVopLFdUIqoN+ZKqj4r//6L1OhhQrafv3rX5va4ejvfXTSF8Wbb77ZtXlQXXOmYAoQlAczX3xPlbqZVDCzd+9eu/DCC12f8itWrLBrr73WXWvq1KnuU4M2NW7c2P74xz+aGn8yIRAPgX379tlVV13ljVmgayxdutTVHdcbnHPOOcd+9KMf2Zw5c9yXSP2hiY4YrADojjvusPnz58cjaZwTgXIF9DtUVf7+8z//06655hpXqHHPPfe40ke1eXjmmWdMjT01derUyZWcM9J0uZSs9FFABWsqSFPvKipg06jop556qunNd3TAxujlFAPo9+fPf/7z6Co+AyaQ+sADDzwYsDSRnDgIzJ4927W+vu+++0wDOKmf0s8//9wFRieddJKrR67/2M8995wb9Omuu+5yf2zikBROGXIBfdlT11znnXeetWrVytLS0lzVFLVt0MBhCtbPOussa9KkiT355JOuRPLIkSPumVU9yZSUFDv99NNdCbrmmRBIhIBKHx955BFXCt6rVy+74IIL7OWXX3aX1jOrknKNZq23O5pUx/y0005z8/yDgF8CkbFl7PHHH3fPohoX64233hrqd+mZZ55p2j5ixAhbuXKlrVmzxo0bU3L0dO2nqi7q3CFa39yvtHGeqgtQUl51w0CeQX8gnnjiCVdNRQlUgKM/EnptpeorCnb+93//15U26o+Ieg3Iy8uzsWPHumBJr2CZEPBbQKXhN954oyvBUWA9ZMgQmzVrlruMSnVKVmlRVSq92VHp5E033eQCHn2R1KQ/KATkjoJ/4iRw4MAB90VQDTnfeOMN99xmZma6ge10SX2xVEm5fmdqwDxVWVE1q2hVQH6HxiljQnxaBdM9evQw9ayiN96aevfubXq7+D//8z+uEaf+zquK1be//W0744wzXIPjN99801Pr2rWr/eQnP3HPs7eSmcAI0PtKYLLCn4SoWoD+w86cOdN9a9Z/YjX6aNu2ratLphLwiRMnuhFXVSdS36YV/KgOLxMC8RZQwKLSHfWmcv7555sacv7iF79wJTcvvvii62nl2Wef9d7S6BnV/mrApBIglaozIRBvAdXHVYM4fVE899xzXdWA8ePHuypW+pw+fbrXs4oKOTT69MiRI3lG450xIT2/6o3/9Kc/tXXr1tnZZ59t6tpQJeTRSQVsKuxQQcUVV1zh9tWXRk16i6P2ZGrjoCCdKdgC/IULdv5UKnX6jzdgwABXZ0z1y9SwQ/XK9MpVr17Vl65KelRnXINcKDBS4xAFRkwIJEJg8+bNpnYLqiqlYFzPohoVqxs5Bd+qVqUuOVWC/sEHH7iGdO3bt3dJIyBPRA6F5xoadlzPVHndFqqfcf3oLY5KvK+//nr3dkYlknpe1SWneqrSpLYP+t2riWfUMfCPzwKDBw82Fbg99dRT7k129PTPP/+8W1agrioqf/nLX1wj5GhArv30JkfHK6AnKI/KBfeT6ivBzZujSplKdFQlQP9hGzRo4F5t6UAF5Jr0h0P1HFWnXP8hMzIy3DdpDSagfkv1Kqt58+ZuX/5BIB4CqgagfpvVjaHqhutNjhrCKRBXNRQFNXqbo6oqegWrgF11x9X7ikrN1TCJCQG/BfTlT8F1eZPeyui5jU56TvUlUd1zqr6unstozyoq1FBpOhMCfgpEq0HpnHrDrbY427Ztc5dQ8K033HoOow3g1QBZXyBjuzzU86m//ypBZwq+ANVXgp9HLoUKuj/++GP3bTiaZNW11X9ANdzUpIBGwbjqmCnoVvUATerBIlpFQPuq8YcGXqHOo+PhnzgK6LnVa1dNqgqggFvBuL4oauRDjc558skn24ORPnRVcql2EJoUvFNn3FHwT5wEVCVAbRVmzJjhRjOOXkbBkHpRUdDz6KOP2sUXX+w2KQDSm0Z1gajeqVSnNzs7O3oYnwj4IqAqp+rJR739qBBN1U3191rVUDRIlfrFV5ewatSpXlTUXiw6qWqV3oqX/Psf3cbn8SFASfnxkU/uP6gaFUUbuimoUQnjY4895uqP69uw/iOqnpnqOKprJJX2aNI3ZDUOUWOk2rVru/rlBOTHScYfx8lUSffAgQNdMK7nUSXlKvXWK1f1+KNtCsIVqN9+++2uOlW0dJKA/DjO+OMk6fpdqDeFTz/9tJdijYaoalRqEK92NuqtSt0aqvtYVa1S2xxNffr0ISD31JjxQyDa37iqmyoIj/SM5/oa16cmVTXVl0XFAOqYQX2QRwNy/W2fN2+e3Xnnna46S05Ojh9J4hzVIECd8mpAP5ZLqmcUNeRQ4K06YqqXq/+Qqkv20ksvueF09TpVf1QU7OgPiBojqY65Xr1GSyuP5docg8DRCuzfv98F2XpeO3fu7N7itGjRwh2u+rb6InnLLbe4Uh/1mXtzZDAL9RKgrjhVWs6EQLwF9MZGPaqokbtKIvU7U4OvqKBD61X/VkG3gna1yVFgrsIOVfeLdncY7zRy/nAJ6O2Mfi+qfYO62VRpuKaPPvrIvSHXvLqJVTeyekN+yimnaJWr0vKrX/3K9bpywgknuB7WVO+c6fgVoPrKcZJ3f/3rX91AK3o9pVJH/XFQ93GvvvqqLVmyxLXGVr1cdYWoXgD0n3zVqlXu9exxcosk8zgXUL1F1dON1gHXl0e9glVXcepSLrpeJT5qGKeSc72OVVUVSnaO88w/jpKvhppqCK/gR4G36uJ+9tlnrn6uqlCV7NXiOLotknqcC2icBv0OVSGb+hLX78xoQYU6ZVDwrV5+VK1KbXJUTVVvytUWR4NZqcCDhsbH+UMQST7VV46DPFRjJJXgaFLrf/Uxrkkl4KoKoCBHo3ipGoBKz1VPV0G76ksyIRAPAdVzXLZsmXdqzWtACzXo1B8T1cNVgzj1nKJeAVQfNzppNE6V6qiqiurkEpBHZfiMh0C0gEIBtyYta+Af1Q+PNoxT9RT1MU5AHo8c4JxHIxAd2E+FGN/73vdcQK6ScVVnUW8q6h5Wf+c1CreqVKlai0rMFcirqhUB+dEoB38fSsoDkEdqVa26YuU1GlLg0r17d1eHTN+OVW1F/zn1zfg73/mOC3j0H3jnzp2uoZy+OTMhEC8BPav6Mqh6uB9++KEbsVDDPKsxsepE6g+KvhSq+pSqAahXClUX0LMa7S8/XmnjvAjECuhtoapJqaRR1VNU8qjepjR4ihrURRvJ67lVQ3qt075MCFSHgN7iqNBNXw7VcYNKzDXpDbgCbxV4qEqVfrf+4Ac/cIUe1ZFOrhk/AUrK42d71GdWbxTRxhyqk6vSRtUd//LLL903ZJXo6EeTqgCobrmqBigQVwCkEnL9odFrWCYE4iUwdepUV01K1U30PKp0UV8mo3Vv1bBTwbeCcD3DCnhU3UqNkPUHJNqdV7zSx3nDLaAGcKoSFZ30Wl8BuXqoUrCtN4oKyvWlUtOoUaNcQzoF6+rNQiWQajDHhEB1Cag9mN52q9eqaL1ypUVvy1USroBcz696Z4mO31BdaeW68RGgpDw+rpU6q0q/1dhNfzT0TVn/6dT7xJpI14UK2NUtlwZZGTZsmDuv+iW99dZb7eqrr3b1xyt1MXZG4BgF9Dwq6FYPFKrPqD7yVfqtdg16Znv16uVG3vx//+//uW0qQe/QoYM3yMoxXpbDEDgqAX3p69mzp+s9Rb37nH766a4agPp01it+lTz269fPLr/8ctczlRpu6gukfv9qXo3imRCobgENWqW34XoDro4c1O2hCj/UGF7PNFNyCxCUByR/VQqpYXD1n07dHSng0X/Mdu3auf7G1c2hghz1javXW2qJfc455zBoRUDyLwzJUJdbKnXUWxmVhGvS4FRqv6DXqeopQCNxqg65Shz1DOuZVg8XTAjEW0CjxaqwQiXf+rKot4oahEpVANQbkAo1VNqo0Tnnz5/v2jSoektubq4beC3e6eP8CBytgH6fqpcVvZVUibgCc7XDYUp+AaqvVHMeq6RGk7rk0h8Rta5WqaNe/as19YsvvuhKyfWfUoG79lNQNGjQIALyas67MFxevVIowFHwomBbw46rZFEBup5T9Z2vtzkawlml4tp+6aWXur7y1QiZgDwMT0n13uPq1avd6Jr6/anGcvr9uHjxYvd7U8+svhyqZwt1zann89RTT3UJVrUrjXKskZCZEAiSgBp3qtqfqqnoh4A8SLkT37RQUh5f3397dr2i0n82fRtWibhKHNVfbmz3caoDqbrj0bqS0YZ2//bEbEDAJwG1V1C3cQps1Oe9uuRS3+Oa17P7hz/8wQXdqnal+uSqMkAfuT7hc5oyAosWLXL9NOvLoCaVjKvgQl8SVTquUYv1JjH6e1SNj/X7Uv0/q4cfFXioa1n1HKRnlQkBBBAImkBqpIHhg0FLVLKn5x//+IcrSdTrVHVbqIF9NIKXellRgyM1oOvWrZtjUEmO6pB/61vfcq9eGekw2Z+OxN6f+glXiWHspPrjaginBkaXXXaZ6UvkW2+95Xr9UUnj3LlzXcNNtXVQPV41TlIdyGiD5NjzsYzAsQpoPAZNaiyshm7RUTVVkKFnT21u9GZRjd/1JlG/S9VFp36P6tlVex2VNJ522mkuiFd1FSYEEEAgiAJUX4lDrqgemIKYfzcpyFajOAXemlcgowGB1P2hAnV1J6c/KJoaNWpkajhXXuD0787PegSORkDda6nHCfXbHDvplb/aM+i5U/dx6h9X61Q6Hu3b+de//rULlPTcakAL+niOVWS5qgJLly51o7/q2VOpeHTAKZ1Xva2ozrhK0NXYWF8MVdVKk6pV6Y3OVVdd5X7HKjhXKTrdHToe/kEAgYAKUFIeh4zR69TrrrvOjbqlkkTVBVfLf/WDq1JGlfaoVbX+gLz77ruu27iZM2e6V63qi1xD7eqPhxonMSHgp4C+MGoQFQXbas2vBsNdu3Z1vf2UvI5KFhXIqGGnGsqpj1yVWE6ePNn1WBEtgdSbHkrHS8ox74eA6oSrap8KLtTQTV/4NFDKBx98YBs3bvRGNFRbB1XxU7Cu36mqwqK6uGowrwGsNPiKfqfyhtGPXOEcCCAQbwFKyuMgrIDmrrvucr2kKIh58803Tf2PqjHcnXfe6RoiqT6kqgOoDq5KeNRISQOyqH6uehA4//zz45AyThl2AdURV8m3Jr36V8M3dXOoQEeTSr3VcFMjcKoHFdXPVaNirVfXcjfccIOrR64gR4ESb3AcG//4LKBeUjTuwtatW90oxnqzWFBQ4NrcTJkyxc2rjvhDDz3kvjCqSovGdTjzzDNNb3C2bNniCkDUHoKA3OfM4XQIIBA3ARp6xolWQYz6vVUDpD/+8Y+uGkp0MItLLrnEleqoWoqqrSg41/4a6EINklQ9gAmBeAioDrn6atZgVWq/oLc1ffr0cVWkFJjrGVT93BEjRrgvhvoyqbc2Kl1X3+Q333xzPJLFOREoI6BnVF/61OZGvVEo0FaJuOb1NvKJJ55wjeT1fCoY1+9ZfaEsb2TkMidnBQIIIBBAAYLyOGZKdGjct99+2/tDoXqO06ZNc9UANGqn6kIqSNLgFbVq1Ypjajg1Al8LqM2C3uDoRyWRKgnXoFWqLqUGnD169HCBufbWoBUaeEWN5xhBjicoEQL6cqgvi+rqcMmSJS7Q1lsZfSlUIYaqAKrkXL0BqYRcVQPV9aZGOlZ1QSYEEEDgeBUgKI9zzqkai3pQ0WAAmt577z1X71HDPjMhUB0CajSnnlJUfUV1wlUPV4GQAh4mBKpTQMOLX3PNNW5kWI28qZ5V9PZQg1CpJyB9oVQVLH2JVAHHSy+95Hqlqs40c20EEEDALwHqlPsl+W/OEx1cZdKkSbZw4UL3+lVddDEhkCgB9eSjL4dq+KaqACr9VvUUtWFQ2wZVWVEjZDWiY0KgOgXUlaEm/d5U7z/jx4+3L774wvS2UVWpNGk0WfUapC+U1Bd3JPyDAAJJIkBQHueM1Gt//QHRQCzqYUVVBTSYBRMCiRDYsGGD3XHHHe7VvxrIqW64qkpp9M2LLrrIBT1qZKwGyKqvq1J0JgSqS+DEE0909cX379/vkqBnUyXmTz75pOudSt0iqgtPVfVTWwja31RXTnFdBBCIh0BaPE7KOUsLaJALjSCn17JMCCRSQD2mqMtD1cFVf87qoULVqVRafvfdd7vBVvTmRs9m69atCXISmTlcq4yA6oTrR9VUVBKuSQ051RBe1VXU1ayeZyYEEEAgGQWoU56Muco9hVJAQ4rPmDHD3nnnHTekuKoAqDvOF1980Vq0aOEadaqkUb0AzZs3z0aOHOnq6apfZwKdUD4ygbxpPb96NhWUK0BXt7ETJ050XSOqZyAmBBBAIFkFCMqTNWe5r9AJPP74467+rfoSV68V6tVHQYx6rdCrftUnr1GjhutKTjiqV86EQBAFPvroI9czlQZSU+m4elxhQgABBJJdgKA82XOY+wuFgOqCa6AU9UqhwajUB7mGIVef+OrSUN3HDR061A0IpIZy6rUiNzc3FDbcJAIIIIAAAseDAEH58ZBLpBGBoxC49957TX3ja2revLkbcVM9V8yZM8eNzKlqAVqvrhA1GiITAggggAACCARHgKA8OHlBShCoksCRI0dcCXmDBg1cKbjqmJ9//vn2xhtvmHqxYEIAAQQQQACB4ArQ+0pw84aUIVApAfXZrF5VVB9XDTc12EqnTp3c4CuVOhE7I4AAAggggEDCBeinPOHkXBCB+AmoX/zFixe77g4bNWrk+nemL+f4eXNmBBBAAAEE/BKg+opfkpwHAQQQQAABBBBAAIFjFKCk/BjhOAwBBBBAAAEEEEAAAb8ECMr9kuQ8CCCAAAIIIIAAAggcowBB+THCcRgCCCCAAAIIIIAAAn4JEJT7Jcl5EEAAAQQQQAABBBA4RgGC8mOE4zAEEEAAAQQQQAABBPwSICj3S5LzIIAAAggggAACCCBwjAIE5ccIx2EIIIAAAggggAACCPglQFDulyTnQQABBBBAAAEEEEDgGAUIyo8RjsMQQAABBBBAAAEEEPBLgKDcL0nOgwACCCCAAAIIIIDAMQoQlB8jHIchgAACCCCAAAIIIOCXAEG5X5KcBwEEEEAAAQQQQACBYxT4/y+ewAycXRF6AAAAAElFTkSuQmCC\n", + "text/plain": [ + "" + ] + }, + "metadata": { + "image/png": { + "width": 500 + } + }, + "output_type": "display_data" + } + ], + "source": [ + "display(Image('similarity-accuracy.png', width=500))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### bahdanau\n", + "\n", + "```text\n", + " precision recall f1-score support\n", + "\n", + "not similar 0.83 0.83 0.83 31524\n", + " similar 0.71 0.71 0.71 18476\n", + "\n", + "avg / total 0.79 0.79 0.79 50000\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### self-attention\n", + "\n", + "```text\n", + " precision recall f1-score support\n", + "\n", + "not similar 0.81 0.83 0.82 31524\n", + " similar 0.70 0.67 0.68 18476\n", + "\n", + "avg / total 0.77 0.77 0.77 50000\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### dilated-cnn\n", + "\n", + "```text\n", + " precision recall f1-score support\n", + "\n", + "not similar 0.82 0.82 0.82 31524\n", + " similar 0.69 0.69 0.69 18476\n", + "\n", + "avg / total 0.77 0.77 0.77 50000\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### bert\n", + "\n", + "```text\n", + " precision recall f1-score support\n", + "\n", + "not similar 0.86 0.86 0.86 50757\n", + " similar 0.77 0.76 0.76 30010\n", + "\n", + "avg / total 0.83 0.83 0.83 80767\n", + "```" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1122,7 +1219,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": {}, "outputs": [ { diff --git a/accuracy/models-accuracy.rst b/accuracy/models-accuracy.rst index 2ddbc210..2e9c0fae 100644 --- a/accuracy/models-accuracy.rst +++ b/accuracy/models-accuracy.rst @@ -824,6 +824,71 @@ BERT avg / total 0.88 0.87 0.86 84104 +Similarity +---------- + +Trained on 80% of dataset, tested on 20% of dataset. All training +sessions stored in +`session/similarity `__ + +.. code:: ipython3 + + display(Image('similarity-accuracy.png', width=500)) + + + +.. image:: models-accuracy_files/models-accuracy_58_0.png + :width: 500px + + +bahdanau +^^^^^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.83 0.83 0.83 31524 + similar 0.71 0.71 0.71 18476 + + avg / total 0.79 0.79 0.79 50000 + +self-attention +^^^^^^^^^^^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.81 0.83 0.82 31524 + similar 0.70 0.67 0.68 18476 + + avg / total 0.77 0.77 0.77 50000 + +dilated-cnn +^^^^^^^^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.82 0.82 0.82 31524 + similar 0.69 0.69 0.69 18476 + + avg / total 0.77 0.77 0.77 50000 + +bert +^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.86 0.86 0.86 50757 + similar 0.77 0.76 0.76 30010 + + avg / total 0.83 0.83 0.83 80767 + Dependency parsing ------------------ @@ -837,7 +902,7 @@ sessions stored in -.. image:: models-accuracy_files/models-accuracy_58_0.png +.. image:: models-accuracy_files/models-accuracy_64_0.png :width: 500px diff --git a/accuracy/models-accuracy_files/models-accuracy_58_0.png b/accuracy/models-accuracy_files/models-accuracy_58_0.png index c79647b9..2ab5798d 100644 Binary files a/accuracy/models-accuracy_files/models-accuracy_58_0.png and b/accuracy/models-accuracy_files/models-accuracy_58_0.png differ diff --git a/accuracy/models-accuracy_files/models-accuracy_64_0.png b/accuracy/models-accuracy_files/models-accuracy_64_0.png new file mode 100644 index 00000000..c79647b9 Binary files /dev/null and b/accuracy/models-accuracy_files/models-accuracy_64_0.png differ diff --git a/accuracy/similarity-accuracy.png b/accuracy/similarity-accuracy.png new file mode 100644 index 00000000..2ab5798d Binary files /dev/null and b/accuracy/similarity-accuracy.png differ diff --git a/accuracy/similarity-template.js b/accuracy/similarity-template.js new file mode 100644 index 00000000..e1ddc223 --- /dev/null +++ b/accuracy/similarity-template.js @@ -0,0 +1,26 @@ +option = { + xAxis: { + type: 'category', + axisLabel: { + interval: 0, + rotate: 30 + }, + data: ['bahdanau','self-attention', 'dilated-cnn', 'BERT'] + }, + yAxis: { + type: 'value', + min:0.76, + max:0.83 + }, + backgroundColor:'rgb(252,252,252)', + series: [{ + data: [0.79, 0.77, 0.77, 0.83], + type: 'bar', + label: { + normal: { + show: true, + position: 'top' + } + }, + }] +}; diff --git a/docs/Api.rst b/docs/Api.rst index 8eccc823..e6f05168 100644 --- a/docs/Api.rst +++ b/docs/Api.rst @@ -198,3 +198,75 @@ malaya.word2vec .. autoclass:: malaya.word2vec.word2vec() :members: + +malaya._models._sklearn_model +--------------------------------- + +.. autoclass:: malaya._models._sklearn_model.CRF() + :members: + +.. autoclass:: malaya._models._sklearn_model.DEPENDENCY() + :members: + +.. autoclass:: malaya._models._sklearn_model.BINARY_XGB() + :members: + +.. autoclass:: malaya._models._sklearn_model.BINARY_BAYES() + :members: + +.. autoclass:: malaya._models._sklearn_model.MULTICLASS_XGB() + :members: + +.. autoclass:: malaya._models._sklearn_model.MULTICLASS_BAYES() + :members: + +.. autoclass:: malaya._models._sklearn_model.TOXIC() + :members: + +.. autoclass:: malaya._models._sklearn_model.LANGUAGE_DETECTION() + :members: + +malaya._models._tensorflow_model +--------------------------------- + +.. autoclass:: malaya._models._tensorflow_model.DEPENDENCY() + :members: + +.. autoclass:: malaya._models._tensorflow_model.TAGGING() + :members: + +.. autoclass:: malaya._models._tensorflow_model.BINARY_BERT() + :members: + +.. autoclass:: malaya._models._tensorflow_model.MULTICLASS_BERT() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SIGMOID_BERT() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SOFTMAX() + :members: + +.. autoclass:: malaya._models._tensorflow_model.BINARY_SOFTMAX() + :members: + +.. autoclass:: malaya._models._tensorflow_model.MULTICLASS_SOFTMAX() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SIGMOID() + :members: + +.. autoclass:: malaya._models._tensorflow_model.DEEP_LANG() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SPARSE_SOFTMAX() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SPARSE_SIGMOID() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SIAMESE() + :members: + +.. autoclass:: malaya._models._tensorflow_model.SIAMESE_BERT() + :members: diff --git a/docs/README.rst b/docs/README.rst index 640f01c3..a817424a 100644 --- a/docs/README.rst +++ b/docs/README.rst @@ -75,13 +75,18 @@ Features - **Spell Correction** Using local Malaysia NLP researches to auto-correct any bahasa words. -- Stemmer +- **Stemmer** + + Use Character LSTM Seq2Seq with attention state-of-art to do Bahasa stemming. - **Subjectivity Analysis** From fine-tuning BERT, Attention-Recurrent model, Sparse Tensorflow and Self-Attention to build deep subjectivity analysis models. +- **Similarity** + + Use deep LSTM siamese, deep Dilated CNN siamese, deep Self-Attention, siamese, Doc2Vec and BERT to build deep semantic similarity models. - **Summarization** - Using skip-thought with attention state-of-art to give precise unsupervised summarization. + Using skip-thought and residual-network with attention state-of-art, LDA, LSA and Doc2Vec to give precise unsupervised summarization, and TextRank as scoring algorithm. - **Topic Modelling** Provide LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization. diff --git a/docs/conf.py b/docs/conf.py index 1a9b7ec4..446b50e9 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -76,6 +76,7 @@ def __getattr__(cls, name): 'sklearn.neighbors', 'pulp', 'ftfy', + 'networkx', ] sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES) diff --git a/docs/load-similarity.rst b/docs/load-similarity.rst index e4649f51..67263a44 100644 --- a/docs/load-similarity.rst +++ b/docs/load-similarity.rst @@ -7,10 +7,296 @@ .. parsed-literal:: - CPU times: user 10.7 s, sys: 906 ms, total: 11.6 s - Wall time: 12 s + CPU times: user 12.5 s, sys: 1.77 s, total: 14.3 s + Wall time: 19.5 s +Deep Siamese network +-------------------- + +Purpose of deep siamese network to study semantic similarity between 2 +strings, near to 1.0 means more similar. Deep Siamese leverage the power +of word-vector, and we also implemented BERT to study semantic +similarity and BERT leverage the power of attention! + +List deep siamese models +------------------------ + +.. code:: python + + malaya.similarity.available_deep_siamese() + + + + +.. parsed-literal:: + + ['self-attention', 'bahdanau', 'dilated-cnn'] + + + +- ``'self-attention'`` - Fast-text architecture, embedded and logits + layers only with self attention. +- ``'bahdanau'`` - LSTM with bahdanau attention architecture. +- ``'dilated-cnn'`` - Pyramid Dilated CNN architecture. + +Load deep siamese models +------------------------ + +.. code:: python + + string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim' + string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri' + string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar' + string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar' + +Load bahdanau model +^^^^^^^^^^^^^^^^^^^ + +.. code:: python + + model = malaya.similarity.deep_siamese('bahdanau') + +Calculate similarity between 2 strings +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``predict`` need to give 2 strings, left and right string + +.. code:: python + + model.predict(string1, string2) + + + + +.. parsed-literal:: + + 0.4267301 + + + +.. code:: python + + model.predict(string1, string3) + + + + +.. parsed-literal:: + + 0.28711933 + + + +Calculate similarity more than 2 strings +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``predict_batch`` need to give 2 lists of strings, left and right +strings + +.. code:: python + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.39504164, 0.33375728], dtype=float32) + + + +Load self-attention model +^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: python + + model = malaya.similarity.deep_siamese('self-attention') + +.. code:: python + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.08130383, 0.09907728], dtype=float32) + + + +Load dilated-cnn model +^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: python + + model = malaya.similarity.deep_siamese('dilated-cnn') + +.. code:: python + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.1886251 , 0.00937402], dtype=float32) + + + +Calculate similarity using doc2vec +---------------------------------- + +We need to load word vector provided by Malaya. + +Important parameters, 1. ``aggregation``, aggregation function to +accumulate word vectors. Default is ``mean``. + +:: + + * ``'mean'`` - mean. + * ``'min'`` - min. + * ``'max'`` - max. + * ``'sum'`` - sum. + * ``'sqrt'`` - square root. + +2. ``similarity`` distance function to calculate similarity. Default is + ``cosine``. + + - ``'cosine'`` - cosine similarity. + - ``'euclidean'`` - euclidean similarity. + - ``'manhattan'`` - manhattan similarity. + +Using word2vec +^^^^^^^^^^^^^^ + +I will use ``load_news``, word2vec from wikipedia took a very long time. +wikipedia much more accurate. + +.. code:: python + + embedded_news = malaya.word2vec.load_news(64) + w2v_wiki = malaya.word2vec.word2vec(embedded_news['nce_weights'], + embedded_news['dictionary']) + +.. code:: python + + malaya.similarity.doc2vec(w2v_wiki, string1, string2) + + + + +.. parsed-literal:: + + 0.9181415736675262 + + + +.. code:: python + + malaya.similarity.doc2vec(w2v_wiki, string1, string4) + + + + +.. parsed-literal:: + + 0.9550771713256836 + + + +.. code:: python + + malaya.similarity.doc2vec(w2v_wiki, string1, string4, similarity = 'euclidean') + + + + +.. parsed-literal:: + + 0.4642694249990522 + + + +Different similarity function different percentage. + +**So you can try use fast-text and elmo to do the similarity study.** + +Calculate similarity using summarizer +------------------------------------- + +We can use extractive summarization model +``malaya.summarize.deep_extractive()`` to get strings embedded and +calculate similarity between the vectors. + +.. code:: python + + deep_summary = malaya.summarize.deep_extractive(model = 'skip-thought') + +.. code:: python + + malaya.similarity.summarizer(deep_summary, string1, string3) + + + + +.. parsed-literal:: + + 0.8722701370716095 + + + +BERT model +---------- + +BERT is the best similarity model in term of accuracy, you can check +similarity accuracy here, +https://malaya.readthedocs.io/en/latest/Accuracy.html#similarity. But +warning, the model size is 700MB! Make sure you have enough resources to +use BERT, and installed bert-tensorflow first, + +.. code:: python + + model = malaya.similarity.bert() + +.. code:: python + + model.predict(string1, string3) + + + + +.. parsed-literal:: + + 0.97767043 + + + +.. code:: python + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.9253927, 0.0317315], dtype=float32) + + + +**BERT is the best!** + +Topics similarity +----------------- + +If you are interested in multiple topics searching inside a string when +giving set of topics to supervised, Malaya provided some interface and +topics related to political landscape in Malaysia + .. code:: python news = 'najib razak dan mahathir mengalami masalah air di kemamam terengganu' diff --git a/docs/load-summarization.rst b/docs/load-summarization.rst index 08fe73a7..8dcaa9e5 100644 --- a/docs/load-summarization.rst +++ b/docs/load-summarization.rst @@ -7,8 +7,8 @@ .. parsed-literal:: - CPU times: user 11.9 s, sys: 1.46 s, total: 13.4 s - Wall time: 17 s + CPU times: user 12.3 s, sys: 1.53 s, total: 13.8 s + Wall time: 17.8 s .. code:: python @@ -36,12 +36,40 @@ We also can give a string, Malaya will always split a string by into multiple sentences. -Load Pretrained News summarization deep learning ------------------------------------------------- +Important parameters, + +1. ``top_k``, number of summarized strings. +2. ``important_words``, number of important words. + +List available deep extractive models +------------------------------------- + +.. code:: python + + malaya.summarize.available_deep_extractive() + + + + +.. parsed-literal:: + + ['skip-thought', 'residual-network'] + + + +- ``'skip-thought'`` - skip-thought summarization deep learning model + trained on news dataset. Hopefully we can train on wikipedia dataset. +- ``'residual-network'`` - residual network with Bahdanau Attention + summarization deep learning model trained on wikipedia dataset. + +We use TextRank for scoring algorithm. + +Load Pretrained extractive skip-thought summarization +----------------------------------------------------- .. code:: python - deep_summary = malaya.summarize.deep_model_news() + deep_summary = malaya.summarize.deep_extractive(model = 'skip-thought') .. code:: python @@ -52,7 +80,7 @@ Load Pretrained News summarization deep learning .. parsed-literal:: - {'summary': 'Namun, ada satu persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik sejak dulu. "Kami pernah terbabit dengan showcase dan majlis korporat sebelum ini. "Sedangkan artis juga menyanyi untuk kerjaya dan ia juga punca pendapatan bagi menyara hidup," katanya.', + {'summary': 'Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada publisiti ketika itu. "Sebab itu, saya sukar menolak untuk bekerjasama dengannya dalam Festival KL Jamm yang dianjurkan buat julung kali dan berkongsi pentas dalam satu konsert bertaraf antarabangsa," katanya. "Saya bersama Sheila serta Datuk Afdlin Shauki akan terbabit dalam satu segmen yang ditetapkan.', 'top-words': ['dumex', 'unchallenged', 'yussoffkaunsel', @@ -63,16 +91,16 @@ Load Pretrained News summarization deep learning 'kepulangan', 'mandat', 'kelembaban'], - 'cluster-top-words': ['kelembaban', - 'merotan', - 'pancaroba', + 'cluster-top-words': ['unchallenged', + 'kelembaban', 'yussoffkaunsel', 'dumex', - 'unchallenged', - 'vienna', - 'mandat', 'sharmini', - 'kepulangan']} + 'merotan', + 'pancaroba', + 'kepulangan', + 'mandat', + 'vienna']} @@ -85,7 +113,7 @@ Load Pretrained News summarization deep learning .. parsed-literal:: - {'summary': '"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia. Kenyataan kontroversi Setiausaha Agung Barisan Nasional (BN), Datuk Seri Mohamed Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau. Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO.', + {'summary': 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', 'top-words': ['bersabdabarangsiapa', 'kepulangan', 'seliakekurangan', @@ -96,16 +124,16 @@ Load Pretrained News summarization deep learning 'chusus', 'mempunya', 'diharap'], - 'cluster-top-words': ['seliakekurangan', - 'bersabdabarangsiapa', + 'cluster-top-words': ['bersabdabarangsiapa', + 'sharmini', 'poupart', - 'chusus', - 'sakailah', + 'diharap', + 'kepulangan', 'pembikin', - 'sharmini', + 'seliakekurangan', + 'sakailah', 'mempunya', - 'kepulangan', - 'diharap']} + 'chusus']} @@ -127,28 +155,55 @@ You also can change sentences to vector representation using .. code:: python - deep_summary.vectorize(isu_kerajaan).shape + deep_summary.vectorize(isu_string).shape .. parsed-literal:: - (12, 128) + (34, 128) -Load Pretrained Wikipedia summarization deep learning ------------------------------------------------------ +Load Pretrained extractive residual-network summarization +--------------------------------------------------------- + +.. code:: python + + deep_summary = malaya.summarize.deep_extractive(model = 'residual-network') .. code:: python - deep_summary = malaya.summarize.deep_model_wiki() + deep_summary.summarize(isu_string,important_words=10) + + .. parsed-literal:: - WARNING: this model is using convolutional based, Tensorflow-GPU above 1.10 may got a problem. Please downgrade to Tensorflow-GPU v1.8 if got any cuDNN error. + {'summary': "Manakala, artis antarabangsa pula membabitkan J Arie (Hong Kong), NCT Dream (Korea Selatan) dan DJ Sura (Korea Selatan). DUA legenda hebat dan 'The living legend' ini sudah memartabatkan bidang muzik sejak lebih tiga dekad lalu. Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada publisiti ketika itu.", + 'top-words': ['jagaannya', + 'ferdy', + 'hoe', + 'laksmi', + 'zulkifli', + 'televisyen', + 'lanun', + 'ongr', + 'sharidake', + 'kawan'], + 'cluster-top-words': ['sharidake', + 'hoe', + 'ferdy', + 'lanun', + 'zulkifli', + 'laksmi', + 'televisyen', + 'ongr', + 'jagaannya', + 'kawan']} + .. code:: python @@ -160,27 +215,27 @@ Load Pretrained Wikipedia summarization deep learning .. parsed-literal:: - {'summary': 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. "Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', + {'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. "Saya ingin menegaskan dua perkara penting. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', 'top-words': ['jagaannya', 'ferdy', 'hoe', 'zulkifli', - 'televisyen', 'lanun', 'laksmi', 'ongr', + 'televisyen', 'kawan', - 'diimbau'], - 'cluster-top-words': ['televisyen', - 'jagaannya', - 'diimbau', - 'zulkifli', + 'sharidake'], + 'cluster-top-words': ['sharidake', + 'hoe', + 'ferdy', 'lanun', + 'zulkifli', 'laksmi', - 'kawan', + 'televisyen', 'ongr', - 'hoe', - 'ferdy']} + 'jagaannya', + 'kawan']} @@ -202,80 +257,74 @@ You also can change sentences to vector representation using .. code:: python - deep_summary.vectorize(isu_kerajaan).shape + deep_summary.vectorize(isu_string).shape .. parsed-literal:: - (12, 64) + (34, 64) -Train skip-thought summarization deep learning model ----------------------------------------------------- - -.. code:: python +Train LSA model +--------------- - deep_summary = malaya.summarize.train_skip_thought(isu_kerajaan, batch_size = 2) +Important parameters, +1. ``vectorizer``, vectorizer technique. Allowed values: -.. parsed-literal:: - - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 2.94it/s, cost=9.45] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.56it/s, cost=7.99] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.67it/s, cost=6.61] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.62it/s, cost=5.34] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.55it/s, cost=4.17] + - ``'bow'`` - Bag of Word. + - ``'tfidf'`` - Term frequency inverse Document Frequency. + - ``'skip-gram'`` - Bag of Word with skipping certain n-grams. +2. ``ngram``, n-grams size to train a corpus. +3. ``important_words``, number of important words. +4. ``top_k``, number of summarized strings. .. code:: python - deep_summary.summarize(isu_kerajaan,important_words=10) + malaya.summarize.lsa(isu_kerajaan,important_words=10) .. parsed-literal:: - {'summary': 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO. Kenyataan kontroversi Setiausaha Agung Barisan Nasional (BN), Datuk Seri Mohamed Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan.', - 'top-words': ['vernakular', - 'bentuk', - 'parti', - 'jelas', - 'pertama', - 'disalah', - 'adalah', - 'kekuatan', - 'bahawa', - 'penting'], - 'cluster-top-words': ['adalah', - 'penting', - 'bentuk', - 'pertama', - 'bahawa', - 'parti', - 'disalah', - 'kekuatan', - 'jelas', - 'vernakular']} - - - -Train LSA model ---------------- + {'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini.', + 'top-words': ['umno', + 'nyata', + 'sekolah', + 'pandang', + 'vernakular', + 'hormat', + 'sekolah vernakular', + 'nazri', + 'hormat paham', + 'hak'], + 'cluster-top-words': ['hak', + 'pandang', + 'sekolah vernakular', + 'hormat paham', + 'umno', + 'nazri', + 'nyata']} + + + +We can use ``tfidf`` as vectorizer. .. code:: python - malaya.summarize.lsa(isu_kerajaan,important_words=10) + malaya.summarize.lsa(isu_kerajaan,important_words=10, ngram = (1,3), vectorizer = 'tfidf') .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', + {'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini.', 'top-words': ['wakil pandang umno', 'mohamed', 'paham sekolah vernakular', @@ -286,74 +335,77 @@ Train LSA model 'mohamed nazri', 'mohamad', 'pandang peribadi'], - 'cluster-top-words': ['negara', - 'mohamad', - 'pandang peribadi', + 'cluster-top-words': ['pandang peribadi', 'wakil pandang umno', - 'mohamed nazri', 'nazri nyata', - 'paham sekolah vernakular']} + 'negara', + 'paham sekolah vernakular', + 'mohamad', + 'mohamed nazri']} + +We can use ``skip-gram`` as vectorizer, and can override ``skip`` value. .. code:: python - malaya.summarize.lsa(isu_string,important_words=10) + malaya.summarize.lsa(isu_kerajaan,important_words=10, ngram = (1,3), vectorizer = 'skip-gram', skip = 3) .. parsed-literal:: - {'summary': "KL Jamm dianjurkan Music Unlimited International Sdn Bhd dan bakal menggabungkan pelbagai genre muzik seperti rock, hip hop, jazz dan pop dengan lebih 100 persembahan, 20 'showcase', pameran dan perdagangan berkaitan. Festival tiga hari itu bakal berlangsung di Pusat Pameran dan Perdagangan Antarabangsa Malaysia (MITEC), Kuala Lumpur pada 26 hingga 28 April ini. Maklumat mengenai pembelian tiket dan keterangan lanjut boleh melayari www.kljamm.com.", - 'top-words': ['zaman', - 'jamm anjur', - 'genre muzik rock', - 'hip', - 'hip hop', - 'hip hop jazz', - 'hop', - 'hop jazz', - 'hop jazz pop', - 'jazz pop'], - 'cluster-top-words': ['hip hop jazz', - 'genre muzik rock', - 'hop jazz pop', - 'jamm anjur', - 'zaman']} - - - -Train NMF model ---------------- + {'summary': 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera.', + 'top-words': ['umno', + 'sekolah', + 'nyata', + 'pandang', + 'nazri', + 'hormat', + 'vernakular', + 'pandang umno', + 'sekolah vernakular', + 'presiden umno'], + 'cluster-top-words': ['pandang umno', + 'sekolah vernakular', + 'nazri', + 'nyata', + 'presiden umno', + 'hormat']} + + .. code:: python - malaya.summarize.nmf(isu_kerajaan,important_words=10) + malaya.summarize.lsa(isu_string,important_words=10) .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', - 'top-words': ['wakil pandang umno', - 'mohamed', - 'paham sekolah vernakular', - 'paham sekolah', - 'paham', - 'negara', - 'nazri nyata', - 'mohamed nazri', - 'mohamad', - 'pandang peribadi'], - 'cluster-top-words': ['negara', - 'mohamad', - 'pandang peribadi', - 'wakil pandang umno', - 'mohamed nazri', - 'nazri nyata', - 'paham sekolah vernakular']} + {'summary': 'Konsert berbayar Mewakili golongan anak seni, Sheila menaruh harapan semoga Festival KL Jamm akan menjadi platform buat artis yang sudah ada nama dan artis muda untuk membuat persembahan, sekali gus sama-sama memartabatkan industri muzik tempatan. Festival KL Jamm bakal menghimpunkan barisan artis tempatan baru dan nama besar dalam konsert iaitu Datuk Ramli Sarip, Datuk Afdlin Shauki, Zamani, Amelina, Radhi OAG, Dr Burn, Santesh, Rabbit Mac, Sheezy, kumpulan Bunkface, Ruffedge, Pot Innuendo, artis dari Kartel (Joe Flizzow, Sona One, Ila Damia, Yung Raja, Faris Jabba dan Abu Bakarxli) dan Malaysia Pasangge (artis India tempatan). "Sedangkan artis juga menyanyi untuk kerjaya dan ia juga punca pendapatan bagi menyara hidup," katanya.', + 'top-words': ['artis', + 'sheila', + 'konsert', + 'muzik', + 'nyanyi', + 'sembah', + 'festival', + 'jamm', + 'kl', + 'babit'], + 'cluster-top-words': ['muzik', + 'babit', + 'sheila', + 'konsert', + 'jamm', + 'nyanyi', + 'artis', + 'festival', + 'kl', + 'sembah']} @@ -369,47 +421,114 @@ Train LDA model .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', - 'top-words': ['wakil pandang umno', - 'mohamed', - 'paham sekolah vernakular', - 'paham sekolah', - 'paham', - 'negara', - 'nazri nyata', - 'mohamed nazri', - 'mohamad', - 'pandang peribadi'], - 'cluster-top-words': ['negara', - 'mohamad', - 'pandang peribadi', - 'wakil pandang umno', - 'mohamed nazri', - 'nazri nyata', - 'paham sekolah vernakular']} + {'summary': '"Saya ingin menegaskan dua perkara penting. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan.', + 'top-words': ['umno', + 'nyata', + 'sekolah', + 'pandang', + 'vernakular', + 'hormat', + 'sekolah vernakular', + 'nazri', + 'hormat paham', + 'hak'], + 'cluster-top-words': ['nazri', + 'umno', + 'pandang', + 'hak', + 'nyata', + 'hormat paham', + 'sekolah vernakular']} -Not clustering important words -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. code:: python + + malaya.summarize.lda(isu_string,important_words=10, vectorizer = 'skip-gram') + + + + +.. parsed-literal:: + + {'summary': 'Namun, ada satu persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik sejak dulu. "Kami memang meminati bidang muzik dan saling memahami antara satu sama lain. DUA legenda hebat dan \'The living legend\' ini sudah memartabatkan bidang muzik sejak lebih tiga dekad lalu.', + 'top-words': ['artis', + 'sheila', + 'konsert', + 'muzik', + 'festival', + 'sembah', + 'nyanyi', + 'kl', + 'kl jamm', + 'jamm'], + 'cluster-top-words': ['kl jamm', + 'sheila', + 'nyanyi', + 'sembah', + 'muzik', + 'artis', + 'festival', + 'konsert']} + + + +Load doc2vec summarization +-------------------------- + +We need to load word vector provided by Malaya. ``doc2vec`` does not +return ``top-words``, so parameter ``important_words`` cannot be use. + +Important parameters, 1. ``aggregation``, aggregation function to +accumulate word vectors. Default is ``mean``. + +:: + + * ``'mean'`` - mean. + * ``'min'`` - min. + * ``'max'`` - max. + * ``'sum'`` - sum. + * ``'sqrt'`` - square root. + +Using word2vec +^^^^^^^^^^^^^^ + +I will use ``load_news``, word2vec from wikipedia took a very long time. + +.. code:: python + + embedded_news = malaya.word2vec.load_news(64) + w2v_wiki = malaya.word2vec.word2vec(embedded_news['nce_weights'], + embedded_news['dictionary']) .. code:: python - malaya.summarize.lda(isu_kerajaan,important_words=10,return_cluster=False) + malaya.summarize.doc2vec(w2v_wiki, isu_kerajaan, soft = False, top_k = 5) .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', - 'top-words': ['wakil pandang umno', - 'mohamed', - 'paham sekolah vernakular', - 'paham sekolah', - 'paham', - 'negara', - 'nazri nyata', - 'mohamed nazri', - 'mohamad', - 'pandang peribadi']} + 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera. Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. "Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia.' + + + +Using fast-text +^^^^^^^^^^^^^^^ + +.. code:: python + + wiki, ngrams = malaya.fast_text.load_wiki() + fast_text_embed = malaya.fast_text.fast_text(wiki['embed_weights'],wiki['dictionary'],ngrams) + +.. code:: python + + malaya.summarize.doc2vec(fast_text_embed, isu_kerajaan, soft = False, top_k = 5) + + + + +.. parsed-literal:: + + 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan.' diff --git a/docs/models-accuracy.rst b/docs/models-accuracy.rst index 2ddbc210..0e0e3f74 100644 --- a/docs/models-accuracy.rst +++ b/docs/models-accuracy.rst @@ -6,10 +6,10 @@ Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in `session/entities `__ -.. code:: ipython3 +.. code:: python from IPython.core.display import Image, display - + display(Image('ner-accuracy.png', width=500)) @@ -132,7 +132,7 @@ Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in `session/language-detection `__ -.. code:: ipython3 +.. code:: python display(Image('language-detection-accuracy.png', width=500)) @@ -213,7 +213,7 @@ Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in `session/pos `__ -.. code:: ipython3 +.. code:: python display(Image('pos-accuracy.png', width=500)) @@ -356,7 +356,7 @@ Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in `session/sentiment `__ -.. code:: ipython3 +.. code:: python display(Image('sentiment-accuracy.png', width=500)) @@ -467,7 +467,7 @@ Labels are, {0: 'toxic', 1: 'severe_toxic', 2: 'obscene', 3: 'threat', 4: 'insult', 5: 'identity_hate'} -.. code:: ipython3 +.. code:: python display(Image('toxic-accuracy.png', width=500)) @@ -596,7 +596,7 @@ Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in `session/subjectivity `__ -.. code:: ipython3 +.. code:: python display(Image('subjectivity-accuracy.png', width=500)) @@ -702,7 +702,7 @@ Trained on 80% of dataset, tested on 20% of dataset. All training sessions stored in `session/emotion `__ -.. code:: ipython3 +.. code:: python display(Image('emotion-accuracy.png', width=500)) @@ -824,6 +824,71 @@ BERT avg / total 0.88 0.87 0.86 84104 +Similarity +---------- + +Trained on 80% of dataset, tested on 20% of dataset. All training +sessions stored in +`session/similarity `__ + +.. code:: python + + display(Image('similarity-accuracy.png', width=500)) + + + +.. image:: models-accuracy_files/models-accuracy_58_0.png + :width: 500px + + +bahdanau +^^^^^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.83 0.83 0.83 31524 + similar 0.71 0.71 0.71 18476 + + avg / total 0.79 0.79 0.79 50000 + +self-attention +^^^^^^^^^^^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.81 0.83 0.82 31524 + similar 0.70 0.67 0.68 18476 + + avg / total 0.77 0.77 0.77 50000 + +dilated-cnn +^^^^^^^^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.82 0.82 0.82 31524 + similar 0.69 0.69 0.69 18476 + + avg / total 0.77 0.77 0.77 50000 + +bert +^^^^ + +.. code:: text + + precision recall f1-score support + + not similar 0.86 0.86 0.86 50757 + similar 0.77 0.76 0.76 30010 + + avg / total 0.83 0.83 0.83 80767 + Dependency parsing ------------------ @@ -831,13 +896,13 @@ Trained on 90% of dataset, tested on 10% of dataset. All training sessions stored in `session/dependency `__ -.. code:: ipython3 +.. code:: python display(Image('dependency-accuracy.png', width=500)) -.. image:: models-accuracy_files/models-accuracy_58_0.png +.. image:: models-accuracy_files/models-accuracy_64_0.png :width: 500px @@ -882,7 +947,7 @@ Bahdanau xcomp 0.8878 0.9039 0.8958 1217 avg / total 0.9953 0.9953 0.9953 951993 - + precision recall f1-score support 0 1.0000 1.0000 1.0000 843055 @@ -1062,7 +1127,7 @@ Luong xcomp 0.9225 0.8364 0.8774 1253 avg / total 0.9950 0.9950 0.9950 951993 - + precision recall f1-score support 0 1.0000 1.0000 1.0000 840905 @@ -1475,7 +1540,7 @@ Attention is all you need xcomp 0.8580 0.8593 0.8587 1301 avg / total 0.9906 0.9906 0.9906 951993 - + precision recall f1-score support 0 1.0000 1.0000 1.0000 841796 @@ -1689,7 +1754,7 @@ CRF aux 0.5000 0.2500 0.3333 4 avg / total 0.8953 0.8961 0.8953 112332 - + precision recall f1-score support 5 0.5452 0.5875 0.5656 5964 diff --git a/docs/models-accuracy_files/models-accuracy_58_0.png b/docs/models-accuracy_files/models-accuracy_58_0.png index c79647b9..2ab5798d 100644 Binary files a/docs/models-accuracy_files/models-accuracy_58_0.png and b/docs/models-accuracy_files/models-accuracy_58_0.png differ diff --git a/docs/models-accuracy_files/models-accuracy_64_0.png b/docs/models-accuracy_files/models-accuracy_64_0.png new file mode 100644 index 00000000..c79647b9 Binary files /dev/null and b/docs/models-accuracy_files/models-accuracy_64_0.png differ diff --git a/example/similarity/README.rst b/example/similarity/README.rst index a99eafd9..c657701c 100644 --- a/example/similarity/README.rst +++ b/example/similarity/README.rst @@ -7,10 +7,296 @@ .. parsed-literal:: - CPU times: user 10.7 s, sys: 906 ms, total: 11.6 s - Wall time: 12 s + CPU times: user 12.5 s, sys: 1.77 s, total: 14.3 s + Wall time: 19.5 s +Deep Siamese network +-------------------- + +Purpose of deep siamese network to study semantic similarity between 2 +strings, near to 1.0 means more similar. Deep Siamese leverage the power +of word-vector, and we also implemented BERT to study semantic +similarity and BERT leverage the power of attention! + +List deep siamese models +------------------------ + +.. code:: ipython3 + + malaya.similarity.available_deep_siamese() + + + + +.. parsed-literal:: + + ['self-attention', 'bahdanau', 'dilated-cnn'] + + + +- ``'self-attention'`` - Fast-text architecture, embedded and logits + layers only with self attention. +- ``'bahdanau'`` - LSTM with bahdanau attention architecture. +- ``'dilated-cnn'`` - Pyramid Dilated CNN architecture. + +Load deep siamese models +------------------------ + +.. code:: ipython3 + + string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim' + string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri' + string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar' + string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar' + +Load bahdanau model +^^^^^^^^^^^^^^^^^^^ + +.. code:: ipython3 + + model = malaya.similarity.deep_siamese('bahdanau') + +Calculate similarity between 2 strings +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``predict`` need to give 2 strings, left and right string + +.. code:: ipython3 + + model.predict(string1, string2) + + + + +.. parsed-literal:: + + 0.4267301 + + + +.. code:: ipython3 + + model.predict(string1, string3) + + + + +.. parsed-literal:: + + 0.28711933 + + + +Calculate similarity more than 2 strings +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``predict_batch`` need to give 2 lists of strings, left and right +strings + +.. code:: ipython3 + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.39504164, 0.33375728], dtype=float32) + + + +Load self-attention model +^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: ipython3 + + model = malaya.similarity.deep_siamese('self-attention') + +.. code:: ipython3 + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.08130383, 0.09907728], dtype=float32) + + + +Load dilated-cnn model +^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: ipython3 + + model = malaya.similarity.deep_siamese('dilated-cnn') + +.. code:: ipython3 + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.1886251 , 0.00937402], dtype=float32) + + + +Calculate similarity using doc2vec +---------------------------------- + +We need to load word vector provided by Malaya. + +Important parameters, 1. ``aggregation``, aggregation function to +accumulate word vectors. Default is ``mean``. + +:: + + * ``'mean'`` - mean. + * ``'min'`` - min. + * ``'max'`` - max. + * ``'sum'`` - sum. + * ``'sqrt'`` - square root. + +2. ``similarity`` distance function to calculate similarity. Default is + ``cosine``. + + - ``'cosine'`` - cosine similarity. + - ``'euclidean'`` - euclidean similarity. + - ``'manhattan'`` - manhattan similarity. + +Using word2vec +^^^^^^^^^^^^^^ + +I will use ``load_news``, word2vec from wikipedia took a very long time. +wikipedia much more accurate. + +.. code:: ipython3 + + embedded_news = malaya.word2vec.load_news(64) + w2v_wiki = malaya.word2vec.word2vec(embedded_news['nce_weights'], + embedded_news['dictionary']) + +.. code:: ipython3 + + malaya.similarity.doc2vec(w2v_wiki, string1, string2) + + + + +.. parsed-literal:: + + 0.9181415736675262 + + + +.. code:: ipython3 + + malaya.similarity.doc2vec(w2v_wiki, string1, string4) + + + + +.. parsed-literal:: + + 0.9550771713256836 + + + +.. code:: ipython3 + + malaya.similarity.doc2vec(w2v_wiki, string1, string4, similarity = 'euclidean') + + + + +.. parsed-literal:: + + 0.4642694249990522 + + + +Different similarity function different percentage. + +**So you can try use fast-text and elmo to do the similarity study.** + +Calculate similarity using summarizer +------------------------------------- + +We can use extractive summarization model +``malaya.summarize.deep_extractive()`` to get strings embedded and +calculate similarity between the vectors. + +.. code:: ipython3 + + deep_summary = malaya.summarize.deep_extractive(model = 'skip-thought') + +.. code:: ipython3 + + malaya.similarity.summarizer(deep_summary, string1, string3) + + + + +.. parsed-literal:: + + 0.8722701370716095 + + + +BERT model +---------- + +BERT is the best similarity model in term of accuracy, you can check +similarity accuracy here, +https://malaya.readthedocs.io/en/latest/Accuracy.html#similarity. But +warning, the model size is 700MB! Make sure you have enough resources to +use BERT, and installed bert-tensorflow first, + +.. code:: ipython3 + + model = malaya.similarity.bert() + +.. code:: ipython3 + + model.predict(string1, string3) + + + + +.. parsed-literal:: + + 0.97767043 + + + +.. code:: ipython3 + + model.predict_batch([string1, string2], [string3, string4]) + + + + +.. parsed-literal:: + + array([0.9253927, 0.0317315], dtype=float32) + + + +**BERT is the best!** + +Topics similarity +----------------- + +If you are interested in multiple topics searching inside a string when +giving set of topics to supervised, Malaya provided some interface and +topics related to political landscape in Malaysia + .. code:: ipython3 news = 'najib razak dan mahathir mengalami masalah air di kemamam terengganu' diff --git a/example/similarity/load-similarity.ipynb b/example/similarity/load-similarity.ipynb index 14bec755..d9b6a5fb 100644 --- a/example/similarity/load-similarity.ipynb +++ b/example/similarity/load-similarity.ipynb @@ -9,8 +9,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 10.7 s, sys: 906 ms, total: 11.6 s\n", - "Wall time: 12 s\n" + "CPU times: user 12.5 s, sys: 1.77 s, total: 14.3 s\n", + "Wall time: 19.5 s\n" ] } ], @@ -19,6 +19,466 @@ "import malaya" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deep Siamese network\n", + "\n", + "Purpose of deep siamese network to study semantic similarity between 2 strings, near to 1.0 means more similar. Deep Siamese leverage the power of word-vector, and we also implemented BERT to study semantic similarity and BERT leverage the power of attention!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## List deep siamese models" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['self-attention', 'bahdanau', 'dilated-cnn']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.similarity.available_deep_siamese()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* ``'self-attention'`` - Fast-text architecture, embedded and logits layers only with self attention.\n", + "* ``'bahdanau'`` - LSTM with bahdanau attention architecture.\n", + "* ``'dilated-cnn'`` - Pyramid Dilated CNN architecture." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load deep siamese models" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'\n", + "string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'\n", + "string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'\n", + "string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load bahdanau model" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "model = malaya.similarity.deep_siamese('bahdanau')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Calculate similarity between 2 strings\n", + "\n", + "`predict` need to give 2 strings, left and right string" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.4267301" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict(string1, string2)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.28711933" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict(string1, string3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Calculate similarity more than 2 strings\n", + "\n", + "`predict_batch` need to give 2 lists of strings, left and right strings" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.39504164, 0.33375728], dtype=float32)" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict_batch([string1, string2], [string3, string4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load self-attention model" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "model = malaya.similarity.deep_siamese('self-attention')" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.08130383, 0.09907728], dtype=float32)" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict_batch([string1, string2], [string3, string4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load dilated-cnn model" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "model = malaya.similarity.deep_siamese('dilated-cnn')" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.1886251 , 0.00937402], dtype=float32)" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict_batch([string1, string2], [string3, string4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calculate similarity using doc2vec\n", + "\n", + "We need to load word vector provided by Malaya.\n", + "\n", + "Important parameters,\n", + "1. `aggregation`, aggregation function to accumulate word vectors. Default is `mean`.\n", + "\n", + " * ``'mean'`` - mean.\n", + " * ``'min'`` - min.\n", + " * ``'max'`` - max.\n", + " * ``'sum'`` - sum.\n", + " * ``'sqrt'`` - square root.\n", + " \n", + "2. `similarity` distance function to calculate similarity. Default is `cosine`.\n", + "\n", + " * ``'cosine'`` - cosine similarity.\n", + " * ``'euclidean'`` - euclidean similarity.\n", + " * ``'manhattan'`` - manhattan similarity." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using word2vec\n", + "\n", + "I will use `load_news`, word2vec from wikipedia took a very long time. wikipedia much more accurate." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "embedded_news = malaya.word2vec.load_news(64)\n", + "w2v_wiki = malaya.word2vec.word2vec(embedded_news['nce_weights'],\n", + " embedded_news['dictionary'])" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9181415736675262" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.similarity.doc2vec(w2v_wiki, string1, string2)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9550771713256836" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.similarity.doc2vec(w2v_wiki, string1, string4)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.4642694249990522" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.similarity.doc2vec(w2v_wiki, string1, string4, similarity = 'euclidean')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Different similarity function different percentage." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**So you can try use fast-text and elmo to do the similarity study.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calculate similarity using summarizer\n", + "\n", + "We can use extractive summarization model `malaya.summarize.deep_extractive()` to get strings embedded and calculate similarity between the vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "deep_summary = malaya.summarize.deep_extractive(model = 'skip-thought')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.8722701370716095" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.similarity.summarizer(deep_summary, string1, string3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## BERT model\n", + "\n", + "BERT is the best similarity model in term of accuracy, you can check similarity accuracy here, https://malaya.readthedocs.io/en/latest/Accuracy.html#similarity. But warning, the model size is 700MB! Make sure you have enough resources to use BERT, and installed bert-tensorflow first," + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "model = malaya.similarity.bert()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.97767043" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict(string1, string3)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.9253927, 0.0317315], dtype=float32)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.predict_batch([string1, string2], [string3, string4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**BERT is the best!**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Topics similarity\n", + "\n", + "If you are interested in multiple topics searching inside a string when giving set of topics to supervised, Malaya provided some interface and topics related to political landscape in Malaysia" + ] + }, { "cell_type": "code", "execution_count": 2, diff --git a/example/summarization/README.rst b/example/summarization/README.rst index 6a0a7c03..f7c085ce 100644 --- a/example/summarization/README.rst +++ b/example/summarization/README.rst @@ -7,8 +7,8 @@ .. parsed-literal:: - CPU times: user 11.9 s, sys: 1.46 s, total: 13.4 s - Wall time: 17 s + CPU times: user 12.3 s, sys: 1.53 s, total: 13.8 s + Wall time: 17.8 s .. code:: ipython3 @@ -36,12 +36,40 @@ We also can give a string, Malaya will always split a string by into multiple sentences. -Load Pretrained News summarization deep learning ------------------------------------------------- +Important parameters, + +1. ``top_k``, number of summarized strings. +2. ``important_words``, number of important words. + +List available deep extractive models +------------------------------------- + +.. code:: ipython3 + + malaya.summarize.available_deep_extractive() + + + + +.. parsed-literal:: + + ['skip-thought', 'residual-network'] + + + +- ``'skip-thought'`` - skip-thought summarization deep learning model + trained on news dataset. Hopefully we can train on wikipedia dataset. +- ``'residual-network'`` - residual network with Bahdanau Attention + summarization deep learning model trained on wikipedia dataset. + +We use TextRank for scoring algorithm. + +Load Pretrained extractive skip-thought summarization +----------------------------------------------------- .. code:: ipython3 - deep_summary = malaya.summarize.deep_model_news() + deep_summary = malaya.summarize.deep_extractive(model = 'skip-thought') .. code:: ipython3 @@ -52,7 +80,7 @@ Load Pretrained News summarization deep learning .. parsed-literal:: - {'summary': 'Namun, ada satu persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik sejak dulu. "Kami pernah terbabit dengan showcase dan majlis korporat sebelum ini. "Sedangkan artis juga menyanyi untuk kerjaya dan ia juga punca pendapatan bagi menyara hidup," katanya.', + {'summary': 'Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada publisiti ketika itu. "Sebab itu, saya sukar menolak untuk bekerjasama dengannya dalam Festival KL Jamm yang dianjurkan buat julung kali dan berkongsi pentas dalam satu konsert bertaraf antarabangsa," katanya. "Saya bersama Sheila serta Datuk Afdlin Shauki akan terbabit dalam satu segmen yang ditetapkan.', 'top-words': ['dumex', 'unchallenged', 'yussoffkaunsel', @@ -63,16 +91,16 @@ Load Pretrained News summarization deep learning 'kepulangan', 'mandat', 'kelembaban'], - 'cluster-top-words': ['kelembaban', - 'merotan', - 'pancaroba', + 'cluster-top-words': ['unchallenged', + 'kelembaban', 'yussoffkaunsel', 'dumex', - 'unchallenged', - 'vienna', - 'mandat', 'sharmini', - 'kepulangan']} + 'merotan', + 'pancaroba', + 'kepulangan', + 'mandat', + 'vienna']} @@ -85,7 +113,7 @@ Load Pretrained News summarization deep learning .. parsed-literal:: - {'summary': '"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia. Kenyataan kontroversi Setiausaha Agung Barisan Nasional (BN), Datuk Seri Mohamed Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau. Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO.', + {'summary': 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', 'top-words': ['bersabdabarangsiapa', 'kepulangan', 'seliakekurangan', @@ -96,16 +124,16 @@ Load Pretrained News summarization deep learning 'chusus', 'mempunya', 'diharap'], - 'cluster-top-words': ['seliakekurangan', - 'bersabdabarangsiapa', + 'cluster-top-words': ['bersabdabarangsiapa', + 'sharmini', 'poupart', - 'chusus', - 'sakailah', + 'diharap', + 'kepulangan', 'pembikin', - 'sharmini', + 'seliakekurangan', + 'sakailah', 'mempunya', - 'kepulangan', - 'diharap']} + 'chusus']} @@ -127,28 +155,55 @@ You also can change sentences to vector representation using .. code:: ipython3 - deep_summary.vectorize(isu_kerajaan).shape + deep_summary.vectorize(isu_string).shape .. parsed-literal:: - (12, 128) + (34, 128) -Load Pretrained Wikipedia summarization deep learning ------------------------------------------------------ +Load Pretrained extractive residual-network summarization +--------------------------------------------------------- + +.. code:: ipython3 + + deep_summary = malaya.summarize.deep_extractive(model = 'residual-network') .. code:: ipython3 - deep_summary = malaya.summarize.deep_model_wiki() + deep_summary.summarize(isu_string,important_words=10) + + .. parsed-literal:: - WARNING: this model is using convolutional based, Tensorflow-GPU above 1.10 may got a problem. Please downgrade to Tensorflow-GPU v1.8 if got any cuDNN error. + {'summary': "Manakala, artis antarabangsa pula membabitkan J Arie (Hong Kong), NCT Dream (Korea Selatan) dan DJ Sura (Korea Selatan). DUA legenda hebat dan 'The living legend' ini sudah memartabatkan bidang muzik sejak lebih tiga dekad lalu. Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada publisiti ketika itu.", + 'top-words': ['jagaannya', + 'ferdy', + 'hoe', + 'laksmi', + 'zulkifli', + 'televisyen', + 'lanun', + 'ongr', + 'sharidake', + 'kawan'], + 'cluster-top-words': ['sharidake', + 'hoe', + 'ferdy', + 'lanun', + 'zulkifli', + 'laksmi', + 'televisyen', + 'ongr', + 'jagaannya', + 'kawan']} + .. code:: ipython3 @@ -160,27 +215,27 @@ Load Pretrained Wikipedia summarization deep learning .. parsed-literal:: - {'summary': 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. "Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', + {'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. "Saya ingin menegaskan dua perkara penting. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', 'top-words': ['jagaannya', 'ferdy', 'hoe', 'zulkifli', - 'televisyen', 'lanun', 'laksmi', 'ongr', + 'televisyen', 'kawan', - 'diimbau'], - 'cluster-top-words': ['televisyen', - 'jagaannya', - 'diimbau', - 'zulkifli', + 'sharidake'], + 'cluster-top-words': ['sharidake', + 'hoe', + 'ferdy', 'lanun', + 'zulkifli', 'laksmi', - 'kawan', + 'televisyen', 'ongr', - 'hoe', - 'ferdy']} + 'jagaannya', + 'kawan']} @@ -202,80 +257,74 @@ You also can change sentences to vector representation using .. code:: ipython3 - deep_summary.vectorize(isu_kerajaan).shape + deep_summary.vectorize(isu_string).shape .. parsed-literal:: - (12, 64) + (34, 64) -Train skip-thought summarization deep learning model ----------------------------------------------------- - -.. code:: ipython3 +Train LSA model +--------------- - deep_summary = malaya.summarize.train_skip_thought(isu_kerajaan, batch_size = 2) +Important parameters, +1. ``vectorizer``, vectorizer technique. Allowed values: -.. parsed-literal:: - - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 2.94it/s, cost=9.45] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.56it/s, cost=7.99] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.67it/s, cost=6.61] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.62it/s, cost=5.34] - minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.55it/s, cost=4.17] + - ``'bow'`` - Bag of Word. + - ``'tfidf'`` - Term frequency inverse Document Frequency. + - ``'skip-gram'`` - Bag of Word with skipping certain n-grams. +2. ``ngram``, n-grams size to train a corpus. +3. ``important_words``, number of important words. +4. ``top_k``, number of summarized strings. .. code:: ipython3 - deep_summary.summarize(isu_kerajaan,important_words=10) + malaya.summarize.lsa(isu_kerajaan,important_words=10) .. parsed-literal:: - {'summary': 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO. Kenyataan kontroversi Setiausaha Agung Barisan Nasional (BN), Datuk Seri Mohamed Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan.', - 'top-words': ['vernakular', - 'bentuk', - 'parti', - 'jelas', - 'pertama', - 'disalah', - 'adalah', - 'kekuatan', - 'bahawa', - 'penting'], - 'cluster-top-words': ['adalah', - 'penting', - 'bentuk', - 'pertama', - 'bahawa', - 'parti', - 'disalah', - 'kekuatan', - 'jelas', - 'vernakular']} - - - -Train LSA model ---------------- + {'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini.', + 'top-words': ['umno', + 'nyata', + 'sekolah', + 'pandang', + 'vernakular', + 'hormat', + 'sekolah vernakular', + 'nazri', + 'hormat paham', + 'hak'], + 'cluster-top-words': ['hak', + 'pandang', + 'sekolah vernakular', + 'hormat paham', + 'umno', + 'nazri', + 'nyata']} + + + +We can use ``tfidf`` as vectorizer. .. code:: ipython3 - malaya.summarize.lsa(isu_kerajaan,important_words=10) + malaya.summarize.lsa(isu_kerajaan,important_words=10, ngram = (1,3), vectorizer = 'tfidf') .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', + {'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini.', 'top-words': ['wakil pandang umno', 'mohamed', 'paham sekolah vernakular', @@ -286,74 +335,77 @@ Train LSA model 'mohamed nazri', 'mohamad', 'pandang peribadi'], - 'cluster-top-words': ['negara', - 'mohamad', - 'pandang peribadi', + 'cluster-top-words': ['pandang peribadi', 'wakil pandang umno', - 'mohamed nazri', 'nazri nyata', - 'paham sekolah vernakular']} + 'negara', + 'paham sekolah vernakular', + 'mohamad', + 'mohamed nazri']} + +We can use ``skip-gram`` as vectorizer, and can override ``skip`` value. .. code:: ipython3 - malaya.summarize.lsa(isu_string,important_words=10) + malaya.summarize.lsa(isu_kerajaan,important_words=10, ngram = (1,3), vectorizer = 'skip-gram', skip = 3) .. parsed-literal:: - {'summary': "KL Jamm dianjurkan Music Unlimited International Sdn Bhd dan bakal menggabungkan pelbagai genre muzik seperti rock, hip hop, jazz dan pop dengan lebih 100 persembahan, 20 'showcase', pameran dan perdagangan berkaitan. Festival tiga hari itu bakal berlangsung di Pusat Pameran dan Perdagangan Antarabangsa Malaysia (MITEC), Kuala Lumpur pada 26 hingga 28 April ini. Maklumat mengenai pembelian tiket dan keterangan lanjut boleh melayari www.kljamm.com.", - 'top-words': ['zaman', - 'jamm anjur', - 'genre muzik rock', - 'hip', - 'hip hop', - 'hip hop jazz', - 'hop', - 'hop jazz', - 'hop jazz pop', - 'jazz pop'], - 'cluster-top-words': ['hip hop jazz', - 'genre muzik rock', - 'hop jazz pop', - 'jamm anjur', - 'zaman']} - - - -Train NMF model ---------------- + {'summary': 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya dalam satu kenyataan akhbar malam ini. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera.', + 'top-words': ['umno', + 'sekolah', + 'nyata', + 'pandang', + 'nazri', + 'hormat', + 'vernakular', + 'pandang umno', + 'sekolah vernakular', + 'presiden umno'], + 'cluster-top-words': ['pandang umno', + 'sekolah vernakular', + 'nazri', + 'nyata', + 'presiden umno', + 'hormat']} + + .. code:: ipython3 - malaya.summarize.nmf(isu_kerajaan,important_words=10) + malaya.summarize.lsa(isu_string,important_words=10) .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', - 'top-words': ['wakil pandang umno', - 'mohamed', - 'paham sekolah vernakular', - 'paham sekolah', - 'paham', - 'negara', - 'nazri nyata', - 'mohamed nazri', - 'mohamad', - 'pandang peribadi'], - 'cluster-top-words': ['negara', - 'mohamad', - 'pandang peribadi', - 'wakil pandang umno', - 'mohamed nazri', - 'nazri nyata', - 'paham sekolah vernakular']} + {'summary': 'Konsert berbayar Mewakili golongan anak seni, Sheila menaruh harapan semoga Festival KL Jamm akan menjadi platform buat artis yang sudah ada nama dan artis muda untuk membuat persembahan, sekali gus sama-sama memartabatkan industri muzik tempatan. Festival KL Jamm bakal menghimpunkan barisan artis tempatan baru dan nama besar dalam konsert iaitu Datuk Ramli Sarip, Datuk Afdlin Shauki, Zamani, Amelina, Radhi OAG, Dr Burn, Santesh, Rabbit Mac, Sheezy, kumpulan Bunkface, Ruffedge, Pot Innuendo, artis dari Kartel (Joe Flizzow, Sona One, Ila Damia, Yung Raja, Faris Jabba dan Abu Bakarxli) dan Malaysia Pasangge (artis India tempatan). "Sedangkan artis juga menyanyi untuk kerjaya dan ia juga punca pendapatan bagi menyara hidup," katanya.', + 'top-words': ['artis', + 'sheila', + 'konsert', + 'muzik', + 'nyanyi', + 'sembah', + 'festival', + 'jamm', + 'kl', + 'babit'], + 'cluster-top-words': ['muzik', + 'babit', + 'sheila', + 'konsert', + 'jamm', + 'nyanyi', + 'artis', + 'festival', + 'kl', + 'sembah']} @@ -369,49 +421,116 @@ Train LDA model .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', - 'top-words': ['wakil pandang umno', - 'mohamed', - 'paham sekolah vernakular', - 'paham sekolah', - 'paham', - 'negara', - 'nazri nyata', - 'mohamed nazri', - 'mohamad', - 'pandang peribadi'], - 'cluster-top-words': ['negara', - 'mohamad', - 'pandang peribadi', - 'wakil pandang umno', - 'mohamed nazri', - 'nazri nyata', - 'paham sekolah vernakular']} + {'summary': '"Saya ingin menegaskan dua perkara penting. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan.', + 'top-words': ['umno', + 'nyata', + 'sekolah', + 'pandang', + 'vernakular', + 'hormat', + 'sekolah vernakular', + 'nazri', + 'hormat paham', + 'hak'], + 'cluster-top-words': ['nazri', + 'umno', + 'pandang', + 'hak', + 'nyata', + 'hormat paham', + 'sekolah vernakular']} -Not clustering important words -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. code:: ipython3 + + malaya.summarize.lda(isu_string,important_words=10, vectorizer = 'skip-gram') + + + + +.. parsed-literal:: + + {'summary': 'Namun, ada satu persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik sejak dulu. "Kami memang meminati bidang muzik dan saling memahami antara satu sama lain. DUA legenda hebat dan \'The living legend\' ini sudah memartabatkan bidang muzik sejak lebih tiga dekad lalu.', + 'top-words': ['artis', + 'sheila', + 'konsert', + 'muzik', + 'festival', + 'sembah', + 'nyanyi', + 'kl', + 'kl jamm', + 'jamm'], + 'cluster-top-words': ['kl jamm', + 'sheila', + 'nyanyi', + 'sembah', + 'muzik', + 'artis', + 'festival', + 'konsert']} + + + +Load doc2vec summarization +-------------------------- + +We need to load word vector provided by Malaya. ``doc2vec`` does not +return ``top-words``, so parameter ``important_words`` cannot be use. + +Important parameters, 1. ``aggregation``, aggregation function to +accumulate word vectors. Default is ``mean``. + +:: + + * ``'mean'`` - mean. + * ``'min'`` - min. + * ``'max'`` - max. + * ``'sum'`` - sum. + * ``'sqrt'`` - square root. + +Using word2vec +^^^^^^^^^^^^^^ + +I will use ``load_news``, word2vec from wikipedia took a very long time. + +.. code:: ipython3 + + embedded_news = malaya.word2vec.load_news(64) + w2v_wiki = malaya.word2vec.word2vec(embedded_news['nce_weights'], + embedded_news['dictionary']) .. code:: ipython3 - malaya.summarize.lda(isu_kerajaan,important_words=10,return_cluster=False) + malaya.summarize.doc2vec(w2v_wiki, isu_kerajaan, soft = False, top_k = 5) .. parsed-literal:: - {'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya.', - 'top-words': ['wakil pandang umno', - 'mohamed', - 'paham sekolah vernakular', - 'paham sekolah', - 'paham', - 'negara', - 'nazri nyata', - 'mohamed nazri', - 'mohamad', - 'pandang peribadi']} + 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera. Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. "Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia.' + + + +Using fast-text +^^^^^^^^^^^^^^^ + +.. code:: ipython3 + + wiki, ngrams = malaya.fast_text.load_wiki() + fast_text_embed = malaya.fast_text.fast_text(wiki['embed_weights'],wiki['dictionary'],ngrams) + +.. code:: ipython3 + + malaya.summarize.doc2vec(fast_text_embed, isu_kerajaan, soft = False, top_k = 5) + + + + +.. parsed-literal:: + + 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera. "Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN," katanya. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan.' diff --git a/example/summarization/load-summarization.ipynb b/example/summarization/load-summarization.ipynb index 1db0bcfb..ae89d3fe 100644 --- a/example/summarization/load-summarization.ipynb +++ b/example/summarization/load-summarization.ipynb @@ -9,8 +9,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 11.9 s, sys: 1.46 s, total: 13.4 s\n", - "Wall time: 17 s\n" + "CPU times: user 12.3 s, sys: 1.53 s, total: 13.8 s\n", + "Wall time: 17.8 s\n" ] } ], @@ -55,34 +55,76 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We also can give a string, Malaya will always split a string by into multiple sentences." + "We also can give a string, Malaya will always split a string by into multiple sentences.\n", + "\n", + "Important parameters,\n", + "\n", + "1. `top_k`, number of summarized strings.\n", + "2. `important_words`, number of important words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Load Pretrained News summarization deep learning" + "## List available deep extractive models" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "['skip-thought', 'residual-network']" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.summarize.available_deep_extractive()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, "source": [ - "deep_summary = malaya.summarize.deep_model_news()" + "* ``'skip-thought'`` - skip-thought summarization deep learning model trained on news dataset. Hopefully we can train on wikipedia dataset.\n", + "* ``'residual-network'`` - residual network with Bahdanau Attention summarization deep learning model trained on wikipedia dataset.\n", + "\n", + "We use TextRank for scoring algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load Pretrained extractive skip-thought summarization" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, + "outputs": [], + "source": [ + "deep_summary = malaya.summarize.deep_extractive(model = 'skip-thought')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Namun, ada satu persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik sejak dulu. \"Kami pernah terbabit dengan showcase dan majlis korporat sebelum ini. \"Sedangkan artis juga menyanyi untuk kerjaya dan ia juga punca pendapatan bagi menyara hidup,\" katanya.',\n", + "{'summary': 'Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada publisiti ketika itu. \"Sebab itu, saya sukar menolak untuk bekerjasama dengannya dalam Festival KL Jamm yang dianjurkan buat julung kali dan berkongsi pentas dalam satu konsert bertaraf antarabangsa,\" katanya. \"Saya bersama Sheila serta Datuk Afdlin Shauki akan terbabit dalam satu segmen yang ditetapkan.',\n", " 'top-words': ['dumex',\n", " 'unchallenged',\n", " 'yussoffkaunsel',\n", @@ -93,19 +135,19 @@ " 'kepulangan',\n", " 'mandat',\n", " 'kelembaban'],\n", - " 'cluster-top-words': ['kelembaban',\n", - " 'merotan',\n", - " 'pancaroba',\n", + " 'cluster-top-words': ['unchallenged',\n", + " 'kelembaban',\n", " 'yussoffkaunsel',\n", " 'dumex',\n", - " 'unchallenged',\n", - " 'vienna',\n", - " 'mandat',\n", " 'sharmini',\n", - " 'kepulangan']}" + " 'merotan',\n", + " 'pancaroba',\n", + " 'kepulangan',\n", + " 'mandat',\n", + " 'vienna']}" ] }, - "execution_count": 5, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -116,13 +158,13 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': '\"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia. Kenyataan kontroversi Setiausaha Agung Barisan Nasional (BN), Datuk Seri Mohamed Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau. Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO.',\n", + "{'summary': 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita,\" katanya dalam satu kenyataan akhbar malam ini. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", " 'top-words': ['bersabdabarangsiapa',\n", " 'kepulangan',\n", " 'seliakekurangan',\n", @@ -133,19 +175,19 @@ " 'chusus',\n", " 'mempunya',\n", " 'diharap'],\n", - " 'cluster-top-words': ['seliakekurangan',\n", - " 'bersabdabarangsiapa',\n", + " 'cluster-top-words': ['bersabdabarangsiapa',\n", + " 'sharmini',\n", " 'poupart',\n", - " 'chusus',\n", - " 'sakailah',\n", + " 'diharap',\n", + " 'kepulangan',\n", " 'pembikin',\n", - " 'sharmini',\n", + " 'seliakekurangan',\n", + " 'sakailah',\n", " 'mempunya',\n", - " 'kepulangan',\n", - " 'diharap']}" + " 'chusus']}" ] }, - "execution_count": 6, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -163,7 +205,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -172,7 +214,7 @@ "(12, 128)" ] }, - "execution_count": 7, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -183,80 +225,112 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(12, 128)" + "(34, 128)" ] }, - "execution_count": 8, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "deep_summary.vectorize(isu_kerajaan).shape" + "deep_summary.vectorize(isu_string).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Load Pretrained Wikipedia summarization deep learning" + "## Load Pretrained extractive residual-network summarization" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "deep_summary = malaya.summarize.deep_extractive(model = 'residual-network')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING: this model is using convolutional based, Tensorflow-GPU above 1.10 may got a problem. Please downgrade to Tensorflow-GPU v1.8 if got any cuDNN error.\n" - ] + "data": { + "text/plain": [ + "{'summary': \"Manakala, artis antarabangsa pula membabitkan J Arie (Hong Kong), NCT Dream (Korea Selatan) dan DJ Sura (Korea Selatan). DUA legenda hebat dan 'The living legend' ini sudah memartabatkan bidang muzik sejak lebih tiga dekad lalu. Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada publisiti ketika itu.\",\n", + " 'top-words': ['jagaannya',\n", + " 'ferdy',\n", + " 'hoe',\n", + " 'laksmi',\n", + " 'zulkifli',\n", + " 'televisyen',\n", + " 'lanun',\n", + " 'ongr',\n", + " 'sharidake',\n", + " 'kawan'],\n", + " 'cluster-top-words': ['sharidake',\n", + " 'hoe',\n", + " 'ferdy',\n", + " 'lanun',\n", + " 'zulkifli',\n", + " 'laksmi',\n", + " 'televisyen',\n", + " 'ongr',\n", + " 'jagaannya',\n", + " 'kawan']}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "deep_summary = malaya.summarize.deep_model_wiki()" + "deep_summary.summarize(isu_string,important_words=10)" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. \"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", + "{'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. \"Saya ingin menegaskan dua perkara penting. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", " 'top-words': ['jagaannya',\n", " 'ferdy',\n", " 'hoe',\n", " 'zulkifli',\n", - " 'televisyen',\n", " 'lanun',\n", " 'laksmi',\n", " 'ongr',\n", + " 'televisyen',\n", " 'kawan',\n", - " 'diimbau'],\n", - " 'cluster-top-words': ['televisyen',\n", - " 'jagaannya',\n", - " 'diimbau',\n", - " 'zulkifli',\n", + " 'sharidake'],\n", + " 'cluster-top-words': ['sharidake',\n", + " 'hoe',\n", + " 'ferdy',\n", " 'lanun',\n", + " 'zulkifli',\n", " 'laksmi',\n", - " 'kawan',\n", + " 'televisyen',\n", " 'ongr',\n", - " 'hoe',\n", - " 'ferdy']}" + " 'jagaannya',\n", + " 'kawan']}" ] }, - "execution_count": 10, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -274,7 +348,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -283,7 +357,7 @@ "(12, 64)" ] }, - "execution_count": 11, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -294,108 +368,99 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(12, 64)" + "(34, 64)" ] }, - "execution_count": 12, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "deep_summary.vectorize(isu_kerajaan).shape" + "deep_summary.vectorize(isu_string).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Train skip-thought summarization deep learning model" + "## Train LSA model" ] }, { - "cell_type": "code", - "execution_count": 13, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 2.94it/s, cost=9.45]\n", - "minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.56it/s, cost=7.99]\n", - "minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.67it/s, cost=6.61]\n", - "minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.62it/s, cost=5.34]\n", - "minibatch loop: 100%|██████████| 5/5 [00:01<00:00, 4.55it/s, cost=4.17]\n" - ] - } - ], "source": [ - "deep_summary = malaya.summarize.train_skip_thought(isu_kerajaan, batch_size = 2)" + "Important parameters,\n", + "\n", + "1. `vectorizer`, vectorizer technique. Allowed values:\n", + " * ``'bow'`` - Bag of Word.\n", + " * ``'tfidf'`` - Term frequency inverse Document Frequency.\n", + " * ``'skip-gram'`` - Bag of Word with skipping certain n-grams.\n", + "2. `ngram`, n-grams size to train a corpus.\n", + "3. `important_words`, number of important words.\n", + "4. `top_k`, number of summarized strings." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili pendirian dan pandangan UMNO. Kenyataan kontroversi Setiausaha Agung Barisan Nasional (BN), Datuk Seri Mohamed Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan.',\n", - " 'top-words': ['vernakular',\n", - " 'bentuk',\n", - " 'parti',\n", - " 'jelas',\n", - " 'pertama',\n", - " 'disalah',\n", - " 'adalah',\n", - " 'kekuatan',\n", - " 'bahawa',\n", - " 'penting'],\n", - " 'cluster-top-words': ['adalah',\n", - " 'penting',\n", - " 'bentuk',\n", - " 'pertama',\n", - " 'bahawa',\n", - " 'parti',\n", - " 'disalah',\n", - " 'kekuatan',\n", - " 'jelas',\n", - " 'vernakular']}" + "{'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita,\" katanya dalam satu kenyataan akhbar malam ini.',\n", + " 'top-words': ['umno',\n", + " 'nyata',\n", + " 'sekolah',\n", + " 'pandang',\n", + " 'vernakular',\n", + " 'hormat',\n", + " 'sekolah vernakular',\n", + " 'nazri',\n", + " 'hormat paham',\n", + " 'hak'],\n", + " 'cluster-top-words': ['hak',\n", + " 'pandang',\n", + " 'sekolah vernakular',\n", + " 'hormat paham',\n", + " 'umno',\n", + " 'nazri',\n", + " 'nyata']}" ] }, - "execution_count": 14, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "deep_summary.summarize(isu_kerajaan,important_words=10)" + "malaya.summarize.lsa(isu_kerajaan,important_words=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Train LSA model" + "We can use `tfidf` as vectorizer." ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", + "{'summary': 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita,\" katanya dalam satu kenyataan akhbar malam ini.',\n", " 'top-words': ['wakil pandang umno',\n", " 'mohamed',\n", " 'paham sekolah vernakular',\n", @@ -406,101 +471,105 @@ " 'mohamed nazri',\n", " 'mohamad',\n", " 'pandang peribadi'],\n", - " 'cluster-top-words': ['negara',\n", - " 'mohamad',\n", - " 'pandang peribadi',\n", + " 'cluster-top-words': ['pandang peribadi',\n", " 'wakil pandang umno',\n", - " 'mohamed nazri',\n", " 'nazri nyata',\n", - " 'paham sekolah vernakular']}" + " 'negara',\n", + " 'paham sekolah vernakular',\n", + " 'mohamad',\n", + " 'mohamed nazri']}" ] }, - "execution_count": 15, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "malaya.summarize.lsa(isu_kerajaan,important_words=10)" + "malaya.summarize.lsa(isu_kerajaan,important_words=10, ngram = (1,3), vectorizer = 'tfidf')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use `skip-gram` as vectorizer, and can override `skip` value." ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': \"KL Jamm dianjurkan Music Unlimited International Sdn Bhd dan bakal menggabungkan pelbagai genre muzik seperti rock, hip hop, jazz dan pop dengan lebih 100 persembahan, 20 'showcase', pameran dan perdagangan berkaitan. Festival tiga hari itu bakal berlangsung di Pusat Pameran dan Perdagangan Antarabangsa Malaysia (MITEC), Kuala Lumpur pada 26 hingga 28 April ini. Maklumat mengenai pembelian tiket dan keterangan lanjut boleh melayari www.kljamm.com.\",\n", - " 'top-words': ['zaman',\n", - " 'jamm anjur',\n", - " 'genre muzik rock',\n", - " 'hip',\n", - " 'hip hop',\n", - " 'hip hop jazz',\n", - " 'hop',\n", - " 'hop jazz',\n", - " 'hop jazz pop',\n", - " 'jazz pop'],\n", - " 'cluster-top-words': ['hip hop jazz',\n", - " 'genre muzik rock',\n", - " 'hop jazz pop',\n", - " 'jamm anjur',\n", - " 'zaman']}" + "{'summary': 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita,\" katanya dalam satu kenyataan akhbar malam ini. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera.',\n", + " 'top-words': ['umno',\n", + " 'sekolah',\n", + " 'nyata',\n", + " 'pandang',\n", + " 'nazri',\n", + " 'hormat',\n", + " 'vernakular',\n", + " 'pandang umno',\n", + " 'sekolah vernakular',\n", + " 'presiden umno'],\n", + " 'cluster-top-words': ['pandang umno',\n", + " 'sekolah vernakular',\n", + " 'nazri',\n", + " 'nyata',\n", + " 'presiden umno',\n", + " 'hormat']}" ] }, - "execution_count": 17, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "malaya.summarize.lsa(isu_string,important_words=10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train NMF model" + "malaya.summarize.lsa(isu_kerajaan,important_words=10, ngram = (1,3), vectorizer = 'skip-gram', skip = 3)" ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", - " 'top-words': ['wakil pandang umno',\n", - " 'mohamed',\n", - " 'paham sekolah vernakular',\n", - " 'paham sekolah',\n", - " 'paham',\n", - " 'negara',\n", - " 'nazri nyata',\n", - " 'mohamed nazri',\n", - " 'mohamad',\n", - " 'pandang peribadi'],\n", - " 'cluster-top-words': ['negara',\n", - " 'mohamad',\n", - " 'pandang peribadi',\n", - " 'wakil pandang umno',\n", - " 'mohamed nazri',\n", - " 'nazri nyata',\n", - " 'paham sekolah vernakular']}" + "{'summary': 'Konsert berbayar Mewakili golongan anak seni, Sheila menaruh harapan semoga Festival KL Jamm akan menjadi platform buat artis yang sudah ada nama dan artis muda untuk membuat persembahan, sekali gus sama-sama memartabatkan industri muzik tempatan. Festival KL Jamm bakal menghimpunkan barisan artis tempatan baru dan nama besar dalam konsert iaitu Datuk Ramli Sarip, Datuk Afdlin Shauki, Zamani, Amelina, Radhi OAG, Dr Burn, Santesh, Rabbit Mac, Sheezy, kumpulan Bunkface, Ruffedge, Pot Innuendo, artis dari Kartel (Joe Flizzow, Sona One, Ila Damia, Yung Raja, Faris Jabba dan Abu Bakarxli) dan Malaysia Pasangge (artis India tempatan). \"Sedangkan artis juga menyanyi untuk kerjaya dan ia juga punca pendapatan bagi menyara hidup,\" katanya.',\n", + " 'top-words': ['artis',\n", + " 'sheila',\n", + " 'konsert',\n", + " 'muzik',\n", + " 'nyanyi',\n", + " 'sembah',\n", + " 'festival',\n", + " 'jamm',\n", + " 'kl',\n", + " 'babit'],\n", + " 'cluster-top-words': ['muzik',\n", + " 'babit',\n", + " 'sheila',\n", + " 'konsert',\n", + " 'jamm',\n", + " 'nyanyi',\n", + " 'artis',\n", + " 'festival',\n", + " 'kl',\n", + " 'sembah']}" ] }, - "execution_count": 18, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "malaya.summarize.nmf(isu_kerajaan,important_words=10)" + "malaya.summarize.lsa(isu_string,important_words=10)" ] }, { @@ -512,33 +581,33 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", - " 'top-words': ['wakil pandang umno',\n", - " 'mohamed',\n", - " 'paham sekolah vernakular',\n", - " 'paham sekolah',\n", - " 'paham',\n", - " 'negara',\n", - " 'nazri nyata',\n", - " 'mohamed nazri',\n", - " 'mohamad',\n", - " 'pandang peribadi'],\n", - " 'cluster-top-words': ['negara',\n", - " 'mohamad',\n", - " 'pandang peribadi',\n", - " 'wakil pandang umno',\n", - " 'mohamed nazri',\n", - " 'nazri nyata',\n", - " 'paham sekolah vernakular']}" + "{'summary': '\"Saya ingin menegaskan dua perkara penting. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan.',\n", + " 'top-words': ['umno',\n", + " 'nyata',\n", + " 'sekolah',\n", + " 'pandang',\n", + " 'vernakular',\n", + " 'hormat',\n", + " 'sekolah vernakular',\n", + " 'nazri',\n", + " 'hormat paham',\n", + " 'hak'],\n", + " 'cluster-top-words': ['nazri',\n", + " 'umno',\n", + " 'pandang',\n", + " 'hak',\n", + " 'nyata',\n", + " 'hormat paham',\n", + " 'sekolah vernakular']}" ] }, - "execution_count": 19, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -547,49 +616,138 @@ "malaya.summarize.lda(isu_kerajaan,important_words=10)" ] }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'summary': 'Namun, ada satu persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik sejak dulu. \"Kami memang meminati bidang muzik dan saling memahami antara satu sama lain. DUA legenda hebat dan \\'The living legend\\' ini sudah memartabatkan bidang muzik sejak lebih tiga dekad lalu.',\n", + " 'top-words': ['artis',\n", + " 'sheila',\n", + " 'konsert',\n", + " 'muzik',\n", + " 'festival',\n", + " 'sembah',\n", + " 'nyanyi',\n", + " 'kl',\n", + " 'kl jamm',\n", + " 'jamm'],\n", + " 'cluster-top-words': ['kl jamm',\n", + " 'sheila',\n", + " 'nyanyi',\n", + " 'sembah',\n", + " 'muzik',\n", + " 'artis',\n", + " 'festival',\n", + " 'konsert']}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.summarize.lda(isu_string,important_words=10, vectorizer = 'skip-gram')" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "#### Not clustering important words" + "## Load doc2vec summarization\n", + "\n", + "We need to load word vector provided by Malaya. `doc2vec` does not return `top-words`, so parameter `important_words` cannot be use.\n", + "\n", + "Important parameters,\n", + "1. `aggregation`, aggregation function to accumulate word vectors. Default is `mean`.\n", + "\n", + " * ``'mean'`` - mean.\n", + " * ``'min'`` - min.\n", + " * ``'max'`` - max.\n", + " * ``'sum'`` - sum.\n", + " * ``'sqrt'`` - square root." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using word2vec\n", + "\n", + "I will use `load_news`, word2vec from wikipedia took a very long time." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "embedded_news = malaya.word2vec.load_news(64)\n", + "w2v_wiki = malaya.word2vec.word2vec(embedded_news['nce_weights'],\n", + " embedded_news['dictionary'])" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'summary': 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi kelebihan dan kekuatan UMNO dan BN selama ini. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya.',\n", - " 'top-words': ['wakil pandang umno',\n", - " 'mohamed',\n", - " 'paham sekolah vernakular',\n", - " 'paham sekolah',\n", - " 'paham',\n", - " 'negara',\n", - " 'nazri nyata',\n", - " 'mohamed nazri',\n", - " 'mohamad',\n", - " 'pandang peribadi']}" + "'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera. Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media. \"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan sekolah vernakular di Malaysia.'" ] }, - "execution_count": 20, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "malaya.summarize.lda(isu_kerajaan,important_words=10,return_cluster=False)" + "malaya.summarize.doc2vec(w2v_wiki, isu_kerajaan, soft = False, top_k = 5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using fast-text" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "wiki, ngrams = malaya.fast_text.load_wiki()\n", + "fast_text_embed = malaya.fast_text.fast_text(wiki['embed_weights'],wiki['dictionary'],ngrams)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian serta pandangan UMNO kerana parti itu menghormati serta memahami keperluan sekolah vernakular dalam negara. Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak orang Melayu dan bumiputera. \"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar UMNO dan BN,\" katanya. Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan yang diperlukan.'" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "malaya.summarize.doc2vec(fast_text_embed, isu_kerajaan, soft = False, top_k = 5)" + ] } ], "metadata": { diff --git a/malaya/__init__.py b/malaya/__init__.py index 3385762f..56a3df37 100644 --- a/malaya/__init__.py +++ b/malaya/__init__.py @@ -17,8 +17,8 @@ from pathlib import Path home = os.path.join(str(Path.home()), 'Malaya') -version = '2.5' -bump_version = '2.5.0' +version = '2.6' +bump_version = '2.6.0' version_path = os.path.join(home, 'version') diff --git a/malaya/_models/_tensorflow_model.py b/malaya/_models/_tensorflow_model.py index 4925e6cb..8458692e 100644 --- a/malaya/_models/_tensorflow_model.py +++ b/malaya/_models/_tensorflow_model.py @@ -14,6 +14,7 @@ language_detection_textcleaning, tag_chunk, bert_tokenization, + bert_tokenization_siamese, ) from .._utils._parse_dependency import DependencyGraph from ..preprocessing import preprocessing_classification_index @@ -1633,3 +1634,172 @@ def predict_batch(self, strings, get_proba = False): list_result.append(label) results.append(list_result) return results + + +class SIAMESE: + def __init__(self, X_left, X_right, logits, sess, dictionary): + self._X_left = X_left + self._X_right = X_right + self._logits = logits + self._sess = sess + self._dictionary = dictionary + + def predict(self, string_left, string_right): + """ + calculate similarity for two different texts. + + Parameters + ---------- + string_left : str + string_right : str + + Returns + ------- + float: float + """ + if not isinstance(string_left, str): + raise ValueError('string_left must be a string') + if not isinstance(string_right, str): + raise ValueError('string_right must be a string') + _, splitted = preprocessing_classification_index(string_left) + batch_x_left = str_idx( + [' '.join(splitted)], self._dictionary, len(splitted), UNK = 3 + ) + _, splitted = preprocessing_classification_index(string_right) + batch_x_right = str_idx( + [' '.join(splitted)], self._dictionary, len(splitted), UNK = 3 + ) + return self._sess.run( + 1 - self._logits, + feed_dict = { + self._X_left: batch_x_left, + self._X_right: batch_x_right, + }, + )[0] + + def predict_batch(self, strings_left, strings_right): + """ + calculate similarity for two different batch of texts. + + Parameters + ---------- + string_left : str + string_right : str + + Returns + ------- + list: list of float + """ + if not isinstance(strings_left, list): + raise ValueError('strings_left must be a list') + if not isinstance(strings_left[0], str): + raise ValueError('strings_left must be list of strings') + if not isinstance(strings_right, list): + raise ValueError('strings_right must be a list') + if not isinstance(strings_right[0], str): + raise ValueError('strings_right must be list of strings') + + strings = [ + ' '.join(preprocessing_classification_index(i)[1]) + for i in strings_left + ] + maxlen = max([len(i.split()) for i in strings]) + batch_x_left = str_idx(strings, self._dictionary, maxlen, UNK = 3) + + strings = [ + ' '.join(preprocessing_classification_index(i)[1]) + for i in strings_right + ] + maxlen = max([len(i.split()) for i in strings]) + batch_x_right = str_idx(strings, self._dictionary, maxlen, UNK = 3) + + return self._sess.run( + 1 - self._logits, + feed_dict = { + self._X_left: batch_x_left, + self._X_right: batch_x_right, + }, + ) + + +class SIAMESE_BERT(BERT): + def __init__( + self, + X, + segment_ids, + input_masks, + logits, + sess, + tokenizer, + maxlen, + label = ['not similar', 'similar'], + ): + BERT.__init__( + self, + X, + segment_ids, + input_masks, + logits, + sess, + tokenizer, + maxlen, + label, + ) + + def _base(self, strings_left, strings_right): + input_ids, input_masks, segment_ids = bert_tokenization_siamese( + self._tokenizer, strings_left, strings_right, self._maxlen + ) + + return self._sess.run( + tf.nn.softmax(self._logits), + feed_dict = { + self._X: input_ids, + self._segment_ids: segment_ids, + self._input_masks: input_masks, + }, + ) + + def predict(self, string_left, string_right): + """ + calculate similarity for two different texts. + + Parameters + ---------- + string_left : str + string_right : str + + Returns + ------- + float: float + """ + if not isinstance(string_left, str): + raise ValueError('string_left must be a string') + if not isinstance(string_right, str): + raise ValueError('string_right must be a string') + + return self._base([string_left], [string_right])[0, 1] + + def predict_batch(self, strings_left, strings_right): + """ + calculate similarity for two different batch of texts. + + Parameters + ---------- + string_left : str + string_right : str + + Returns + ------- + list: list of float + """ + if not isinstance(strings_left, list): + raise ValueError('strings_left must be a list') + if not isinstance(strings_left[0], str): + raise ValueError('strings_left must be list of strings') + if not isinstance(strings_right, list): + raise ValueError('strings_right must be a list') + if not isinstance(strings_right[0], str): + raise ValueError('strings_right must be list of strings') + + return self._base(strings_left, strings_right)[:, 1] diff --git a/malaya/_utils/_paths.py b/malaya/_utils/_paths.py index fb825666..aeb3d62d 100644 --- a/malaya/_utils/_paths.py +++ b/malaya/_utils/_paths.py @@ -595,7 +595,7 @@ 'setting': 'v24/emotion/emotion-dictionary.json', }, 'self-attention': { - 'model': 'v24/emotion/luong-emotion.pb', + 'model': 'v24/emotion/self-attention-emotion.pb', 'setting': 'v24/emotion/emotion-dictionary.json', }, 'multinomial': { @@ -691,3 +691,45 @@ 'setting': 'v24/relevancy/relevancy-dictionary.json', }, } + +PATH_SIMILARITY = { + 'bahdanau': { + 'model': home + '/similarity/bahdanau/bahdanau-similarity.pb', + 'setting': home + '/similarity/similarity-dictionary.json', + 'version': 'v26', + }, + 'dilated-cnn': { + 'model': home + '/similarity/luong/dilated-cnn-similarity.pb', + 'setting': home + '/similarity/similarity-dictionary.json', + 'version': 'v26', + }, + 'self-attention': { + 'model': home + '/similarity/luong/self-attention-similarity.pb', + 'setting': home + '/similarity/similarity-dictionary.json', + 'version': 'v26', + }, + 'bert': { + 'model': home + '/similarity/bert/bert-similarity.pb', + 'vocab': home + '/bert/multilanguage-vocab.txt', + 'version': 'v26', + }, +} + +S3_PATH_SIMILARITY = { + 'bahdanau': { + 'model': 'v26/similarity/bahdanau-similarity.pb', + 'setting': 'v26/similarity/similarity-dictionary.json', + }, + 'self-attention': { + 'model': 'v26/similarity/self-attention-similarity.pb', + 'setting': 'v26/similarity/similarity-dictionary.json', + }, + 'dilated-cnn': { + 'model': 'v26/similarity/dilated-cnn-similarity.pb', + 'setting': 'v26/similarity/similarity-dictionary.json', + }, + 'bert': { + 'model': 'v26/similarity/bert-similarity.pb', + 'vocab': 'v24/multilanguage-vocab.txt', + }, +} diff --git a/malaya/cluster.py b/malaya/cluster.py index c2ff0db3..e0b65469 100644 --- a/malaya/cluster.py +++ b/malaya/cluster.py @@ -190,7 +190,7 @@ def cluster_scatter( ngram = (1, 3), cleaning = simple_textcleaning, vectorizer = 'bow', - stop_words = STOPWORDS, + stop_words = None, num_clusters = 5, clustering = KMeans, decomposition = MDS, @@ -219,8 +219,8 @@ def cluster_scatter( n-grams size to train a corpus. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS vectorizer: str, (default='bow') vectorizer technique. Allowed values: @@ -283,6 +283,8 @@ def cluster_scatter( Vectorizer = SkipGramVectorizer else: raise Exception("vectorizer must be in ['tfidf', 'bow', 'skip-gram']") + if stop_words is None: + stop_words = STOPWORDS try: import matplotlib.pyplot as plt @@ -365,7 +367,7 @@ def cluster_dendogram( ngram = (1, 3), cleaning = simple_textcleaning, vectorizer = 'bow', - stop_words = STOPWORDS, + stop_words = None, random_samples = 0.3, figsize = (17, 9), **kwargs @@ -389,8 +391,8 @@ def cluster_dendogram( n-grams size to train a corpus. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS vectorizer: str, (default='bow') vectorizer technique. Allowed values: @@ -450,6 +452,8 @@ def cluster_dendogram( raise Exception( 'matplotlib and seaborn not installed. Please install it and try again.' ) + if stop_words is None: + stop_words = STOPWORDS tf_vectorizer = Vectorizer( ngram_range = ngram, @@ -507,7 +511,7 @@ def cluster_graph( ngram = (1, 3), cleaning = simple_textcleaning, vectorizer = 'bow', - stop_words = STOPWORDS, + stop_words = None, num_clusters = 5, clustering = KMeans, figsize = (17, 9), @@ -538,8 +542,8 @@ def cluster_graph( n-grams size to train a corpus. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS vectorizer: str, (default='bow') vectorizer technique. Allowed values: @@ -620,6 +624,8 @@ def cluster_graph( raise Exception( 'matplotlib, seaborn, networkx not installed. Please install it and try again.' ) + if stop_words is None: + stop_words = STOPWORDS tf_vectorizer = Vectorizer( ngram_range = ngram, @@ -701,7 +707,7 @@ def cluster_entity_linking( stemming = True, cleaning = simple_textcleaning, vectorizer = 'bow', - stop_words = STOPWORDS, + stop_words = None, figsize = (17, 9), **kwargs ): @@ -733,8 +739,8 @@ def cluster_entity_linking( n-grams size to train a corpus. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS vectorizer: str, (default='bow') vectorizer technique. Allowed values: @@ -795,6 +801,8 @@ def cluster_entity_linking( raise ValueError( 'threshold must be bigger than 0, less than or equal to 1' ) + if stop_words is None: + stop_words = STOPWORDS try: import matplotlib.pyplot as plt diff --git a/malaya/emotion.py b/malaya/emotion.py index a7f2ea75..58dee3d2 100644 --- a/malaya/emotion.py +++ b/malaya/emotion.py @@ -146,7 +146,7 @@ def bert(validate = True): Returns ------- - XGB : malaya._models._tensorflow_model.MULTICLASS_BERT class + MULTICLASS_BERT : malaya._models._tensorflow_model.MULTICLASS_BERT class """ if not isinstance(validate, bool): raise ValueError('validate must be a boolean') diff --git a/malaya/normalize.py b/malaya/normalize.py index 634c72bd..57c6aff0 100644 --- a/malaya/normalize.py +++ b/malaya/normalize.py @@ -35,9 +35,8 @@ ) from .num2word import to_cardinal from .word2num import word2num -from .preprocessing import _SocialTokenizer +from .preprocessing import _tokenizer -_tokenizer = _SocialTokenizer().tokenize ignore_words = ['ringgit', 'sen'] ignore_postfix = ['adalah'] @@ -368,7 +367,7 @@ def fuzzy(corpus): Returns ------- - FUZZY_NORMALIZE: Trained malaya.normalizer._FUZZY_NORMALIZE class + _FUZZY_NORMALIZE: Trained malaya.normalizer._FUZZY_NORMALIZE class """ if not isinstance(corpus, list): raise ValueError('corpus must be a list') @@ -415,7 +414,7 @@ def spell(speller): Returns ------- - SPELL_NORMALIZE: malaya.normalizer._SPELL_NORMALIZE class + _SPELL_NORMALIZE: malaya.normalizer._SPELL_NORMALIZE class """ if not hasattr(speller, 'correct') and not hasattr( speller, 'normalize_elongated' diff --git a/malaya/preprocessing.py b/malaya/preprocessing.py index 5c60b287..535d983c 100644 --- a/malaya/preprocessing.py +++ b/malaya/preprocessing.py @@ -110,7 +110,7 @@ def _get_expression_dict(): } -class _SocialTokenizer: +class SocialTokenizer: def __init__(self, lowercase = False, **kwargs): """ Args: @@ -362,7 +362,7 @@ def __init__( self._remove_postfix = remove_postfix self._regexes = _get_expression_dict() self._expand_hashtags = expand_hashtags - self._tokenizer = _SocialTokenizer(lowercase = lowercase).tokenize + self._tokenizer = SocialTokenizer(lowercase = lowercase).tokenize if self._expand_hashtags: self._segmenter = _Segmenter(maxlen_segmenter) self._expand_contractions = expand_english_contractions @@ -691,7 +691,7 @@ def segmenter(max_split_length = 20, validate = True): return _Segmenter(max_split_length = max_split_length) -_tokenizer = _SocialTokenizer().tokenize +_tokenizer = SocialTokenizer().tokenize _rejected = ['wkwk', 'http', 'https', 'lolol', 'hahaha'] diff --git a/malaya/similarity.py b/malaya/similarity.py index f8a5f73d..15b9f392 100644 --- a/malaya/similarity.py +++ b/malaya/similarity.py @@ -12,7 +12,11 @@ import json from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from .texts.vectorizer import SkipGramVectorizer -from sklearn.metrics.pairwise import cosine_similarity +from sklearn.metrics.pairwise import ( + cosine_similarity, + euclidean_distances, + manhattan_distances, +) from sklearn.utils import shuffle from .texts._text_functions import ( STOPWORDS, @@ -22,7 +26,12 @@ ) from .generator import sentence_ngram from . import home, _delete_macos -from ._utils._utils import download_file +from ._utils._utils import ( + check_file, + load_graph, + check_available, + generate_session, +) from ._models._skip_thought import ( train_model as skip_train, batch_sequence, @@ -32,9 +41,11 @@ train_model as siamese_train, load_siamese as load_deep_siamese, ) - +from .preprocessing import _tokenizer from .topic import calon from .topic import location +from ._utils._paths import PATH_SIMILARITY, S3_PATH_SIMILARITY +from ._models._tensorflow_model import SIAMESE, SIAMESE_BERT def _apply_stopwords_calon(string): @@ -323,7 +334,7 @@ def fuzzy(dictionary): return _FUZZY(dictionary) -def is_location(string, fuzzy_ratio = 90, location = _location): +def is_location(string, fuzzy_ratio = 90, location = None): """ check whether a string is a location, default is malaysia location. @@ -332,8 +343,8 @@ def is_location(string, fuzzy_ratio = 90, location = _location): string: str fuzzy_ratio: int, (default=90) ratio of similar characters by positions, if 90, means 90%. - location: list, (default=_location) - list of locations. + location: list, (default=None) + list of locations. If None, will use malaya.similarity._location Returns ------- @@ -343,6 +354,8 @@ def is_location(string, fuzzy_ratio = 90, location = _location): raise ValueError('input must be a string') if not isinstance(fuzzy_ratio, int): raise ValueError('fuzzy_ratio must be an integer') + if location is None: + location = _location if not (fuzzy_ratio > 0 and fuzzy_ratio < 100): raise ValueError('fuzzy_ratio must be bigger than 0 and less than 100') for loc in location: @@ -690,3 +703,299 @@ def bow(dictionary, vectorizer = 'tfidf', ngram = (3, 10)): vectorizer = char_vectorizer.fit(output) vectorized = vectorizer.transform(output) return _FAST_SIMILARITY(vectorizer, vectorized, keys) + + +def doc2vec( + vectorizer, + left_string, + right_string, + aggregation = 'mean', + similarity = 'cosine', + tokenizer = _tokenizer, + soft = True, +): + """ + Calculate similarity between 2 documents using doc2vec. + + Parameters + ---------- + vectorizer : object + fast-text or word2vec interface object. + left_string: str + first string to predict. + right_string: str + second string to predict. + aggregation : str, optional (default='mean') + Aggregation supported. Allowed values: + + * ``'mean'`` - mean. + * ``'min'`` - min. + * ``'max'`` - max. + * ``'sum'`` - sum. + * ``'sqrt'`` - square root. + similarity : str, optional (default='mean') + similarity supported. Allowed values: + + * ``'cosine'`` - cosine similarity. + * ``'euclidean'`` - euclidean similarity. + * ``'manhattan'`` - manhattan similarity. + tokenizer : object + default is tokenizer from malaya.preprocessing._SocialTokenizer + soft: bool, optional (default=True) + word not inside vectorizer will replace with nearest word if True, else, will skip. + + Returns + ------- + result: float + """ + + if not hasattr(vectorizer, 'get_vector_by_name'): + raise ValueError('vectorizer must has `get_vector_by_name` method') + if not isinstance(left_string, str): + raise ValueError('left_string must be a string') + if not isinstance(right_string, str): + raise ValueError('right_string must be a string') + if not isinstance(aggregation, str): + raise ValueError('aggregation must be a string') + if not isinstance(similarity, str): + raise ValueError('similarity must be a string') + + aggregation = aggregation.lower() + if aggregation == 'mean': + aggregation_function = np.mean + elif aggregation == 'min': + aggregation_function = np.min + elif aggregation == 'max': + aggregation_function = np.max + elif aggregation == 'sum': + aggregation_function = np.sum + elif aggregation == 'sqrt': + aggregation_function = np.sqrt + else: + raise ValueError( + 'aggregation only supports `mean`, `min`, `max`, `sum` and `sqrt`' + ) + + similarity = similarity.lower() + if similarity == 'cosine': + similarity_function = cosine_similarity + elif similarity == 'euclidean': + similarity_function = euclidean_distances + elif similarity == 'manhattan': + similarity_function = manhattan_distances + else: + raise ValueError( + 'similarity only supports `cosine`, `euclidean`, and `manhattan`' + ) + + left_tokenized = _tokenizer(left_string) + if not len(left_tokenized): + raise ValueError('insert not empty left string') + right_tokenized = _tokenizer(right_string) + if not len(right_tokenized): + raise ValueError('insert not empty right string') + + left_vectors, right_vectors = [], [] + for token in left_tokenized: + try: + left_vectors.append(vectorizer.get_vector_by_name(token)) + except: + if not soft: + pass + else: + arr = np.array([fuzz.ratio(token, k) for k in vectorizer.words]) + idx = (-arr).argsort()[0] + left_vectors.append( + vectorizer.get_vector_by_name(vectorizer.words[idx]) + ) + for token in right_tokenized: + try: + right_vectors.append(vectorizer.get_vector_by_name(token)) + except: + if not soft: + pass + else: + arr = np.array([fuzz.ratio(token, k) for k in vectorizer.words]) + idx = (-arr).argsort()[0] + right_vectors.append( + vectorizer.get_vector_by_name(vectorizer.words[idx]) + ) + left_vectors = [aggregation_function(left_vectors, axis = 0)] + right_vectors = [aggregation_function(right_vectors, axis = 0)] + similar = similarity_function(left_vectors, right_vectors)[0, 0] + if similarity == 'cosine': + return (similar + 1) / 2 + else: + return 1 / (similar + 1) + + +def summarizer(summarizer, left_string, right_string, similarity = 'cosine'): + """ + Calculate similarity between 2 documents using summarizer. + + Parameters + ---------- + vectorizer : object + fast-text or word2vec interface object. + left_string: str + first string to predict. + right_string: str + second string to predict. + similarity : str, optional (default='mean') + similarity supported. Allowed values: + + * ``'cosine'`` - cosine similarity. + * ``'euclidean'`` - euclidean similarity. + * ``'manhattan'`` - manhattan similarity. + tokenizer : object + default is tokenizer from malaya.preprocessing._SocialTokenizer + soft: bool, optional (default=True) + word not inside vectorizer will replace with nearest word if True, else, will skip. + + Returns + ------- + result: float + """ + + if not hasattr(summarizer, 'vectorize'): + raise ValueError('summarizer must has `vectorize` method') + if not isinstance(left_string, str): + raise ValueError('left_string must be a string') + if not isinstance(right_string, str): + raise ValueError('right_string must be a string') + if not isinstance(similarity, str): + raise ValueError('similarity must be a string') + + similarity = similarity.lower() + if similarity == 'cosine': + similarity_function = cosine_similarity + elif similarity == 'euclidean': + similarity_function = euclidean_distances + elif similarity == 'manhattan': + similarity_function = manhattan_distances + else: + raise ValueError( + 'similarity only supports `cosine`, `euclidean`, and `manhattan`' + ) + + left_vectorized = summarizer.vectorize(left_string + '.') + right_vectorized = summarizer.vectorize(right_string + '.') + similar = similarity_function(left_vectorized, right_vectorized)[0, 0] + if similarity == 'cosine': + return (similar + 1) / 2 + else: + return 1 / (similar + 1) + + +def available_deep_siamese(): + """ + List available deep siamese models. + """ + return ['self-attention', 'bahdanau', 'dilated-cnn'] + + +def deep_siamese(model = 'bahdanau', validate = True): + """ + Load deep siamese model. + + Parameters + ---------- + model : str, optional (default='bahdanau') + Model architecture supported. Allowed values: + + * ``'self-attention'`` - Fast-text architecture, embedded and logits layers only with self attention. + * ``'bahdanau'`` - LSTM with bahdanau attention architecture. + * ``'dilated-cnn'`` - Pyramid Dilated CNN architecture. + validate: bool, optional (default=True) + if True, malaya will check model availability and download if not available. + + Returns + ------- + SIAMESE: malaya._models._tensorflow_model.SIAMESE class + """ + + if not isinstance(model, str): + raise ValueError('model must be a string') + if not isinstance(validate, bool): + raise ValueError('validate must be a boolean') + model = model.lower() + if model not in available_deep_siamese(): + raise Exception( + 'model is not supported, please check supported models from malaya.similarity.available_deep_siamese()' + ) + if validate: + check_file(PATH_SIMILARITY[model], S3_PATH_SIMILARITY[model]) + else: + if not check_available(PATH_SIMILARITY[model]): + raise Exception( + 'similarity/%s is not available, please `validate = True`' + % (model) + ) + try: + with open(PATH_SIMILARITY[model]['setting'], 'r') as fopen: + dictionary = json.load(fopen)['dictionary'] + g = load_graph(PATH_SIMILARITY[model]['model']) + except: + raise Exception( + "model corrupted due to some reasons, please run malaya.clear_cache('similarity/%s') and try again" + % (model) + ) + return SIAMESE( + g.get_tensor_by_name('import/Placeholder:0'), + g.get_tensor_by_name('import/Placeholder_1:0'), + g.get_tensor_by_name('import/logits:0'), + generate_session(graph = g), + dictionary, + ) + + +def bert(validate = True): + """ + Load BERT similarity model. + + Parameters + ---------- + validate: bool, optional (default=True) + if True, malaya will check model availability and download if not available. + + Returns + ------- + SIMILARITY_BERT : malaya._models._tensorflow_model.SIAMESE_BERT class + """ + if not isinstance(validate, bool): + raise ValueError('validate must be a boolean') + try: + from bert import tokenization + except: + raise Exception( + 'bert-tensorflow not installed. Please install it using `pip3 install bert-tensorflow` and try again.' + ) + if validate: + check_file(PATH_SIMILARITY['bert'], S3_PATH_SIMILARITY['bert']) + else: + if not check_available(PATH_SIMILARITY['bert']): + raise Exception( + 'toxic/bert is not available, please `validate = True`' + ) + + tokenization.validate_case_matches_checkpoint(True, '') + tokenizer = tokenization.FullTokenizer( + vocab_file = PATH_SIMILARITY['bert']['vocab'], do_lower_case = True + ) + try: + g = load_graph(PATH_SIMILARITY['bert']['model']) + except: + raise Exception( + "model corrupted due to some reasons, please run malaya.clear_cache('similarity/bert') and try again" + ) + + return SIAMESE_BERT( + X = g.get_tensor_by_name('import/Placeholder:0'), + segment_ids = g.get_tensor_by_name('import/Placeholder_1:0'), + input_masks = g.get_tensor_by_name('import/Placeholder_2:0'), + logits = g.get_tensor_by_name('import/logits:0'), + sess = generate_session(graph = g), + tokenizer = tokenizer, + maxlen = 100, + label = ['not similar', 'similar'], + ) diff --git a/malaya/spell.py b/malaya/spell.py index 38b0ae0d..b6f21163 100644 --- a/malaya/spell.py +++ b/malaya/spell.py @@ -350,7 +350,7 @@ def fuzzy(corpus): Returns ------- - SPELL: Trained malaya.spell._SPELL class + _SPELL: Trained malaya.spell._SPELL class """ if not isinstance(corpus, list): raise ValueError('corpus must be a list') @@ -370,7 +370,7 @@ def probability(validate = True): Returns ------- - SPELL: Trained malaya.spell._SpellCorrector class + _SpellCorrector: Trained malaya.spell._SpellCorrector class """ if validate: check_file(PATH_NGRAM[1], S3_PATH_NGRAM[1]) diff --git a/malaya/subjective.py b/malaya/subjective.py index c632aeb8..46600b4e 100644 --- a/malaya/subjective.py +++ b/malaya/subjective.py @@ -24,7 +24,7 @@ def available_deep_model(): def sparse_deep_model(model = 'fast-text-char', validate = True): """ - Load deep learning sentiment analysis model. + Load deep learning subjectivity analysis model. Parameters ---------- @@ -145,7 +145,7 @@ def bert(validate = True): Returns ------- - XGB : malaya._models._tensorflow_model.BINARY_BERT class + BERT : malaya._models._tensorflow_model.BINARY_BERT class """ if not isinstance(validate, bool): raise ValueError('validate must be a boolean') diff --git a/malaya/summarize.py b/malaya/summarize.py index 0fffa98b..9e582ac2 100644 --- a/malaya/summarize.py +++ b/malaya/summarize.py @@ -7,22 +7,22 @@ import numpy as np import re import random -from scipy.linalg import svd -from operator import itemgetter -from sklearn.feature_extraction.text import TfidfVectorizer +from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer +from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation from sklearn.utils import shuffle -from sklearn.cluster import KMeans -from sklearn.metrics import pairwise_distances_argmin_min -from sklearn.decomposition import NMF, LatentDirichletAllocation +from fuzzywuzzy import fuzz +from sklearn.metrics.pairwise import cosine_similarity from .texts._text_functions import ( summary_textcleaning, classification_textcleaning, STOPWORDS, split_into_sentences, ) +import networkx as nx from .stem import sastrawi from ._models import _skip_thought from .cluster import cluster_words +from .texts.vectorizer import SkipGramVectorizer class _DEEP_SUMMARIZER: @@ -60,9 +60,7 @@ def vectorize(self, corpus): self._logits, feed_dict = {self._X: np.array(sequences)} ) - def summarize( - self, corpus, top_k = 3, important_words = 3, return_cluster = True - ): + def summarize(self, corpus, top_k = 3, important_words = 3): """ Summarize list of strings / corpus @@ -71,9 +69,9 @@ def summarize( corpus: str, list top_k: int, (default=3) - number of summarized strings + number of summarized strings. important_words: int, (default=3) - number of important words + number of important words. Returns ------- @@ -83,8 +81,6 @@ def summarize( raise ValueError('top_k must be an integer') if not isinstance(important_words, int): raise ValueError('important_words must be an integer') - if not isinstance(return_cluster, bool): - raise ValueError('return_cluster must be a boolean') if not isinstance(corpus, list) and not isinstance(corpus, str): raise ValueError('corpus must be a list') if isinstance(corpus, list): @@ -102,63 +98,64 @@ def summarize( sequences = _skip_thought.batch_sequence( cleaned_strings, self.dictionary, maxlen = self._maxlen ) - encoded, attention = self._sess.run( + vectors, attention = self._sess.run( [self._logits, self._attention], feed_dict = {self._X: np.array(sequences)}, ) attention = attention.sum(axis = 0) - kmeans = KMeans(n_clusters = top_k, random_state = 0) - kmeans = kmeans.fit(encoded) - avg = [] - for j in range(top_k): - idx = np.where(kmeans.labels_ == j)[0] - avg.append(np.mean(idx)) - closest, _ = pairwise_distances_argmin_min( - kmeans.cluster_centers_, encoded - ) indices = np.argsort(attention)[::-1] top_words = [self._rev_dictionary[i] for i in indices[:important_words]] - ordering = sorted(range(top_k), key = lambda k: avg[k]) - summarized = ' '.join( - [original_strings[closest[idx]] for idx in ordering] + + similar = cosine_similarity(vectors, vectors) + similar[similar >= 0.999] = 0 + nx_graph = nx.from_numpy_array(similar) + scores = nx.pagerank(nx_graph, max_iter = 1000) + ranked_sentences = sorted( + ((scores[i], s) for i, s in enumerate(original_strings)), + reverse = True, ) - if return_cluster: - return { - 'summary': summarized, - 'top-words': top_words, - 'cluster-top-words': cluster_words(top_words), - } - return {'summary': summarized, 'top-words': top_words} + summary = [r[1] for r in ranked_sentences[:top_k]] + return { + 'summary': ' '.join(summary), + 'top-words': top_words, + 'cluster-top-words': cluster_words(top_words), + } -def deep_model_news(): - """ - Load skip-thought summarization deep learning model trained on news dataset. - Returns - ------- - _DEEP_SUMMARIZER: _DEEP_SUMMARIZER class +def available_deep_extractive(): """ - sess, x, logits, attention, dictionary, maxlen = ( - _skip_thought.news_load_model() - ) - return _DEEP_SUMMARIZER(sess, x, logits, attention, dictionary, maxlen) + List available deep extractive summarization models. + """ + return ['skip-thought', 'residual-network'] -def deep_model_wiki(): +def deep_extractive(model = 'skip-thought'): """ - Load residual network with Bahdanau Attention summarization deep learning model trained on wikipedia dataset. + Load deep learning subjectivity analysis model, scoring using TextRank. + + Parameters + ---------- + model : str, optional (default='skip-thought') + Model architecture supported. Allowed values: + + * ``'skip-thought'`` - skip-thought summarization deep learning model trained on news dataset. Hopefully we can train on wikipedia dataset. + * ``'residual-network'`` - residual network with Bahdanau Attention summarization deep learning model trained on wikipedia dataset. Returns ------- - _DEEP_SUMMARIZER: _DEEP_SUMMARIZER class + _DEEP_SUMMARIZER: malaya.summarize._DEEP_SUMMARIZER class """ - print( - 'WARNING: this model is using convolutional based, Tensorflow-GPU above 1.10 may got a problem. Please downgrade to Tensorflow-GPU v1.8 if got any cuDNN error.' - ) - sess, x, logits, attention, dictionary, maxlen = ( - _skip_thought.wiki_load_model() - ) + model = model.lower() + if model == 'skip-thought': + model = _skip_thought.news_load_model + elif model == 'residual-network': + model = _skip_thought.wiki_load_model + else: + raise Exception( + 'model is not supported, please check supported models from malaya.summarize.available_deep_extractive()' + ) + sess, x, logits, attention, dictionary, maxlen = model() return _DEEP_SUMMARIZER(sess, x, logits, attention, dictionary, maxlen) @@ -245,49 +242,38 @@ def train_skip_thought( ) -def lsa( +def _base_summarizer( corpus, - ngram = (1, 3), - min_df = 2, + decomposition, top_k = 3, - important_words = 3, - return_cluster = True, - **kwargs + max_df = 0.95, + min_df = 2, + ngram = (1, 3), + vectorizer = 'bow', + important_words = 10, + **kwargs, ): - """ - summarize a list of strings using LSA. - - Parameters - ---------- - corpus: list - ngram: tuple, (default=(1,3)) - n-grams size to train a corpus - min_df: int, (default=2) - minimum document frequency for a word - top_k: int, (default=3) - number of summarized strings - important_words: int, (default=3) - number of important words - return_cluster: bool, (default=True) - if True, will cluster important_words to similar texts - - Returns - ------- - dictionary: result - """ + if not isinstance(vectorizer, str): + raise ValueError('vectorizer must be a string') if not isinstance(top_k, int): raise ValueError('top_k must be an integer') - if not isinstance(important_words, int): - raise ValueError('important_words must be an integer') - if not isinstance(return_cluster, bool): - raise ValueError('return_cluster must be a boolean') + vectorizer = vectorizer.lower() + if not vectorizer in ['tfidf', 'bow', 'skip-gram']: + raise ValueError("vectorizer must be in ['tfidf', 'bow', 'skip-gram']") if not isinstance(ngram, tuple): raise ValueError('ngram must be a tuple') if not len(ngram) == 2: raise ValueError('ngram size must equal to 2') - if not isinstance(min_df, int) or isinstance(min_df, float): - raise ValueError('min_df must be an integer or a float') - + if not isinstance(min_df, int): + raise ValueError('min_df must be an integer') + if not (isinstance(max_df, int) or isinstance(max_df, float)): + raise ValueError('max_df must be an integer or a float') + if min_df < 1: + raise ValueError('min_df must be bigger than 0') + if not (max_df <= 1 and max_df > 0): + raise ValueError( + 'max_df must be bigger than 0, less than or equal to 1' + ) if not isinstance(corpus, list) and not isinstance(corpus, str): raise ValueError('corpus must be a list') if isinstance(corpus, list): @@ -303,170 +289,176 @@ def lsa( original_strings = [i[0] for i in splitted_fullstop] cleaned_strings = [i[1] for i in splitted_fullstop] stemmed = [sastrawi(i) for i in cleaned_strings] - tfidf = TfidfVectorizer( - ngram_range = ngram, min_df = min_df, stop_words = STOPWORDS, **kwargs - ).fit(stemmed) - U, S, Vt = svd(tfidf.transform(stemmed).todense().T, full_matrices = False) - summary = [ - (original_strings[i], np.linalg.norm(np.dot(np.diag(S), Vt[:, b]), 2)) - for i in range(len(splitted_fullstop)) - for b in range(len(Vt)) - ] - summary = sorted(summary, key = itemgetter(1)) - summary = dict( - (v[0], v) for v in sorted(summary, key = lambda summary: summary[1]) - ).values() - summarized = ' '.join([a for a, b in summary][len(summary) - (top_k) :]) - indices = np.argsort(tfidf.idf_)[::-1] - features = tfidf.get_feature_names() + + if vectorizer == 'tfidf': + Vectorizer = TfidfVectorizer + elif vectorizer == 'bow': + Vectorizer = CountVectorizer + elif vectorizer == 'skip-gram': + Vectorizer = SkipGramVectorizer + else: + raise Exception("vectorizer must be in ['tfidf', 'bow', 'skip-gram']") + tf_vectorizer = Vectorizer( + max_df = max_df, + min_df = min_df, + ngram_range = ngram, + stop_words = STOPWORDS, + **kwargs, + ) + tf = tf_vectorizer.fit_transform(stemmed) + if hasattr(tf_vectorizer, 'idf_'): + indices = np.argsort(tf_vectorizer.idf_)[::-1] + else: + indices = np.argsort(np.asarray(tf.sum(axis = 0))[0])[::-1] + + features = tf_vectorizer.get_feature_names() top_words = [features[i] for i in indices[:important_words]] - if return_cluster: - return { - 'summary': summarized, - 'top-words': top_words, - 'cluster-top-words': cluster_words(top_words), - } - return {'summary': summarized, 'top-words': top_words} + vectors = decomposition(tf.shape[1] // 2).fit_transform(tf) + similar = cosine_similarity(vectors, vectors) + similar[similar >= 0.999] = 0 + nx_graph = nx.from_numpy_array(similar) + scores = nx.pagerank(nx_graph, max_iter = 1000) + ranked_sentences = sorted( + ((scores[i], s) for i, s in enumerate(original_strings)), reverse = True + ) + summary = [r[1] for r in ranked_sentences[:top_k]] + return { + 'summary': ' '.join(summary), + 'top-words': top_words, + 'cluster-top-words': cluster_words(top_words), + } -def nmf( +def lda( corpus, - ngram = (1, 3), - min_df = 2, top_k = 3, - important_words = 3, - return_cluster = True, - **kwargs + important_words = 10, + max_df = 0.95, + min_df = 2, + ngram = (1, 3), + vectorizer = 'bow', + **kwargs, ): """ - summarize a list of strings using NMF. + summarize a list of strings using LDA, scoring using TextRank. Parameters ---------- corpus: list - ngram: tuple, (default=(1,3)) - n-grams size to train a corpus top_k: int, (default=3) - number of summarized strings - important_words: int, (default=3) - number of important words + number of summarized strings. + important_words: int, (default=10) + number of important words. + max_df: float, (default=0.95) + maximum of a word selected based on document frequency. min_df: int, (default=2) - minimum document frequency for a word - return_cluster: bool, (default=True) - if True, will cluster important_words to similar texts + minimum of a word selected on based on document frequency. + ngram: tuple, (default=(1,3)) + n-grams size to train a corpus. + vectorizer: str, (default='bow') + vectorizer technique. Allowed values: + + * ``'bow'`` - Bag of Word. + * ``'tfidf'`` - Term frequency inverse Document Frequency. + * ``'skip-gram'`` - Bag of Word with skipping certain n-grams. Returns ------- - dictionary: result + dict: result """ - if not isinstance(top_k, int): - raise ValueError('top_k must be an integer') - if not isinstance(important_words, int): - raise ValueError('important_words must be an integer') - if not isinstance(return_cluster, bool): - raise ValueError('return_cluster must be a boolean') - if not isinstance(ngram, tuple): - raise ValueError('ngram must be a tuple') - if not len(ngram) == 2: - raise ValueError('ngram size must equal to 2') - if not isinstance(min_df, int) or isinstance(min_df, float): - raise ValueError('min_df must be an integer or a float') - - if not isinstance(corpus, list) and not isinstance(corpus, str): - raise ValueError('corpus must be a list') - if isinstance(corpus, list): - if not isinstance(corpus[0], str): - raise ValueError('corpus must be list of strings') - if isinstance(corpus, str): - corpus = split_into_sentences(corpus) - else: - corpus = '. '.join(corpus) - corpus = split_into_sentences(corpus) - - splitted_fullstop = [summary_textcleaning(i) for i in corpus] - original_strings = [i[0] for i in splitted_fullstop] - cleaned_strings = [i[1] for i in splitted_fullstop] - stemmed = [sastrawi(i) for i in cleaned_strings] - tfidf = TfidfVectorizer( - ngram_range = ngram, min_df = min_df, stop_words = STOPWORDS, **kwargs - ).fit(stemmed) - densed_tfidf = tfidf.transform(stemmed).todense() - nmf = NMF(len(splitted_fullstop)).fit(densed_tfidf) - vectors = nmf.transform(densed_tfidf) - components = nmf.components_.mean(axis = 1) - summary = [ - ( - original_strings[i], - np.linalg.norm(np.dot(np.diag(components), vectors[:, b]), 2), - ) - for i in range(len(splitted_fullstop)) - for b in range(len(vectors)) - ] - summary = sorted(summary, key = itemgetter(1)) - summary = dict( - (v[0], v) for v in sorted(summary, key = lambda summary: summary[1]) - ).values() - summarized = ' '.join([a for a, b in summary][len(summary) - (top_k) :]) - indices = np.argsort(tfidf.idf_)[::-1] - features = tfidf.get_feature_names() - top_words = [features[i] for i in indices[:important_words]] - if return_cluster: - return { - 'summary': summarized, - 'top-words': top_words, - 'cluster-top-words': cluster_words(top_words), - } - return {'summary': summarized, 'top-words': top_words} + return _base_summarizer( + corpus, + LatentDirichletAllocation, + top_k = top_k, + max_df = max_df, + min_df = min_df, + ngram = ngram, + vectorizer = vectorizer, + important_words = important_words, + **kwargs, + ) -def lda( +def lsa( corpus, - maintain_original = False, - ngram = (1, 3), - min_df = 2, top_k = 3, - important_words = 3, - return_cluster = True, - **kwargs + important_words = 10, + max_df = 0.95, + min_df = 2, + ngram = (1, 3), + vectorizer = 'bow', + **kwargs, ): """ - summarize a list of strings using LDA. + summarize a list of strings using LSA, scoring using TextRank. Parameters ---------- corpus: list - maintain_original: bool, (default=False) - If False, will apply malaya.text_functions.classification_textcleaning - ngram: tuple, (default=(1,3)) - n-grams size to train a corpus + top_k: int, (default=3) + number of summarized strings. + important_words: int, (default=10) + number of important words. + max_df: float, (default=0.95) + maximum of a word selected based on document frequency. min_df: int, (default=2) - minimum document frequency for a word + minimum of a word selected on based on document frequency. + ngram: tuple, (default=(1,3)) + n-grams size to train a corpus. + vectorizer: str, (default='bow') + vectorizer technique. Allowed values: + + * ``'bow'`` - Bag of Word. + * ``'tfidf'`` - Term frequency inverse Document Frequency. + * ``'skip-gram'`` - Bag of Word with skipping certain n-grams. + + Returns + ------- + dict: result + """ + return _base_summarizer( + corpus, + TruncatedSVD, + top_k = top_k, + max_df = max_df, + min_df = min_df, + ngram = ngram, + vectorizer = vectorizer, + important_words = important_words, + **kwargs, + ) + + +def doc2vec(vectorizer, corpus, top_k = 3, aggregation = 'mean', soft = True): + """ + summarize a list of strings using doc2vec, scoring using TextRank. + + Parameters + ---------- + vectorizer : object + fast-text or word2vec interface object. + corpus: list top_k: int, (default=3) - number of summarized strings - important_words: int, (default=3) - number of important words - return_cluster: bool, (default=True) - if True, will cluster important_words to similar texts + number of summarized strings. + aggregation : str, optional (default='mean') + Aggregation supported. Allowed values: + + * ``'mean'`` - mean. + * ``'min'`` - min. + * ``'max'`` - max. + * ``'sum'`` - sum. + * ``'sqrt'`` - square root. + soft: bool, optional (default=True) + word not inside vectorizer will replace with nearest word if True, else, will skip. Returns ------- dictionary: result """ - if not isinstance(maintain_original, bool): - raise ValueError('maintain_original must be a boolean') + if not hasattr(vectorizer, 'get_vector_by_name'): + raise ValueError('vectorizer must has `get_vector_by_name` method') if not isinstance(top_k, int): raise ValueError('top_k must be an integer') - if not isinstance(important_words, int): - raise ValueError('important_words must be an integer') - if not isinstance(return_cluster, bool): - raise ValueError('return_cluster must be a boolean') - if not isinstance(ngram, tuple): - raise ValueError('ngram must be a tuple') - if not len(ngram) == 2: - raise ValueError('ngram size must equal to 2') - if not isinstance(min_df, int) or isinstance(min_df, float): - raise ValueError('min_df must be an integer or a float') - if not isinstance(corpus, list) and not isinstance(corpus, str): raise ValueError('corpus must be a list') if isinstance(corpus, list): @@ -477,38 +469,50 @@ def lda( else: corpus = '. '.join(corpus) corpus = split_into_sentences(corpus) - splitted_fullstop = [summary_textcleaning(i) for i in corpus] original_strings = [i[0] for i in splitted_fullstop] cleaned_strings = [i[1] for i in splitted_fullstop] - stemmed = [sastrawi(i) for i in cleaned_strings] - tfidf = TfidfVectorizer( - ngram_range = ngram, min_df = min_df, stop_words = STOPWORDS, **kwargs - ).fit(stemmed) - densed_tfidf = tfidf.transform(stemmed).todense() - lda = LatentDirichletAllocation(len(splitted_fullstop)).fit(densed_tfidf) - vectors = lda.transform(densed_tfidf) - components = lda.components_.mean(axis = 1) - summary = [ - ( - original_strings[i], - np.linalg.norm(np.dot(np.diag(components), vectors[:, b]), 2), + + aggregation = aggregation.lower() + if aggregation == 'mean': + aggregation_function = np.mean + elif aggregation == 'min': + aggregation_function = np.min + elif aggregation == 'max': + aggregation_function = np.max + elif aggregation == 'sum': + aggregation_function = np.sum + elif aggregation == 'sqrt': + aggregation_function = np.sqrt + else: + raise ValueError( + 'aggregation only supports `mean`, `min`, `max`, `sum` and `sqrt`' ) - for i in range(len(splitted_fullstop)) - for b in range(len(vectors)) - ] - summary = sorted(summary, key = itemgetter(1)) - summary = dict( - (v[0], v) for v in sorted(summary, key = lambda summary: summary[1]) - ).values() - summarized = ' '.join([a for a, b in summary][len(summary) - (top_k) :]) - indices = np.argsort(tfidf.idf_)[::-1] - features = tfidf.get_feature_names() - top_words = [features[i] for i in indices[:important_words]] - if return_cluster: - return { - 'summary': summarized, - 'top-words': top_words, - 'cluster-top-words': cluster_words(top_words), - } - return {'summary': summarized, 'top-words': top_words} + + vectors = [] + for string in cleaned_strings: + inside = [] + for token in string.split(): + try: + inside.append(vectorizer.get_vector_by_name(token)) + except: + if not soft: + pass + else: + arr = np.array( + [fuzz.ratio(token, k) for k in vectorizer.words] + ) + idx = (-arr).argsort()[0] + inside.append( + vectorizer.get_vector_by_name(vectorizer.words[idx]) + ) + vectors.append(aggregation_function(inside, axis = 0)) + similar = cosine_similarity(vectors, vectors) + similar[similar >= 0.999] = 0 + nx_graph = nx.from_numpy_array(similar) + scores = nx.pagerank(nx_graph, max_iter = 1000) + ranked_sentences = sorted( + ((scores[i], s) for i, s in enumerate(original_strings)), reverse = True + ) + summary = [r[1] for r in ranked_sentences[:top_k]] + return ' '.join(summary) diff --git a/malaya/texts/__init__.py b/malaya/texts/__init__.py index a78fec95..87ae5af3 100644 --- a/malaya/texts/__init__.py +++ b/malaya/texts/__init__.py @@ -5,3 +5,29 @@ # Author: huseinzol05 # URL: # For license information, see https://github.com/huseinzol05/Malaya/blob/master/LICENSE + +from ._tatabahasa import ( + tanya_list, + perintah_list, + pangkal_list, + bantu_list, + penguat_list, + penegas_list, + nafi_list, + pemeri_list, + sendi_list, + pembenar_list, + nombor_list, + suku_bilangan_list, + pisahan_list, + keterangan_list, + arah_list, + hubung_list, + gantinama_list, + permulaan, + hujung, + hujung_malaysian, + calon_dictionary, + stopwords_calon, +) +from ._text_functions import STOPWORDS as stopwords diff --git a/malaya/texts/_text_functions.py b/malaya/texts/_text_functions.py index b72232b8..48ad1443 100644 --- a/malaya/texts/_text_functions.py +++ b/malaya/texts/_text_functions.py @@ -478,3 +478,51 @@ def bert_tokenization(tokenizer, texts, maxlen): segment_ids.append(segment_id) return input_ids, input_masks, segment_ids + + +def _truncate_seq_pair(tokens_a, tokens_b, max_length): + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + + +def bert_tokenization_siamese(tokenizer, left, right, maxlen): + input_ids, input_masks, segment_ids = [], [], [] + for i in range(len(left)): + tokens_a = tokenizer.tokenize(left[i]) + tokens_b = tokenizer.tokenize(right[i]) + _truncate_seq_pair(tokens_a, tokens_b, maxlen - 3) + + tokens = [] + segment_id = [] + tokens.append('[CLS]') + segment_id.append(0) + for token in tokens_a: + tokens.append(token) + segment_id.append(0) + + tokens.append('[SEP]') + segment_id.append(0) + for token in tokens_b: + tokens.append(token) + segment_id.append(1) + tokens.append('[SEP]') + segment_id.append(1) + input_id = tokenizer.convert_tokens_to_ids(tokens) + input_mask = [1] * len(input_id) + + while len(input_id) < maxlen: + input_id.append(0) + input_mask.append(0) + segment_id.append(0) + + input_ids.append(input_id) + input_masks.append(input_mask) + segment_ids.append(segment_id) + + return input_ids, input_masks, segment_ids diff --git a/malaya/topic_model.py b/malaya/topic_model.py index 70118a81..1687bc47 100644 --- a/malaya/topic_model.py +++ b/malaya/topic_model.py @@ -349,7 +349,7 @@ def _base_topic_modelling( vectorizer = 'bow', stemming = True, cleaning = simple_textcleaning, - stop_words = STOPWORDS, + stop_words = None, **kwargs, ): if not isinstance(corpus, list): @@ -423,7 +423,7 @@ def lda( stemming = True, vectorizer = 'bow', cleaning = simple_textcleaning, - stop_words = STOPWORDS, + stop_words = None, **kwargs, ): """ @@ -450,13 +450,15 @@ def lda( * ``'skip-gram'`` - Bag of Word with skipping certain n-grams. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS Returns ------- _TOPIC: malaya.topic_modelling._TOPIC class """ + if stop_words is None: + stop_words = STOPWORDS return _base_topic_modelling( corpus, n_topics, @@ -481,7 +483,7 @@ def nmf( stemming = True, vectorizer = 'bow', cleaning = simple_textcleaning, - stop_words = STOPWORDS, + stop_words = None, **kwargs, ): """ @@ -508,13 +510,15 @@ def nmf( * ``'skip-gram'`` - Bag of Word with skipping certain n-grams. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS Returns ------- _TOPIC: malaya.topic_modelling._TOPIC class """ + if stop_words is None: + stop_words = STOPWORDS return _base_topic_modelling( corpus, n_topics, @@ -539,7 +543,7 @@ def lsa( vectorizer = 'bow', stemming = True, cleaning = simple_textcleaning, - stop_words = STOPWORDS, + stop_words = None, **kwargs, ): """ @@ -566,13 +570,15 @@ def lsa( If True, sastrawi_stemmer will apply. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS Returns ------- _TOPIC: malaya.topic_modelling._TOPIC class """ + if stop_words is None: + stop_words = STOPWORDS return _base_topic_modelling( corpus, n_topics, @@ -597,7 +603,7 @@ def lda2vec( ngram = (1, 3), cleaning = simple_textcleaning, vectorizer = 'bow', - stop_words = STOPWORDS, + stop_words = None, window_size = 2, embedding_size = 128, epoch = 10, @@ -623,8 +629,8 @@ def lda2vec( n-grams size to train a corpus. cleaning: function, (default=simple_textcleaning) function to clean the corpus. - stop_words: list, (default=STOPWORDS) - list of stop words to remove. + stop_words: list, (default=None) + list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS embedding_size: int, (default=128) embedding size of lda2vec tensors. training_iteration: int, (default=10) @@ -696,6 +702,8 @@ def lda2vec( max_df = max_df, stop_words = stop_words, ) + if stop_words is None: + stop_words = STOPWORDS if cleaning is not None: for i in range(len(corpus)): diff --git a/malaya/toxic.py b/malaya/toxic.py index dd404ba6..9a4c83fa 100644 --- a/malaya/toxic.py +++ b/malaya/toxic.py @@ -124,7 +124,7 @@ def deep_model(model = 'luong', validate = True): Returns ------- - TOXIC: malaya._models._tensorflow_model.SIGMOID class + SIGMOID: malaya._models._tensorflow_model.SIGMOID class """ if not isinstance(model, str): raise ValueError('model must be a string') @@ -163,6 +163,18 @@ def deep_model(model = 'luong', validate = True): def bert(validate = True): + """ + Load BERT toxicity model. + + Parameters + ---------- + validate: bool, optional (default=True) + if True, malaya will check model availability and download if not available. + + Returns + ------- + SIGMOID_BERT : malaya._models._tensorflow_model.SIGMOID_BERT class + """ try: from bert import tokenization except: @@ -215,7 +227,7 @@ def sparse_deep_model(model = 'fast-text-char', validate = True): Returns ------- - SPARSE_SOFTMAX: malaya._models._tensorflow_model.SPARSE_SIGMOID class + SPARSE_SIGMOID: malaya._models._tensorflow_model.SPARSE_SIGMOID class """ if not isinstance(model, str): diff --git a/session/similarity/augmenting.ipynb b/session/similarity/augmenting.ipynb new file mode 100644 index 00000000..cbe06b24 --- /dev/null +++ b/session/similarity/augmenting.ipynb @@ -0,0 +1,367 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import glob\n", + "import re\n", + "import malaya" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = malaya.preprocessing._SocialTokenizer().tokenize\n", + "\n", + "def is_number_regex(s):\n", + " if re.match(\"^\\d+?\\.\\d+?$\", s) is None:\n", + " return s.isdigit()\n", + " return True\n", + "\n", + "def detect_money(word):\n", + " if word[:2] == 'rm' and is_number_regex(word[2:]):\n", + " return True\n", + " else:\n", + " return False\n", + "\n", + "def preprocessing(string):\n", + " tokenized = tokenizer(string)\n", + " tokenized = [w.lower() for w in tokenized if len(w) > 2]\n", + " tokenized = ['' if is_number_regex(w) else w for w in tokenized]\n", + " tokenized = ['' if detect_money(w) else w for w in tokenized]\n", + " return tokenized" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "left, right, label = [], [], []\n", + "for file in glob.glob('quora/*.json'):\n", + " with open(file) as fopen:\n", + " x = json.load(fopen)\n", + " for i in x:\n", + " splitted = i[0].split(' <> ')\n", + " if len(splitted) != 2:\n", + " continue\n", + " left.append(splitted[0])\n", + " right.append(splitted[1])\n", + " label.append(i[1])" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(403831, 403831, 403831)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(left), len(right), len(label)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "with open('synonym0.json') as fopen:\n", + " s = json.load(fopen)\n", + " \n", + "with open('synonym1.json') as fopen:\n", + " s1 = json.load(fopen)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "synonyms = {}\n", + "for l, r in (s + s1):\n", + " if l not in synonyms:\n", + " synonyms[l] = r + [l]\n", + " else:\n", + " synonyms[l].extend(r)\n", + "synonyms = {k: list(set(v)) for k, v in synonyms.items()}" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "def augmentation(s, maximum = 0.8):\n", + " s = s.lower().split()\n", + " for i in range(int(len(s) * maximum)):\n", + " index = random.randint(0, len(s) - 1)\n", + " word = s[index]\n", + " sy = synonyms.get(word, [word])\n", + " sy = random.choice(sy)\n", + " s[index] = sy\n", + " return s" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "train_left, test_left = left[:-50000], left[-50000:]\n", + "train_right, test_right = right[:-50000], right[-50000:]\n", + "train_label, test_label = label[:-50000], label[-50000:]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(353831, 50000)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(train_left), len(test_left)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['apakah maksud mengecap sejati kepada anda?',\n", + " 'apakah maksud pilihan sejati kepada anda?',\n", + " 'apakah maksud mencinta sejati kepada anda?',\n", + " 'apakah maksud mengasihi sejati kepada anda?',\n", + " 'apakah maksud cinta sejati kepada anda?',\n", + " 'apakah maksud menyayangi sejati kepada anda?',\n", + " 'apakah maksud percintaan sejati kepada anda?']" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aug = [' '.join(augmentation(train_left[0])) for _ in range(10)] + [train_left[0].lower()]\n", + "aug = list(set(aug))\n", + "aug" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['apakah maksud \"cinta sejati\"?']" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aug = [' '.join(augmentation(train_right[0])) for _ in range(10)] + [train_right[0].lower()]\n", + "aug = list(set(aug))\n", + "aug" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_label[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 353831/353831 [00:46<00:00, 7536.26it/s]\n" + ] + } + ], + "source": [ + "from tqdm import tqdm\n", + "\n", + "LEFT, RIGHT, LABEL = [], [], []\n", + "for i in tqdm(range(len(train_left))):\n", + " aug_left = [' '.join(augmentation(train_left[i])) for _ in range(3)] + [train_left[i].lower()]\n", + " aug_left = list(set(aug_left))\n", + " \n", + " aug_right = [' '.join(augmentation(train_right[i])) for _ in range(3)] + [train_right[i].lower()]\n", + " aug_right = list(set(aug_right))\n", + " \n", + " for l in aug_left:\n", + " for r in aug_right:\n", + " LEFT.append(l)\n", + " RIGHT.append(r)\n", + " LABEL.append(train_label[i])" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4136391, 4136391, 4136391)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(LEFT), len(RIGHT), len(LABEL)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 4136391/4136391 [10:34<00:00, 6523.13it/s]\n" + ] + } + ], + "source": [ + "for i in tqdm(range(len(LEFT))):\n", + " LEFT[i] = preprocessing(LEFT[i])\n", + " RIGHT[i] = preprocessing(RIGHT[i])" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 50000/50000 [00:06<00:00, 7268.75it/s]\n" + ] + } + ], + "source": [ + "for i in tqdm(range(len(test_left))):\n", + " test_left[i] = preprocessing(test_left[i])\n", + " test_right[i] = preprocessing(test_right[i])" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "with open('train-similarity.json', 'w') as fopen:\n", + " json.dump({'left': LEFT, 'right': RIGHT, 'label': LABEL}, fopen)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "with open('test-similarity.json', 'w') as fopen:\n", + " json.dump({'left': test_left, 'right': test_right, 'label': test_label}, fopen)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/session/similarity/bahdanau-contrastive.ipynb b/session/similarity/bahdanau-contrastive.ipynb new file mode 100644 index 00000000..955dae83 --- /dev/null +++ b/session/similarity/bahdanau-contrastive.ipynb @@ -0,0 +1,914 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "import re\n", + "import numpy as np\n", + "import pandas as pd\n", + "from tqdm import tqdm\n", + "import collections\n", + "import itertools\n", + "from unidecode import unidecode\n", + "import malaya\n", + "import re\n", + "import json" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "def build_dataset(words, n_words, atleast=2):\n", + " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3]]\n", + " counter = collections.Counter(words).most_common(n_words - 10)\n", + " counter = [i for i in counter if i[1] >= atleast]\n", + " count.extend(counter)\n", + " dictionary = dict()\n", + " for word, _ in count:\n", + " dictionary[word] = len(dictionary)\n", + " data = list()\n", + " unk_count = 0\n", + " for word in words:\n", + " index = dictionary.get(word, 0)\n", + " if index == 0:\n", + " unk_count += 1\n", + " data.append(index)\n", + " count[0][1] = unk_count\n", + " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n", + " return data, count, dictionary, reversed_dictionary\n", + "\n", + "def str_idx(corpus, dic, maxlen, UNK = 3):\n", + " X = np.zeros((len(corpus), maxlen))\n", + " for i in range(len(corpus)):\n", + " for no, k in enumerate(corpus[i][:maxlen]):\n", + " X[i, no] = dic.get(k, UNK)\n", + " return X\n", + "\n", + "tokenizer = malaya.preprocessing._SocialTokenizer().tokenize\n", + "\n", + "def is_number_regex(s):\n", + " if re.match(\"^\\d+?\\.\\d+?$\", s) is None:\n", + " return s.isdigit()\n", + " return True\n", + "\n", + "def detect_money(word):\n", + " if word[:2] == 'rm' and is_number_regex(word[2:]):\n", + " return True\n", + " else:\n", + " return False\n", + "\n", + "def preprocessing(string):\n", + " tokenized = tokenizer(string)\n", + " tokenized = [w.lower() for w in tokenized if len(w) > 2]\n", + " tokenized = ['' if is_number_regex(w) else w for w in tokenized]\n", + " tokenized = ['' if detect_money(w) else w for w in tokenized]\n", + " return tokenized" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "with open('train-similarity.json') as fopen:\n", + " train = json.load(fopen)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "left, right, label = train['left'], train['right'], train['label']" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "with open('test-similarity.json') as fopen:\n", + " test = json.load(fopen)\n", + "test_left, test_right, test_label = test['left'], test['right'], test['label']" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1]), array([2605321, 1531070]))" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.unique(label, return_counts = True)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "vocab from size: 73142\n", + "Most common words [('saya', 3584482), ('yang', 3541065), ('untuk', 2110965), ('apakah', 1948962), ('dan', 1556927), ('anda', 1375550)]\n", + "Sample data [7, 355, 325, 2415, 43, 9, 7, 355, 4166, 2415] ['apakah', 'maksud', 'cinta', 'sejati', 'kepada', 'anda', 'apakah', 'maksud', 'memuja', 'sejati']\n" + ] + } + ], + "source": [ + "concat = list(itertools.chain(*(left + right)))\n", + "vocabulary_size = len(list(set(concat)))\n", + "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size, 1)\n", + "print('vocab from size: %d'%(vocabulary_size))\n", + "print('Most common words', count[4:10])\n", + "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "with open('similarity-dictionary.json','w') as fopen:\n", + " fopen.write(json.dumps({'dictionary':dictionary,'reverse_dictionary':rev_dictionary}))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "class Model:\n", + " def __init__(self, size_layer, num_layers, embedded_size,\n", + " dict_size, learning_rate, dropout):\n", + " \n", + " def cells(size, reuse=False):\n", + " cell = tf.nn.rnn_cell.LSTMCell(size,initializer=tf.orthogonal_initializer(),reuse=reuse)\n", + " return tf.contrib.rnn.DropoutWrapper(cell,output_keep_prob=dropout)\n", + " \n", + " def rnn(inputs, scope):\n", + " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n", + " attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(\n", + " num_units = size_layer, memory = inputs)\n", + " rnn_cells = tf.contrib.seq2seq.AttentionWrapper(\n", + " cell = tf.nn.rnn_cell.MultiRNNCell(\n", + " [cells(size_layer) for _ in range(num_layers)]\n", + " ),\n", + " attention_mechanism = attention_mechanism,\n", + " attention_layer_size = size_layer,\n", + " alignment_history = True,\n", + " )\n", + " outputs, last_state = tf.nn.dynamic_rnn(\n", + " rnn_cells, inputs, dtype = tf.float32\n", + " )\n", + " return outputs[:,-1]\n", + " \n", + " self.X_left = tf.placeholder(tf.int32, [None, None])\n", + " self.X_right = tf.placeholder(tf.int32, [None, None])\n", + " self.Y = tf.placeholder(tf.float32, [None])\n", + " self.batch_size = tf.shape(self.X_left)[0]\n", + " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n", + " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n", + " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n", + " \n", + " def contrastive_loss(y,d):\n", + " tmp= y * tf.square(d)\n", + " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n", + " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n", + " \n", + " self.output_left = rnn(embedded_left, 'left')\n", + " self.output_right = rnn(embedded_right, 'right')\n", + " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),\n", + " 1,keep_dims=True))\n", + " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),\n", + " 1,keep_dims=True)),\n", + " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),\n", + " 1,keep_dims=True))))\n", + " self.distance = tf.reshape(self.distance, [-1])\n", + " self.logits = tf.identity(self.distance, name = 'logits')\n", + " self.cost = contrastive_loss(self.Y,self.distance)\n", + " \n", + " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n", + " tf.rint(self.distance))\n", + " correct_predictions = tf.equal(self.temp_sim, self.Y)\n", + " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n", + " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "size_layer = 256\n", + "num_layers = 2\n", + "embedded_size = 128\n", + "learning_rate = 1e-4\n", + "maxlen = 50\n", + "batch_size = 128\n", + "dropout = 0.8" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.cross_validation import train_test_split\n", + "\n", + "train_X_left = str_idx(left, dictionary, maxlen)\n", + "train_X_right = str_idx(right, dictionary, maxlen)\n", + "train_Y = label\n", + "\n", + "test_X_left = str_idx(test_left, dictionary, maxlen)\n", + "test_X_right = str_idx(test_right, dictionary, maxlen)\n", + "test_Y = test_label" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Colocations handled automatically by placer.\n", + "WARNING:tensorflow:From :6: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.\n", + "WARNING:tensorflow:From :15: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.\n", + "WARNING:tensorflow:From :22: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Please use `keras.layers.RNN(cell)`, which is equivalent to this API\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn_cell_impl.py:1259: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n", + "WARNING:tensorflow:From :42: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "keep_dims is deprecated, use keepdims instead\n", + "WARNING:tensorflow:From :46: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Deprecated in favor of operator or tf.math.divide.\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use tf.cast instead.\n" + ] + } + ], + "source": [ + "tf.reset_default_graph()\n", + "sess = tf.InteractiveSession()\n", + "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n", + "sess.run(tf.global_variables_initializer())" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'bahdanau/model.ckpt'" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "saver = tf.train.Saver(tf.trainable_variables())\n", + "saver.save(sess, 'bahdanau/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [1:55:21<00:00, 4.74it/s, accuracy=0.718, cost=0.0761]\n", + "test minibatch loop: 100%|██████████| 391/391 [00:34<00:00, 11.29it/s, accuracy=0.775, cost=0.0865]\n", + "train minibatch loop: 0%| | 0/32316 [00:00 CURRENT_ACC:\n", + " print(\n", + " 'epoch: %d, pass acc: %f, current acc: %f'\n", + " % (EPOCH, CURRENT_ACC, test_acc)\n", + " )\n", + " CURRENT_ACC = test_acc\n", + " CURRENT_CHECKPOINT = 0\n", + " else:\n", + " CURRENT_CHECKPOINT += 1\n", + " \n", + " print('time taken:', time.time()-lasttime)\n", + " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n", + " train_acc,test_loss,\n", + " test_acc))" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Placeholder',\n", + " 'Placeholder_1',\n", + " 'Placeholder_2',\n", + " 'Variable',\n", + " 'left/memory_layer/kernel',\n", + " 'left/rnn/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/kernel/Read/ReadVariableOp',\n", + " 'left/rnn/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/bias/Read/ReadVariableOp',\n", + " 'left/rnn/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/kernel/Read/ReadVariableOp',\n", + " 'left/rnn/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/bias/Read/ReadVariableOp',\n", + " 'left/rnn/attention_wrapper/bahdanau_attention/query_layer/kernel',\n", + " 'left/rnn/attention_wrapper/bahdanau_attention/attention_v',\n", + " 'left/rnn/attention_wrapper/attention_layer/kernel',\n", + " 'right/memory_layer/kernel',\n", + " 'right/rnn/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/kernel/Read/ReadVariableOp',\n", + " 'right/rnn/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/bias/Read/ReadVariableOp',\n", + " 'right/rnn/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/kernel/Read/ReadVariableOp',\n", + " 'right/rnn/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/bias/Read/ReadVariableOp',\n", + " 'right/rnn/attention_wrapper/bahdanau_attention/query_layer/kernel',\n", + " 'right/rnn/attention_wrapper/bahdanau_attention/attention_v',\n", + " 'right/rnn/attention_wrapper/attention_layer/kernel',\n", + " 'logits']" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "strings = ','.join(\n", + " [\n", + " n.name\n", + " for n in tf.get_default_graph().as_graph_def().node\n", + " if ('Variable' in n.op\n", + " or 'Placeholder' in n.name\n", + " or 'logits' in n.name\n", + " or 'alphas' in n.name)\n", + " and 'Adam' not in n.name\n", + " and '_power' not in n.name\n", + " and 'gradient' not in n.name\n", + " and 'Initializer' not in n.name\n", + " and 'Assign' not in n.name\n", + " ]\n", + ")\n", + "strings.split(',')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'bahdanau/model.ckpt'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "saver.save(sess, 'bahdanau/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([0.], dtype=float32), array([0.11445844], dtype=float32)]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n", + "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n", + "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n", + " model.X_right: right})" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "validation minibatch loop: 100%|██████████| 391/391 [00:34<00:00, 11.42it/s]\n" + ] + } + ], + "source": [ + "real_Y, predict_Y = [], []\n", + "\n", + "pbar = tqdm(\n", + " range(0, len(test_X_left), batch_size), desc = 'validation minibatch loop'\n", + ")\n", + "for i in pbar:\n", + " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n", + " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n", + " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n", + " predict_Y += sess.run(model.temp_sim, feed_dict = {model.X_left : batch_x_left, \n", + " model.X_right: batch_x_right,\n", + " model.Y : batch_y}).tolist()\n", + " real_Y += batch_y" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + "not similar 0.83 0.83 0.83 31524\n", + " similar 0.71 0.71 0.71 18476\n", + "\n", + "avg / total 0.79 0.79 0.79 50000\n", + "\n" + ] + } + ], + "source": [ + "from sklearn import metrics\n", + "\n", + "print(\n", + " metrics.classification_report(\n", + " real_Y, predict_Y, target_names = ['not similar', 'similar']\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "def freeze_graph(model_dir, output_node_names):\n", + "\n", + " if not tf.gfile.Exists(model_dir):\n", + " raise AssertionError(\n", + " \"Export directory doesn't exists. Please specify an export \"\n", + " 'directory: %s' % model_dir\n", + " )\n", + "\n", + " checkpoint = tf.train.get_checkpoint_state(model_dir)\n", + " input_checkpoint = checkpoint.model_checkpoint_path\n", + "\n", + " absolute_model_dir = '/'.join(input_checkpoint.split('/')[:-1])\n", + " output_graph = absolute_model_dir + '/frozen_model.pb'\n", + " clear_devices = True\n", + " with tf.Session(graph = tf.Graph()) as sess:\n", + " saver = tf.train.import_meta_graph(\n", + " input_checkpoint + '.meta', clear_devices = clear_devices\n", + " )\n", + " saver.restore(sess, input_checkpoint)\n", + " output_graph_def = tf.graph_util.convert_variables_to_constants(\n", + " sess,\n", + " tf.get_default_graph().as_graph_def(),\n", + " output_node_names.split(','),\n", + " )\n", + " with tf.gfile.GFile(output_graph, 'wb') as f:\n", + " f.write(output_graph_def.SerializeToString())\n", + " print('%d ops in the final graph.' % len(output_graph_def.node))" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use standard file APIs to check for files with this prefix.\n", + "INFO:tensorflow:Restoring parameters from bahdanau/model.ckpt\n", + "WARNING:tensorflow:From :23: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use tf.compat.v1.graph_util.convert_variables_to_constants\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/graph_util_impl.py:245: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use tf.compat.v1.graph_util.extract_sub_graph\n", + "INFO:tensorflow:Froze 17 variables.\n", + "INFO:tensorflow:Converted 17 variables to const ops.\n", + "647 ops in the final graph.\n" + ] + } + ], + "source": [ + "freeze_graph('bahdanau', strings)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "def load_graph(frozen_graph_filename):\n", + " with tf.gfile.GFile(frozen_graph_filename, 'rb') as f:\n", + " graph_def = tf.GraphDef()\n", + " graph_def.ParseFromString(f.read())\n", + " with tf.Graph().as_default() as graph:\n", + " tf.import_graph_def(graph_def)\n", + " return graph" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.11765248], dtype=float32)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = load_graph('bahdanau/frozen_model.pb')\n", + "x1 = g.get_tensor_by_name('import/Placeholder:0')\n", + "x2 = g.get_tensor_by_name('import/Placeholder_1:0')\n", + "logits = g.get_tensor_by_name('import/logits:0')\n", + "test_sess = tf.InteractiveSession(graph = g)\n", + "test_sess.run(1-logits, feed_dict = {x1 : left, x2: right})" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.4636389 , 0.5283668 , 0.43854022, 0.8202803 , 0.64394784,\n", + " 0.84979135, 0.745062 , 0.01964164, 0.07101661, 0.02169931,\n", + " 0.8392247 , 0.22707516, 0.19469285, 0.4840045 , 0.05370182,\n", + " 0.4678564 , 0.4111814 , 0.11001766, 0.20520616, 0.07242185,\n", + " 0.7431572 , 0.52817804, 0.4351002 , 0.63338685, 0.52839124,\n", + " 0.07311231, 0.1716168 , 0.09279257, 0.02310717, 0.02681172,\n", + " 0.2308088 , 0.551746 , 0.8105283 , 0.66022396, 0.739179 ,\n", + " 0.38779128, 0.8515695 , 0.7534613 , 0.05358309, 0.05516434,\n", + " 0.63869566, 0.7444098 , 0.63428354, 0.49298012, 0.75610924,\n", + " 0.54483724, 0.9024776 , 0.05228931, 0.05101156, 0.02496451,\n", + " 0.7684243 , 0.37446058, 0.8911811 , 0.39399248, 0.04925126,\n", + " 0.89727813, 0.34909683, 0.09850705, 0.04967946, 0.05255091,\n", + " 0.58232725, 0.40308565, 0.68486273, 0.41244376, 0.06464297,\n", + " 0.07472116, 0.06430554, 0.42752308, 0.10852087, 0.0495699 ,\n", + " 0.11905402, 0.26009667, 0.53447616, 0.88553053, 0.04034108,\n", + " 0.05235732, 0.43953466, 0.10045218, 0.07925862, 0.06360978],\n", + " dtype=float32)" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_sess.run(1-logits, feed_dict = {x1 : batch_x_left, x2: batch_x_right})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/session/similarity/bert-crossentropy.ipynb b/session/similarity/bert-crossentropy.ipynb new file mode 100644 index 00000000..7f2c3072 --- /dev/null +++ b/session/similarity/bert-crossentropy.ipynb @@ -0,0 +1,1147 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# !wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip\n", + "# !unzip multi_cased_L-12_H-768_A-12.zip" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import bert\n", + "from bert import run_classifier\n", + "from bert import optimization\n", + "from bert import tokenization\n", + "from bert import modeling\n", + "import numpy as np\n", + "import tensorflow as tf\n", + "import pandas as pd\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "BERT_VOCAB = 'multi_cased_L-12_H-768_A-12/vocab.txt'\n", + "BERT_INIT_CHKPNT = 'multi_cased_L-12_H-768_A-12/bert_model.ckpt'\n", + "BERT_CONFIG = 'multi_cased_L-12_H-768_A-12/bert_config.json'\n", + "\n", + "tokenization.validate_case_matches_checkpoint(True, '')\n", + "tokenizer = tokenization.FullTokenizer(\n", + " vocab_file=BERT_VOCAB, do_lower_case=True)\n", + "MAX_SEQ_LENGTH = 100" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import glob\n", + "\n", + "left, right, label = [], [], []\n", + "for file in glob.glob('quora/*.json'):\n", + " with open(file) as fopen:\n", + " x = json.load(fopen)\n", + " for i in x:\n", + " splitted = i[0].split(' <> ')\n", + " if len(splitted) != 2:\n", + " continue\n", + " left.append(splitted[0])\n", + " right.append(splitted[1])\n", + " label.append(i[1])" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1]), array([254659, 149172]))" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.unique(label, return_counts = True)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "def _truncate_seq_pair(tokens_a, tokens_b, max_length):\n", + " while True:\n", + " total_length = len(tokens_a) + len(tokens_b)\n", + " if total_length <= max_length:\n", + " break\n", + " if len(tokens_a) > len(tokens_b):\n", + " tokens_a.pop()\n", + " else:\n", + " tokens_b.pop()\n", + " \n", + "def get_inputs(left, right):\n", + "\n", + " input_ids, input_masks, segment_ids = [], [], []\n", + "\n", + " for i in tqdm(range(len(left))):\n", + " tokens_a = tokenizer.tokenize(' '.join(left[i]))\n", + " tokens_b = tokenizer.tokenize(' '.join(right[i]))\n", + " _truncate_seq_pair(tokens_a, tokens_b, MAX_SEQ_LENGTH - 3)\n", + "\n", + " tokens = []\n", + " segment_id = []\n", + " tokens.append(\"[CLS]\")\n", + " segment_id.append(0)\n", + " for token in tokens_a:\n", + " tokens.append(token)\n", + " segment_id.append(0)\n", + " tokens.append(\"[SEP]\")\n", + " segment_id.append(0)\n", + " for token in tokens_b:\n", + " tokens.append(token)\n", + " segment_id.append(1)\n", + " tokens.append(\"[SEP]\")\n", + " segment_id.append(1)\n", + " input_id = tokenizer.convert_tokens_to_ids(tokens)\n", + " input_mask = [1] * len(input_id)\n", + "\n", + " while len(input_id) < MAX_SEQ_LENGTH:\n", + " input_id.append(0)\n", + " input_mask.append(0)\n", + " segment_id.append(0)\n", + "\n", + " input_ids.append(input_id)\n", + " input_masks.append(input_mask)\n", + " segment_ids.append(segment_id)\n", + " \n", + " return input_ids, input_masks, segment_ids" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 403831/403831 [06:23<00:00, 1051.93it/s]\n" + ] + } + ], + "source": [ + "input_ids, input_masks, segment_ids = get_inputs(left, right)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "bert_config = modeling.BertConfig.from_json_file(BERT_CONFIG)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "epoch = 10\n", + "batch_size = 60\n", + "warmup_proportion = 0.1\n", + "num_train_steps = int(len(left) / batch_size * epoch)\n", + "num_warmup_steps = int(num_train_steps * warmup_proportion)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "class Model:\n", + " def __init__(\n", + " self,\n", + " dimension_output,\n", + " learning_rate = 2e-5,\n", + " ):\n", + " self.X = tf.placeholder(tf.int32, [None, None])\n", + " self.segment_ids = tf.placeholder(tf.int32, [None, None])\n", + " self.input_masks = tf.placeholder(tf.int32, [None, None])\n", + " self.Y = tf.placeholder(tf.int32, [None])\n", + " \n", + " model = modeling.BertModel(\n", + " config=bert_config,\n", + " is_training=True,\n", + " input_ids=self.X,\n", + " input_mask=self.input_masks,\n", + " token_type_ids=self.segment_ids,\n", + " use_one_hot_embeddings=False)\n", + " \n", + " output_layer = model.get_pooled_output()\n", + " self.logits = tf.layers.dense(output_layer, dimension_output)\n", + " self.logits = tf.identity(self.logits, name = 'logits')\n", + " \n", + " self.cost = tf.reduce_mean(\n", + " tf.nn.sparse_softmax_cross_entropy_with_logits(\n", + " logits = self.logits, labels = self.Y\n", + " )\n", + " )\n", + " \n", + " self.optimizer = optimization.create_optimizer(self.cost, learning_rate, \n", + " num_train_steps, num_warmup_steps, False)\n", + " correct_pred = tf.equal(\n", + " tf.argmax(self.logits, 1, output_type = tf.int32), self.Y\n", + " )\n", + " self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Colocations handled automatically by placer.\n", + "\n", + "WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.\n", + "For more information, please see:\n", + " * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n", + " * https://github.com/tensorflow/addons\n", + "If you depend on functionality not listed there, please file an issue.\n", + "\n", + "WARNING:tensorflow:From /home/jupyter/.local/lib/python3.6/site-packages/bert/modeling.py:358: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n", + "WARNING:tensorflow:From /home/jupyter/.local/lib/python3.6/site-packages/bert/modeling.py:671: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use keras.layers.dense instead.\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Deprecated in favor of operator or tf.math.divide.\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use tf.cast instead.\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use standard file APIs to check for files with this prefix.\n", + "INFO:tensorflow:Restoring parameters from multi_cased_L-12_H-768_A-12/bert_model.ckpt\n" + ] + } + ], + "source": [ + "dimension_output = 2\n", + "learning_rate = 2e-5\n", + "\n", + "tf.reset_default_graph()\n", + "sess = tf.InteractiveSession()\n", + "model = Model(\n", + " dimension_output,\n", + " learning_rate\n", + ")\n", + "\n", + "sess.run(tf.global_variables_initializer())\n", + "var_lists = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope = 'bert')\n", + "saver = tf.train.Saver(var_list = var_lists)\n", + "saver.restore(sess, BERT_INIT_CHKPNT)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "train_input_ids, test_input_ids, train_input_masks, test_input_masks, train_segment_ids, test_segment_ids, train_Y, test_Y = train_test_split(\n", + " input_ids, input_masks, segment_ids, label, test_size = 0.2\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 5385/5385 [37:10<00:00, 2.84it/s, accuracy=0.75, cost=0.568] \n", + "test minibatch loop: 100%|██████████| 1347/1347 [03:05<00:00, 7.25it/s, accuracy=0.857, cost=0.425]\n", + "train minibatch loop: 0%| | 0/5385 [00:00 CURRENT_ACC:\n", + " print(\n", + " 'epoch: %d, pass acc: %f, current acc: %f'\n", + " % (EPOCH, CURRENT_ACC, test_acc)\n", + " )\n", + " CURRENT_ACC = test_acc\n", + " CURRENT_CHECKPOINT = 0\n", + " else:\n", + " CURRENT_CHECKPOINT += 1\n", + " \n", + " print('time taken:', time.time() - lasttime)\n", + " print(\n", + " 'epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'\n", + " % (EPOCH, train_loss, train_acc, test_loss, test_acc)\n", + " )\n", + " EPOCH += 1" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "validation minibatch loop: 100%|██████████| 1347/1347 [03:03<00:00, 7.33it/s]\n" + ] + } + ], + "source": [ + "real_Y, predict_Y = [], []\n", + "\n", + "pbar = tqdm(\n", + " range(0, len(test_input_ids), batch_size), desc = 'validation minibatch loop'\n", + ")\n", + "for i in pbar:\n", + " index = min(i + batch_size, len(test_input_ids))\n", + " batch_x = test_input_ids[i: index]\n", + " batch_masks = test_input_masks[i: index]\n", + " batch_segment = test_segment_ids[i: index]\n", + " batch_y = test_Y[i: index]\n", + " predict_Y += np.argmax(sess.run(model.logits,\n", + " feed_dict = {\n", + " model.Y: batch_y,\n", + " model.X: batch_x,\n", + " model.segment_ids: batch_segment,\n", + " model.input_masks: batch_masks\n", + " },\n", + " ), 1, ).tolist()\n", + " real_Y += batch_y" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + "not similar 0.86 0.86 0.86 50757\n", + " similar 0.77 0.76 0.76 30010\n", + "\n", + "avg / total 0.83 0.83 0.83 80767\n", + "\n" + ] + } + ], + "source": [ + "from sklearn import metrics\n", + "\n", + "print(\n", + " metrics.classification_report(\n", + " real_Y, predict_Y, target_names = ['not similar', 'similar']\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'bert-similarity/model.ckpt'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "saver = tf.train.Saver(tf.trainable_variables())\n", + "saver.save(sess, 'bert-similarity/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Placeholder',\n", + " 'Placeholder_1',\n", + " 'Placeholder_2',\n", + " 'Placeholder_3',\n", + " 'bert/embeddings/word_embeddings',\n", + " 'bert/embeddings/token_type_embeddings',\n", + " 'bert/embeddings/position_embeddings',\n", + " 'bert/embeddings/LayerNorm/gamma',\n", + " 'bert/encoder/layer_0/attention/self/query/kernel',\n", + " 'bert/encoder/layer_0/attention/self/query/bias',\n", + " 'bert/encoder/layer_0/attention/self/key/kernel',\n", + " 'bert/encoder/layer_0/attention/self/key/bias',\n", + " 'bert/encoder/layer_0/attention/self/value/kernel',\n", + " 'bert/encoder/layer_0/attention/self/value/bias',\n", + " 'bert/encoder/layer_0/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_0/attention/output/dense/bias',\n", + " 'bert/encoder/layer_0/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_0/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_0/intermediate/dense/bias',\n", + " 'bert/encoder/layer_0/output/dense/kernel',\n", + " 'bert/encoder/layer_0/output/dense/bias',\n", + " 'bert/encoder/layer_0/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_1/attention/self/query/kernel',\n", + " 'bert/encoder/layer_1/attention/self/query/bias',\n", + " 'bert/encoder/layer_1/attention/self/key/kernel',\n", + " 'bert/encoder/layer_1/attention/self/key/bias',\n", + " 'bert/encoder/layer_1/attention/self/value/kernel',\n", + " 'bert/encoder/layer_1/attention/self/value/bias',\n", + " 'bert/encoder/layer_1/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_1/attention/output/dense/bias',\n", + " 'bert/encoder/layer_1/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_1/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_1/intermediate/dense/bias',\n", + " 'bert/encoder/layer_1/output/dense/kernel',\n", + " 'bert/encoder/layer_1/output/dense/bias',\n", + " 'bert/encoder/layer_1/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_2/attention/self/query/kernel',\n", + " 'bert/encoder/layer_2/attention/self/query/bias',\n", + " 'bert/encoder/layer_2/attention/self/key/kernel',\n", + " 'bert/encoder/layer_2/attention/self/key/bias',\n", + " 'bert/encoder/layer_2/attention/self/value/kernel',\n", + " 'bert/encoder/layer_2/attention/self/value/bias',\n", + " 'bert/encoder/layer_2/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_2/attention/output/dense/bias',\n", + " 'bert/encoder/layer_2/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_2/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_2/intermediate/dense/bias',\n", + " 'bert/encoder/layer_2/output/dense/kernel',\n", + " 'bert/encoder/layer_2/output/dense/bias',\n", + " 'bert/encoder/layer_2/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_3/attention/self/query/kernel',\n", + " 'bert/encoder/layer_3/attention/self/query/bias',\n", + " 'bert/encoder/layer_3/attention/self/key/kernel',\n", + " 'bert/encoder/layer_3/attention/self/key/bias',\n", + " 'bert/encoder/layer_3/attention/self/value/kernel',\n", + " 'bert/encoder/layer_3/attention/self/value/bias',\n", + " 'bert/encoder/layer_3/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_3/attention/output/dense/bias',\n", + " 'bert/encoder/layer_3/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_3/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_3/intermediate/dense/bias',\n", + " 'bert/encoder/layer_3/output/dense/kernel',\n", + " 'bert/encoder/layer_3/output/dense/bias',\n", + " 'bert/encoder/layer_3/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_4/attention/self/query/kernel',\n", + " 'bert/encoder/layer_4/attention/self/query/bias',\n", + " 'bert/encoder/layer_4/attention/self/key/kernel',\n", + " 'bert/encoder/layer_4/attention/self/key/bias',\n", + " 'bert/encoder/layer_4/attention/self/value/kernel',\n", + " 'bert/encoder/layer_4/attention/self/value/bias',\n", + " 'bert/encoder/layer_4/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_4/attention/output/dense/bias',\n", + " 'bert/encoder/layer_4/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_4/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_4/intermediate/dense/bias',\n", + " 'bert/encoder/layer_4/output/dense/kernel',\n", + " 'bert/encoder/layer_4/output/dense/bias',\n", + " 'bert/encoder/layer_4/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_5/attention/self/query/kernel',\n", + " 'bert/encoder/layer_5/attention/self/query/bias',\n", + " 'bert/encoder/layer_5/attention/self/key/kernel',\n", + " 'bert/encoder/layer_5/attention/self/key/bias',\n", + " 'bert/encoder/layer_5/attention/self/value/kernel',\n", + " 'bert/encoder/layer_5/attention/self/value/bias',\n", + " 'bert/encoder/layer_5/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_5/attention/output/dense/bias',\n", + " 'bert/encoder/layer_5/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_5/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_5/intermediate/dense/bias',\n", + " 'bert/encoder/layer_5/output/dense/kernel',\n", + " 'bert/encoder/layer_5/output/dense/bias',\n", + " 'bert/encoder/layer_5/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_6/attention/self/query/kernel',\n", + " 'bert/encoder/layer_6/attention/self/query/bias',\n", + " 'bert/encoder/layer_6/attention/self/key/kernel',\n", + " 'bert/encoder/layer_6/attention/self/key/bias',\n", + " 'bert/encoder/layer_6/attention/self/value/kernel',\n", + " 'bert/encoder/layer_6/attention/self/value/bias',\n", + " 'bert/encoder/layer_6/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_6/attention/output/dense/bias',\n", + " 'bert/encoder/layer_6/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_6/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_6/intermediate/dense/bias',\n", + " 'bert/encoder/layer_6/output/dense/kernel',\n", + " 'bert/encoder/layer_6/output/dense/bias',\n", + " 'bert/encoder/layer_6/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_7/attention/self/query/kernel',\n", + " 'bert/encoder/layer_7/attention/self/query/bias',\n", + " 'bert/encoder/layer_7/attention/self/key/kernel',\n", + " 'bert/encoder/layer_7/attention/self/key/bias',\n", + " 'bert/encoder/layer_7/attention/self/value/kernel',\n", + " 'bert/encoder/layer_7/attention/self/value/bias',\n", + " 'bert/encoder/layer_7/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_7/attention/output/dense/bias',\n", + " 'bert/encoder/layer_7/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_7/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_7/intermediate/dense/bias',\n", + " 'bert/encoder/layer_7/output/dense/kernel',\n", + " 'bert/encoder/layer_7/output/dense/bias',\n", + " 'bert/encoder/layer_7/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_8/attention/self/query/kernel',\n", + " 'bert/encoder/layer_8/attention/self/query/bias',\n", + " 'bert/encoder/layer_8/attention/self/key/kernel',\n", + " 'bert/encoder/layer_8/attention/self/key/bias',\n", + " 'bert/encoder/layer_8/attention/self/value/kernel',\n", + " 'bert/encoder/layer_8/attention/self/value/bias',\n", + " 'bert/encoder/layer_8/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_8/attention/output/dense/bias',\n", + " 'bert/encoder/layer_8/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_8/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_8/intermediate/dense/bias',\n", + " 'bert/encoder/layer_8/output/dense/kernel',\n", + " 'bert/encoder/layer_8/output/dense/bias',\n", + " 'bert/encoder/layer_8/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_9/attention/self/query/kernel',\n", + " 'bert/encoder/layer_9/attention/self/query/bias',\n", + " 'bert/encoder/layer_9/attention/self/key/kernel',\n", + " 'bert/encoder/layer_9/attention/self/key/bias',\n", + " 'bert/encoder/layer_9/attention/self/value/kernel',\n", + " 'bert/encoder/layer_9/attention/self/value/bias',\n", + " 'bert/encoder/layer_9/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_9/attention/output/dense/bias',\n", + " 'bert/encoder/layer_9/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_9/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_9/intermediate/dense/bias',\n", + " 'bert/encoder/layer_9/output/dense/kernel',\n", + " 'bert/encoder/layer_9/output/dense/bias',\n", + " 'bert/encoder/layer_9/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_10/attention/self/query/kernel',\n", + " 'bert/encoder/layer_10/attention/self/query/bias',\n", + " 'bert/encoder/layer_10/attention/self/key/kernel',\n", + " 'bert/encoder/layer_10/attention/self/key/bias',\n", + " 'bert/encoder/layer_10/attention/self/value/kernel',\n", + " 'bert/encoder/layer_10/attention/self/value/bias',\n", + " 'bert/encoder/layer_10/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_10/attention/output/dense/bias',\n", + " 'bert/encoder/layer_10/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_10/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_10/intermediate/dense/bias',\n", + " 'bert/encoder/layer_10/output/dense/kernel',\n", + " 'bert/encoder/layer_10/output/dense/bias',\n", + " 'bert/encoder/layer_10/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_11/attention/self/query/kernel',\n", + " 'bert/encoder/layer_11/attention/self/query/bias',\n", + " 'bert/encoder/layer_11/attention/self/key/kernel',\n", + " 'bert/encoder/layer_11/attention/self/key/bias',\n", + " 'bert/encoder/layer_11/attention/self/value/kernel',\n", + " 'bert/encoder/layer_11/attention/self/value/bias',\n", + " 'bert/encoder/layer_11/attention/output/dense/kernel',\n", + " 'bert/encoder/layer_11/attention/output/dense/bias',\n", + " 'bert/encoder/layer_11/attention/output/LayerNorm/gamma',\n", + " 'bert/encoder/layer_11/intermediate/dense/kernel',\n", + " 'bert/encoder/layer_11/intermediate/dense/bias',\n", + " 'bert/encoder/layer_11/output/dense/kernel',\n", + " 'bert/encoder/layer_11/output/dense/bias',\n", + " 'bert/encoder/layer_11/output/LayerNorm/gamma',\n", + " 'bert/pooler/dense/kernel',\n", + " 'bert/pooler/dense/bias',\n", + " 'dense/kernel',\n", + " 'dense/bias',\n", + " 'logits']" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "strings = ','.join(\n", + " [\n", + " n.name\n", + " for n in tf.get_default_graph().as_graph_def().node\n", + " if ('Variable' in n.op\n", + " or 'Placeholder' in n.name\n", + " or 'logits' in n.name\n", + " or 'alphas' in n.name)\n", + " and 'adam' not in n.name\n", + " and 'beta' not in n.name\n", + " and 'global_step' not in n.name\n", + " ]\n", + ")\n", + "strings.split(',')" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "def freeze_graph(model_dir, output_node_names):\n", + "\n", + " if not tf.gfile.Exists(model_dir):\n", + " raise AssertionError(\n", + " \"Export directory doesn't exists. Please specify an export \"\n", + " 'directory: %s' % model_dir\n", + " )\n", + "\n", + " checkpoint = tf.train.get_checkpoint_state(model_dir)\n", + " input_checkpoint = checkpoint.model_checkpoint_path\n", + "\n", + " absolute_model_dir = '/'.join(input_checkpoint.split('/')[:-1])\n", + " output_graph = absolute_model_dir + '/frozen_model.pb'\n", + " clear_devices = True\n", + " with tf.Session(graph = tf.Graph()) as sess:\n", + " saver = tf.train.import_meta_graph(\n", + " input_checkpoint + '.meta', clear_devices = clear_devices\n", + " )\n", + " saver.restore(sess, input_checkpoint)\n", + " output_graph_def = tf.graph_util.convert_variables_to_constants(\n", + " sess,\n", + " tf.get_default_graph().as_graph_def(),\n", + " output_node_names.split(','),\n", + " )\n", + " with tf.gfile.GFile(output_graph, 'wb') as f:\n", + " f.write(output_graph_def.SerializeToString())\n", + " print('%d ops in the final graph.' % len(output_graph_def.node))" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Restoring parameters from bert-similarity/model.ckpt\n", + "WARNING:tensorflow:From :23: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use tf.compat.v1.graph_util.convert_variables_to_constants\n", + "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/graph_util_impl.py:245: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "Use tf.compat.v1.graph_util.extract_sub_graph\n", + "INFO:tensorflow:Froze 201 variables.\n", + "INFO:tensorflow:Converted 201 variables to const ops.\n", + "2132 ops in the final graph.\n" + ] + } + ], + "source": [ + "freeze_graph('bert-similarity', strings)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "def load_graph(frozen_graph_filename):\n", + " with tf.gfile.GFile(frozen_graph_filename, 'rb') as f:\n", + " graph_def = tf.GraphDef()\n", + " graph_def.ParseFromString(f.read())\n", + " with tf.Graph().as_default() as graph:\n", + " tf.import_graph_def(graph_def)\n", + " return graph" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[9.8868495e-01, 1.1315098e-02],\n", + " [9.5157367e-01, 4.8426390e-02],\n", + " [9.8955989e-01, 1.0440130e-02],\n", + " [9.9740845e-01, 2.5915783e-03],\n", + " [6.8217957e-01, 3.1782040e-01],\n", + " [9.9995232e-01, 4.7634694e-05],\n", + " [9.6060568e-01, 3.9394356e-02]], dtype=float32)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = load_graph('bert-similarity/frozen_model.pb')\n", + "x = g.get_tensor_by_name('import/Placeholder:0')\n", + "segment_ids = g.get_tensor_by_name('import/Placeholder_1:0')\n", + "input_masks = g.get_tensor_by_name('import/Placeholder_2:0')\n", + "logits = g.get_tensor_by_name('import/logits:0')\n", + "test_sess = tf.InteractiveSession(graph = g)\n", + "result = test_sess.run(tf.nn.softmax(logits), feed_dict = {x: batch_x,\n", + " segment_ids: batch_segment,\n", + " input_masks: batch_masks})\n", + "result" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1.1315098e-02, 4.8426390e-02, 1.0440130e-02, 2.5915783e-03,\n", + " 3.1782040e-01, 4.7634694e-05, 3.9394356e-02], dtype=float32)" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result[:,1]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 0, 0, 0, 1, 0, 0]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "batch_y" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/session/similarity/dilated-cnn-contrastive.ipynb b/session/similarity/dilated-cnn-contrastive.ipynb new file mode 100644 index 00000000..f96d245b --- /dev/null +++ b/session/similarity/dilated-cnn-contrastive.ipynb @@ -0,0 +1,873 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "import re\n", + "import numpy as np\n", + "import pandas as pd\n", + "from tqdm import tqdm\n", + "import collections\n", + "import itertools\n", + "from unidecode import unidecode\n", + "import malaya\n", + "import re\n", + "import json" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "def build_dataset(words, n_words, atleast=2):\n", + " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3]]\n", + " counter = collections.Counter(words).most_common(n_words - 10)\n", + " counter = [i for i in counter if i[1] >= atleast]\n", + " count.extend(counter)\n", + " dictionary = dict()\n", + " for word, _ in count:\n", + " dictionary[word] = len(dictionary)\n", + " data = list()\n", + " unk_count = 0\n", + " for word in words:\n", + " index = dictionary.get(word, 0)\n", + " if index == 0:\n", + " unk_count += 1\n", + " data.append(index)\n", + " count[0][1] = unk_count\n", + " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n", + " return data, count, dictionary, reversed_dictionary\n", + "\n", + "def str_idx(corpus, dic, maxlen, UNK = 3):\n", + " X = np.zeros((len(corpus), maxlen))\n", + " for i in range(len(corpus)):\n", + " for no, k in enumerate(corpus[i][:maxlen]):\n", + " X[i, no] = dic.get(k, UNK)\n", + " return X\n", + "\n", + "tokenizer = malaya.preprocessing._SocialTokenizer().tokenize\n", + "\n", + "def is_number_regex(s):\n", + " if re.match(\"^\\d+?\\.\\d+?$\", s) is None:\n", + " return s.isdigit()\n", + " return True\n", + "\n", + "def detect_money(word):\n", + " if word[:2] == 'rm' and is_number_regex(word[2:]):\n", + " return True\n", + " else:\n", + " return False\n", + "\n", + "def preprocessing(string):\n", + " tokenized = tokenizer(string)\n", + " tokenized = [w.lower() for w in tokenized if len(w) > 2]\n", + " tokenized = ['' if is_number_regex(w) else w for w in tokenized]\n", + " tokenized = ['' if detect_money(w) else w for w in tokenized]\n", + " return tokenized" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "with open('train-similarity.json') as fopen:\n", + " train = json.load(fopen)\n", + " \n", + "left, right, label = train['left'], train['right'], train['label']" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "with open('test-similarity.json') as fopen:\n", + " test = json.load(fopen)\n", + "test_left, test_right, test_label = test['left'], test['right'], test['label']" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1]), array([2605321, 1531070]))" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.unique(label, return_counts = True)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "with open('similarity-dictionary.json') as fopen:\n", + " x = json.load(fopen)\n", + "dictionary = x['dictionary']\n", + "rev_dictionary = x['reverse_dictionary']" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "def position_encoding(inputs):\n", + " T = tf.shape(inputs)[1]\n", + " repr_dim = inputs.get_shape()[-1].value\n", + " pos = tf.reshape(tf.range(0.0, tf.to_float(T), dtype=tf.float32), [-1, 1])\n", + " i = np.arange(0, repr_dim, 2, np.float32)\n", + " denom = np.reshape(np.power(10000.0, i / repr_dim), [1, -1])\n", + " enc = tf.expand_dims(tf.concat([tf.sin(pos / denom), tf.cos(pos / denom)], 1), 0)\n", + " return tf.tile(enc, [tf.shape(inputs)[0], 1, 1])\n", + "\n", + "def layer_norm(inputs, epsilon=1e-8):\n", + " mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)\n", + " normalized = (inputs - mean) / (tf.sqrt(variance + epsilon))\n", + " params_shape = inputs.get_shape()[-1:]\n", + " gamma = tf.get_variable('gamma', params_shape, tf.float32, tf.ones_initializer())\n", + " beta = tf.get_variable('beta', params_shape, tf.float32, tf.zeros_initializer())\n", + " return gamma * normalized + beta\n", + "\n", + "def cnn_block(x, dilation_rate, pad_sz, hidden_dim, kernel_size):\n", + " x = layer_norm(x)\n", + " pad = tf.zeros([tf.shape(x)[0], pad_sz, hidden_dim])\n", + " x = tf.layers.conv1d(inputs = tf.concat([pad, x, pad], 1),\n", + " filters = hidden_dim,\n", + " kernel_size = kernel_size,\n", + " dilation_rate = dilation_rate)\n", + " x = x[:, :-pad_sz, :]\n", + " x = tf.nn.relu(x)\n", + " return x\n", + "\n", + "class Model:\n", + " def __init__(self, size_layer, num_layers, embedded_size,\n", + " dict_size, learning_rate, dropout, kernel_size = 5):\n", + " \n", + " def cnn(x, scope):\n", + " x += position_encoding(x)\n", + " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n", + " for n in range(num_layers):\n", + " dilation_rate = 2 ** n\n", + " pad_sz = (kernel_size - 1) * dilation_rate \n", + " with tf.variable_scope('block_%d'%n,reuse=tf.AUTO_REUSE):\n", + " x += cnn_block(x, dilation_rate, pad_sz, size_layer, kernel_size)\n", + " \n", + " with tf.variable_scope('logits', reuse=tf.AUTO_REUSE):\n", + " return tf.layers.dense(x, size_layer)[:, -1]\n", + " \n", + " self.X_left = tf.placeholder(tf.int32, [None, None])\n", + " self.X_right = tf.placeholder(tf.int32, [None, None])\n", + " self.Y = tf.placeholder(tf.float32, [None])\n", + " self.batch_size = tf.shape(self.X_left)[0]\n", + " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n", + " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n", + " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n", + " \n", + " def contrastive_loss(y,d):\n", + " tmp= y * tf.square(d)\n", + " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n", + " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n", + " \n", + " self.output_left = cnn(embedded_left, 'left')\n", + " self.output_right = cnn(embedded_right, 'right')\n", + " print(self.output_left, self.output_right)\n", + " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),\n", + " 1,keep_dims=True))\n", + " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),\n", + " 1,keep_dims=True)),\n", + " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),\n", + " 1,keep_dims=True))))\n", + " self.distance = tf.reshape(self.distance, [-1])\n", + " self.logits = tf.identity(self.distance, name = 'logits')\n", + " self.cost = contrastive_loss(self.Y,self.distance)\n", + " \n", + " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n", + " tf.rint(self.distance))\n", + " correct_predictions = tf.equal(self.temp_sim, self.Y)\n", + " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n", + " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "size_layer = 128\n", + "num_layers = 4\n", + "embedded_size = 128\n", + "learning_rate = 1e-3\n", + "maxlen = 50\n", + "batch_size = 128\n", + "dropout = 0.8" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tensor(\"left/logits/strided_slice:0\", shape=(?, 128), dtype=float32) Tensor(\"right/logits/strided_slice:0\", shape=(?, 128), dtype=float32)\n", + "WARNING:tensorflow:From :62: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "keep_dims is deprecated, use keepdims instead\n" + ] + }, + { + "data": { + "text/plain": [ + "'dilated-cnn/model.ckpt'" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tf.reset_default_graph()\n", + "sess = tf.InteractiveSession()\n", + "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n", + "sess.run(tf.global_variables_initializer())\n", + "saver = tf.train.Saver(tf.trainable_variables())\n", + "saver.save(sess, 'dilated-cnn/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "strings = ','.join(\n", + " [\n", + " n.name\n", + " for n in tf.get_default_graph().as_graph_def().node\n", + " if ('Variable' in n.op\n", + " or 'Placeholder' in n.name\n", + " or 'logits' in n.name\n", + " or 'alphas' in n.name)\n", + " and 'Adam' not in n.name\n", + " and '_power' not in n.name\n", + " and 'gradient' not in n.name\n", + " and 'Initializer' not in n.name\n", + " and 'Assign' not in n.name\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [39:10<00:00, 13.75it/s, accuracy=0.549, cost=0.119] \n", + "test minibatch loop: 100%|██████████| 391/391 [00:08<00:00, 44.01it/s, accuracy=0.725, cost=0.0879]\n", + "train minibatch loop: 0%| | 2/32316 [00:00<39:09, 13.76it/s, accuracy=0.852, cost=0.0575]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 0, pass acc: 0.000000, current acc: 0.754236\n", + "time taken: 2359.090886592865\n", + "epoch: 0, training loss: 0.088377, training acc: 0.736193, valid loss: 0.083123, valid acc: 0.754236\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [39:10<00:00, 14.49it/s, accuracy=0.577, cost=0.0901] \n", + "test minibatch loop: 100%|██████████| 391/391 [00:08<00:00, 45.23it/s, accuracy=0.712, cost=0.0807]\n", + "train minibatch loop: 0%| | 2/32316 [00:00<38:55, 13.84it/s, accuracy=0.898, cost=0.0552]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 0, pass acc: 0.754236, current acc: 0.770444\n", + "time taken: 2359.3658940792084\n", + "epoch: 0, training loss: 0.075040, training acc: 0.782331, valid loss: 0.078482, valid acc: 0.770444\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [39:12<00:00, 14.52it/s, accuracy=0.775, cost=0.0608] \n", + "test minibatch loop: 100%|██████████| 391/391 [00:08<00:00, 45.24it/s, accuracy=0.725, cost=0.0816]\n", + "train minibatch loop: 0%| | 2/32316 [00:00<39:00, 13.81it/s, accuracy=0.906, cost=0.0465]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 0, pass acc: 0.770444, current acc: 0.773316\n", + "time taken: 2360.77591919899\n", + "epoch: 0, training loss: 0.065331, training acc: 0.815129, valid loss: 0.078364, valid acc: 0.773316\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [39:12<00:00, 14.52it/s, accuracy=0.831, cost=0.0589] \n", + "test minibatch loop: 100%|██████████| 391/391 [00:08<00:00, 45.04it/s, accuracy=0.712, cost=0.088] \n", + "train minibatch loop: 0%| | 2/32316 [00:00<39:09, 13.75it/s, accuracy=0.945, cost=0.0312]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "time taken: 2361.537866592407\n", + "epoch: 0, training loss: 0.055691, training acc: 0.847078, valid loss: 0.078892, valid acc: 0.772364\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [39:10<00:00, 14.47it/s, accuracy=0.901, cost=0.0346] \n", + "test minibatch loop: 100%|██████████| 391/391 [00:08<00:00, 45.22it/s, accuracy=0.725, cost=0.0924]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "time taken: 2359.5909333229065\n", + "epoch: 0, training loss: 0.046877, training acc: 0.875735, valid loss: 0.080502, valid acc: 0.771336\n", + "\n", + "break epoch:0\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "import time\n", + "\n", + "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 2, 0, 0, 0\n", + "\n", + "while True:\n", + " lasttime = time.time()\n", + " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n", + " print('break epoch:%d\\n' % (EPOCH))\n", + " break\n", + "\n", + " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n", + " pbar = tqdm(range(0, len(left), batch_size), desc='train minibatch loop')\n", + " for i in pbar:\n", + " index = min(i+batch_size,len(left))\n", + " batch_x_left = str_idx(left[i: index], dictionary, maxlen)\n", + " batch_x_right = str_idx(right[i: index], dictionary, maxlen)\n", + " batch_y = label[i:index]\n", + " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n", + " feed_dict = {model.X_left : batch_x_left, \n", + " model.X_right: batch_x_right,\n", + " model.Y : batch_y})\n", + " assert not np.isnan(loss)\n", + " train_loss += loss\n", + " train_acc += acc\n", + " pbar.set_postfix(cost=loss, accuracy = acc)\n", + " \n", + " pbar = tqdm(range(0, len(test_left), batch_size), desc='test minibatch loop')\n", + " for i in pbar:\n", + " index = min(i+batch_size,len(test_left))\n", + " batch_x_left = str_idx(test_left[i: index], dictionary, maxlen)\n", + " batch_x_right = str_idx(test_right[i: index], dictionary, maxlen)\n", + " batch_y = test_label[i: index]\n", + " acc, loss = sess.run([model.accuracy, model.cost], \n", + " feed_dict = {model.X_left : batch_x_left, \n", + " model.X_right: batch_x_right,\n", + " model.Y : batch_y})\n", + " test_loss += loss\n", + " test_acc += acc\n", + " pbar.set_postfix(cost=loss, accuracy = acc)\n", + " \n", + " train_loss /= (len(left) / batch_size)\n", + " train_acc /= (len(left) / batch_size)\n", + " test_loss /= (len(test_left) / batch_size)\n", + " test_acc /= (len(test_left) / batch_size)\n", + " \n", + " if test_acc > CURRENT_ACC:\n", + " print(\n", + " 'epoch: %d, pass acc: %f, current acc: %f'\n", + " % (EPOCH, CURRENT_ACC, test_acc)\n", + " )\n", + " CURRENT_ACC = test_acc\n", + " CURRENT_CHECKPOINT = 0\n", + " else:\n", + " CURRENT_CHECKPOINT += 1\n", + " \n", + " print('time taken:', time.time()-lasttime)\n", + " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n", + " train_acc,test_loss,\n", + " test_acc))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'dilated-cnn/model.ckpt'" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "saver.save(sess, 'dilated-cnn/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([0.], dtype=float32), array([0.0343591], dtype=float32)]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n", + "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n", + "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n", + " model.X_right: right})" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "validation minibatch loop: 100%|██████████| 391/391 [00:08<00:00, 47.37it/s]\n" + ] + } + ], + "source": [ + "real_Y, predict_Y = [], []\n", + "\n", + "pbar = tqdm(\n", + " range(0, len(test_left), batch_size), desc = 'validation minibatch loop'\n", + ")\n", + "for i in pbar:\n", + " index = min(i+batch_size,len(test_left))\n", + " batch_x_left = str_idx(test_left[i: index], dictionary, maxlen)\n", + " batch_x_right = str_idx(test_right[i: index], dictionary, maxlen)\n", + " batch_y = test_label[i: index]\n", + " predict_Y += sess.run(model.temp_sim, feed_dict = {model.X_left : batch_x_left, \n", + " model.X_right: batch_x_right,\n", + " model.Y : batch_y}).tolist()\n", + " real_Y += batch_y" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + "not similar 0.82 0.82 0.82 31524\n", + " similar 0.69 0.69 0.69 18476\n", + "\n", + "avg / total 0.77 0.77 0.77 50000\n", + "\n" + ] + } + ], + "source": [ + "from sklearn import metrics\n", + "\n", + "print(\n", + " metrics.classification_report(\n", + " real_Y, predict_Y, target_names = ['not similar', 'similar']\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Placeholder',\n", + " 'Placeholder_1',\n", + " 'Placeholder_2',\n", + " 'Variable',\n", + " 'left/block_0/gamma',\n", + " 'left/block_0/beta',\n", + " 'left/block_0/conv1d/kernel',\n", + " 'left/block_0/conv1d/bias',\n", + " 'left/block_1/gamma',\n", + " 'left/block_1/beta',\n", + " 'left/block_1/conv1d/kernel',\n", + " 'left/block_1/conv1d/bias',\n", + " 'left/block_2/gamma',\n", + " 'left/block_2/beta',\n", + " 'left/block_2/conv1d/kernel',\n", + " 'left/block_2/conv1d/bias',\n", + " 'left/block_3/gamma',\n", + " 'left/block_3/beta',\n", + " 'left/block_3/conv1d/kernel',\n", + " 'left/block_3/conv1d/bias',\n", + " 'left/logits/dense/kernel',\n", + " 'left/logits/dense/kernel/read',\n", + " 'left/logits/dense/bias',\n", + " 'left/logits/dense/bias/read',\n", + " 'left/logits/dense/Tensordot/Shape',\n", + " 'left/logits/dense/Tensordot/Rank',\n", + " 'left/logits/dense/Tensordot/axes',\n", + " 'left/logits/dense/Tensordot/GreaterEqual/y',\n", + " 'left/logits/dense/Tensordot/GreaterEqual',\n", + " 'left/logits/dense/Tensordot/Cast',\n", + " 'left/logits/dense/Tensordot/mul',\n", + " 'left/logits/dense/Tensordot/Less/y',\n", + " 'left/logits/dense/Tensordot/Less',\n", + " 'left/logits/dense/Tensordot/Cast_1',\n", + " 'left/logits/dense/Tensordot/add',\n", + " 'left/logits/dense/Tensordot/mul_1',\n", + " 'left/logits/dense/Tensordot/add_1',\n", + " 'left/logits/dense/Tensordot/range/start',\n", + " 'left/logits/dense/Tensordot/range/delta',\n", + " 'left/logits/dense/Tensordot/range',\n", + " 'left/logits/dense/Tensordot/ListDiff',\n", + " 'left/logits/dense/Tensordot/GatherV2/axis',\n", + " 'left/logits/dense/Tensordot/GatherV2',\n", + " 'left/logits/dense/Tensordot/GatherV2_1/axis',\n", + " 'left/logits/dense/Tensordot/GatherV2_1',\n", + " 'left/logits/dense/Tensordot/Const',\n", + " 'left/logits/dense/Tensordot/Prod',\n", + " 'left/logits/dense/Tensordot/Const_1',\n", + " 'left/logits/dense/Tensordot/Prod_1',\n", + " 'left/logits/dense/Tensordot/concat/axis',\n", + " 'left/logits/dense/Tensordot/concat',\n", + " 'left/logits/dense/Tensordot/concat_1/axis',\n", + " 'left/logits/dense/Tensordot/concat_1',\n", + " 'left/logits/dense/Tensordot/stack',\n", + " 'left/logits/dense/Tensordot/transpose',\n", + " 'left/logits/dense/Tensordot/Reshape',\n", + " 'left/logits/dense/Tensordot/transpose_1/perm',\n", + " 'left/logits/dense/Tensordot/transpose_1',\n", + " 'left/logits/dense/Tensordot/Reshape_1/shape',\n", + " 'left/logits/dense/Tensordot/Reshape_1',\n", + " 'left/logits/dense/Tensordot/MatMul',\n", + " 'left/logits/dense/Tensordot/Const_2',\n", + " 'left/logits/dense/Tensordot/concat_2/axis',\n", + " 'left/logits/dense/Tensordot/concat_2',\n", + " 'left/logits/dense/Tensordot',\n", + " 'left/logits/dense/BiasAdd',\n", + " 'left/logits/strided_slice/stack',\n", + " 'left/logits/strided_slice/stack_1',\n", + " 'left/logits/strided_slice/stack_2',\n", + " 'left/logits/strided_slice',\n", + " 'right/block_0/gamma',\n", + " 'right/block_0/beta',\n", + " 'right/block_0/conv1d/kernel',\n", + " 'right/block_0/conv1d/bias',\n", + " 'right/block_1/gamma',\n", + " 'right/block_1/beta',\n", + " 'right/block_1/conv1d/kernel',\n", + " 'right/block_1/conv1d/bias',\n", + " 'right/block_2/gamma',\n", + " 'right/block_2/beta',\n", + " 'right/block_2/conv1d/kernel',\n", + " 'right/block_2/conv1d/bias',\n", + " 'right/block_3/gamma',\n", + " 'right/block_3/beta',\n", + " 'right/block_3/conv1d/kernel',\n", + " 'right/block_3/conv1d/bias',\n", + " 'right/logits/dense/kernel',\n", + " 'right/logits/dense/kernel/read',\n", + " 'right/logits/dense/bias',\n", + " 'right/logits/dense/bias/read',\n", + " 'right/logits/dense/Tensordot/Shape',\n", + " 'right/logits/dense/Tensordot/Rank',\n", + " 'right/logits/dense/Tensordot/axes',\n", + " 'right/logits/dense/Tensordot/GreaterEqual/y',\n", + " 'right/logits/dense/Tensordot/GreaterEqual',\n", + " 'right/logits/dense/Tensordot/Cast',\n", + " 'right/logits/dense/Tensordot/mul',\n", + " 'right/logits/dense/Tensordot/Less/y',\n", + " 'right/logits/dense/Tensordot/Less',\n", + " 'right/logits/dense/Tensordot/Cast_1',\n", + " 'right/logits/dense/Tensordot/add',\n", + " 'right/logits/dense/Tensordot/mul_1',\n", + " 'right/logits/dense/Tensordot/add_1',\n", + " 'right/logits/dense/Tensordot/range/start',\n", + " 'right/logits/dense/Tensordot/range/delta',\n", + " 'right/logits/dense/Tensordot/range',\n", + " 'right/logits/dense/Tensordot/ListDiff',\n", + " 'right/logits/dense/Tensordot/GatherV2/axis',\n", + " 'right/logits/dense/Tensordot/GatherV2',\n", + " 'right/logits/dense/Tensordot/GatherV2_1/axis',\n", + " 'right/logits/dense/Tensordot/GatherV2_1',\n", + " 'right/logits/dense/Tensordot/Const',\n", + " 'right/logits/dense/Tensordot/Prod',\n", + " 'right/logits/dense/Tensordot/Const_1',\n", + " 'right/logits/dense/Tensordot/Prod_1',\n", + " 'right/logits/dense/Tensordot/concat/axis',\n", + " 'right/logits/dense/Tensordot/concat',\n", + " 'right/logits/dense/Tensordot/concat_1/axis',\n", + " 'right/logits/dense/Tensordot/concat_1',\n", + " 'right/logits/dense/Tensordot/stack',\n", + " 'right/logits/dense/Tensordot/transpose',\n", + " 'right/logits/dense/Tensordot/Reshape',\n", + " 'right/logits/dense/Tensordot/transpose_1/perm',\n", + " 'right/logits/dense/Tensordot/transpose_1',\n", + " 'right/logits/dense/Tensordot/Reshape_1/shape',\n", + " 'right/logits/dense/Tensordot/Reshape_1',\n", + " 'right/logits/dense/Tensordot/MatMul',\n", + " 'right/logits/dense/Tensordot/Const_2',\n", + " 'right/logits/dense/Tensordot/concat_2/axis',\n", + " 'right/logits/dense/Tensordot/concat_2',\n", + " 'right/logits/dense/Tensordot',\n", + " 'right/logits/dense/BiasAdd',\n", + " 'right/logits/strided_slice/stack',\n", + " 'right/logits/strided_slice/stack_1',\n", + " 'right/logits/strided_slice/stack_2',\n", + " 'right/logits/strided_slice',\n", + " 'logits']" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "strings.split(',')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "def freeze_graph(model_dir, output_node_names):\n", + "\n", + " if not tf.gfile.Exists(model_dir):\n", + " raise AssertionError(\n", + " \"Export directory doesn't exists. Please specify an export \"\n", + " 'directory: %s' % model_dir\n", + " )\n", + "\n", + " checkpoint = tf.train.get_checkpoint_state(model_dir)\n", + " input_checkpoint = checkpoint.model_checkpoint_path\n", + "\n", + " absolute_model_dir = '/'.join(input_checkpoint.split('/')[:-1])\n", + " output_graph = absolute_model_dir + '/frozen_model.pb'\n", + " clear_devices = True\n", + " with tf.Session(graph = tf.Graph()) as sess:\n", + " saver = tf.train.import_meta_graph(\n", + " input_checkpoint + '.meta', clear_devices = clear_devices\n", + " )\n", + " saver.restore(sess, input_checkpoint)\n", + " output_graph_def = tf.graph_util.convert_variables_to_constants(\n", + " sess,\n", + " tf.get_default_graph().as_graph_def(),\n", + " output_node_names.split(','),\n", + " )\n", + " with tf.gfile.GFile(output_graph, 'wb') as f:\n", + " f.write(output_graph_def.SerializeToString())\n", + " print('%d ops in the final graph.' % len(output_graph_def.node))" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Restoring parameters from dilated-cnn/model.ckpt\n", + "INFO:tensorflow:Froze 37 variables.\n", + "INFO:tensorflow:Converted 37 variables to const ops.\n", + "875 ops in the final graph.\n" + ] + } + ], + "source": [ + "freeze_graph('dilated-cnn', strings)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "def load_graph(frozen_graph_filename):\n", + " with tf.gfile.GFile(frozen_graph_filename, 'rb') as f:\n", + " graph_def = tf.GraphDef()\n", + " graph_def.ParseFromString(f.read())\n", + " with tf.Graph().as_default() as graph:\n", + " tf.import_graph_def(graph_def)\n", + " return graph" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.0343591], dtype=float32)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = load_graph('dilated-cnn/frozen_model.pb')\n", + "x1 = g.get_tensor_by_name('import/Placeholder:0')\n", + "x2 = g.get_tensor_by_name('import/Placeholder_1:0')\n", + "logits = g.get_tensor_by_name('import/logits:0')\n", + "test_sess = tf.InteractiveSession(graph = g)\n", + "test_sess.run(1-logits, feed_dict = {x1 : left, x2: right})" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.21152252, 0.14559478, 0.20776057, 0.70417494, 0.49244803,\n", + " 0.33945912, 0.9202117 , 0.02324635, 0.12748677, 0.8314166 ,\n", + " 0.975024 , 0.62006444, 0.18129557, 0.6427861 , 0.07265455,\n", + " 0.4061333 , 0.18890274, 0.02502632, 0.0484429 , 0.10148406,\n", + " 0.8321909 , 0.05768776, 0.55261767, 0.6817114 , 0.11403704,\n", + " 0.44246477, 0.4924479 , 0.18728226, 0.07191038, 0.05914503,\n", + " 0.0800122 , 0.3046261 , 0.60251844, 0.761145 , 0.95517516,\n", + " 0.88605934, 0.814803 , 0.07416344, 0.06447667, 0.03957129,\n", + " 0.03240418, 0.75431895, 0.6757686 , 0.76394105, 0.9388763 ,\n", + " 0.24763906, 0.98832715, 0.05210805, 0.02429408, 0.12788087,\n", + " 0.1121434 , 0.8168456 , 0.9283892 , 0.5351901 , 0.01739019,\n", + " 0.9779401 , 0.02959573, 0.07608068, 0.16026843, 0.07550842,\n", + " 0.6336924 , 0.23004955, 0.8670918 , 0.68216723, 0.06849951,\n", + " 0.02407455, 0.01773602, 0.88574535, 0.06930637, 0.01752573,\n", + " 0.02795351, 0.5855931 , 0.1376006 , 0.958021 , 0.00917709,\n", + " 0.01847631, 0.8541901 , 0.5811751 , 0.02121222, 0.07525933],\n", + " dtype=float32)" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_sess.run(1-logits, feed_dict = {x1 : batch_x_left, x2: batch_x_right})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/session/similarity/self-attention-contrastive.ipynb b/session/similarity/self-attention-contrastive.ipynb new file mode 100644 index 00000000..23f9d0cc --- /dev/null +++ b/session/similarity/self-attention-contrastive.ipynb @@ -0,0 +1,1006 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "import re\n", + "import numpy as np\n", + "import pandas as pd\n", + "from tqdm import tqdm\n", + "import collections\n", + "import itertools\n", + "from unidecode import unidecode\n", + "import malaya\n", + "import re\n", + "import json" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "def build_dataset(words, n_words, atleast=2):\n", + " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3]]\n", + " counter = collections.Counter(words).most_common(n_words - 10)\n", + " counter = [i for i in counter if i[1] >= atleast]\n", + " count.extend(counter)\n", + " dictionary = dict()\n", + " for word, _ in count:\n", + " dictionary[word] = len(dictionary)\n", + " data = list()\n", + " unk_count = 0\n", + " for word in words:\n", + " index = dictionary.get(word, 0)\n", + " if index == 0:\n", + " unk_count += 1\n", + " data.append(index)\n", + " count[0][1] = unk_count\n", + " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n", + " return data, count, dictionary, reversed_dictionary\n", + "\n", + "def str_idx(corpus, dic, maxlen, UNK = 3):\n", + " X = np.zeros((len(corpus), maxlen))\n", + " for i in range(len(corpus)):\n", + " for no, k in enumerate(corpus[i][:maxlen]):\n", + " X[i, no] = dic.get(k, UNK)\n", + " return X\n", + "\n", + "tokenizer = malaya.preprocessing._SocialTokenizer().tokenize\n", + "\n", + "def is_number_regex(s):\n", + " if re.match(\"^\\d+?\\.\\d+?$\", s) is None:\n", + " return s.isdigit()\n", + " return True\n", + "\n", + "def detect_money(word):\n", + " if word[:2] == 'rm' and is_number_regex(word[2:]):\n", + " return True\n", + " else:\n", + " return False\n", + "\n", + "def preprocessing(string):\n", + " tokenized = tokenizer(string)\n", + " tokenized = [w.lower() for w in tokenized if len(w) > 2]\n", + " tokenized = ['' if is_number_regex(w) else w for w in tokenized]\n", + " tokenized = ['' if detect_money(w) else w for w in tokenized]\n", + " return tokenized" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "with open('train-similarity.json') as fopen:\n", + " train = json.load(fopen)\n", + " \n", + "left, right, label = train['left'], train['right'], train['label']" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "with open('test-similarity.json') as fopen:\n", + " test = json.load(fopen)\n", + "test_left, test_right, test_label = test['left'], test['right'], test['label']" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1]), array([2605321, 1531070]))" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.unique(label, return_counts = True)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "with open('similarity-dictionary.json') as fopen:\n", + " x = json.load(fopen)\n", + "dictionary = x['dictionary']\n", + "rev_dictionary = x['reverse_dictionary']" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "def position_encoding(inputs):\n", + " T = tf.shape(inputs)[1]\n", + " repr_dim = inputs.get_shape()[-1].value\n", + " pos = tf.reshape(tf.range(0.0, tf.to_float(T), dtype=tf.float32), [-1, 1])\n", + " i = np.arange(0, repr_dim, 2, np.float32)\n", + " denom = np.reshape(np.power(10000.0, i / repr_dim), [1, -1])\n", + " enc = tf.expand_dims(tf.concat([tf.sin(pos / denom), tf.cos(pos / denom)], 1), 0)\n", + " return tf.tile(enc, [tf.shape(inputs)[0], 1, 1])\n", + "\n", + "def layer_norm(inputs, epsilon=1e-8):\n", + " mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)\n", + " normalized = (inputs - mean) / (tf.sqrt(variance + epsilon))\n", + " params_shape = inputs.get_shape()[-1:]\n", + " gamma = tf.get_variable('gamma', params_shape, tf.float32, tf.ones_initializer())\n", + " beta = tf.get_variable('beta', params_shape, tf.float32, tf.zeros_initializer())\n", + " return gamma * normalized + beta\n", + "\n", + "def self_attention(inputs, is_training, num_units, num_heads = 8, activation=None):\n", + " T_q = T_k = tf.shape(inputs)[1]\n", + " Q_K_V = tf.layers.dense(inputs, 3*num_units, activation)\n", + " Q, K, V = tf.split(Q_K_V, 3, -1)\n", + " Q_ = tf.concat(tf.split(Q, num_heads, axis=2), 0)\n", + " K_ = tf.concat(tf.split(K, num_heads, axis=2), 0)\n", + " V_ = tf.concat(tf.split(V, num_heads, axis=2), 0)\n", + " align = tf.matmul(Q_, K_, transpose_b=True)\n", + " align *= tf.rsqrt(tf.to_float(K_.get_shape()[-1].value))\n", + " paddings = tf.fill(tf.shape(align), float('-inf'))\n", + " lower_tri = tf.ones([T_q, T_k])\n", + " lower_tri = tf.linalg.LinearOperatorLowerTriangular(lower_tri).to_dense()\n", + " masks = tf.tile(tf.expand_dims(lower_tri,0), [tf.shape(align)[0],1,1])\n", + " align = tf.where(tf.equal(masks, 0), paddings, align)\n", + " align = tf.nn.softmax(align)\n", + " align = tf.layers.dropout(align, 0.1, training=is_training) \n", + " x = tf.matmul(align, V_)\n", + " x = tf.concat(tf.split(x, num_heads, axis=0), 2)\n", + " x += inputs\n", + " x = layer_norm(x)\n", + " return x\n", + "\n", + "def ffn(inputs, hidden_dim, activation=tf.nn.relu):\n", + " x = tf.layers.conv1d(inputs, 4* hidden_dim, 1, activation=activation) \n", + " x = tf.layers.conv1d(x, hidden_dim, 1, activation=None)\n", + " x += inputs\n", + " x = layer_norm(x)\n", + " return x\n", + "\n", + "class Model:\n", + " def __init__(self, size_layer, num_layers, embedded_size,\n", + " dict_size, learning_rate, dropout, kernel_size = 5):\n", + " \n", + " def cnn(x, scope):\n", + " x += position_encoding(x)\n", + " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n", + " for n in range(num_layers):\n", + " with tf.variable_scope('attn_%d'%n,reuse=tf.AUTO_REUSE):\n", + " x = self_attention(x, True, size_layer)\n", + " with tf.variable_scope('ffn_%d'%n, reuse=tf.AUTO_REUSE):\n", + " x = ffn(x, size_layer)\n", + " \n", + " with tf.variable_scope('logits', reuse=tf.AUTO_REUSE):\n", + " return tf.layers.dense(x, size_layer)[:, -1]\n", + " \n", + " self.X_left = tf.placeholder(tf.int32, [None, None])\n", + " self.X_right = tf.placeholder(tf.int32, [None, None])\n", + " self.Y = tf.placeholder(tf.float32, [None])\n", + " self.batch_size = tf.shape(self.X_left)[0]\n", + " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n", + " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n", + " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n", + " \n", + " def contrastive_loss(y,d):\n", + " tmp= y * tf.square(d)\n", + " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n", + " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n", + " \n", + " self.output_left = cnn(embedded_left, 'left')\n", + " self.output_right = cnn(embedded_right, 'right')\n", + " print(self.output_left, self.output_right)\n", + " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),\n", + " 1,keep_dims=True))\n", + " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),\n", + " 1,keep_dims=True)),\n", + " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),\n", + " 1,keep_dims=True))))\n", + " self.distance = tf.reshape(self.distance, [-1])\n", + " self.logits = tf.identity(self.distance, name = 'logits')\n", + " self.cost = contrastive_loss(self.Y,self.distance)\n", + " \n", + " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n", + " tf.rint(self.distance))\n", + " correct_predictions = tf.equal(self.temp_sim, self.Y)\n", + " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n", + " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "size_layer = 128\n", + "num_layers = 4\n", + "embedded_size = 128\n", + "learning_rate = 1e-4\n", + "maxlen = 50\n", + "batch_size = 128\n", + "dropout = 0.8" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tensor(\"left/logits/strided_slice:0\", shape=(?, 128), dtype=float32) Tensor(\"right/logits/strided_slice:0\", shape=(?, 128), dtype=float32)\n", + "WARNING:tensorflow:From :80: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n", + "Instructions for updating:\n", + "keep_dims is deprecated, use keepdims instead\n" + ] + }, + { + "data": { + "text/plain": [ + "'self-attention/model.ckpt'" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tf.reset_default_graph()\n", + "sess = tf.InteractiveSession()\n", + "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n", + "sess.run(tf.global_variables_initializer())\n", + "saver = tf.train.Saver(tf.trainable_variables())\n", + "saver.save(sess, 'self-attention/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "strings = ','.join(\n", + " [\n", + " n.name\n", + " for n in tf.get_default_graph().as_graph_def().node\n", + " if ('Variable' in n.op\n", + " or 'Placeholder' in n.name\n", + " or 'logits' in n.name\n", + " or 'alphas' in n.name)\n", + " and 'Adam' not in n.name\n", + " and '_power' not in n.name\n", + " and 'gradient' not in n.name\n", + " and 'Initializer' not in n.name\n", + " and 'Assign' not in n.name\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "train minibatch loop: 100%|██████████| 32316/32316 [1:48:08<00:00, 5.65it/s, accuracy=0.549, cost=0.123] \n", + "test minibatch loop: 100%|██████████| 391/391 [00:29<00:00, 13.98it/s, accuracy=0.725, cost=0.089] \n", + "train minibatch loop: 0%| | 0/32316 [00:00 CURRENT_ACC:\n", + " print(\n", + " 'epoch: %d, pass acc: %f, current acc: %f'\n", + " % (EPOCH, CURRENT_ACC, test_acc)\n", + " )\n", + " CURRENT_ACC = test_acc\n", + " CURRENT_CHECKPOINT = 0\n", + " else:\n", + " CURRENT_CHECKPOINT += 1\n", + " \n", + " print('time taken:', time.time()-lasttime)\n", + " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n", + " train_acc,test_loss,\n", + " test_acc))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'self-attention/model.ckpt'" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "saver.save(sess, 'self-attention/model.ckpt')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([0.], dtype=float32), array([0.02327037], dtype=float32)]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n", + "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n", + "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n", + " model.X_right: right})" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "validation minibatch loop: 100%|██████████| 391/391 [00:29<00:00, 14.17it/s]\n" + ] + } + ], + "source": [ + "real_Y, predict_Y = [], []\n", + "\n", + "pbar = tqdm(\n", + " range(0, len(test_left), batch_size), desc = 'validation minibatch loop'\n", + ")\n", + "for i in pbar:\n", + " index = min(i+batch_size,len(test_left))\n", + " batch_x_left = str_idx(test_left[i: index], dictionary, maxlen)\n", + " batch_x_right = str_idx(test_right[i: index], dictionary, maxlen)\n", + " batch_y = test_label[i: index]\n", + " predict_Y += sess.run(model.temp_sim, feed_dict = {model.X_left : batch_x_left, \n", + " model.X_right: batch_x_right,\n", + " model.Y : batch_y}).tolist()\n", + " real_Y += batch_y" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + "not similar 0.81 0.83 0.82 31524\n", + " similar 0.70 0.67 0.68 18476\n", + "\n", + "avg / total 0.77 0.77 0.77 50000\n", + "\n" + ] + } + ], + "source": [ + "from sklearn import metrics\n", + "\n", + "print(\n", + " metrics.classification_report(\n", + " real_Y, predict_Y, target_names = ['not similar', 'similar']\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Placeholder',\n", + " 'Placeholder_1',\n", + " 'Placeholder_2',\n", + " 'Variable',\n", + " 'left/attn_0/dense/kernel',\n", + " 'left/attn_0/dense/bias',\n", + " 'left/attn_0/gamma',\n", + " 'left/attn_0/beta',\n", + " 'left/ffn_0/conv1d/kernel',\n", + " 'left/ffn_0/conv1d/bias',\n", + " 'left/ffn_0/conv1d_1/kernel',\n", + " 'left/ffn_0/conv1d_1/bias',\n", + " 'left/ffn_0/gamma',\n", + " 'left/ffn_0/beta',\n", + " 'left/attn_1/dense/kernel',\n", + " 'left/attn_1/dense/bias',\n", + " 'left/attn_1/gamma',\n", + " 'left/attn_1/beta',\n", + " 'left/ffn_1/conv1d/kernel',\n", + " 'left/ffn_1/conv1d/bias',\n", + " 'left/ffn_1/conv1d_1/kernel',\n", + " 'left/ffn_1/conv1d_1/bias',\n", + " 'left/ffn_1/gamma',\n", + " 'left/ffn_1/beta',\n", + " 'left/attn_2/dense/kernel',\n", + " 'left/attn_2/dense/bias',\n", + " 'left/attn_2/gamma',\n", + " 'left/attn_2/beta',\n", + " 'left/ffn_2/conv1d/kernel',\n", + " 'left/ffn_2/conv1d/bias',\n", + " 'left/ffn_2/conv1d_1/kernel',\n", + " 'left/ffn_2/conv1d_1/bias',\n", + " 'left/ffn_2/gamma',\n", + " 'left/ffn_2/beta',\n", + " 'left/attn_3/dense/kernel',\n", + " 'left/attn_3/dense/bias',\n", + " 'left/attn_3/gamma',\n", + " 'left/attn_3/beta',\n", + " 'left/ffn_3/conv1d/kernel',\n", + " 'left/ffn_3/conv1d/bias',\n", + " 'left/ffn_3/conv1d_1/kernel',\n", + " 'left/ffn_3/conv1d_1/bias',\n", + " 'left/ffn_3/gamma',\n", + " 'left/ffn_3/beta',\n", + " 'left/logits/dense/kernel',\n", + " 'left/logits/dense/kernel/read',\n", + " 'left/logits/dense/bias',\n", + " 'left/logits/dense/bias/read',\n", + " 'left/logits/dense/Tensordot/Shape',\n", + " 'left/logits/dense/Tensordot/Rank',\n", + " 'left/logits/dense/Tensordot/axes',\n", + " 'left/logits/dense/Tensordot/GreaterEqual/y',\n", + " 'left/logits/dense/Tensordot/GreaterEqual',\n", + " 'left/logits/dense/Tensordot/Cast',\n", + " 'left/logits/dense/Tensordot/mul',\n", + " 'left/logits/dense/Tensordot/Less/y',\n", + " 'left/logits/dense/Tensordot/Less',\n", + " 'left/logits/dense/Tensordot/Cast_1',\n", + " 'left/logits/dense/Tensordot/add',\n", + " 'left/logits/dense/Tensordot/mul_1',\n", + " 'left/logits/dense/Tensordot/add_1',\n", + " 'left/logits/dense/Tensordot/range/start',\n", + " 'left/logits/dense/Tensordot/range/delta',\n", + " 'left/logits/dense/Tensordot/range',\n", + " 'left/logits/dense/Tensordot/ListDiff',\n", + " 'left/logits/dense/Tensordot/GatherV2/axis',\n", + " 'left/logits/dense/Tensordot/GatherV2',\n", + " 'left/logits/dense/Tensordot/GatherV2_1/axis',\n", + " 'left/logits/dense/Tensordot/GatherV2_1',\n", + " 'left/logits/dense/Tensordot/Const',\n", + " 'left/logits/dense/Tensordot/Prod',\n", + " 'left/logits/dense/Tensordot/Const_1',\n", + " 'left/logits/dense/Tensordot/Prod_1',\n", + " 'left/logits/dense/Tensordot/concat/axis',\n", + " 'left/logits/dense/Tensordot/concat',\n", + " 'left/logits/dense/Tensordot/concat_1/axis',\n", + " 'left/logits/dense/Tensordot/concat_1',\n", + " 'left/logits/dense/Tensordot/stack',\n", + " 'left/logits/dense/Tensordot/transpose',\n", + " 'left/logits/dense/Tensordot/Reshape',\n", + " 'left/logits/dense/Tensordot/transpose_1/perm',\n", + " 'left/logits/dense/Tensordot/transpose_1',\n", + " 'left/logits/dense/Tensordot/Reshape_1/shape',\n", + " 'left/logits/dense/Tensordot/Reshape_1',\n", + " 'left/logits/dense/Tensordot/MatMul',\n", + " 'left/logits/dense/Tensordot/Const_2',\n", + " 'left/logits/dense/Tensordot/concat_2/axis',\n", + " 'left/logits/dense/Tensordot/concat_2',\n", + " 'left/logits/dense/Tensordot',\n", + " 'left/logits/dense/BiasAdd',\n", + " 'left/logits/strided_slice/stack',\n", + " 'left/logits/strided_slice/stack_1',\n", + " 'left/logits/strided_slice/stack_2',\n", + " 'left/logits/strided_slice',\n", + " 'right/attn_0/dense/kernel',\n", + " 'right/attn_0/dense/bias',\n", + " 'right/attn_0/gamma',\n", + " 'right/attn_0/beta',\n", + " 'right/ffn_0/conv1d/kernel',\n", + " 'right/ffn_0/conv1d/bias',\n", + " 'right/ffn_0/conv1d_1/kernel',\n", + " 'right/ffn_0/conv1d_1/bias',\n", + " 'right/ffn_0/gamma',\n", + " 'right/ffn_0/beta',\n", + " 'right/attn_1/dense/kernel',\n", + " 'right/attn_1/dense/bias',\n", + " 'right/attn_1/gamma',\n", + " 'right/attn_1/beta',\n", + " 'right/ffn_1/conv1d/kernel',\n", + " 'right/ffn_1/conv1d/bias',\n", + " 'right/ffn_1/conv1d_1/kernel',\n", + " 'right/ffn_1/conv1d_1/bias',\n", + " 'right/ffn_1/gamma',\n", + " 'right/ffn_1/beta',\n", + " 'right/attn_2/dense/kernel',\n", + " 'right/attn_2/dense/bias',\n", + " 'right/attn_2/gamma',\n", + " 'right/attn_2/beta',\n", + " 'right/ffn_2/conv1d/kernel',\n", + " 'right/ffn_2/conv1d/bias',\n", + " 'right/ffn_2/conv1d_1/kernel',\n", + " 'right/ffn_2/conv1d_1/bias',\n", + " 'right/ffn_2/gamma',\n", + " 'right/ffn_2/beta',\n", + " 'right/attn_3/dense/kernel',\n", + " 'right/attn_3/dense/bias',\n", + " 'right/attn_3/gamma',\n", + " 'right/attn_3/beta',\n", + " 'right/ffn_3/conv1d/kernel',\n", + " 'right/ffn_3/conv1d/bias',\n", + " 'right/ffn_3/conv1d_1/kernel',\n", + " 'right/ffn_3/conv1d_1/bias',\n", + " 'right/ffn_3/gamma',\n", + " 'right/ffn_3/beta',\n", + " 'right/logits/dense/kernel',\n", + " 'right/logits/dense/kernel/read',\n", + " 'right/logits/dense/bias',\n", + " 'right/logits/dense/bias/read',\n", + " 'right/logits/dense/Tensordot/Shape',\n", + " 'right/logits/dense/Tensordot/Rank',\n", + " 'right/logits/dense/Tensordot/axes',\n", + " 'right/logits/dense/Tensordot/GreaterEqual/y',\n", + " 'right/logits/dense/Tensordot/GreaterEqual',\n", + " 'right/logits/dense/Tensordot/Cast',\n", + " 'right/logits/dense/Tensordot/mul',\n", + " 'right/logits/dense/Tensordot/Less/y',\n", + " 'right/logits/dense/Tensordot/Less',\n", + " 'right/logits/dense/Tensordot/Cast_1',\n", + " 'right/logits/dense/Tensordot/add',\n", + " 'right/logits/dense/Tensordot/mul_1',\n", + " 'right/logits/dense/Tensordot/add_1',\n", + " 'right/logits/dense/Tensordot/range/start',\n", + " 'right/logits/dense/Tensordot/range/delta',\n", + " 'right/logits/dense/Tensordot/range',\n", + " 'right/logits/dense/Tensordot/ListDiff',\n", + " 'right/logits/dense/Tensordot/GatherV2/axis',\n", + " 'right/logits/dense/Tensordot/GatherV2',\n", + " 'right/logits/dense/Tensordot/GatherV2_1/axis',\n", + " 'right/logits/dense/Tensordot/GatherV2_1',\n", + " 'right/logits/dense/Tensordot/Const',\n", + " 'right/logits/dense/Tensordot/Prod',\n", + " 'right/logits/dense/Tensordot/Const_1',\n", + " 'right/logits/dense/Tensordot/Prod_1',\n", + " 'right/logits/dense/Tensordot/concat/axis',\n", + " 'right/logits/dense/Tensordot/concat',\n", + " 'right/logits/dense/Tensordot/concat_1/axis',\n", + " 'right/logits/dense/Tensordot/concat_1',\n", + " 'right/logits/dense/Tensordot/stack',\n", + " 'right/logits/dense/Tensordot/transpose',\n", + " 'right/logits/dense/Tensordot/Reshape',\n", + " 'right/logits/dense/Tensordot/transpose_1/perm',\n", + " 'right/logits/dense/Tensordot/transpose_1',\n", + " 'right/logits/dense/Tensordot/Reshape_1/shape',\n", + " 'right/logits/dense/Tensordot/Reshape_1',\n", + " 'right/logits/dense/Tensordot/MatMul',\n", + " 'right/logits/dense/Tensordot/Const_2',\n", + " 'right/logits/dense/Tensordot/concat_2/axis',\n", + " 'right/logits/dense/Tensordot/concat_2',\n", + " 'right/logits/dense/Tensordot',\n", + " 'right/logits/dense/BiasAdd',\n", + " 'right/logits/strided_slice/stack',\n", + " 'right/logits/strided_slice/stack_1',\n", + " 'right/logits/strided_slice/stack_2',\n", + " 'right/logits/strided_slice',\n", + " 'logits']" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "strings.split(',')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "def freeze_graph(model_dir, output_node_names):\n", + "\n", + " if not tf.gfile.Exists(model_dir):\n", + " raise AssertionError(\n", + " \"Export directory doesn't exists. Please specify an export \"\n", + " 'directory: %s' % model_dir\n", + " )\n", + "\n", + " checkpoint = tf.train.get_checkpoint_state(model_dir)\n", + " input_checkpoint = checkpoint.model_checkpoint_path\n", + "\n", + " absolute_model_dir = '/'.join(input_checkpoint.split('/')[:-1])\n", + " output_graph = absolute_model_dir + '/frozen_model.pb'\n", + " clear_devices = True\n", + " with tf.Session(graph = tf.Graph()) as sess:\n", + " saver = tf.train.import_meta_graph(\n", + " input_checkpoint + '.meta', clear_devices = clear_devices\n", + " )\n", + " saver.restore(sess, input_checkpoint)\n", + " output_graph_def = tf.graph_util.convert_variables_to_constants(\n", + " sess,\n", + " tf.get_default_graph().as_graph_def(),\n", + " output_node_names.split(','),\n", + " )\n", + " with tf.gfile.GFile(output_graph, 'wb') as f:\n", + " f.write(output_graph_def.SerializeToString())\n", + " print('%d ops in the final graph.' % len(output_graph_def.node))" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Restoring parameters from self-attention/model.ckpt\n", + "INFO:tensorflow:Froze 85 variables.\n", + "INFO:tensorflow:Converted 85 variables to const ops.\n", + "1637 ops in the final graph.\n" + ] + } + ], + "source": [ + "freeze_graph('self-attention', strings)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "def load_graph(frozen_graph_filename):\n", + " with tf.gfile.GFile(frozen_graph_filename, 'rb') as f:\n", + " graph_def = tf.GraphDef()\n", + " graph_def.ParseFromString(f.read())\n", + " with tf.Graph().as_default() as graph:\n", + " tf.import_graph_def(graph_def)\n", + " return graph" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.01998395], dtype=float32)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = load_graph('self-attention/frozen_model.pb')\n", + "x1 = g.get_tensor_by_name('import/Placeholder:0')\n", + "x2 = g.get_tensor_by_name('import/Placeholder_1:0')\n", + "logits = g.get_tensor_by_name('import/logits:0')\n", + "test_sess = tf.InteractiveSession(graph = g)\n", + "test_sess.run(1-logits, feed_dict = {x1 : left, x2: right})" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.2318753 , 0.5197979 , 0.2777239 , 0.14316326, 0.8766695 ,\n", + " 0.22495192, 0.91102034, 0.0115208 , 0.070916 , 0.07542306,\n", + " 0.94589764, 0.04265296, 0.34291208, 0.43791467, 0.13047814,\n", + " 0.05099976, 0.04077601, 0.03098774, 0.05358207, 0.09898269,\n", + " 0.4222178 , 0.07683033, 0.27565062, 0.18730605, 0.34941596,\n", + " 0.08564615, 0.19999826, 0.05309838, 0.04758018, 0.01607895,\n", + " 0.13069487, 0.6605412 , 0.9515858 , 0.16830862, 0.5734025 ,\n", + " 0.5354396 , 0.749179 , 0.2538219 , 0.0801577 , 0.05013776,\n", + " 0.4355023 , 0.45459825, 0.03258169, 0.15339905, 0.9313603 ,\n", + " 0.42679828, 0.95682436, 0.07610172, 0.03255141, 0.00740314,\n", + " 0.52017945, 0.46709698, 0.74399465, 0.45834607, 0.02888119,\n", + " 0.9627122 , 0.1260702 , 0.03194386, 0.11266536, 0.05345899,\n", + " 0.5395947 , 0.34424478, 0.73064005, 0.17178106, 0.76854 ,\n", + " 0.03258795, 0.06777585, 0.8709656 , 0.09303659, 0.03535146,\n", + " 0.07395506, 0.06536621, 0.1412226 , 0.94608825, 0.07875746,\n", + " 0.01958525, 0.16110301, 0.19749928, 0.0451234 , 0.03573173],\n", + " dtype=float32)" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_sess.run(1-logits, feed_dict = {x1 : batch_x_left, x2: batch_x_right})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/setup-gpu.py b/setup-gpu.py index e0295a2e..57ff9fef 100644 --- a/setup-gpu.py +++ b/setup-gpu.py @@ -6,7 +6,7 @@ setuptools.setup( name = __packagename__, packages = setuptools.find_packages(), - version = '2.5', + version = '2.6', python_requires = '>=3.6.*', description = 'Natural-Language-Toolkit for bahasa Malaysia, powered by Deep Learning Tensorflow. GPU Version', author = 'huseinzol05', @@ -30,6 +30,7 @@ 'PySastrawi', 'toolz', 'ftfy', + 'networkx', ], license = 'MIT', classifiers = [ diff --git a/setup.py b/setup.py index 21a72b0b..83f99259 100644 --- a/setup.py +++ b/setup.py @@ -6,7 +6,7 @@ setuptools.setup( name = __packagename__, packages = setuptools.find_packages(), - version = '2.5', + version = '2.6', python_requires = '>=3.6.*', description = 'Natural-Language-Toolkit for bahasa Malaysia, powered by Deep Learning Tensorflow.', author = 'huseinzol05', @@ -30,6 +30,7 @@ 'PySastrawi', 'toolz', 'ftfy', + 'networkx', ], license = 'MIT', classifiers = [ diff --git a/xlnet/README.md b/xlnet/README.md new file mode 100644 index 00000000..e98b029e --- /dev/null +++ b/xlnet/README.md @@ -0,0 +1,125 @@ +# XLNET-Bahasa + +Thanks to [zihangdai](https://github.com/zihangdai) for opensourcing XLNET, https://github.com/zihangdai/xlnet + +## Objective + +1. There is no multilanguage implementation of XLNET, and obviously no Bahasa Malaysia implemented. So this directory to provide pretraining XLNET for Bahasa Malaysia. + +## How-to + +1. Git clone [Malaya-Dataset](https://github.com/huseinzol05/Malaya-Dataset), + +```bash +git clone https://github.com/huseinzol05/Malaya-Dataset.git +``` + +2. Run [tokenization.ipynb](tokenization.ipynb) to create dictionary for tokenizer and text dataset for pretraining. + +3. Git clone [Sentence-Piece](https://github.com/google/sentencepiece), + +```bash +git clone https://github.com/google/sentencepiece.git +``` + +4. Install [Sentence-Piece](https://github.com/google/sentencepiece), + +On 23rd June 2019, I cannot use latest master to compile sentence-piece using bazel, after a few googled, we need to revert to some commit. + +```bash +cd sentencepiece +git checkout d4dd947fe71c4fa4ee24ad8297beee32887d8828 +mkdir build +cd build +cmake .. +make -j $(nproc) +sudo make install +sudo ldconfig -v +``` + +Make sure you tested to run `spm_train` to make sure everything is fine, + +```bash +spm_train +``` + +```text +ERROR: --input must not be empty + +sentencepiece + +Usage: sentencepiece [options] files + + --accept_language (comma-separated list of languages this model can accept) type: string default: + --add_dummy_prefix (Add dummy whitespace at the beginning of text) type: bool default: true + --bos_id (Override BOS () id. Set -1 to disable BOS.) type: int32 default: 1 + --bos_piece (Override BOS () piece.) type: string default: + --character_coverage (character coverage to determine the minimum symbols) type: double default: 0.9995 + --control_symbols (comma separated list of control symbols) type: string default: + --eos_id (Override EOS () id. Set -1 to disable EOS.) type: int32 default: 2 +... +``` + +5. Create tokenizer using Sentence-Piece, + +```bash +cd ../ +spm_train \ +--input=texts.txt \ +--model_prefix=sp10m.cased.v3 \ +--vocab_size=32000 \ +--character_coverage=0.99995 \ +--model_type=unigram \ +--control_symbols=\,\,\,\,\ \ +--user_defined_symbols=\,.,\(,\),\",-,–,£,€ \ +--shuffle_input_sentence \ +--input_sentence_size=10000000 +``` + +**In the future, I will use Malaya tokenizer as XLNET tokenizer, if XLNET accuracies beat BERT accuracies.** + +6. Convert text files to tfrecord, + +```bash +mkdir save-location +python3 data_utils.py \ + --bsz_per_host=8 \ + --seq_len=256 \ + --reuse_len=128 \ + --input_glob=*.txt \ + --save_dir=save-location \ + --num_passes=20 \ + --bi_data=True \ + --sp_path=sp10m.cased.v3.model \ + --mask_alpha=6 \ + --mask_beta=1 \ + --num_predict=85 \ + --num_core_per_host=1 +``` + +7. Run pretained, + +I reduce the size of XLNET by 2 while maintain the number of attention, here is [original size](https://github.com/zihangdai/xlnet#pretraining-with-xlnet). + +```bash +python3 train_gpu.py \ + --corpus_info_path=save-location/corpus_info.json \ + --record_info_dir=save-location/tfrecords \ + --train_batch_size=8 \ + --seq_len=256 \ + --reuse_len=128 \ + --perm_size=128 \ + --n_layer=12 \ + --d_model=512 \ + --d_embed=512 \ + --n_head=16 \ + --d_head=64 \ + --d_inner=2048 \ + --untie_r=True \ + --mask_alpha=6 \ + --mask_beta=1 \ + --num_predict=85 \ + --model_dir=output-model \ + --uncased=True \ + --num_core_per_host=1 +```