Skip to content

weihungtseng/program-juncture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Hello, Juncture

This Juncture essay illustrates the use of a few Markdown formatting tags and the incorporation of an image and a map into a Juncture essay.

Official tutorial

Juncture home page


> [Wikidata home page](https://www.wikidata.org/wiki/Wikidata:Main_Page)

Example

Aulacophora indica

.ve-media wc:The_Bug_Peek.jpg right

The image depicts a leaf beetle (Aulacophora indica) (Family: Chrysomelidae; subfamily: Galerucinae) looking out from a leaf hole of Alnus nepalensis tree. Adult leaf beetles make holes in host plant leaves while feeding. They camouflage themselves with these holes.

This image is hosted on Wikimedia Commons and was runner-up for Wikimedia Commons Picture of the Year for 2021.

Image controls are located in the top-left corner of the image and can be seen when hovering over the image. These controls support image zoom, rotation, full-screen viewing, and repositioning to the start position. Panning can be performed with keyboard arrow keys or by mouse click-and-drag.

Image information can be seen when hovering the cursor over the info icon located in the top-right corner of the image. The Image information popover includes the image title, description, attribution statement, and reuse rights.

Chitwan National Park, Nepal

.ve-map Q1075023 right

The map is centered on the Chitwan National Park in Nepal, which is the location associated with the image above. The Wikidata identifier for Chitwan National Park is Q1075023. When a map location is specified using a Wikidata ID (or QID) Juncture can automatically retrieve the geographic coordinates for map centering.

An alternative to using a Wikidata identifier for map positioning is to use regular latitude and longitude coordinates. In that approach the QID would be replaced with the coordinates 27.5,84.333, resulting in an identical map.

Similar to the image viewer, map zooming is controlled using the buttons located in the top-left corner of the map viewer. Panning is performed with the keyboard arrow keys or by mouse click-and-drag.``

Implement : ALA Annual Conference 2023 - Poster

<style> .divStyle1{ width:100%; height:100%; background: -webkit-radial-gradient(circle,#E6B2FF,#76EEFF,#E6B2FF); background: -o-radial-gradient(circle,#E6B2FF,#76EEFF,#E6B2FF); background: -moz-radial-gradient(circle,#E6B2FF,#76EEFF,#E6B2FF); background: radial-gradient(circle,#E6B2FF,#76EEFF,#E6B2FF); padding: 20px; } .titleTextStyle { background: -webkit-linear-gradient(left top,#000066,#0099cc); background: -o-linear-gradient(bottom right,#000066,#0099cc); background: -moz-linear-gradient(bottom right,#000066,#0099cc); background: linear-gradient(to bottom right,#000066,#0099cc); background-clip: text; -webkit-background-clip: text; color: transparent; } .textStyle1 { color: #0D00C0; } .textStyle2 { cursor: pointer; color: blue; } .paragraphStyle1 { border: 1px solid #000; padding: 10px; background-color: #EEF5FF; } .detailsStyle1 { border: 2px solid #000; padding: 10px; background: -webkit-linear-gradient(left top,#76EEFF,#E6B2FF); background: -o-linear-gradient(bottom right,#76EEFF,#E6B2FF); background: -moz-linear-gradient(bottom right,#76EEFF,#E6B2FF); background: linear-gradient(to bottom right,#76EEFF,#E6B2FF); } .summaryTextStyle { text-align: center; } </style>

.ve-media gh:weihungtseng/media/picture/ALA_Annual_Conference_2023.png


Subject Heading Prediction based on the BERT Model

Huei-Yu Wang | Wei-Hung Tseng | Yu-Hao Lai | Ming-Hsin Phoebe Chiu

Graduate Institute Of Library & Information Studies

<br>
<details class="detailsStyle1">
    <summary><h3 class="summaryTextStyle">Motivation</h3></summary>
    <p class="paragraphStyle1">Cataloging is an essential part of the mission of the libraries, as the collection serves as a carrier of knowledge, so the users can effectively retrieve and utilize this knowledge. Automatic classification technologies have been introduced to the library technical services to enhance efficiency and improve inconsistency in cataloging.
    </p>
</details>

<br>
<details class="detailsStyle1">
    <summary><h3 class="summaryTextStyle">Problem to be solved</h3></summary>
    <p class="paragraphStyle1">To address the actual cataloging needs and problems in libraries, this study used 620,217 titles from the National Taiwan Normal University Library as experiment datasets and trained with the <mark qid="Q61726893">BERT</mark> distilbert-base-multilingual-cased model on different combinations of call number, titles, and authors’ data to make multiple subject cataloging predictions in both Chinese and English languages.
    </p>
</details>

<br>
<details class="detailsStyle1">
    <summary><h3 class="summaryTextStyle">Procedure</h3></summary>
    <p class="paragraphStyle1">After analyzing the dataset, we found that there are 191,478 usable data entries. Each entry has at least one subject keyword, with a total of 66,548 subject keywords distributed among them. The subject keyword with the highest frequency appears 5,085 times, while the subject keyword with the lowest frequency appears only once. Due to the large number of subject keywords and significant differences in frequency, we will sort them in descending order based on their occurrence frequency. Furthermore, we define the coverage of data entries as the dataset, and only when all the subject keywords of these data entries fall within the current range of subject keywords, they are considered part of the dataset.
    </p>
    
    <div>
        <ve-media anno-base="None/None/" src="gh:weihungtseng/juncture-media/picture/Analysis_and_Statistics_of_Subject_Keywords.png"></ve-media>
    </div>
    
    <p class="paragraphStyle1">In this study, BERT technology was applied, using a pre-trained model for fine-tuning. During training, different combinations of the first three digits of the call number, book title, and author were attempted as the text input, while the subject keywords served as labels. The data set was divided into a training set and a test set using the iterative_train_test_split() function from the scikit-multilearn package, with a ratio of nine to one, achieving stratification for multi-label data.
    </p>

    <p class="paragraphStyle1">Initially, due to the large number of subject keywords, the experiment focused on the top 50 subject keywords with the highest occurrence frequency. The data set consisted of records covered by these subject keywords, and training experiments were conducted using the text input as the call number. Finally, the distilbert-base-multilingual-cased model, which performed better, was selected as the model for subsequent experiments. This model is capable of handling multilingual classification tasks.
    </p>

    <div>
        <ve-media anno-base="None/None/" src="gh:weihungtseng/juncture-media/picture/Performance_Comparison_of_Different_Models.png"></ve-media>
    </div>
    
    <p class="paragraphStyle1">In the subsequent experiments, various combinations of Text were tested, using data sets covered by the top 0.07%, 0.15%, 0.45%, 1%, 2%, 5%, and 10% of subject keywords respectively. The reason for not selecting 15%, 20%, and 25% is that the newly added subject keyword samples were too few, with minimum occurrence counts lower than 10.
    </p>

    <div>
        <ve-media anno-base="None/None/" src="gh:weihungtseng/juncture-media/picture/The_result_of_the_experiment.png"></ve-media>
    </div>
</details>
<br>

<details class="detailsStyle1">
    <summary><h3 class="summaryTextStyle">Conclusion</h3></summary>
    <p class="paragraphStyle1">From the experiments, it can be concluded that the most suitable BERT model for this study is distilbert-base-multilingual-cased. Among the different Text options, the combination of Book Title and Call Number achieved the best prediction performance. When considering the data set covered by the top 50 subject keywords, the Micro-F <sup><span class="textStyle2">1</span></sup> score reached 0.8623. In summary, there are two key factors for achieving good performance in predicting subject keywords using BERT. The first factor is the frequency of occurrence (sample size) of the subject keywords, and the second factor is ensuring that the meaning represented by the Text is closely related to the meaning of the subject keywords.
    </p>
</details>
<section class="footnote">
    <hr>
    <ol>
        <li id="fn:1">
            <p>Micro-F: Micro, Macro & Weighted Averages of F1 Score, Clearly Explained
                <a href="https://towardsdatascience.com/micro-macro-weighted-averages-of-f1-score-clearly-explained-b603420b292f">
                    https://towardsdatascience.com/micro-macro-weighted-averages-of-f1-score-clearly-explained-b603420b292f
                </a> 
                &nbsp;
                <!-- <a href="https://www.juncture-digital.org/weihungtseng/program-juncture/#fnref:1" title="Jump back to footnote 1 in the text">↩</a> -->
                <span class="textStyle2">↩</span>
            </p>
        </li>
    </ol>
</section>

Location of NTNU

Fly to ==NTNU=={flyto:Q706712} .ve-map Q1755318 17 height=200px width=50% - Q706712

NTNU Introduction Video

<iframe width="70%" height="500px" src="https://www.youtube.com/embed/WIeFdYGbHPw"></iframe>

About

Juncture visual essays

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published