Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat : 학교 홈페이지 리뉴얼로인한 교직원 스크랩 및 스크랩간 직위 추가 #223

Merged
merged 48 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
cdab7ff
fix : 교직원 스크랩할 페이지 base url 변경(학과 홈페이지)
rlagkswn00 Nov 27, 2024
48a182e
remove : 리빙디자인, 커뮤니케이션 디자인 파서 삭제(미사용)
rlagkswn00 Nov 27, 2024
39d0b7c
fix : 학과 교직원 페이지 파서 수정(부동산 학과 제외)
rlagkswn00 Nov 27, 2024
7d1e6db
fix : 부동산 학과 파싱로직 변경
rlagkswn00 Nov 27, 2024
8db54c6
feat : 교직원 스크랩시 필요한 정보 변경
rlagkswn00 Nov 27, 2024
9c64b73
feat : 전체 학과 siteId, siteName 수정
rlagkswn00 Nov 27, 2024
0fd0dc8
feat : 교직원 정보 지원 유무 검증을 위한 메서드 추가
rlagkswn00 Nov 27, 2024
28a7557
feat : siteId, siteName 필드값 추가로 인한 getter 변경
rlagkswn00 Nov 27, 2024
961b896
remove : 교직원 스크랩 API Client 통합에 따른 미사용 클래스 삭제
rlagkswn00 Nov 27, 2024
fd4025e
feat : 학과 교직원 스크랩 API Client 로직 변경
rlagkswn00 Nov 27, 2024
cca4bd4
feat : 교직원 스크랩 과정 간 직위정보 추가
rlagkswn00 Nov 27, 2024
c8c22e1
feat : StaffUpdater 스크랩 로직 수정
rlagkswn00 Nov 27, 2024
a20e5de
feat : 수의예과, 수의학과 중복 교직원 정보 제거를 위한 distinct처리
rlagkswn00 Nov 27, 2024
5c915ee
feat : test용 html 파일 추가(컴퓨터공학부, 부동산학과) 및 legacy 파일 이동
rlagkswn00 Nov 27, 2024
f6bc03e
test : 학과 교직원 정보 스크랩 로직 테스트코드 작성
rlagkswn00 Nov 27, 2024
49ca89b
fix : entity position 추가 전 로직으로 수정
rlagkswn00 Nov 28, 2024
3ae1d1a
remove : legacy StaffScraperTest 삭제
rlagkswn00 Nov 29, 2024
38654a6
remove : 불필요 주석 제거
rlagkswn00 Nov 29, 2024
8e921da
test : MockServerSupport 객체 생성
rlagkswn00 Nov 29, 2024
2a84b06
remove : 불필요 DTO 제거
rlagkswn00 Nov 29, 2024
4a5f28a
test : 신규 StaffScraperTest 추가
rlagkswn00 Nov 29, 2024
708f4bf
remove : 불필요 import문 삭제
rlagkswn00 Nov 30, 2024
9ac2010
feat : StaffDTO identifier() 메서드 추가
rlagkswn00 Nov 30, 2024
2b36510
feat : 수의과대학 교직원 스크랩시 수의예과만 스크랩 하도록 변경
rlagkswn00 Nov 30, 2024
30c6675
feat : 전화번호 유틸 클래스 분리
rlagkswn00 Nov 30, 2024
8ce7b42
feat : 이메일 유틸 클래스 분리
rlagkswn00 Nov 30, 2024
3727f63
test : 이메일 유틸 클래스 테스트 추가
rlagkswn00 Nov 30, 2024
ffd86bd
teat : 전화번호 유틸 클래스 테스트 추가
rlagkswn00 Nov 30, 2024
49c684b
feat : Staff DB position column 추가
rlagkswn00 Nov 30, 2024
eb9cded
feat : Staff & StaffDTO position 추가
rlagkswn00 Nov 30, 2024
5d3a715
feat : StaffDTO position 비교 로직 추가
rlagkswn00 Nov 30, 2024
666db2c
feat : 이메일 valid 정책 수정(공백 허용)
rlagkswn00 Nov 30, 2024
b607552
feat : Staff 업데이트시 직위 추가
rlagkswn00 Nov 30, 2024
980fe0f
feat : 교직원 스크랩 스케쥴링 월 1회 활성화
rlagkswn00 Nov 30, 2024
143a5e5
feat : EmailSupporter & PhoneNumberSupporter 검증/변환 로직 분리
rlagkswn00 Dec 2, 2024
3049c30
feat : EmailSupporter & PhoneNumberSupporter 검증 메서드 테스트 추가
rlagkswn00 Dec 2, 2024
47b08f9
feat : StaffDTO 객체 생성 간 email, phone 검증, 변환하도록 수정
rlagkswn00 Dec 2, 2024
05bdb5d
refactor : 불필요 import문 제거
rlagkswn00 Dec 2, 2024
c8f848e
refactor : 테스트 클래스 및 메서드 public 키워드 제거
rlagkswn00 Dec 2, 2024
48b48c1
refactor : 주석 TODO 키워드 제거 및 replaceAll() -> replace() 변경
rlagkswn00 Dec 2, 2024
ec2fafb
feat : 직위 추가에 따른 StaffUpdate 로직 변경
rlagkswn00 Dec 2, 2024
5cf1377
refactor : 람다 함수 사용 간 불필요 괄호 제거
rlagkswn00 Dec 2, 2024
518ac33
feat : 전화번호 없을 경우 기본 저장 값 변경("-" -> "")
rlagkswn00 Dec 2, 2024
bcadaba
fix : StaffUpdate 로직 변경에 따른 Staff 도메인 테스트 변경(identifier(), 전화번호)
rlagkswn00 Dec 2, 2024
4216676
test : StaffUpdate 로직 변경에 따른 테스트 코드 변경(identifier(), 전화번호, 직위)
rlagkswn00 Dec 2, 2024
90f5958
refactor : 소나큐브 이슈 수정(변수명 컨벤션)
rlagkswn00 Dec 2, 2024
3954e23
remove : 불필요 출력문 제거
rlagkswn00 Dec 2, 2024
5afe41f
feat : 교직원 스크랩 스케쥴링 시간 1분 변경.(테스트 후 30분 되돌릴 예정)
rlagkswn00 Dec 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
package com.kustacks.kuring.common.utils.converter;

import java.util.Arrays;
import java.util.regex.Pattern;

public class EmailSupporter {
private static final Pattern AT_PATTERN = Pattern.compile("\\s+at\\s+");
private static final Pattern DOT_PATTERN = Pattern.compile("\\s+dot\\s+");
private static final Pattern EMAIL_PATTERN = Pattern.compile("^[a-zA-Z0-9_!#$%&'\\*+/=?{|}~^.-]+@[a-zA-Z0-9.-]+$");

private static final String KONKUK_DOMAIN = "@konkuk.ac.kr";
private static final String EMPTY_EMAIL = "";

public static boolean isNullOrBlank(String email) {
return email == null || email.isBlank();
}

public static String convertValidEmail(String email) {
if (isNullOrBlank(email)) {
return EMPTY_EMAIL;
}

String[] emailGroups = splitEmails(email);
String[] normalizedEmails = normalizeEmails(emailGroups);

//여러 이메일 중 konkuk을 우선 선택, 없으면 첫번째 내용
return selectPreferredEmail(normalizedEmails);
}

private static String[] splitEmails(String email) {
return email.split("[/,]");
}

private static String[] normalizeEmails(String[] emailGroups) {
return Arrays.stream(emailGroups)
.map(EmailSupporter::normalizeEmail)
.toArray(String[]::new);
}

private static String normalizeEmail(String email) {
if (EMAIL_PATTERN.matcher(email).matches()) {
return email;
}

if (containsSubstitutePatterns(email)) {
return replaceSubstitutePatterns(email);
}

return EMPTY_EMAIL;
}

private static String replaceSubstitutePatterns(String email) {
return email.replaceAll(DOT_PATTERN.pattern(), ".")
.replaceAll(AT_PATTERN.pattern(), "@");
}

private static boolean containsSubstitutePatterns(String email) {
return DOT_PATTERN.matcher(email).find() && AT_PATTERN.matcher(email).find();
}

// Konkuk 도메인 우선 선택
private static String selectPreferredEmail(String[] emails) {
return Arrays.stream(emails)
.filter(email -> email.endsWith(KONKUK_DOMAIN))
.findFirst()
.orElseGet(() -> emails.length > 0 ? emails[0] : EMPTY_EMAIL);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
package com.kustacks.kuring.common.utils.converter;

import java.util.regex.Pattern;

public class PhoneNumberSupporter {

private static final Pattern LAST_FOUR_NUMBER_PATTERN = Pattern.compile("\\d{4}");
private static final Pattern FULL_NUMBER_PATTERN = Pattern.compile("02-\\d{3,4}-\\d{4}");
private static final Pattern FULL_NUMBER_WITH_PARENTHESES_PATTERN = Pattern.compile("02[)]\\d{3,4}-\\d{4}");

private static final String EMPTY_PHONE = "";

public static boolean isNullOrBlank(String number) {
return number == null || number.isBlank();
}

public static String convertFullExtensionNumber(String number) {
if (isNullOrBlank(number)) {
return EMPTY_PHONE;
}

if (FULL_NUMBER_PATTERN.matcher(number).matches()) {
return number;
}
if (containsLastFourNumber(number)) {
return "02-450-" + number;
}
if (containsParenthesesPattern(number)) {
return number.replace(")", "-");
}

return EMPTY_PHONE;
}

private static boolean containsLastFourNumber(String number) {
return LAST_FOUR_NUMBER_PATTERN.matcher(number).matches();
}

private static boolean containsParenthesesPattern(String number) {
return FULL_NUMBER_WITH_PARENTHESES_PATTERN.matcher(number).matches();
}
}
3 changes: 2 additions & 1 deletion src/main/java/com/kustacks/kuring/staff/domain/Email.java
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ public Email(String email) {
}

private boolean isValidEmail(String email) {
return !Objects.isNull(email) && patternMatches(email);
return Objects.nonNull(email) &&
(patternMatches(email) || Objects.equals(email,""));
}

private boolean patternMatches(String email) {
Expand Down
5 changes: 3 additions & 2 deletions src/main/java/com/kustacks/kuring/staff/domain/Phone.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,14 @@ public class Phone {
= Pattern.compile("(\\d{3,4})[-\\s]*(\\d{4})");
private static final String SEOUL_AREA_CODE = "02";
private static final String DELIMITER = "-";
private static final String EMPTY_NUMBER = "";

@Column(name = "phone", length = 64)
private String value;

public Phone(String phone) {
if(isEmptyNumbers(phone)) {
this.value = DELIMITER;
this.value = EMPTY_NUMBER;
return;
}

Expand Down Expand Up @@ -71,7 +72,7 @@ private boolean isValidNumbersAndSet(String phone) {
}

private static boolean isEmptyNumbers(String phone) {
return phone == null || phone.isBlank() || phone.equals(DELIMITER);
return phone == null || phone.isBlank();
}

public boolean isSameValue(String phone) {
Expand Down
18 changes: 16 additions & 2 deletions src/main/java/com/kustacks/kuring/staff/domain/Staff.java
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ public class Staff {
@Column(name = "lab", length = 64)
private String lab;

@Getter(AccessLevel.PUBLIC)
@Column(name = "position", length = 64)
private String position;

@Embedded
private Phone phone;

Expand All @@ -45,24 +49,26 @@ public class Staff {
private College college;

@Builder
private Staff(String name, String major, String lab, String phone, String email, String dept, String college) {
private Staff(String name, String major, String lab, String phone, String email, String dept, String college, String position) {
this.name = new Name(name);
this.major = major;
this.lab = lab;
this.phone = new Phone(phone);
this.email = new Email(email);
this.dept = dept;
this.college = College.valueOf(college);
this.position = position;
}

public void updateInformation(String name, String major, String lab, String phone, String email, String deptName, String college) {
public void updateInformation(String name, String major, String lab, String phone, String email, String deptName, String college, String position) {
this.name = new Name(name);
this.major = major;
this.lab = lab;
this.phone = new Phone(phone);
this.email = new Email(email);
this.dept = deptName;
this.college = College.valueOf(college);
this.position = position;
}

public String getEmail() {
Expand Down Expand Up @@ -105,6 +111,14 @@ public boolean isSameCollege(String collegeName) {
return this.college == College.valueOf(collegeName);
}

public boolean isSamePosition(String position) {
return this.position.equals(position);
}

public String identifier() {
return String.join(",", getName(), position, dept);
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
package com.kustacks.kuring.worker.parser.staff;

import com.kustacks.kuring.worker.scrap.deptinfo.DeptInfo;
import com.kustacks.kuring.worker.scrap.deptinfo.art_design.CommunicationDesignDept;
import com.kustacks.kuring.worker.scrap.deptinfo.art_design.LivingDesignDept;
import com.kustacks.kuring.worker.scrap.deptinfo.real_estate.RealEstateDept;
import lombok.NoArgsConstructor;
import lombok.extern.slf4j.Slf4j;
Expand All @@ -18,33 +16,22 @@ public class EachDeptStaffHtmlParser extends StaffHtmlParserTemplate {

@Override
public boolean support(DeptInfo deptInfo) {
return !(deptInfo instanceof RealEstateDept) &&
!(deptInfo instanceof LivingDesignDept) &&
!(deptInfo instanceof CommunicationDesignDept);
return !(deptInfo instanceof RealEstateDept);
}

protected Elements selectStaffInfoRows(Document document) {
Element table = document.select(".photo_intro").get(0);
return table.getElementsByTag("dl");
return document.select(".row");
}

protected String[] extractStaffInfoFromRow(Element row) {
Elements infos = row.getElementsByTag("dd");

// 교수명, 직위, 세부전공, 연구실, 연락처, 이메일 순으로 파싱
// 연구실, 연락처 정보는 없는 경우가 종종 있으므로, childNode접근 전 인덱스 체크하는 로직을 넣었음
String name = infos.get(0).getElementsByTag("span").get(1).text();

String jobPosition = String.valueOf(infos.get(1).childNodeSize() < 2 ? "" : infos.get(1).childNode(1));
if (jobPosition.contains("명예") || jobPosition.contains("대우") || jobPosition.contains("휴직") || !jobPosition.contains("교수")) {
log.info("스크래핑 스킵 -> {} 교수", name);
return new String[]{};
}

String major = infos.get(2).childNodeSize() < 2 ? "" : String.valueOf(infos.get(2).childNode(1));
String lab = infos.get(3).childNodeSize() < 2 ? "" : String.valueOf(infos.get(3).childNode(1));
String phone = infos.get(4).childNodeSize() < 2 ? "" : String.valueOf(infos.get(4).childNode(1));
String email = infos.get(5).getElementsByTag("a").get(0).text();
return new String[]{name, major, lab, phone, email};
String name = row.select(".info .title .name").text();

Elements detailElement = row.select(".detail");
String jobPosition = detailElement.select(".ico1 dd").text().trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

여기의 모든 html요소들이 항상 존재할까요?
간혹 특정 정보가 없는 교수님의 정보도 있을 수 있는것 같아요~
NullPointException의 여지가 있는 것같아요~

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

말씀해주신 부분 알아보니 null값은 나오지 않도록 Jsoup에서 지원하는 것 같아요!

예를들어 etailElement.select(".ico1 dd")을 수행할 때

  1. class="ico1"이 없는 경우
  2. class="ico1"은 있으나 하위에 dd 태그가 없는 경우

두 경우 모두 직접 테스트 해본 결과 null값이 아닌 ""과 같은 빈 공백이 배열에 저장됩니다.

실제, 데이터가 없는 학과(ex. 수의예과의 경우 1번의 경우에 해당되는걸 확인했습니다.

혹시나 싶어 Jsoup 라이브러리의 select메서드를 훑어봤을 때 찾는 요소가 없다면 빈 Elements 객체를 반환하는 걸로 보입니다.

public static Elements select(String query, Iterable<Element> roots) {
        Validate.notEmpty(query);
        Validate.notNull(roots);
        Evaluator evaluator = QueryParser.parse(query);
        Elements elements = new Elements();
        IdentityHashMap<Element, Boolean> seenElements = new IdentityHashMap();
        Iterator var5 = roots.iterator();

        while(var5.hasNext()) {
            Element root = (Element)var5.next();
            Elements found = select(evaluator, root);
            Iterator var8 = found.iterator();

            while(var8.hasNext()) {
                Element el = (Element)var8.next();
                if (seenElements.put(el, Boolean.TRUE) == null) {
                    elements.add(el);
                }
            }
        }

        return elements;
    }

마찬가지 text() 메서드 또한 빈 StringBuilder 객체를 생성하고 사용하기에 값이 없다면 그대로 빈 공백이 출력되도록 하는거 같습니다.

 public String text() {
        StringBuilder sb = StringUtil.borrowBuilder();

        Element element;
        for(Iterator var2 = this.iterator(); var2.hasNext(); sb.append(element.text())) {
            element = (Element)var2.next();
            if (sb.length() != 0) {
                sb.append(" ");
            }
        }

        return StringUtil.releaseBuilder(sb);
    }

솔직하게 말하자면 잠깐 고민했던 부분인데 일단 돌아가길래 뒀던거 같습니다 하하...😂

String major = detailElement.select(".ico2 dd").text().trim();
String lab = detailElement.select(".ico3 dd").text().trim();
String extensionNumber = detailElement.select(".ico4 dd").text().trim();
String email = detailElement.select(".ico5 dd").text().trim();
return new String[]{name, jobPosition, major, lab, extensionNumber, email};
}
}

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,20 @@ public class RealEstateStaffHtmlParser extends StaffHtmlParserTemplate {
public boolean support(DeptInfo deptInfo) {
return deptInfo instanceof RealEstateDept;
}

protected Elements selectStaffInfoRows(Document document) {
Element table = document.select(".sub0201_list").get(0).getElementsByTag("ul").get(0);
return table.getElementsByTag("li");
return document.select(".row");
}

protected String[] extractStaffInfoFromRow(Element row) {
Element content = row.select(".con").get(0);

String name = content.select("dl > dt > a > strong").get(0).text();
String major = String.valueOf(content.select("dl > dd").get(0).childNode(4)).replaceFirst("\\s", "").trim();

Element textMore = content.select(".text_more").get(0);

String lab = String.valueOf(textMore.childNode(4)).split(":")[1].replaceFirst("\\s", "").trim();
String phone = String.valueOf(textMore.childNode(6)).split(":")[1].replaceFirst("\\s", "").trim();
String email = textMore.getElementsByTag("a").get(0).text();
return new String[]{name, major, lab, phone, email};
String name = row.select(".info .title .name").text();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

이쪽 함수도 NPE가능성이 있어서 trycatch로 한번 잡아서 log남겨줘도 좋을것 같아요


Elements detalTagElement = row.select(".detail");
String jobPosition = detalTagElement.select("dt:contains(직위) + dd").text();
String major = detalTagElement.select("dt:contains(연구분야) + dd").text().trim();
String lab = detalTagElement.select("dt:contains(연구실) + dd").text().trim();
String extensionNumber = detalTagElement.select("dt:contains(연락처) + dd").text().trim();
String email = detalTagElement.select("dt:contains(이메일) + dd").text().trim();
return new String[]{name, jobPosition, major, lab, extensionNumber, email};
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,11 @@ private static List<StaffDto> convertStaffDtos(DeptInfo deptInfo, List<String[]>
return parseResult.stream()
.map(oneStaffInfo -> StaffDto.builder()
.name(oneStaffInfo[0])
.major(oneStaffInfo[1])
.lab(oneStaffInfo[2])
.phone(oneStaffInfo[3])
.email(oneStaffInfo[4])
.position(oneStaffInfo[1])
.major(oneStaffInfo[2])
.lab(oneStaffInfo[3])
.phone(oneStaffInfo[4])
.email(oneStaffInfo[5])
.deptName(deptInfo.getDeptName())
.collegeName(deptInfo.getCollegeName()
).build()
Expand Down
Loading
Loading