[Dacon] 악성 URL 분류 AI 경진대회 (2)

악성 URL 분류 AI 경진대회 - EDA

이번글은 악성 URL 분류 대회에서 수행했던 탐색적 데이터 분석(EDA)에 대한 글입니다. 이번 대회를 통해서 URL에 대해서 알아보며, 기본적으로 URL을 분류하기 위해 URL의 구성 그리고 URL 데이로부터 어떤 전처리 방법을 사용하는 지에 대해서 Kaggle, 논문을 통해서 여러 가지 방법들을 알 수 있었던 시간이었습니다. 그럼 오늘 EDA 과정을 살펴보도록 하겠습니다.

내용 요약

1. 데이터 확인

- URL 구조

2. 데이터 추출 기법

3. 기타 기법

Dacon 악성 URL 분류 AI 경진대회 링크

: https://dacon.io/competitions/official/236451/overview/description

악성 URL 분류 AI 경진대회 - DACON

분석시각화 대회 코드 공유 게시물은 내용 확인 후 좋아요(투표) 가능합니다.

dacon.io

탐색적 데이터 분석 (Exploratory Data Anaysis, EDA)

1. 데이터 확인

# 학습/평가 데이터 로드
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# '[.]'을 '.'으로 복구
train_df['URL'] = train_df['URL'].str.replace(r'\[\.\]', '.', regex=True)
test_df['URL'] = test_df['URL'].str.replace(r'\[\.\]', '.', regex=True)

train_df.head(5)

처음 데이터를 받으면 위 사진과 같이 ID, URL 그리고 Label 컬럼을 확인할 수 있습니다. 처음에 이것을 보면 URL의 형식이 너무 다양해서 이에 대한 공부가 필요하겠다라는 생각을 하였습니다. 그래서 Kaggle 그리고 인터넷 검색을 통해서 다른 분들은 악성 URL을 어떻게 분류하였는 지 참고해보았습니다.

1-1. URL 구조

먼저 URL이란 무엇인가에 대해 그 정의를 확인해보면 'URL(Uniform Resource Loader)은 인터넷에서 특정 리소스(웹 페이지, 이미지, 비디오 등)의 위치를 지정하는 주소' 라고 정의되어 있습니다. 즉, 웹페이지를 찾기 위한 주소라고 말할 수 있습니다.

그럼 우리가 흔하게 보는 URL은 어떤 구조로 되어 있는 지 살펴보도록 하겠습니다.

대표적인 예시로 위 그림과 같이 URL은 Scheme, Domain Name, Port, Path, Parameter, Anchor로 나누어져 있다는 것을 확인할 수 있습니다. 그리고 그 각각에 대한 설명은 아래와 같습니다.

1) Scheme

: URL이 사용하는 프로토콜을 정의합니다. 보통 http, https, ftp 등이 사용됩니다.

2) Domain Name

: 웹 서버의 주소를 나타내며, 인터넷에서 특정 사이트를 식별합니다.

3) Port

: 서버의 포트를 지정합니다. http는 80. https 443 포트를 사용하고 다른 포트를 명시할 수도 있습니다.

4) Path

: 서버 내의 특정 자원의 위치를 나타냅니다.

5) Parameter

: 서버에 추가적인 정보를 전달하는 쿼리 문자열입니다. ?로 시작하며, 여러 매개변수는 &로 구분할 수 있습니다.

6) Anchor

: 페이지 내 특정 부분으로의 링크를 지정합니다.

추가적으로 2)의 도메인 네임은 아래와 같이 구분할 수 있습니다.

1) Subdomain

: 특정 서비스나 섹션을 나타내는 데 사용합니다.

2) 2차 도메인 (Second-Level Domain, SL)

: 일반적으로 웹 사이트의 고유한 이름을 나타냅니다.

3) 1차 도메인 (Top Level Domain, TLD)

: 일반적으로 국가 코드나 일반 TLD가 포함됩니다.

일단 이런식으로 저는 URL에 대한 구조를 공부해보았고, 다음으로 URL에 대해서 어떤 특징을 통해 악성 URL과 일반 URL을 구분하는 지 다양한 방식을 살펴보았습니다.

2. 데이터 추출 기법

이제 URL로부터 특징을 추출하기 위해서 저는 다음과 같은 기준을 세워서 컬럼을 생성하였습니다.

1) 길이 기반 특징 추출

2) 개수 기반 특징 추출

3) 존재 여부 특징 추출

4) 기타 특징 추출

2-1. 길이 기반 특징 추출

# # URL 길이
train_df['length'] = train_df['URL'].str.len()
test_df['length'] = test_df['URL'].str.len()

# 최대값과 최소값 구하기
max_length = test_df['length'].max()
min_length = test_df['length'].min()

# 최대 연속 소문자 길이 계산
train_df['max_lowercase_sequence'] = train_df['URL'].apply(lambda x: max([len(seq) for seq in re.findall(r'[a-z]+', x)] or [0]))
test_df['max_lowercase_sequence'] = test_df['URL'].apply(lambda x: max([len(seq) for seq in re.findall(r'[a-z]+', x)] or [0]))

train_df['max_numeric_sequence'] = train_df['URL'].apply(lambda x: max([len(seq) for seq in re.findall(r'\d+', x)] or [0]))
test_df['max_numeric_sequence'] = test_df['URL'].apply(lambda x: max([len(seq) for seq in re.findall(r'\d+', x)] or [0]))

# 최대 연속 대문자 길이 계산
train_df['max_uppercase_sequence'] = train_df['URL'].apply(lambda x: max([len(seq) for seq in re.findall(r'[A-Z]+', x)] or [0]))
test_df['max_uppercase_sequence'] = test_df['URL'].apply(lambda x: max([len(seq) for seq in re.findall(r'[A-Z]+', x)] or [0]))

# 첫 번째 / 이후의 호스트 길이 계산하는 컬럼 추가
def host_length_after_slash(url):
    if '/' in url:
        # 첫 번째 / 이후의 부분 추출
        host_part = url.split('/', 1)[1]  # 첫 번째 슬래시 이후의 부분
        return len(host_part.split('/')[0])  # 호스트 길이 계산
    return 0  # 슬래시가 없으면 0 반환

train_df['Host Length After Slash'] = train_df['URL'].apply(host_length_after_slash)
test_df['Host Length After Slash'] = test_df['URL'].apply(host_length_after_slash)

# 첫 번째 / 이전의 호스트 길이 계산하는 컬럼 추가
def host_length_before_slash(url):
    if '/' in url:
        # 첫 번째 / 이전의 부분 추출
        host_part = url.split('/', 1)[0]  # 첫 번째 슬래시 이전의 부분
        return len(host_part)  # 호스트 길이 계산
    return len(url)  # 슬래시가 없으면 전체 URL 길이를 반환

train_df['Host Length Before Slash'] = train_df['URL'].apply(host_length_before_slash)
test_df['Host Length Before Slash'] = test_df['URL'].apply(host_length_before_slash)

2-2. 존재 여부 특징 추출

# 단어 목록
keywords = [
    "login", "bank", "secure", "update", "verify",
    "account", "password", "security", "transaction",
    "sensitive", "confidential", "payment", "access",
    "protect", "fraud", "alert", "notify", "register", "dashboard", "profile", "checkout", "cart", "search", "terms", "privacy"
]
# 각 URL에서 단어 포함 개수를 카운트하여 하나의 컬럼에 합산
def count_keywords(url):
    return sum(url.count(keyword) for keyword in keywords)

# 모든 URL을 소문자로 변환하여 카운트
train_df['keyword_count'] = train_df['URL'].apply(lambda x: count_keywords(x.lower()))
test_df['keyword_count'] = test_df['URL'].apply(lambda x: count_keywords(x.lower()))

# 단축 URL 서비스 확인 함수
def shortening_service(url):
    match = re.search(r'bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                      r'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                      r'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                      r'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                      r'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                      r'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                      r'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                      r'tr\.im|link\.zip\.net', url)
    if match:
        return 1  # 단축 URL 서비스가 발견됨
    else:
        return 0  # 단축 URL 서비스가 없음

# 새로운 컬럼에 단축 URL 여부 추가
train_df['short_url'] = train_df['URL'].apply(shortening_service)
test_df['short_url'] = test_df['URL'].apply(shortening_service)

# 체크할 파일 확장자 목록
extensions = [
    ".jpg", ".jpeg", ".png", ".gif", ".bmp",  # 이미지
    ".pdf", ".doc", ".docx", ".xls", ".xlsx",  # 문서
    ".mp4", ".avi", ".mov",  # 비디오
    ".mp3", ".wav",  # 오디오
    ".zip", ".rar",  # 압축 파일
    ".tiff", ".tif",  # 이미지
    ".webp",  # 이미지
    ".svg",  # 이미지
    ".ppt", ".pptx",  # 문서
    ".txt",  # 문서
    ".csv",  # 문서
    ".xml",  # 문서
    ".html", ".htm",  ".hwpx"# 문서
    ".mkv",  # 비디오
    ".wmv",  # 비디오
    ".flv",  # 비디오
    ".mpeg", ".mpg",  # 비디오
    ".aac",  # 오디오
    ".flac",  # 오디오
    ".ogg",  # 오디오
    ".7z",  # 압축 파일
    ".tar",  # 압축 파일
    ".gz",  # 압축 파일
    ".bz2",  # 압축 파일
    ".iso",  # 기타
    ".json",  # 기타
    ".md",  # 기타
    ".psd",  # 기타
    ".ai" , # 기타
    ".lnk", ".vbs"
]


# 파일 확장자 존재 여부 확인 함수 정의
def check_extensions(url):
    return any(url.lower().endswith(ext) for ext in extensions)

# 새로운 컬럼 추가: has_extension
train_df['has_extension'] = train_df['URL'].apply(check_extensions)
test_df['has_extension'] = test_df['URL'].apply(check_extensions)

# 체크할 특수 문자 리스트
special_characters = ['-', '=']

# 각 특수 문자의 개수를 카운트하는 label 생성
for char in special_characters:
    train_df[f'count_{char}'] = train_df['URL'].apply(lambda x: x.count(char))
    test_df[f'count_{char}'] = test_df['URL'].apply(lambda x: x.count(char))

2-3. 개수 기반 특징 추출

train_df['path_depth'] = train_df['URL'].str.count('/')
test_df['path_depth'] = test_df['URL'].str.count('/')

# 서브도메인 개수
train_df['subdomain_count'] = train_df['URL'].str.split('.').apply(lambda x: len(x) - 2)
test_df['subdomain_count'] = test_df['URL'].str.split('.').apply(lambda x: len(x) - 2)

# URL에서 'www'의 개수 세기
train_df['count-www'] = train_df['URL'].apply(lambda url: url.count('www'))
test_df['count-www'] = test_df['URL'].apply(lambda url: url.count('www'))

# mail
train_df['count-mail'] = train_df['URL'].str.count('mail')
test_df['count-mail'] = test_df['URL'].str.count('mail')

# blog
train_df['count-blog'] = train_df['URL'].str.count('blog')
test_df['count-blog'] = test_df['URL'].str.count('blog')

# 숫자의 개수
train_df['digit_count'] = train_df['URL'].str.count(r'\d')
test_df['digit_count'] = test_df['URL'].str.count(r'\d')

# 소문자 비율 계산
train_df['lowercase_count'] = train_df['URL'].str.count(r'[a-z]')
test_df['lowercase_count'] = test_df['URL'].str.count(r'[a-z]')

# 대문자 비율 계산
train_df['uppercase_count'] = train_df['URL'].str.count(r'[A-Z]')
test_df['uppercase_count'] = test_df['URL'].str.count(r'[A-Z]')

# 문자, 숫자, 특수 문자의 개수를 세는 함수
def count_characters(url):
    letters_count = sum(c.isalpha() for c in url)  # 문자 수
    digits_count = sum(c.isdigit() for c in url)   # 숫자 수
    special_chars_count = sum(c in string.punctuation for c in url)  # 특수 문자 수

    return letters_count, digits_count, special_chars_count

# train_df와 test_df에 대해 개수 계산 후 새로운 컬럼에 저장
train_df['letters_count'], train_df['digits_count'], train_df['special_chars_count'] = zip(*train_df['Prime_url'].apply(count_characters))
test_df['letters_count'], test_df['digits_count'], test_df['special_chars_count'] = zip(*test_df['Prime_url'].apply(count_characters))

# 문자, 숫자, 특수 문자의 개수를 세는 함수
def count_characters(text):
    if text is None:  # None인 경우
        return 0, 0, 0  # 모두 0 반환
    letters_count = sum(c.isalpha() for c in text)  # 문자 수
    digits_count = sum(c.isdigit() for c in text)   # 숫자 수
    special_chars_count = sum(c in string.punctuation for c in text)  # 특수 문자 수

    return letters_count, digits_count, special_chars_count

# train_df와 test_df에 대해 Other_domain에 대한 개수 계산 후 새로운 컬럼에 저장
train_df['letters_count_other'], train_df['digits_count_other'], train_df['special_chars_count_other'] = zip(*train_df['Other_domain'].apply(count_characters))
test_df['letters_count_other'], test_df['digits_count_other'], test_df['special_chars_count_other'] = zip(*test_df['Other_domain'].apply(count_characters))

# 문자를 제거한 나머지의 개수를 계산하는 함수
def non_alpha_count(url):
    non_alpha = re.sub(r'[a-zA-Z]', '', url)  # 문자(알파벳)만 제거
    return len(non_alpha) if non_alpha else 0

train_df['Non Alpha Count'] = train_df['URL'].apply(non_alpha_count)
test_df['Non Alpha Count'] = test_df['URL'].apply(non_alpha_count)

2-4. 기타 특징 추출

import pandas as pd

# 주어진 get_url_region 함수
def get_url_region(primary_domain):
    ccTLD_to_region =  {
".ac": "Ascension Island",
".ad": "Andorra",
".ae": "United Arab Emirates",
".af": "Afghanistan",
".ag": "Antigua and Barbuda",
".ai": "Anguilla",
".al": "Albania",
".am": "Armenia",
".an": "Netherlands Antilles",
".ao": "Angola",
".aq": "Antarctica",
".ar": "Argentina",
".as": "American Samoa",
".at": "Austria",
".au": "Australia",
".aw": "Aruba",
".ax": "Åland Islands",
".az": "Azerbaijan",
".ba": "Bosnia and Herzegovina",
".bb": "Barbados",
".bd": "Bangladesh",
".be": "Belgium",
".bf": "Burkina Faso",
".bg": "Bulgaria",
".bh": "Bahrain",
".bi": "Burundi",
".bj": "Benin",
".bm": "Bermuda",
".bn": "Brunei Darussalam",
".bo": "Bolivia",
".br": "Brazil",
".bs": "Bahamas",
".bt": "Bhutan",
".bv": "Bouvet Island",
".bw": "Botswana",
".by": "Belarus",
".bz": "Belize",
".ca": "Canada",
".cc": "Cocos Islands",
".cd": "Democratic Republic of the Congo",
".cf": "Central African Republic",
".cg": "Republic of the Congo",
".ch": "Switzerland",
".ci": "Côte d'Ivoire",
".ck": "Cook Islands",
".cl": "Chile",
".cm": "Cameroon",
".cn": "China",
".co": "Colombia",
".cr": "Costa Rica",
".cu": "Cuba",
".cv": "Cape Verde",
".cw": "Curaçao",
".cx": "Christmas Island",
".cy": "Cyprus",
".cz": "Czech Republic",
".de": "Germany",
".dj": "Djibouti",
".dk": "Denmark",
".dm": "Dominica",
".do": "Dominican Republic",
".dz": "Algeria",
".ec": "Ecuador",
".ee": "Estonia",
".eg": "Egypt",
".er": "Eritrea",
".es": "Spain",
".et": "Ethiopia",
".eu": "European Union",
".fi": "Finland",
".fj": "Fiji",
".fk": "Falkland Islands",
".fm": "Federated States of Micronesia",
".fo": "Faroe Islands",
".fr": "France",
".ga": "Gabon",
".gb": "United Kingdom",
".gd": "Grenada",
".ge": "Georgia",
".gf": "French Guiana",
".gg": "Guernsey",
".gh": "Ghana",
".gi": "Gibraltar",
".gl": "Greenland",
".gm": "Gambia",
".gn": "Guinea",
".gp": "Guadeloupe",
".gq": "Equatorial Guinea",
".gr": "Greece",
".gs": "South Georgia and the South Sandwich Islands",
".gt": "Guatemala",
".gu": "Guam",
".gw": "Guinea-Bissau",
".gy": "Guyana",
".hk": "Hong Kong",
".hm": "Heard Island and McDonald Islands",
".hn": "Honduras",
".hr": "Croatia",
".ht": "Haiti",
".hu": "Hungary",
".id": "Indonesia",
".ie": "Ireland",
".il": "Israel",
".im": "Isle of Man",
".in": "India",
".io": "British Indian Ocean Territory",
".iq": "Iraq",
".ir": "Iran",
".is": "Iceland",
".it": "Italy",
".je": "Jersey",
".jm": "Jamaica",
".jo": "Jordan",
".jp": "Japan",
".ke": "Kenya",
".kg": "Kyrgyzstan",
".kh": "Cambodia",
".ki": "Kiribati",
".km": "Comoros",
".kn": "Saint Kitts and Nevis",
".kp": "Democratic People's Republic of Korea (North Korea)",
".kr": "Republic of Korea (South Korea)",
".kw": "Kuwait",
".ky": "Cayman Islands",
".kz": "Kazakhstan",
".la": "Laos",
".lb": "Lebanon",
".lc": "Saint Lucia",
".li": "Liechtenstein",
".lk": "Sri Lanka",
".lr": "Liberia",
".ls": "Lesotho",
".lt": "Lithuania",
".lu": "Luxembourg",
".lv": "Latvia",
".ly": "Libya",
".ma": "Morocco",
".mc": "Monaco",
".md": "Moldova",
".me": "Montenegro",
".mf": "Saint Martin (French part)",
".mg": "Madagascar",
".mh": "Marshall Islands",
".mk": "North Macedonia",
".ml": "Mali",
".mm": "Myanmar",
".mn": "Mongolia",
".mo": "Macao",
".mp": "Northern Mariana Islands",
".mq": "Martinique",
".mr": "Mauritania",
".ms": "Montserrat",
".mt": "Malta",
".mu": "Mauritius",
".mv": "Maldives",
".mw": "Malawi",
".mx": "Mexico",
".my": "Malaysia",
".mz": "Mozambique",
".na": "Namibia",
".nc": "New Caledonia",
".ne": "Niger",
".nf": "Norfolk Island",
".ng": "Nigeria",
".ni": "Nicaragua",
".nl": "Netherlands",
".no": "Norway",
".np": "Nepal",
".nr": "Nauru",
".nu": "Niue",
".nz": "New Zealand",
".om": "Oman",
".pa": "Panama",
".pe": "Peru",
".pf": "French Polynesia",
".pg": "Papua New Guinea",
".ph": "Philippines",
".pk": "Pakistan",
".pl": "Poland",
".pm": "Saint Pierre and Miquelon",
".pn": "Pitcairn",
".pr": "Puerto Rico",
".ps": "Palestinian Territory",
".pt": "Portugal",
".pw": "Palau",
".py": "Paraguay",
".qa": "Qatar",
".re": "Réunion",
".ro": "Romania",
".rs": "Serbia",
".ru": "Russia",
".rw": "Rwanda",
".sa": "Saudi Arabia",
".sb": "Solomon Islands",
".sc": "Seychelles",
".sd": "Sudan",
".se": "Sweden",
".sg": "Singapore",
".sh": "Saint Helena",
".si": "Slovenia",
".sj": "Svalbard and Jan Mayen",
".sk": "Slovakia",
".sl": "Sierra Leone",
".sm": "San Marino",
".sn": "Senegal",
".so": "Somalia",
".sr": "Suriname",
".ss": "South Sudan",
".st": "São Tomé and Príncipe",
".sv": "El Salvador",
".sx": "Sint Maarten (Dutch part)",
".sy": "Syria",
".sz": "Eswatini",
".tc": "Turks and Caicos Islands",
".td": "Chad",
".tf": "French Southern Territories",
".tg": "Togo",
".th": "Thailand",
".tj": "Tajikistan",
".tk": "Tokelau",
".tl": "Timor-Leste",
".tm": "Turkmenistan",
".tn": "Tunisia",
".to": "Tonga",
".tr": "Turkey",
".tt": "Trinidad and Tobago",
".tv": "Tuvalu",
".tw": "Taiwan",
".tz": "Tanzania",
".ua": "Ukraine",
".ug": "Uganda",
".uk": "United Kingdom",
".us": "United States",
".uy": "Uruguay",
".uz": "Uzbekistan",
".va": "Vatican City",
".vc": "Saint Vincent and the Grenadines",
".ve": "Venezuela",
".vg": "British Virgin Islands",
".vi": "U.S. Virgin Islands",
".vn": "Vietnam",
".vu": "Vanuatu",
".wf": "Wallis and Futuna",
".ws": "Samoa",
".ye": "Yemen",
".yt": "Mayotte",
".za": "South Africa",
".zm": "Zambia",
".zw": "Zimbabwe"
}

    for ccTLD in ccTLD_to_region:
        if primary_domain.endswith(ccTLD):
            return ccTLD_to_region[ccTLD]

    return "ETC"

# Prime_url에서 최상위 도메인 추출하는 함수
def extract_top_level_domain(prime_url):
    # '.'로 나누어 마지막 부분을 가져옴
    domain_parts = prime_url.split('.')
    return '.' + domain_parts[-1]  # 최상위 도메인

# Prime_url에서 최상위 도메인 추출 후 지역 찾기
train_df['Top Level Domain'] = train_df['Prime_url'].apply(extract_top_level_domain)
train_df['Region'] = train_df['Top Level Domain'].apply(get_url_region)

test_df['Top Level Domain'] = test_df['Prime_url'].apply(extract_top_level_domain)
test_df['Region'] = test_df['Top Level Domain'].apply(get_url_region)

# LabelEncoder 객체 생성
label_encoder = LabelEncoder()
# Region 컬럼을 레이블 인코딩 (학습 데이터)
train_df['Region_Encoded'] = label_encoder.fit_transform(train_df['Region'])
# 테스트 데이터에 레이블 인코딩 적용 (학습 데이터에서 사용한 인코더 재사용)
test_df['Region_Encoded'] = label_encoder.transform(test_df['Region'])
# 인코딩된 레이블과 원래 레이블 간의 매핑 확인
label_mapping = dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))

def hash_url(url):
    if url is None:
        return 0  # None인 경우 0 반환
    return int(hashlib.sha256(url.encode()).hexdigest(), 16)  # 해시값 생성

# Prime_url 컬럼과 Other_domain 컬럼을 해싱하여 새로운 컬럼 생성
train_df['Prime_url_hashed'] = train_df['Prime_url'].apply(hash_url)
train_df['Other_domain_hashed'] = train_df['Other_domain'].apply(hash_url)

test_df['Prime_url_hashed'] = test_df['Prime_url'].apply(hash_url)
test_df['Other_domain_hashed'] = test_df['Other_domain'].apply(hash_url)


def url_to_number(url):
    return int(''.join(str(ord(c)) for c in url))% (10 ** 10)

# Prime_url 컬럼을 숫자로 변환
train_df['Prime_url_encoded'] = train_df['Prime_url'].apply(url_to_number)
test_df['Prime_url_encoded'] = test_df['Prime_url'].apply(url_to_number)

3. BERT Based URL Embedding

위와 같이 특징을 추출을 해도 예측의 정확도가 95% 이상이 되지만, 저는 좀 더 추가적인 성능을 높이기 위해서 Kaggle을 살펴본 결과 BERT를 기반의 URL Embedding 방식을 사용해서 데이터를 추출하시는 것을 보고 해당 방법을 사용하여 96% 이상의 성능을 확인할 수 있었습니다.

import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from tqdm import tqdm  # tqdm 라이브러리 임포트

# GPU가 사용 가능한지 확인하고 모델을 GPU로 이동
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# BERT 모델과 토크나이저 로드
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device)

# train_df의 URL 임베딩 생성 (진행율 표시)
url_embeddings_train = []
for url in tqdm(train_df['URL'], desc="Processing train URLs"):
    tokens = tokenizer.encode(url, add_special_tokens=True, max_length=512, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(tokens)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()  # 결과를 CPU로 이동
    url_embeddings_train.append(embeddings)

url_embeddings_train = torch.tensor(url_embeddings_train)

# test_df의 URL 임베딩 생성 (진행율 표시)
url_embeddings_test = []
for url in tqdm(test_df['URL'], desc="Processing test URLs"):
    tokens = tokenizer.encode(url, add_special_tokens=True, max_length=512, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(tokens)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()  # 결과를 CPU로 이동
    url_embeddings_test.append(embeddings)

url_embeddings_test = torch.tensor(url_embeddings_test)

■ 참고

URL 구조

: https://velog.io/@liankim/URL%EC%9D%98-%EA%B5%AC%EC%A1%B0

URL의 구조

URL은 Uniform Resource Locators의 약자로, 웹에서 HTML 페이지, CSS 문서, 이미지 등 리소스의 위치를 나타내는 주소를 뜻한다. 쉽게 말해서, URL은 웹 페이지를 찾기위한 주소를 말한다. 흔히 웹 사이트 주

velog.io

URL 특징

: 김영준, 이재우. (2022). URL 주요특징을 고려한 악성URL 머신러닝 탐지모델 개발. 한국정보통신학회논문지, 26(12), 1786-1793.

https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11185752

URL 주요특징을 고려한 악성URL 머신러닝 탐지모델 개발 | DBpia

김영준, 이재우 | 한국정보통신학회논문지 | 2022.12

www.dbpia.co.kr

Kaggle

: https://www.kaggle.com/code/prasanthsagar32/nlp-final-project-urldecomposition-bert

NLP_Final_Project_URLdecomposition_&_BERT

Explore and run machine learning code with Kaggle Notebooks | Using data from Malicious URLs dataset

www.kaggle.com

'Personal Projects > Dacon' 카테고리의 다른 글

[Dacon] 이미지 분류 해커톤 경진대회 (1) - 후기 (0)	2025.04.30
[Dacon] 악성 URL 분류 AI 경진 대회 (3) - Code (0)	2025.03.31
[Dacon] 악성 URL 분류 AI 경진대회 (1) (0)	2025.03.31
[Dacon] 채무 불이행 여부 예측 해커톤 (3) - Code (0)	2025.03.31
[Dacon] 채무 불이행 여부 예측 해커톤 (2) - EDA (0)	2025.03.31

Muns_day

[Dacon] 악성 URL 분류 AI 경진대회 (2) - EDA

내용 요약

1. 데이터 확인

1-1. URL 구조

2. 데이터 추출 기법

2-1. 길이 기반 특징 추출

2-2. 존재 여부 특징 추출

2-3. 개수 기반 특징 추출

2-4. 기타 특징 추출

3. BERT Based URL Embedding

■ 참고

'Personal Projects > Dacon' 카테고리의 다른 글

티스토리툴바

[Dacon] 악성 URL 분류 AI 경진대회 (2) - EDA

내용 요약

1. 데이터 확인

1-1. URL 구조

2. 데이터 추출 기법

2-1. 길이 기반 특징 추출

2-2. 존재 여부 특징 추출

2-3. 개수 기반 특징 추출

2-4. 기타 특징 추출

3. BERT Based URL Embedding

■ 참고

'Personal Projects > Dacon' 카테고리의 다른 글

관련글

티스토리툴바