[Python] Beutiful Soup4 - decompose() 와 extract()

목표

미디어 위키 페이지에서 중첩된 리스트 내의 요소들을 텍스트로 추출한다.

위와 같은 중첩 리스트에서 list 내의 text를 모두 추출한 결과 다음과 같았다.

중첩 리스트의 텍스트가 모두 추출되고, 그 다음 요소에 중복해서 등장한다.

따라서 두 list를 분리하기 위해 두 가지 방식을 알게 되었다.

.decompose()

태그를 트리에서 제거한 다음, 그와 그의 내용물을 완전히 파괴한다.

lists = content.find_all('li')

    for idx in range(len(lists)):
        if lists[idx].ul is not None:
            lists[idx].ul.decompose()
            
        text = lists[idx].text
        print(f"{idx} 번째 list : {text}")

그 결과 자식 list의 태그가 완전히 제거되어서 다음 순서에서 text를 추출할 수 없었다.

.extract()

실행 시에만 1회성으로 태그를 제거하고 return한다.

.decompose()는 None을 리턴하는 반면, .extract()는 제거된 태그를 리턴한다.

lists = content.find_all('li')

    for idx in range(len(lists)):
        if lists[idx].ul is not None:
            lists[idx].ul.extract()
            
        text = lists[idx].text
        print(f"{idx} 번째 list : {text}")

전체 코드

import json
import requests


from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urlparse, urlencode

import time

from inkling.fulltext_search import FullTextSearch, FullTextSearchException
from inkling.env import INKLING_HOME
import pathlib
from tunip.yaml_loader import YamlLoader

import re
import string
from konlpy.tag import Mecab

if __name__=="__main__":

    req_session = requests.Session()
    params = urlencode(
        {
            "title": "KIA",
        }
    )
    res = req_session.get(
        url=f"http://ascentkorea.iptime.org:31015/mediawiki/index.php?{params}"
    )
    if res.status_code == 200:
        soup = BeautifulSoup(res.text, 'html.parser')
    else:
        print("fail")

    topic_keyword=["자동차"]

    candidate = {}
    
    content = soup.find('div', class_='mw-body')
    lists = content.find_all('li')

    for idx in range(len(lists)):
        if lists[idx].ul is not None:
            lists[idx].ul.extract()
            
        text = lists[idx].text

        candidate[idx] = 0

        print(f"{idx} 번째 list : {text}")
        for topic in topic_keyword:
            match = re.findall(topic, text)
            if match is not None:
                candidate[idx] += len(match)
        
    candidate = sorted(candidate.items(), key=lambda x: x[1], reverse=True)
    idx = candidate[0][0]
    print(lists[idx].find('a')['title'])

Ref.

https://studyforus.com/innisfree/650714

https://yongbeomkim.github.io/python/bs4-tutorial/

'Python' 카테고리의 다른 글

[디자인패턴] 파이썬에서 Decorator Pattern (데코레이터 패턴) 구현하기 (0)	2023.05.21
[디자인패턴] 파이썬에서 Delegate Pattern (위임 패턴) 구현하기 (1)	2023.05.07
[Python/NLP] 위키피디아 덤프 데이터에서 하이퍼링크(anchor text) 추출하기 (0)	2021.12.06
[Python] 백준 알고리즘 5단계: 1차원 배열 (0)	2021.09.11

자, 연어를 먹어보쟈 🍣

[Python] Beutiful Soup4 - decompose() 와 extract()

.decompose()

.extract()

'Python' 카테고리의 다른 글

티스토리툴바

[Python] Beutiful Soup4 - decompose() 와 extract()

.decompose()

.extract()

'Python' 카테고리의 다른 글

'Python' Related Articles

티스토리툴바