[의학용어 탐지 및 해석 모델] 말뭉치 구축

NLP/프로젝트

[의학용어 탐지 및 해석 모델] 말뭉치 구축 - [1] 크롤링(Crawling)

나는 라미 2023. 4. 21. 14:29

728x90

신뢰도가 높은 의학용어 사전 페이지를 크롤링하여 의학용어와 정의가 담긴 사전을 구축한다.

크롤링(Crawling)이란 쉽게 말해 웹페이지를 긁어와 원하는 정보를 추출하여 사용하는 것으로 생각하면 된다.

파이썬은 다양한 라이브러리가 구현되어있기 때문에 사용법만 익히면 편하게 이용할 수 있다.

크롤링에 필요한 라이브러리 역시 파이썬에 구현된 라이브러리를 사용한다.

- beautiful soap
- requests

두 라이브러리를 먼저 설치한다.

크롤링을 원하는 페이지에 접속해 F12를 눌러 관리자 모드에 진입한다.

관리자모드에 커서를 올리면 페이지의 어느 부분인지 알 수 있도록 블록처리가 된다.

원하는 정보가 있는 곳을 찾은 후 정규표현을 이용해 긁어오면 크롤링 성공이다.

의학용어 페이지를 크롤링한 코드는 다음과 같다.

전체 코드

# ----- about crawling -------
def crawling():
    url = "<http://www.원하는 페이지/>"

    # header 설정 없이 그냥 요청하면 네이버에서 차단합니다. 아마 하도 크롤링하는 사람들이 많아서 막은 거 같습니다.
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
    }

    # 요청 시작
    r = requests.get(url, headers=headers)
    # 메인페이지 파싱
    soup = BeautifulSoup(r.text, "html.parser")

    # 각 카테고리 섹션 수집
    categoris = soup.find_all("div", "section-content")

    # 카테고리별 인덱스
    jaso_index = []
    # 각 카테고리의 주소를 저장
    jaso_index = get_categoris(jaso_index)

	
    medical_term, medical_link = get_term(jaso_index)
    medical_value = get_value(medical_link)

    write_csv('medical_data.csv', medical_value, medical_term)


def get_categoris(categoris, jaso_index):
    for idx in categoris:
        cat_list = idx.find_all("li")
        for inx in cat_list:
            jaso_index.append(inx.find("a")["href"])
    return jaso_index

def get_term(jaso_index):
    medical_term, medical_link = [], []

    for ith in jaso_index:
        url = f"<http://www.kmle.co.kr/{ith}>"
        r = requests.get(url)
        soup = BeautifulSoup(r.text, "html.parser")
        term_list = soup.find_all("div", "list-group")

        tmp_term, tmp_link = [], []
  
        for content in term_list:
            for j in content.find_all("a", "list-group-item"):
                if j["class"] != ['list-group-item']:
                    continue
                a = str(j)

                b = re.search(">.*<", a)
                tmp_term.append(b[0][1:-1])
                tmp_link.append(j["href"])

        medical_term.append(tmp_term)
        medical_link.append(tmp_link)
    return medical_term, medical_link

def get_value(medical_link):
    medical_value, tmp_value = [], []
    for links in medical_link:
        print("get link")
        for link in links:
            url = f"<http://www.kmle.co.kr/{link}>"
            r = requests.get(url)
            time.sleep(0.5)
            soup = BeautifulSoup(r.text, "html.parser")
            if soup is not None:
                txt = soup.find("div", "panel-body").text.strip().replace('\\xa0', ' ')
            if txt is not None:
                tmp_value.append(txt)
        medical_value.append(tmp_value)
        tmp_value = []
    return medical_value

# ----- about csv -------
def write_csv(fname, medical_value, medical_term):
    with open(fname, 'w', encoding='utf8', newline='') as f:
        wr = csv.writer(f)
        for cat_idx, sents in enumerate(medical_value):
            for snt_idx, value in enumerate(sents):
                wr.writerow([medical_term[cat_idx][snt_idx], value])

#def write_csv(fname, medi_dict):
#    with open(fname, 'w', encoding='ANSI', newline = '') as f:
#        wr = csv.writer(f)
#        for idx, term in enumerate(medi_dict.items()):
#            wr.writerow([term[0], term[1]])

def read_csv(fname):
    f = open(fname, 'r')
    return csv.reader(f)

복잡해 보이지만 하나하나 뜯어서 보면 복잡하지 않다.

내가 원하는 페이지는 메인 페이지에 자소와 알파벳으로 카테고리가 나눠진 형태를 가지고 있었다.