크롤링 ) requests + beautifulSoup

Server/크롤링

크롤링 ) requests + beautifulSoup

EEYatHo 2023. 3. 15. 22:22

requests + beautifulSoup

크롤링을 하는 한 방법
html을 쉽게 불러올 수 있는 requests 라이브러리와,
html을 사용하기 쉽게 파싱해주는 beautifulSoup 라이브러리를 사용하여,
원하는 태그를 찾고 데이터를 크롤링함

한계점
- 로그인이 필요한 페이지를 크롤링하기 매우 힘듦 ( 세션관리.. )
- 동적페이지를 크롤링 할 수 없음 ( 동적 페이지는 selenium 사용 )

사용법

페이지 로딩 및 태그 선택

import requests
from bs4 import BeautifulSoup

# html 가져오기
response = requests.get("https://www.naver.com")
html = response.text

# BeautifulSoup 를 사용
# html 을 사용하기 좋게 파싱
soup = BeautifulSoup(html, 'html.parser')

# id == NM_set_home_btn 인 태그를 찾기
word = soup.select_one("#NM_set_home_btn")

# 태그가 가진 텍스트 값 출력
print(word.text)

# "네이버를 시작 페이지로"

태그의 속성값 가져오기
- text = 태그의 내용
- attrs[key] = 태그의 속성 값

links = soup.select(".news_tit")

for link in links:
    title = link.text 			# 태그안에 text요소를 가져온다
    url = link.attrs['href'] 	# 속성 값 중 href를 가져온다
    print(title, url)

저작자표시 (새창열림)