👨🏻‍🏫IT 활동/인공지능교육 - NLP

[NLP] Day 6 - CSS SELECTOR

728x90
반응형

Selector

In [51]:
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537'}


def getDownload(url, param = None, retries = 3):
    resp = None
    try:
        resp = requests.get(url, params = param, headers = headers)
        resp.raise_for_status()
    except requests.exceptions.HTTPError as e:
        if 500 <= resp.status_code < 600 and retries > 0:
            print('Retries : {0}'.format(retries))
            return getDownload(url, param, retries -1)
        else:
            print(resp.status_code)
            print(resp.reason)
            print(resp.request.headers)
            
    return resp

Selector => Tag, ID(아이디), class(.클래스명) [name='asdf]

[name^='asdf'] asdf시작하는 거로 찾아라

[name$='asdf'] asdf로 끝나는 것

div p => infd_all : div찾고 자손중에 p

In [52]:
import requests
In [62]:
import requests
url = "https://search.daum.net/nate"
param = {
        "thr":"sbma",
         "w":"tot",
         "q":"%ED%8C%8C%EC%9D%B4%EC%8D%AC"}

html = getDownload(url,param)
dom = BeautifulSoup(html.content,"lxml")
In [179]:
for tag in dom.select('div#blogColl  a.wrap_tit + span'):
    print(tag.text)
    if tag.has_attr('href'):
        print(tag['href'])

구글

In [72]:
url ='https://www.google.com/search?q=%ED%8C%8C%EC%9D%B4%EC%8D%AC&oq=%ED%8C%8C%EC%9D%B4%EC%8D%AC&aqs=chrome..69i57j35i39j69i60j69i65j69i60l2.1587j0j4&sourceid=chrome&ie=UTF-8'
html = getDownload(url,{})
In [73]:
dom = BeautifulSoup(html.text,'lxml')
In [74]:
len(dom.select(' .r > a'))
Out[74]:
11
In [75]:
for tag in dom.select(".r > a > h3"):
    print(tag.text)
    print(tag.find_parent()['href'])
    
Welcome to Python.org
https://www.python.org/
파이썬 자습서 — Python 3.7.2 documentation
https://docs.python.org/ko/3/tutorial/index.html
Python - 나무위키
https://namu.wiki/w/Python
파이썬 - 위키백과, 우리 모두의 백과사전
https://ko.wikipedia.org/wiki/%ED%8C%8C%EC%9D%B4%EC%8D%AC
파이썬 입문 | 프로그래머스
https://programmers.co.kr/learn/courses/2
01-5 파이썬 둘러보기 - 점프 투 파이썬 - WikiDocs
https://wikidocs.net/9
1. 파이썬 시작하기 - 왕초보를 위한 Python 2.7 - WikiDocs
https://wikidocs.net/43
01-2 파이썬의 특징 - 점프 투 파이썬 - WikiDocs
https://wikidocs.net/6
02-2 문자열 자료형 - 점프 투 파이썬 - WikiDocs
https://wikidocs.net/13
Python & Ruby - 생활코딩
https://opentutorials.org/course/1750

NAVER

In [85]:
url = 'https://search.naver.com/search.naver?sm=top_hty&fbm=1&ie=utf8&query=%ED%8C%8C%EC%9D%B4%EC%8D%AC'
html = getDownload(url)
dom = BeautifulSoup(html.text,'lxml')
In [95]:
len(dom.select(' .blog dt > a '))
Out[95]:
5
In [96]:
for tag in dom.select(" .blog dt > a "):
    print(tag.text)
    print(tag['href'])
파이썬 기초 공부 방법 - 입문하는 초보들에게
https://blog.naver.com/urmyver?Redirect=Log&logNo=221460054164
파이썬으로 디스코드 봇 만들기
https://tjgus1668.blog.me/221462704457
[수학으로 배우는 파이썬] 다나카 카즈나리 저 / 유세라 역
https://parksehoon1971.blog.me/221484676644
파이썬 웹구축부터 머신러닝까지 다재다능한 코딩언어!
https://blog.naver.com/ridesafe?Redirect=Log&logNo=221465766151
삼성이 챙기는 파이썬
https://blog.naver.com/tech-plus?Redirect=Log&logNo=221403058110

Crawling

In [122]:
seed = 'http://example.webscraping.com/places/default/index'
html = getDownload(seed)
dom = BeautifulSoup(html.text,'lxml')
In [141]:
# dom 확인
In [124]:
# 첫 페이지에 링크가 16개 있음
len(dom.select('a'))
Out[124]:
16
In [125]:
from urllib.parse import urljoin
requests.compat.urljoin(seed,'/search')

# urljoin(seed, "/search")
Out[125]:
'http://example.webscraping.com/search'
In [129]:
# 첫 페이지인 경우 seed랑 같은게 있으면 안됨

unseen = []


for tag in dom.select('a'):
    if tag.has_attr('href'):
        href = tag['href']
        
        if href.startswith('http'): # HTTP(S)
            print("External : {0}".format(href))
        elif href.startswith("/"): #
            newSeed = requests.compat.urljoin(seed, href)
            if seed != newSeed:
                unseen.append(newSeed)
        else:
            print("Skipped: {0}".format(href))
        
        # 오류 나지 않을 때 분석
        # 모두 내부링크
Skipped: #
In [130]:
unseen
Out[130]:
['http://example.webscraping.com/places/default/user/register?_next=/places/default/index',
 'http://example.webscraping.com/places/default/user/login?_next=/places/default/index',
 'http://example.webscraping.com/places/default/search',
 'http://example.webscraping.com/places/default/view/Afghanistan-1',
 'http://example.webscraping.com/places/default/view/Aland-Islands-2',
 'http://example.webscraping.com/places/default/view/Albania-3',
 'http://example.webscraping.com/places/default/view/Algeria-4',
 'http://example.webscraping.com/places/default/view/American-Samoa-5',
 'http://example.webscraping.com/places/default/view/Andorra-6',
 'http://example.webscraping.com/places/default/view/Angola-7',
 'http://example.webscraping.com/places/default/view/Anguilla-8',
 'http://example.webscraping.com/places/default/view/Antarctica-9',
 'http://example.webscraping.com/places/default/view/Antigua-and-Barbuda-10',
 'http://example.webscraping.com/places/default/index/1']
In [150]:
# 함수화 하기

def getUrls(base):
    # base 주소에 request 주고 response -> a태그 추출 -> href 정구화
    # -> DB List를 통해 관리
    
    unseen = []
     #request 보내야해서 getDownload
        
    html = getDownload(base)
    dom = BeautifulSoup(html.text,'lxml')
    
    
    for tag in dom.select('a'):
        if tag.has_attr('href'):
            href = tag['href']
        
            if href.startswith('http'): # HTTP(S)
                 #print("External : {0}".format(href))
                unseen.append(href)
            elif href.startswith("/"): #
                newSeed = requests.compat.urljoin(seed, href)
                if seed != newSeed:
                    unseen.append(newSeed)
#             else:
#                 print("Skipped: {0}".format(href))
    
    print("{0} -> {1}".format(base,len(unseen)))
    
    return unseen
        # 오류 나지 않을 때 분석
        # 모두 내부링크
        
    
In [146]:
queue = getUrls(seed)
seen = []

while queue:
    seed = queue.pop(0) # 마지막 원소가 꺼내지고 삭제 보통, 여기선 1번째로
   
    time.sleep(random.randint(1,3))
    
    unseen = getUrls(seed)
    seen.append(seed)
    # 딜레이 : 사람이 하는 것 처럼 하게 함
    
    
    # 앞으로 진행할, 봤던 링크에 있으면 안된다. 
        
    print("Q : {0}, Unseen : {1}".format(len(queue),len(unseen)))

    
    queue.extend([link for link in unseen if link not in seen and queue])
    
    
# 선입선출, 주소를 하나씩 탐색
# 대략 500개 정도 늘어났다가 줄어듬 
Skipped: #
http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Cape-Verde-42 -> 4
Skipped: #
http://example.webscraping.com/places/default/user/register -> 3
Q : 3, Unseen : 3
Skipped: #
http://example.webscraping.com/places/default/user/login -> 3
Q : 5, Unseen : 3
429
TOO MANY REQUESTS
{'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
http://example.webscraping.com/places/default/index -> 0
Q : 6, Unseen : 0
429
TOO MANY REQUESTS
{'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
http://example.webscraping.com/places/default/search -> 0
Q : 5, Unseen : 0
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-146-b37d68ea4e91> in <module>()
      5     seed = queue.pop(0) # 마지막 원소가 꺼내짐 보통, 여기선 1번째
      6 
----> 7     time.sleep(random.randint(1,3))
      8 
      9     unseen = getUrls(seed)

KeyboardInterrupt: 
In [137]:
# 딜레이
import time
import random

time.sleep(random.randint(1,3)) # second 만큼 delay 검
In [155]:
# append 말고 extend로 붙이기
# 원소만 가져와서 붙여짐

구글 적용

In [ ]:
queue = ['https://www.google.com/search?q=%ED%8C%8C%EC%9D%B4%EC%8D%AC&oq=%ED%8C%8C%EC%9D%B4%EC%8D%AC&aqs=chrome..69i57j69i60j35i39j0l3.1994j0j8&sourceid=chrome&ie=UTF-8']

while queue:
    base = queue.pop(0)
    links = getUrls(base)
    queue.extend(links)
In [170]:
# 첫 페이지 링크 수집
# 가장 먼저 구동 

url = 'https://www.google.com/search'
param = {'q':'파이썬'}

queue =[]

html = getDownload(url,param)
dom = BeautifulSoup(html.text,'lxml')

# . 은 클래스
for tag in dom.select(".r > a > h3"):
    print(tag.text)
    print(tag.find_parent()['href'])
    queue.append({"url" : tag.find_parent()['href'], "depth" : 0})
Welcome to Python.org
https://www.python.org/
파이썬 자습서 — Python 3.7.2 documentation
https://docs.python.org/ko/3/tutorial/index.html
Python - 나무위키
https://namu.wiki/w/Python
파이썬 - 위키백과, 우리 모두의 백과사전
https://ko.wikipedia.org/wiki/%ED%8C%8C%EC%9D%B4%EC%8D%AC
파이썬 입문 | 프로그래머스
https://programmers.co.kr/learn/courses/2
01-5 파이썬 둘러보기 - 점프 투 파이썬 - WikiDocs
https://wikidocs.net/9
1. 파이썬 시작하기 - 왕초보를 위한 Python 2.7 - WikiDocs
https://wikidocs.net/43
01-2 파이썬의 특징 - 점프 투 파이썬 - WikiDocs
https://wikidocs.net/6
02-2 문자열 자료형 - 점프 투 파이썬 - WikiDocs
https://wikidocs.net/13
Python & Ruby - 생활코딩
https://opentutorials.org/course/1750
In [171]:
# 최종

def getUrls(link, depth=3):
    if depth > 3 :
        return None
# for link in queue:
    links = []
    html = getDownload(link)
    dom = BeautifulSoup(html.text,'lxml')
    
    for a in dom.select('a'):
        if a.has_attr('href'):    # 있는지 확인
            if a['href'].startswith('http'):
                links.append({"url":a['href'],"depth":depth+1})
            elif  a['href'].startswith('/') and len(a['href']) > 1:
                links.append({"url":requests.compat.urljoin(link,a['href']), 'depth':depth+1})
#             else:
#                 print("Skipped : {0}".format(a['href']))
            
    print("{0} {1} : {2}".format(">"*depth, link, len(links)))
    return links
# , / (만 있는거 ), javascript시작하는 것들 다 걸러야함 
# /(.+) , http(s)뭐라도 있는 애들 살려야함
# 링크안의 링크 수를 나타냄 
In [174]:
depth = 0

while queue:
    link = queue.pop(0)
    links = getUrls(link['url'], link['depth'])
    
    if links != None:
        queue.extend(links)
        
# 실행시 depth, seed, 안의 url 수가 출력된다. 
In [178]:
# pop 참고 
a = [1,2,3,4]
b = a.pop(0)
print(a, b)
[2, 3, 4] 1


728x90
반응형