Week 4— Data Crawling
Importing Data from Websites using Python
--
Alright. Now it’s time to apply our knowledge in Python to import specific sets of data for manipulation, and this process is called ‘data crawling.’
Using a Python Package
Requesting and importing data from an OpenApi requires that you download the ‘requests’ library. Using the package, we’ll import data from an OpenAPI that presents real-time concentrations of fine dust in different parts of Seoul.
import requestsr = requests.get('http://openapi.seoul.go.kr:8088/6d4d776b466c65
6533356a4b4b5872/json/RealtimeCityAir/1/99')
rjson = r.json() print(rjson['RealtimeCityAir']['row'][0]['NO2'])
Now, we’ll try importing the ‘IDEX_MVL’ value of all ‘gu’-s:
import requests # requests 라이브러리 설치 필요 r = requests.get('http://openapi.seoul.go.kr:8088/6d4d776b466c65
6533356a4b4b5872/json/RealtimeCityAir/1/99')
rjson = r.json() gus = rjson['RealtimeCityAir']['row'] for gu in gus:
print(gu['MSRSTE_NM'], gu['IDEX_MVL'])
What if we wanted to print only those with IDEX_MVL < 60?
import requests # requests 라이브러리 설치 필요 r = requests.get('http://openapi.seoul.go.kr:8088/6d4d776b466c65
6533356a4b4b5872/json/RealtimeCityAir/1/99')
rjson = r.json() gus = rjson['RealtimeCityAir']['row'] for gu in gus:
if gu['IDEX_MVL'] < 60:
print (gu['MSRSTE_NM'], gu['IDEX_MVL'])
Now, let us actually crawl some data.
Basics of Web Scraping
Let us import data from Naver Movies, the website we used list time. We need to download a library called ‘BeautifulSoup’ in order to parse our HTML for web scraping. The basic template is as follows:
import requests
from bs4 import BeautifulSoup
# read the target URL and import HTML
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('<https://movie.naver.com/movie/sdb/rank/ rmovie.nhn?sel=pnt&date=20200303>',headers=headers)
# make the HTML search-friendly using BeautifulSoup
# inside the variable 'soup' is a parse-friendly version of html
# extract necessary components out of the HTML/soup through coding
soup = BeautifulSoup(data.text, 'html.parser')
#############################
# code!!
#############################
It is important to get used to the commands of .select & .select_one. Let us learn them through importing titles of movies.
import requests
from bs4 import BeautifulSoup # read the target URL and import HTML
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/ rmovie.nhn?sel=pnt&date=20200303',headers=headers)# make the HTML search-friendly using BeautifulSoup
soup = BeautifulSoup(data.text, 'html.parser')# using select, import tr
movies = soup.select('#old_content > table > tbody > tr') # for loop for movies (tr's)
for movie in movies:
a_tag = movie.select_one('td.title > div > a')
if a_tag is not None:
print (a_tag.text)
Here, use the Chrome inspection tool to copy element selectors (ex. #old_content > table > tbody > tr):
Practicing Web Scraping
Let us use the skills learned to import ranks, titles, and stars for movies listed on Naver Movies. The completed code looks as follows:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('<https://movie.naver.com/movie/sdb/rank/ rmovie.nhn?sel=pnt&date=20200303>',headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
movies = soup.select('#old_content > table > tbody > tr')
for movie in movies:
a_tag = movie.select_one('td.title > div > a')
if a_tag is not None:
rank = movie.select_one('td:nth-child(1) > img')['alt']
title = a_tag.text
star = movie.select_one('td.point').text
print(rank,title,star)
The result:
Alright. I think that was very interesting. And you know what’s more interesting? You can store the sets of data in your own database and manipulate them! Next class, we’ll learn about databases and their uses in data crawling using MongoDB.
Then, see you then!
Fin.