Preparation

First, let’s load the R packages we will use in this page. (Don’t forget to install them before loading).

library("plyr")
library("readr")
library("XML")
library("stringr")

HTML

Almost all web pages (including this page) are written with HTML, which stands for Hyper Text Markup Language.

For example, let’s take a look at this page’s source code. Most web browsers let you view the source code of the web page you are opening. In Firefox, hit “Cmd” (or “Ctrl”) and “U” at the same time. In Safari, “Cmd” + “Option” + “V”. In Google Chrome, “Cmd” + “Option” + “U” on Mac or “Ctrl” + “U” on Windows. As you can see in the source, a typical static webpage is constructed by a text file written in HTML language (with CSS). Your web browser interprets the grammar of HTML and shows you the nicely translated (visualized) version of the page on screen.

When you extract data from webpages, you need to have the elementary knowledge about HTML to figure out the structures of the pages. You should learn HTML (in addition to R) because

  1. it is likely that you will have to collect data online for your research, and
  2. you will be expected to have your own website when you become a professional researcher.

Fortunately, HTML is one of the most easiest computer languages, and there are tons of free online learning resources. For instance, you can start studying it here.

Regular Expressions

Most data on the web pages are text, and they are usually unstructured. We sometimes want the unstrutured text as a whole, but we normally need the subset of the text that matches a secific pattern. Thus, we have to find systematic patterns in the text. Regular expressions let us do that.

Here, I will explain some basic regular expressions.

For our practice, create two sample obejects of character string.

eg_Acton <- "Power tends to corrupt. Absolute power corrupts absolutely."
eg_Wilde <- "A little sincerity is a dangerous thing, and a great deal of it is absolutely fatal."

Character Matching

We use the stringr package to manipulate text (charcter strings). First, stringr::sting_exract() function extracts the specific string from the obeject.

str_extract(eg_Acton, pattern = "corrupt")
## [1] "corrupt"

Because the object eg_Acton has strings that matach the specified pattern “corrupt”, the function returns the pattern itself. Though the object has two occrurences of “corrupt” in the string, this function only returns the first occurence.

If the sring and the pattern do not match, the NA will be returned.

str_extract(eg_Acton, pattern = "Acton")
## [1] NA

While str_extract() returns only the first appearance of the pattern, stringr::str_extract_all() returns all matches.

str_extract_all(eg_Acton, pattern = "corrupt")
## [[1]]
## [1] "corrupt" "corrupt"

As this example shows, str_extract_all() returns a list. The length of the list is the number of objects we pass to the function.

str_extract_all(c(eg_Acton, eg_Wilde), pattern = "absolute")
## [[1]]
## [1] "absolute"
## 
## [[2]]
## [1] "absolute"

These functions are both case-sensitive. To ignore the case, we use ignore.case().

str_extract_all(c(eg_Acton, eg_Wilde), pattern = ignore.case("absolute"))
## [[1]]
## [1] "Absolute" "absolute"
## 
## [[2]]
## [1] "absolute"

To find the string at the beginning of the object, we use “^” (the caret symbol) as follows.

str_extract(eg_Acton, pattern = "^corrupt")
## [1] NA
str_extract(eg_Acton, pattern = "^Power")
## [1] "Power"

Similarly, to find the string at the end, we use “$” as follows.

str_extract_all(c(eg_Acton, eg_Wilde), pattern = ".$")
## [[1]]
## [1] "."
## 
## [[2]]
## [1] "."

By using “|” symbol, we can find the string that matches one of the specified patterns.

str_extract_all(eg_Acton, pattern = "corrupt|absolute")
## [[1]]
## [1] "corrupt"  "corrupt"  "absolute"

Pay attention to spaces when you use regular expressions. For instance, if you have spaces between the words and the symbol, the return value might not what you want.

str_extract_all(eg_Acton, pattern = "corrupt | absolute")
## [[1]]
## [1] " absolute"

The command above looked for the pattern “corrupt ” (including one white space at the end) or “ absolute” (white space at the beginning) and found the latter only.

General Pattern Matching

So far, we have tried to find the specific characters in the strings. However, with regular expressions, we can find more general patterns.

For example, “.” (dot) means any character.

str_extract_all(eg_Wilde, pattern = "i.")
## [[1]]
## [1] "it" "in" "it" "is" "in" "it" "is"

To extact dot itself, we use “\.”

str_extract_all(eg_Wilde, pattern = "\\.")
## [[1]]
## [1] "."

Here, the backslash (“\”) is the escape character because it enables us to escape definitinos of characters in R. In R, “.” means “any character”, but “\.” lets us escape and use it as literally “.”.

To find “it” or “in” (but not “is”), run the following.

str_extract_all(eg_Wilde, pattern = "i[tn]")
## [[1]]
## [1] "it" "in" "it" "in" "it"

To find the words “it” and “is” only, (that is, excluding, for instance, “it” in little), we can do

str_extract_all(eg_Wilde, pattern = "\\bi[ts]")
## [[1]]
## [1] "is" "it" "is"

As these examples show, we can extract the strings that matches the specified pattern. For this purpose we use following regular expressions.

  • [:digit:] – Digits: 0 through 9
  • [:lower:] – Lowercase characters: a to z
  • [:upper:] – Uppecase characters: A to Z
  • [:alpha:] – Alphabetic characters: a to z and A to Z
  • [:alnum:] – Alphanumeric characters: a to z, A to Z, and 0 to 9
  • [:punct:] – Punctuations
  • [:blank:] – Blank characters: space and tab
  • [:space:] – Space characters: space, tab, newline, and other space characters
  • [:print:] – Printable characters: [:alnum:], [:punct:], and [:space:]

Furthermore, some symbols have special meanings. (When you use them, write one more backslash like “\\w”.)

  • \w – Word characters
  • \W – No word characters
  • \s – Space characters
  • \S – No space characters
  • \d – Digits
  • \D – No digits
  • \b – Word edge
  • \B – No word edge
  • \< – Word beginning
  • \> – Word ending

In addition, we can specify how many of each regular expressions we want. For instance, to find 5-letter words in the text, we run the following.

str_extract_all(eg_Wilde, pattern = "\\b[:alpha:]{5}\\b")
## [[1]]
## [1] "thing" "great" "fatal"

We use the following qunatifiers right after the expression to quantify.

  • ?: The preceeding item is optional and will be matched at most once
  • *: The preceeding item will be matched zero or more times
  • +: The preceeding item will be matched one or more times
  • {n}: The preceeding item is matched exactly n times
  • {n,}: The preceeding item is matched n or more times
  • {n,m}: The preceeding item is matched at least n times, but not more than m times

Without a quantifier, each item is matched exactly once.

Let’s examine some examples.

y_names <- c("Yamada", "Yamaji", "Yamamoto", "Yamashita",
             "Yamai", "Yanai", "Yanagi", "Yoshida")
unlist(str_extract_all(y_names, pattern = "Ya.a.+"))
## [1] "Yamada"    "Yamaji"    "Yamamoto"  "Yamashita" "Yamai"     "Yanai"    
## [7] "Yanagi"
unlist(str_extract_all(y_names, pattern = "Ya.a.?i"))
## [1] "Yamaji" "Yamai"  "Yanai"  "Yanagi"
unlist(str_extract_all(y_names, pattern = "Y.+da"))
## [1] "Yamada"  "Yoshida"
unlist(str_extract_all(y_names, pattern = "Yama.+t.?"))
## [1] "Yamamoto"  "Yamashita"
unlist(str_extract_all(y_names, pattern = "Ya[:alpha:]{4}$"))
## [1] "Yamada" "Yamaji" "Yanagi"
unlist(str_extract_all(y_names, pattern = "^[:alpha:]{6,}$"))
## [1] "Yamada"    "Yamaji"    "Yamamoto"  "Yamashita" "Yanagi"    "Yoshida"
unlist(str_extract_all(y_names, pattern = "^[:alpha:]{6,8}$"))
## [1] "Yamada"   "Yamaji"   "Yamamoto" "Yanagi"   "Yoshida"
unlist(str_extract_all(y_names, pattern = "\\w+m\\w+"))
## [1] "Yamada"    "Yamaji"    "Yamamoto"  "Yamashita" "Yamai"

For more information, read Regular Expresions as used in R. To learn more about regular expressions, see Jeffrey E. F. Friedl. 2006. Mastering Regular Expressions, 3rd ed. (O’Reilly) [株式会社ロングテール, 長尾高広 訳. 2008.『詳細 正規表現 第3版』オライリー・ジャパン].



Collecting Data Online

Downloading Datasets

Nowadays, it is common to use datasets available online for our research. For example, let’s visit the GDELT Project, which provides us with the huge datasets of events in the world. To find the datasets, click “Data” on top, “Raw Data Files”, then “Download Events”, which leads you here. As you can see, there are a lot of events files provided in the page. To obtain the datasets, you can click hyperlinks to the files and download them. However, what will you do if you want to get more than a few files? Will you click all of the hyperlinks?

With R (or another programming language such as Python), we do not have to click manually the hyperlinks. Rather, we let R download multiple files promptly.

Let’s try downloading the event data for the year 2014.

(Note: Unless you will use the GDELT data for your research, don’t download all the files because the files are large. If you would like to try downloading yourself to learn the process, I recommend you replace 2014 in the codes with 2012 because there are a large number of files for 2014.)

First, we specify the URL of the webpage that leads us to the data sets.

url <- "http://data.gdeltproject.org/events/index.html"

Then, we use XML::getHTMLLinks() function to find all the hyperlinks in the page. In the webpage of the GDELT raw data, the hyperlinks are highly visible and easy to find. However, it might be hard to find hyperlinks leading to data sets in other webpages. Even in such a case, we can rely on getHTMLLinks() to detect the links.

links <- getHTMLLinks(url)
links[1:5]    ## check the content (the first 5 links only)
## [1] "md5sums"                            
## [2] "filesizes"                          
## [3] "GDELT.MASTERREDUCEDV2.1979-2013.zip"
## [4] "20151103.export.CSV.zip"            
## [5] "20151102.export.CSV.zip"

The function getHTMLLinks(url) returns all the hyperlinks in the page, and the page has some hyperlinks that are not for our target data sets. Accordingly, the object links contains the links we do not want. Thus, we have to find the relevant links. We want the data sets for 2014. What do the files for the 2014 events share? You have to answer this question by looking at the target webpage.

It turns out that all the target file names begin with “2014”. Thus, we should find the hyperlinks beggining with “2014”. stringr::str_detct() checks if an object contains a spcific string and returns TRUE or FALSE. The string we want to find is something beggining with “2014”, so the regular expression “^2014” does the job. (“^” means that it is the beggining of the string. Accordingly, “^2014” matches “201410”, but not “102014”, for example.)

filenames_2014 <- links[str_detect(links, "^2014")]  ## pick out the 2014 files
filenames_2014[1:10]  ## check the content (the first 10 elements only)
##  [1] "20141231.export.CSV.zip" "20141230.export.CSV.zip"
##  [3] "20141229.export.CSV.zip" "20141228.export.CSV.zip"
##  [5] "20141227.export.CSV.zip" "20141226.export.CSV.zip"
##  [7] "20141225.export.CSV.zip" "20141224.export.CSV.zip"
##  [9] "20141223.export.CSV.zip" "20141222.export.CSV.zip"
length(filenames_2014)
## [1] 361

We found 361 links for the 2014 event data. Do you want to click the links that many times? I guess not.

Now let’s download these files. We can use download.file() function to get a file from the web. To sophistiacte downloading procedure, let’s make a function to download a file to a specific folder (directory).

download_file <- function(file, base_url, folder) {
    ## Args: file = name of the target file
    ##       base_url = path to the dir where the file exists
    ##       folder = folder to save the file
    
    ## create a folder in the current wd if it doesn't exist
    dir.create(folder, showWarnings = FALSE)
    file_url <- str_c(base_url, file)    ## URL of the file
    outfile <- str_c(folder, "/", file)  ## path to the file on computer
    if (!file.exists(outfile)) { ## download only if the file doesn't exist
        download.file(file_url, outfile)
        Sys.sleep(1)    ## pause one second after downloading a file
    }
}

This function takes the name of a single file for its first argument. By using Sys.sleep(), we intentionally slow down the downloading process. Without it, we could download the files faster, but we would not like to do so. If R accesses the webpage many times in a short period of time, the server might not be able to handle the request from R, or the webmaster would think that you are attacking the website. Therefore, when we automate the process, which is intended to be performed by human being (by “clicking” in this example), we had better slow it down so that the server can deal with it.

Since we have a lot of files to download, we would like to apply this function to them at once. Thus, we should make a list of file names so that we can apply download_file() function to multiple files.

file_list <- as.list(filenames_2014)
file_list[1:3]  ## check the content
## [[1]]
## [1] "20141231.export.CSV.zip"
## 
## [[2]]
## [1] "20141230.export.CSV.zip"
## 
## [[3]]
## [1] "20141229.export.CSV.zip"

Now we are ready. Let’s apply download_file() to the list file_list by lapply().

lapply(file_list, FUN = download_file,
       base_url = "http://data.gdeltproject.org/events/", 
       folder = "GDELT_2014_zip")

Let’s examine the files in the create folder.

saved <- list.files("./GDELT_2014_zip")
saved[1:5]
## [1] "20140101.export.CSV.zip" "20140102.export.CSV.zip"
## [3] "20140103.export.CSV.zip" "20140104.export.CSV.zip"
## [5] "20140105.export.CSV.zip"
length(saved)
## [1] 361

We can see that there are 1 zip files in the folder, which means that we successfully downloaded all the target files.

If you want to unzip the zip files, you can do that in R too.

dir.create("GDELT_2014_csv")  ## create folder to save csv files
zip_list <- as.list(str_c("GDELT_2014_zip/", saved))  ## list of the zip files 
## unzip the files and save them in exdir
lapply(zip_list, FUN = unzip, exdir = "GDELT_2014_csv")

Now all the unzipped CSV files are in the folder named GDELT_2014_csv.

saved <- list.files("GDELT_2014_csv")
saved[1:5]
## [1] "20140101.export.CSV" "20140102.export.CSV" "20140103.export.CSV"
## [4] "20140104.export.CSV" "20140105.export.CSV"
length(saved)
## [1] 361

You can load these data sets into R by readr::read_csv() function as usual.

Web Scraping

You can find a variety of data on the internet, but they are not neccessarily available as the readily downloadable files. Rather, the most information on the internet is displayed as text of the page. Thus, we have to scrape the web pages to gather the information we need for our research.

If the information you need is available in a single page, you might want to copy the content and paste it into a local text or CSV file (or you might want to use OutWit Hub). However, it is likely that the information is scatterd around multiple pages in a website. In addition, it usually takes some time to decide which part of a page you should copy. In such a situation, web scraping is the way to go.

Mitchell (2015) puts, “web scraping is the practice of gathering data through any means other than a program interacting with API.” When API (Application Progrramming Interface) is available, you should use it. For instance, e-Stat, the portal website of statistics published by the government of Japan, provides API. Thus, we do not have to and should not use web scraping for e-stat. However, all websites do not provide APIs, and we sometimes want to use web scraping.

Nowadays, it is one of the easiest and most popular ways for web scraping to use Python with Beautiful Soup. See the example script on the course website if you are interested. See Ryan Mitchell. 2015. Web Scraping with Python. (O’Reilly) for more information.

Here, we will learn some basic web scraping techniques with a case study.


Case Study: Collecting MPs’ Email Addresses

Let’s collect the informaiton about the members of the House of Commons in the UK. Suppose we want to know each MP’s name, party, constituency, and email address. The UK Parliament provides these data with us at the official website. Visit www.parliament.uk. On this page, you can click each MP’s name and see the required information. However, it takes some time to gather infomation for all MPs. Since each MP has their own page, we need write a program to open many different pages and collect data. That is, we would like to scrape the web. Let’s try to conduct web scraping with R.

First, specify the base URL.

url <- "http://www.parliament.uk/mps-lords-and-offices/mps/"

The hyperlinks to each MPs exist in this page. Let’s find them by getHTMLLinks().

links <- getHTMLLinks(url)
links[1:10]
##  [1] "/"                                                                        
##  [2] "/site-information/accessibility/"                                         
##  [3] "/site-information/privacy/"                                               
##  [4] "https://subscriptions.parliament.uk/accounts/UKPARLIAMENT/subscriber/new?"
##  [5] "/site-information/rss-feeds/"                                             
##  [6] "/site-information/contact-us/"                                            
##  [7] "/"                                                                        
##  [8] "/business/"                                                               
##  [9] "http://www.parliament.uk/business/commons/"                               
## [10] "http://www.parliament.uk/business/lords/"

It is shown that links also includes the hyperlinks to the pages that we do not want to visit.

Now, we have to figure out what the URL of each MP shares. By clicking several names of MPs, it turns out that each MP’s page contains “biographies/commons” in the URL. Thus, we want to find the hyperlinks with that string. We can use stringr::str_detect() function, which returns TRUE if the specified pattern is found in the text

commons_MPs <- links[str_detect(links, "biographies/commons/")]
commons_MPs[1:5]
## [1] "http://www.parliament.uk/biographies/commons/ms-diane-abbott/172"         
## [2] "http://www.parliament.uk/biographies/commons/debbie-abrahams/4212"        
## [3] "http://www.parliament.uk/biographies/commons/nigel-adams/4057"            
## [4] "http://www.parliament.uk/biographies/commons/adam-afriyie/1586"           
## [5] "http://www.parliament.uk/biographies/commons/ms-tasmina-ahmed-sheikh/4427"

Let’s check how many MPs there exist.

## the number of MPs in the House of Commons
(n <- length(commons_MPs))
## [1] 649

Assuming our codes above and the Parliament’s website are both correct, there are 649 MPs. So, it is really fortnuate that we can scrape the web with R instead of visiting 649 different pages to gather informaiton about MPs.

Now we are ready for scraping. Web scraping goes through the following steps.

  1. Open a page
  2. Parse the HTML code
  3. Find the specific content by HTML tags
  4. Extract the information
  5. Go to a next page and repeat

To open a page, we use read.lines() funtion, which reads each line of the page. Then, by XML::htmlParse() function, we parse the source code. By so doing, we can access spcific parts of the HTML relying on HTML’s structures (i.e. by tags).

To specify which parts of the page we want, we need open the source code (HTML) of the webpage and find relevant information. Thus, we should know a little about HTML to understand the HTML structure and efficiently specify the parts we want.

To extract the content from a parsed object, we use XML::xmlValue() function. Then, we find more specific character strings by str_extract() and regular expressions. Finally, we trim the string by stringr::str_sub() function.

The following function scrape a page

scrape_MP <- function(url) {
    ## Argument: url = URL of target page
    ## Return: res = data frame containing 4 variables
    
    mp_source <- readLines(url, encoding = "UTF-8")  ## read the page
    mp_parsed <- htmlParse(mp_source, encoding = "UTF-8") ## parse the HTML
    
    ## MP's name
    mp_name <- mp_parsed["//title"]  ## extract the "title" tag
    mp_name <- mp_name[[1]]          ## extract the list element
    mp_name <- xmlValue(mp_name)     ## extract the content (value)
    mp_name <- str_extract(mp_name, "[:upper:].+MP")  ## extract the string of name
    mp_name <- str_sub(mp_name,      ## trim the fat
                       start = 1, end = str_length(mp_name) - 3)    
    ## MP's party
    ## extract the "div" tag whose id is "commons-party"
    mp_party <- mp_parsed["//div[@id='commons-party']"]
    mp_party <- mp_party[[1]]        ## extract the list element
    mp_party <- xmlValue(mp_party)   ## extract the content
    mp_party <- str_extract(mp_party, "[:upper:].+")  ## extract the string
    
    ## MP's constituency
    ## extract the "div" tag whose id is "commons-constituency"
    mp_const <- mp_parsed["//div[@id='commons-constituency']"]
    mp_const <- mp_const[[1]]
    mp_const <- xmlValue(mp_const)
    mp_const <- str_extract(mp_const, "[:upper:].+")    
    
    ## MP's email address
    ## extract the "p" tag whose data-generic-id is "email-address"
    mp_email <- mp_parsed["//p[@data-generic-id='email-address']"]
    if (length(mp_email) == 0) {  ## if email is not available
        mp_email <- NA
    } else {
        mp_email <- mp_email[[1]]       ## extract the list element
        mp_email <- xmlValue(mp_email)  ## extract the content
        ## find the string containing email address
        mp_email <- str_extract(mp_email, "Email:[:space:]+.+")
        ## extract the email address 
        mp_email <- str_extract(mp_email, "[:space:][:alpha:].+")
        mp_email <- str_sub(mp_email, start = 2) ## trim
    }
    ## save in the data frame
    res <- data.frame(name = mp_name, 
                      party = mp_party, 
                      constituency = mp_const,
                      email = mp_email)
    Sys.sleep(1)   ## pause a second 
    return(res)
}

We pause a second for each page to behave nicely.

We apply this function to each MP’s page using plyr::ldply(), which applies a funciton to a list (hence “l”dply) and returns a data frame (hence l“d”ply).

lst_MPs <- as.list(commons_MPs)
MP_contacts <- ldply(lst_MPs, .fun = scrape_MP)

To save the data frame in a CSV file, we use readr::write_csv() function.

dir.create("data", showWarnings = FALSE)
write_csv(MP_contacts, path = "data/MP-contacts.csv")
## alternatively, you can use write.csv()
# write.csv(MP_contacts, file = "data/MP-contacts.csv", row.names = FALSE)

Here is the product.



Back to Class Materials