Learning to webscrape and playing with regex

When I posted about the trend in movies toward franchises, I briefly ruminated on the idea of doing a similar, more in-depth analysis for all of the wide-release movies in the US from 1980-2011. The biggest stumbling block was the size of the data set — 31 years with anywhere from 80 to 160 movies gaining wide-release status. Just putting together the spreadsheets by hand would be time consuming. But the other night as I was falling asleep, I realized that it should be possible to craft a webpage scraping script which would pull the lists off of BoxOfficeMojo’s site and which would store the lists as CSVs.

It took me a few hours to hack this together as I don’t do much programming regularly, though it’s something I enjoy challenging myself with on occasion. And this ended up being a nice learning experience. I’ll go ahead and drop my code here. It’s far from clean, and someone who programs regularly can probably find a dozen ways to clean it up with better ways to go about doing what I did.

<?php
for ($year=1980; $year<=2010; $year++) {
 $titles = array();

 for ($i=1; $i<=2; $i++) {
  $url = "http://boxofficemojo.com/yearly/chart/?page=$i&view=widedate&view2=domestic&yr=$year&p=.htm";
  $ch = curl_init($url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  $curl_scraped_page = curl_exec($ch);
  curl_close($ch);
  //echo $curl_scraped_page;
  $content = preg_replace("#><#", ">\n<", $curl_scraped_page);
  //echo $content;
  $regex = '#<a href=\"/movies/.+\">.*</a>#';
  preg_match_all($regex,$content,$matches); // stores strings as $matches[][], size [1][102]
  //var_dump($matches);
  array_shift($matches[0]); // get rid of extra unrelated
  array_pop($matches[0]); // get rid of extra unrelated
  //var_dump($matches);
  $moviereg = '#>.*<#';

  foreach($matches[0] as $movies) {
   //echo $movies . "\n";
   preg_match($moviereg,$movies,$title);
   //echo $title[0] . "\n";
   $title[0] = substr($title[0],1,$title[0].length-1);
   //echo $title[0] . "\n";
   array_push($titles, $title[0]);
  }
 }
 //print_r($titles);
 $titlelist = "";

 foreach($titles as $title) {
  $title = str_replace(",","",$title);
  $titlelist .= "$title,";
 }

 $fileyear = "$year.csv";
 if (!$handle = fopen($fileyear,'a')) {
  echo "Cannot open file ($fileyear)";
  exit;
 }
 if (fwrite($handle,$titlelist) === FALSE) {
  echo "Cannot write to file ($fileyear)";
  exit;
 }

 echo "Success, wrote to file ($fileyear)";

 fclose($handle);
}
?>

The meat of the program opens the webpage at BoxOfficeMojo based on the year, and pulls the page’s HTML into a string. I had to do a bit of formatting by sticking new lines between any neighboring HTML tags because the regular expression (“regex”) that I cobbled together was just throwing out a massive block of text and seemed to be waiting to find a new line. Then it was just a simple case of creating an array of all the occurrences of a specific sequence — the a href=”/movies/…” which indicated the link and title of the movies in the list (and two other links that weren’t part of the list). Then, get rid of everything other than the actual movie title and any commas in the movies’ names and create another string which was just the list of the titles as separated by commas and save it as a CSV. Toss all of that into a pair of loops which incremented the year value and checked for the second page of the lists that had more than 100 movies on them and everything’s groovy.

Open all the CSV’s as sheets in the same spreadsheet, convert the row of movies to a column, and I now have a 31 tab file that’s ready to be evaluated for the nature of the scripts. When I get around to doing all of that evaluation is another question entirely. If only I had an intern.

Learning to webscrape and playing with regex

Related

Leave a ReplyCancel reply

Share this:

Related

Leave a ReplyCancel reply