Jupyter Notebook Web Scraping

30-04-2021

Jupyter Notebook Web Scraping

Jupyter Notebook Web Scraping
Jupyter Notebook App
Web Scraping Tools

There are a lot of hypocritical people that complain about modern life while benefiting enormously from it every day… I know this is true because I am one of those people.

In a lot of ways people had it better in the past. They could discover new continents, invent the theory gravity, and drive without seatbelts.

Let’s use Python and some web scraping techniques to download images. Update 2 (Feb 25, 2020): One of the problems with scraping w ebpages is that the target elements depend on the a selector of some sort. Get code examples like 'how to read excel file in jupyter notebook' instantly right from your google search results with the Grepper Chrome Extension.

The American expansion into the West was one of those golden periods of time. Cowboys riding around, miles and miles of open range, tuberculosis—what more could anyone ask for?

Jupyter Notebook Web Scraping

I am a fan of Western films. Those films captured the West with all the cool shootouts and uniquely western landscapes. Sure, they are romanticized but that is what I like. While watching a spaghetti western I wondered where these events were supposed to be taking place.

The question in my head became, “According to Western films, where is the West?” We can answer that. We have the technology. After a few false starts I got a working process together:

Find a bunch of lists of Western films on Wikipedia
Scrape the film titles
Run those film titles by Wikipedia to see if we can find their plots
Use a natural language processor (NLP) to pull place names from the plots
Run the place names through Wikipedia again to remove junk place names
Geocode the place names and load it all into a spatially enabled dataframe
Map it

Let’s take the example of the 1952 film Cripple Creek. First, we get the title.

Next, we use the title to find the plot on its Wikipedia page.

After running the plot through the NLP, we get the following “GPE” entities.

Next, we geocode these addresses to get X and Y coordinates and read them into a pandas dataframe.

We see Cripple Creek, CO and Texas mapped from the plot of the film Cripple Creek.

The notebook is broken up as follows:

Import packages

Search Plots with Natural Language Processor for Place Names

This is a map of the West according to our definition and process.

A lot of the hotspots seem to occur from many mentions of states like Texas, Colorado, Kansas, and California. But a lot of cities and towns made it through as well.

To start 1995 titles were collected. From these we found over 1750 movie plots. This became about 3750 potential place names. Those got filtered down to around 1750 by checking their Wikipedia entries. Finally, 1000 or so points made it to the end to be plotted.

Jupyter Notebook App

Plotting the centroid, we find {36.35037924014938, -106.2693241648988} which sits neatly in north-central New Mexico.

If you ever make it on Jeopardy and the answer “The centroid of the American West as defined by geocoding place names mentioned in the plots of Western films between 1920 and 1969” comes up do not respond “Carson National Forest” or you will get it wrong. You need to answer in the form of a question on Jeopardy.

The Jupyter notebook can be found here.

You need to have to have selenium installed first: run pipinstallselenium in the terminal or with iPython run !pipinstallselenium.

First, set up a Firefox webdriver and point it to our URL of interest.

Let’s select the first <table> element from Wikipedia’s ISS article as an example. Ugt usb devices driver download.

Now we want to see if our XPath selector got us what we were looking for.

We can look at the raw HTML of that first <table> and see it it’s what we wanted. To get the raw HTML of a selected element, we can get its outerHTML attribute:

Web Scraping Tools

Reading raw HTML isn’t very nice.

Let’s take advantage of some iPython Notebook magic: since we’re viewing the notebook in a web browser, we can also render HTML content directly in the notebook.

We lose whatever CSS styling was in the scraped website, as well as anything loaded from relative links, but we can see the general structure which is often all we want anyway.

This can make it much easier to see what our XPath selectors are actually pulling from the site. Is it what we intended? Scraping HTML is a messy business and selectors often surprise you, so it’s nice to be able to get visual feedback.

Here is the same table as above, rendered in HTML in the iPython notebook. Relative links won’t work, but in the example below the image of the ISS shows up correctly because its src is an absolute link.

International Space Station
Station statistics
The International Space Station on 23 May 2010 as seen from the departing Space ShuttleAtlantis during STS-132.
COSPAR ID	1998-067A
Call sign	Alpha, Station
Crew	Fully crewed: 6 Currently aboard: 6 (Expedition 47)
Launch	20 November 1998
Launch pad	Baikonur1/5 and 81/23 KennedyLC-39
Mass	Appx. 419,455 kg (924,740 lb)^[1]
Length	72.8 m (239 ft)
Width	108.5 m (356 ft)
Height	c. 20 m (c. 66 ft) nadir–zenith, arrays forward–aft (27 November 2009)^{[dated info]}
Pressurised volume	916 m³ (32,300 cu ft) (3 November 2014)
Atmospheric pressure	101.3 kPa (29.91 inHg, 1 atm)
Perigee	409 km (254 mi) AMSL^[2]
Apogee	416 km (258 mi) AMSL^[2]
Orbital inclination	51.65 degrees^[2]
Average speed	7.66 kilometres per second (27,600 km/h; 17,100 mph)^[2]
Orbital period	92.69 minutes^[2]
Orbit epoch	25 January 2015^[2]
Days in orbit	6353 (12 April)
Days occupied	5640 (12 April)
Number of orbits	95912^[2]
Orbital decay	2 km/month
Statistics as of 9 March 2011 (unless noted otherwise)
References: ^[1]^[2]^[3]^[4]^[5]^[6]
Configuration
Station elements as of May 2015 (exploded view)

Much nicer!

Please enable JavaScript to view the comments powered by Disqus.

Ibload486

Jupyter Notebook Web Scraping

Jupyter Notebook Web Scraping

Jupyter Notebook App

Web Scraping Tools

MOST POPULAR ARTICLES