Python Web Scraping Tutorial 2 – Our First Funny Web Scraper


Let’s create our first, and very simple web scraper. We’ll look at how the process of accessing a webpage works in step and we’ll try to extract the HTML from a simple page.

Source Code for tutorials on Youtube:

[bg_collapse view=”link-inline” color=”#4a4949″ icon=”zoom” expand_text=”View transcript” collapse_text=”Hide transcript” ]hello Python is does on YouTube and welcome to another Python web scraping tutorial this is our second video and this tutorial we’re going to create our very first and very simple web scraper so we’ll go over the basics so let’s get into it so once you start web scraping you start to appreciate all the little things that the browser’s do for us the web without a layer of HTML formatting CSS styling JavaScript execution and image rendering and all this can look a little intimidating at first but in this tutorial as well as the several next ones will cover how to format and interpret data without the help of a browser so this lesson we’ll start with the basics of sending a get request to a web server for a specific page reading the HTML output from that page and doing some simple data extraction in order to isolate the content that we are looking for so let’s talk a little about when we are connecting to a web server so if you haven’t spent much time in networking or network security the mechanics of the Internet might seem a little mysterious we don’t want to think about that exactly the network is doing every the leper is doing every time we open a browser and for example go to http and these days we don’t have to know what’s happening under the hood so in fact it’s a fantastic fantastic that computers computer interfaces have advanced to the point where most people who use the internet Internet don’t have the faintest idea about how it works however web scraping requires stripping away some of this shroud of interface not just at the browser level like how it interprets all of its HTML CSS and Java script but occasionally at the level of the of the network connection so and to give you some idea of the interval infrastructure required to get information to your browser let’s use the following example the famous Alice and Bob example Alice let’s say Alice owns a web server and Bob uses a desktop computer which is trying to connect to Alice’s service server so when one machine wants to talk to another machine something like the following exchange takes place like the process that we are going to talk about here so what’s happening in the process it’s that let’s say Bob’s computer sends along a stream of 1 and 0 bits indicated by high and low voltage voltages on a wire these bits from some information containing a header and body the header contains an immediate destination of his local routers MAC address with a final destination of Alice’s IP address the body contains his request for Alice’s server application so the next step is that Bob’s local router receives all these ones and zeros bits of information and interprets them as a packet from Bob’s own MAC address and destined for Alice’s IP address his router stamps its own IP address on the packet as the from IP address and sends its it off across the internet and then Bob’s packet traverses several intermediate mediary servers which direct his packet toward the correct physical wired path to Alice’s server and then Alice’s server receives the packet at her IP address and then Alice’s server reads the packet port destination which is almost always port for excuse me poor 84 web applications this can be thought of as something like an apartment number for packet data where IP address is the street address in the header and passes off to the appropriate application in this case the web server application and then the web server application receives a stream of data from the server processor this data says something like this is a get request and the following file is requested for instance index dot HTML and then the web server locates the correct HTML files bundles it up into a new packet to send to Bob and sends it through its to its local router for transfer back to Bob’s machine through the same process and there we have that’s the internet so we’re in this exchange did the web server web browser come into play absolutely nowhere in fact browsers are a relatively recent invention in the history of the internet when Nexus was released in 1990 so yes the web browser is a very useful application for creating these packets of information sending them off and interpreting the data you get back as pretty pictures sounds videos and text however a web browser it just is is just code and code can be taken apart broken into its basic components rewritten reused and made to do anything we want a web browser can tell the processor to send some data to the application that handles your wireless or wired interface but many languages have libraries that can do that just as well so let’s look at it how it’s done in Python so I’m going to create a new file in my folder structures and I hope I’m by the way I’m using Python 3.5 and I’m using the we as a visual studio code as an editor editor and you can use whatever you want hope you are proficient with and comfortable comfortable with the with a with a text editor of your choice and you know how to execute Python scripts from the command line so let’s create our first file this is all very simple scraper I’ll explain in a minute what we are going to do here let me just type this out first and this website is what’s used in the book that I showed you guys in the previous video so you can use that as well and not in codes course so let’s run this code and I’m going to just step through the code and it gets a page and the page that we are referring to is this page so it’s quick it’s scraping this page gets all the HTML tags textile heads and body so run it in your in your system or if you like to use the do it for command man that’s totally up to you or you want to use an IDE that’s also perfectly fine so what it does is that it will output the complete HTML code of the Python scraping comm page let’s go into our presentation and it gets the HTML code from this URL more accurately this outputs the HTML file page 1 dot HTML found in the web root slash pages directory on the server located at the domain name HTTP Python scraping com so why is it important to start thinking of these addresses as files rather than pages most modern web pages have many resource files associated with them these could be image files JavaScript files CSS files or any other content that the page you are requesting is linked to so when a web browser hits the tags such as the image source equal to cute kitten Gipp jpg the browser knows that it needs to make another request to the server to get the data at the file cute kitten dot jpg in order to fully render the page for the user keep in mind that our Python script doesn’t have the logic to go back and request multiple files and at least not yet so it can only read the single HTML file that we requested so how does it do this and thanks to the plain English nature of Python the line from URL Lib dot request import the URL open means that what it looks like it means it looks at the Python module request found within the URL Lib library and imports only the function URL open and a note on your lab you are URL Lib or URL Lib tube so if you use the URL Lib library in Python to point X you might have noticed that the things have changed somewhat between URL URL Lib 2 and URL Lib in Python 3 point X URL URL Lib 2 was renamed to you are a lip and will split into several sub modules URL Lib request URL Lib parse and URL Lib error although function names mostly remain the same you might want note which functions have moved to sub modules when using the new URL Lib URL Lib is a standard Python library meaning you don’t have to install anything extra to run this example that we are just done and contains functions for requesting data across the web handling cookies and even changing metadata such as headers and your user agent we will be using URL Lib extensively throughout this tutorials so I recommend that you read Python documentation for the library let me get back to these are the library documentation so please do read if there is some code that you don’t understand just go and and go through the documentation so you’d like to understand what the what the functions and the code structure is doing so URL open is used to open a remote object across a network and read it because it is a fairly generic library it can read HTML files image files or any other file stream with it so we’ll be using it quite frequently throughout the tutorials so that’s it for this tutorial guide I hope you have enjoyed creating your very very simple and first read scraper and we’ll get into making a little more complex applications in the later tutorial so hope you enjoyed it if you liked the video please subscribe hit the like button do comment if you have any comment and please help me share this content as well and thank you so much for watching and I hope to see you in the next video thanks guys bye[/bg_collapse]


Please enter your comment!
Please enter your name here