[bg_collapse view=”link-inline” color=”#4a4949″ icon=”zoom” expand_text=”View transcript” collapse_text=”Hide transcript” ]
in this tutorial I want to show you how we can build a web scraper in no chairs using the packages request and cheerio so we’re going to get all of the different jobs that’s on the San Francisco Bay Area inside of the IT sector and we’re going to see how you can get the job title the data was posted the location the description of the job and maybe even the compensation of the job and in the end we’re going to end up with this big JSON object array with all the data from the various jobs I hope you will enjoy this tutorial I’ll see you in a bit while we start building the scraper okay so the first thing we are going to do we’re going to make a folder for our project in this case I will call it Craigslist then I’m going to go inside of the folder in the terminal and then we will write NPM in it that’s – yes so the – – yes is to accept the licensing agreement already so we don’t have to press ENTER or inter inter inter and after we have run this we can open the project inside of our code editor so I’m going to go ahead open the folder inside of Visual Studio code now the folder or the project is open inside of Visual Studio code okay so now we are going to add the packages we are using inside of a web scraper project which is going to be request and request promise and cheerio and I will show you later what we use these different packages for if you prefer to use NPM instead of yarn which I use here you can use NPM install that’s – save so now once all of the packages have finished installing we can create our in stages where we will be running our ape scraper project inside left side is my editor we chose to recode which I used to edit the code and on the right side we have the browser where I never gate around in the page that I want to scrape so this enables me to look around in the elements select elements find out how to select them and then write the code on the other side so first let’s get the request module say require request promise and this enables us to download pages which we can then use cheerio to select different elements just like jQuery okay so let’s create our main function for the web scraper which is going to be an async function and if you’re not familiar with async its feature in the new JavaScript es 7 which enables us to use keywords such as a weight and I will explain that later now we’ll make the call for the function down here so it enables us to use features such as a weight so we can say a weight for every asynchronous call we make so let me show you in practice how this works we will get the HTML from the page first so we say request that get and this downloads basically any URL we paste into request so we get the quickest page here we don’t the chops I will make a URL variable going to paste this URL down in the description as well so if you want to get it you look in the description or the resources for the udemy lecture and then we can use the await keyword to wait for a request to finish this request and then we can do something with the HTML result like the old fashioned way to do it would be using something like request I’d get URL and then then and then do something with the result and maybe have a catch cross here but I think that using the await keyword in the new version of JavaScript is a lot cleaner than having the chain then clauses and so on oh that was just a short intro to a wait in case you’re not used to that then I’m also going to paste in a try-catch clause so we catch any errors we have in the code and we can paste it out to our console so now let’s try and just see in the console what request is actually getting from the Craigslist page so you can see it in practice will write console.log HTML result and then our right note in the XJS which has our web scraper and then you can see inside of the console that it’s actually just the basic HTML page you’re getting in a string format now we can’t really do so much with this page as it is right now that’s what we need cheerio for ok so now we have all the HTML of the quakes test page inside of this string variable but we also need to be able to select different elements and their values that we want of course you could use regular expressions maybe but it’s a lot more easy and efficient to use something like jQuery or in no J’s case cheerio so you can see we use the dollar sign to define the church area loaded page which is going to enable us to select all the different elements using CSS selectors so let’s start out by selecting all the titles of the jobs so let’s open the chrome developer tools we use control shift I to go into the tools and you can also go inside of the menu inside google chrome say more tools developer tools to access the developer tools and then we can select all of the different elements with the icon up here and we can click on them and then we get inside the elements menu and see how the HTML look at this specific element for example right here we have the title for the autonomous technical technical autonomous vehicle trainer we have a result info class inside of this unordered list with the class rows and so job title and the date and so on is inside this result row or a list item which also has the P class result info now inside this p class resolved info we have all the information we want in the first page let’s try first and just define a sample object with all the information that we want to scrape from this list so we get an overview of what we want to get going to call it scrape result and ever have a title off the job for example technical autonomous vehicle trainer and we’re also going to have a description which is a longer text when you click on the link of the job where there’s a long description about the job their requirements and who they are and so on it’s all this six down here and then I think we also want D let’s see let’s see what we should get we should get the date and let’s post it could be interesting to get also so I will make that as a JavaScript object you know JavaScript date object then I think we could also get the URL for the job so you know why you can read the full description off the top and so on see the location etc and that is what you can see when you click on the URL in here so that’s on the first page let’s also get the neighborhood that the job is inside or the hood and then I think it could also be interesting maybe to get the address of the job Zita address over here on the right side in street and I think we also want to get the compensation interesting data to get for all of the jobs see which one pays the most and of course this is just a sample so you can see what we are actually going to get and we’re going to get this for every job on this page inside of Craigslist now let’s try and see first if jQuery is enabled on this page so we go inside of the console inside of the chrome developer tools and we can see something comes up when I press the dollar sign which means that shaker is on the page remember that every job had this result info class inside of this P element well we can see we’re in readwrite result info the selector for result info we get 120 elements which means we get all of the jobs on the page so that’s great now let’s try let’s just try and go inside of the result in for element we see we have a H element which has the link and the title of the job so maybe if we try and get the header link class or the result title class we can get all of the job titles let’s try and do that so we go through each of the result info elements we make a each loop here which has a index and an element and the element is then each of the job titles so every right console.log dollar sign and have my element in here and then I can do a selector like find or children I’m going to use children now and I will select all the children which has the class of result title and that’s going to get me the title of every job here we have called info class for every job and then inside there’s a H element which has the class of result title and we can use that to get the job title and now I just need to figure out what where I’m missing my pyrin faeces here so hold on a moment I just need to figure out where it’s missing there’s a lot of parent visas so hang in there but I’ll figure it out okay yeah I missed one at the end here and we get all of the job titles now so and now we can basically just paste this code that we made inside of the chrome console inside of our node.js project let me show you how that works you just select all of this and then I can inside of my ojs project like so and you can notice we’re using the dollar sign that we defined using cheerio load which then gives us the opportunity to just copy paste code from the chrome console which is also using the dollar sign to select elements using jQuery and you will see we get exactly the same output as we did inside of chrome console so right run the index.js file using node in the XJS and we should see the exact same output as inside chrome console there we go okay last thing I’m going to do I think I will just make a variable called title so I will say Const title instead of having a console log like that so now let’s try and get the URL for the job where we can read the description you can see that inside of the H element which has the H ref attribute as you call it which has the URL for the job so let’s make a variable for just the title and then we can select the title and the euro let me show you what I mean so I’ll call it result title and put in the element itself in here and from the result title and can just say dot txt to get the title of the job and to get the URL of the job I’d say attribute href and this is is exactly the same as if I was using jQuery of course so if you are familiar with jQuery this should be really familiar to you and I think I want to put this inside of our object the same like the sample object we defined before I which is called it scrape basalt and this up here called scrape sample and and here we put in the title and the euro in our defy arrey we’re puttin all of my different scrape results which is the different jobs and I will just push the result onto here and then in the end we can also try and do a console log of our array we don’t the jobs to see how it looks like Ronit note index.js and we can see there’s a pretty array coming up with lots of objects which has a title and a URL for the job oh now we just need to add more properties to these great results now let’s try and get the date time that the job was posted onto Craigslist which is inside of this time element the HTML time with the date/time attribute and I think we can use this date time attributed to create a new JavaScript date object now notice that the time is inside of salt info so we can use the same each loop that we used before and just select the time element instead of our result in a result title class so let me just write time and then we get the date/time attribute instead of text so and now we have the time that every job was posted and again we can just paste this code inside of our node.js project and it’s going to run exactly the same as the chrome console so now I just added the element children with the selector inside of the node.js project and then I can just add the object into my scrape result and we will have the date I think I will also want to create this into a JavaScript object actually let me just add it first all right new date you make it into a javascript object and then we should have a JavaScript object with the date that the job has posted let’s try and run the code with no GS and we can see the date post it is the dated job was posted okay so that was it okay so first let’s try and get the neighborhood that the job is in or the hood which is inside of the result Mayda class or the span with the result made a class so it’s not a direct children it’s actually a grandchild I guess you would call it so instead of using children children I would just use find to go and traceroute all the children and that means I can use find and put in result hood get D class and we just want the text of it so we get the neighborhood text okay like so now we have all the of the different jobs now before we move on to the job description let’s copy paste the code or the selector we have inside of chrome console for our neighborhood and make available call it hood paste the selector inside here and then add it to our result object now the next thing we need to do is we need to go through each of these jobs you’re ill and get their job description you can also get their address inside of that page and maybe their compensation so first let’s just rename our main function from scrape Craigslist to scrape job header like so and then let’s make another main function which we’ll call scrape Craigslist and in here we will make a call for the scrape job header instead and we get the array back from it did you chop a rate with the headers with the title and links and so on I’ll say wait and then I want to return the scrape results array from here which contains the title date it was posted and so on and then let’s make functioning here which we call scrape description where we will be going through each of the jobs inside of this scrape results array and get the description for the jobs so we pass in the array drops with headers and then inside here we need to do a little magic using promised at all to iterate through each of these jobs and make a request me just show you what we will get from this function we will get D chops with the full data and we pass in D chops with headers now we say a weight first and then promised at all so this takes in an array of promises and it’s only going to resolve once all of the promises have resolved and we use these jobs with headers dot map to iterate over the array of jobs with headers this is basically JavaScript es6 that I’m showing you here or e is 7 we iterate over the chops with haloes array using map and then we can go through each shop and make up a quest to the jobs URL that we got earlier from the main page and I’m also throwing in a sink here as well so we can use the evade keyword inside of the map function all of this can seem confusing if you don’t know is six or seven first but if you don’t know exactly what I’m doing here maybe prash up a bit on e7 or es6 but hopefully you will get the gist of how this works if you don’t know he is seven already then let’s make the cheerio variable just like we did before in the subtitles now we’re just doing it for every job instead and then let’s get the posting body ID which contains all of the DTaP description and now let’s see how posting body text looks like we can see we get all of the job description here but I don’t like the QR code linked to this post so much in the top so let’s try and get rid of that now since the QR code thing is inside of posting body I think I want to remove this element first and then we can do the text on posting body to get the job description let me show you how that works so we can use the print QR code container class to select it and then I will just simply remove it using jQuery and now the element has been removed and when I use dot text on posting body again I can see that there’s no longer the code link text up on the top and now we can simply do the sack same thing inside of the node.js project we will remove the element first to not have the qr-code link text in the job description and then use the text to create our own description variable like so and now just let me make a property on the sharp object and we have the chopped dot description basically the job object and it’s also take the address while we add it find out where the address is so I select the element and select the address and we can see there’s a deal with a class map address so why not try and select this and also use right click copy selector inside of chrome tools but I don’t think the selector is always so short or efficient as when you’re just making it up yourself let’s try and see how this selector looks like you can see it’s very long because it wants to make sure that we are selecting the exact element but in this case I think we can just use the deal and dmap address class and let’s try and look how the text looks for that and that’s perfectly get the address of the job so let’s just copy that and make that as our job dot address or pre free cap inside of the jobs with headers that map we are returning a promise and a wait promise that all is waiting for all of these promises to resolve and again this returns a complete array of all the jobs with dirt description over use return a weight and this is going to return the job description all the jobs with their description let’s try and console.log all of the data see what we are getting and it’s going to take more time we can see that there’s a undefined on this array so I’m clearly doing something wrong here and I think I need to return also the object every time like so and I will try and run it again and this time we can see an array full of objects containing the description and all of the other properties that we scraped before you you now let’s try and get the compensation for each chop also so we go again and select the element and we can see have this just a span element without any class or ID so but it’s inside this attribute group P class and so let’s try and select the P element with the class attribute group instead and so if we can get the compensation out from that so I could use for example D select the P class attribute group and then select the first child of this element let’s try and do that so attribute group and text we can see there’s compensation and the employment type but we just want the compensation so let’s select the first child so and then we get the compensation $23 per hour that’s exactly what we want I think I want to clean this up a bit and maybe remove the compensation in front and just have the 23 slash hours instead let’s make a variable first let’s try it out compensation and replace compensation with just an empty string and this way it our data is a bit more clean I think so let’s do that inside of a node.js project as well first let’s get the ball compensation data think it will make this into a variable first they stood in and then let’s get the more clean version of the text without the compensation in front where you use the replace to replace the compensation with a empty string like so and then we have the data nicely formatted on chop compensation let’s try and see how the code looks and we can see we have the compensation here hourly rate some of you e some of the jobs don’t have a compensation listed but I think it’s an interesting property to have at least then let’s put it inside a try-catch cross to catch any errors we might get I’ll put the arrow to console.log for now or console dot ever and that’s it now we got a good base of building a scraper at least for Craigslist and there’s probably a lot of other sites you can use these techniques techniques on and let’s try and see how many jobs regard we should have 120 like Jer should be on the page so you can see 120 jobs so that’s exactly the amount we need now you can keep building on this project you can make it pay scrape at the sites or pages on this as well you’re only scraping one site or one page right now there are small to per pages but you can also build it up so it scrapes all of the pages there is of the jobs you can also build a REST API based on this scraper so if you go for to a rest endpoint then BAM you get all of the data like we get in the console here and you could present it inside of a client or just a basic HTML view it’s up to you you could also just save all of the data to a file if that’s what you’re applying or what you want like a CSV file and that’s up to you okay so congratulations on finishing this little mini course on scraping Craigslist I hope that you got something out of it I’m assuming already that you know jQuery and you know basic JavaScript so if it went a little fast and you didn’t get so much out of it try and crush up on jQuery for selecting the HTM aichi HTML elements in the page and basic JavaScript and ojs for the code we’re riding on the right side of the editor oh I hope you got something out of the tutorial and see you next time

[/bg_collapse]

LEAVE A REPLY

Please enter your comment!
Please enter your name here