If you’ve never got the chance to start checking Web Scraping out and how it works, this is the chance.
I’m going to show you 4 easy steps to start out with Web Scraping with Nodejs and how you can make a simple scraper for yourself.
Table of contents
- Prerequisites
- Initiating the project
- Installing the dependencies
- Writing the code
- 1. Including the libraries in our code
- Request-Promise
- Cheerio
- 2. Sending the request
- 3. Parsing with Cheerio
- 4. Returning and making use of them
- Full Code
- Want to learn more?
Let’s start, I’m sure you will learn something interesting by the end of post.
Prerequisites
So, first, in order to start scraping and coding with NodeJs, you actually need to make sure:
- Have NodeJs installed ( try to have the latest stable version all the time )
You can easily get Node from their actual website on NodeJs.org if you don’t have it already.
Initiating the project
Now, you need to make sure you create a new blank project instantiated with npm init
- Create a new and empty folder
- Open the terminal ( with either mac or windows ) and navigate to that folder
- And then simply write inside of the Terminal the following code
$ npm init
And this will ask you to input some details for your project and if you don’t want to input some of them, you can just hit enter and leave them blank.
So by the end of the initiation, you will have something looking like this: ( in your package.json
file )
{ "name": "LearnScraping.com", "version": "1.0.0", "description": "", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" }, "author": "Grohs Fabian", "license": "ISC", "dependencies": { } }
Installing the dependencies
Now we need to make sure we at least have the required dependencies for your new project so that we can start to write some code.
Here’s what you’ll need to install:
- request-promise
- cheerio
And you can install these by using the following code in your Terminal just like you did with the npm init c
$ npm install request-promise --save
and do the same with the cheerio library.
To make sure you installed them properly, you can check your package.json
Writing the code
In this example, we are going to take a simple example, just like we did with NodeJs Web Scraping with Puppeteer and we’re going to create a scraper with this method also, for an IMDB Movie page.
1. Including the libraries in our code
Firstly we need to input the libraries that we just installed so that we can make use of them.
const request = require('request-promise'); const cheerio = require('cheerio');
And here is some info about them:
Request-Promise
We’re going to use this library to send direct requests to an url and basically get the response so that we can parse it.
In our case, we are going to send a GET Request
PS: We are using this library instead of the “normal” one which is called “request” just for the fact that ( as the name says ) it is returning a promise. Meaning that we can make use of the ES6 Async Await syntax, which I personally love.
Cheerio
For parsing html content, cheerio is the best and the most popular solution you get.
You basically need to input it the HTML and then you can parse it by using the jquery selectors and methods a
What I mean by that is, you can use most of the functions available in jquery in order to parse / scrape content from the html that we get. Stay tuned for the actual piece of code thats doing it.
2. Sending the request
Now that we have everything else ready, lets start to create the actual code that is sending the request to their server.
First, I want to create a variable that holds the link to the movie that we want to scrape details from, just like this:
const movieUrl = 'https://www.imdb.com/title/tt0102926/?ref_=nv_sr_1';
and right now, lets deal with the request
const response = await request({ uri: movieUrl, headers: { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.9,fr;q=0.8,ro;q=0.7,ru;q=0.6,la;q=0.5,pt;q=0.4,de;q=0.3', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'www.imdb.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' }, gzip: true } );
This code is basically going to send a GET Request to the movieUrl
These are some normal HTTP headers that I specified there in the headers object, that I manually created and got from a normal request from the actual Chrome DevTools w
Sometimes you don’t need to specify them and you can leave them to their defaults, but its good practice to add them to give the server the impression of an actual human-like request.
Besides this, I specified GZIP to true
Now, after this line, we should have the actual HTML Content of the MoviePage inside of the response variable.
3. Parsing with Cheerio
It’s the time to make use of the second library that we’re using for scraping, which is Cheerio.
let $ = cheerio.load(response);
With this line of code, we instantiate the cheerio l
Lets start with the parsing!
let title = $('div[class="title_wrapper"] > h1').text().trim(); let rating = $('div[class="ratingValue"] > strong > span').text(); let ratingCount = $('div[class="imdbRating"] > a').text();
And as you can see, it looks pretty easy and simple. If you have some basic knowledge of the Jquery libary, this will come very common to you.
We’re using the typical Css Selectors
For example, the CSS Selector for the Title basically translates to:
“Give me the div, that has the class equal to “title_wrapper” and then select the Direct Child Element of it that is a h1″
And then basically using the .text() function to get the text inside of the selected HTML Element.
You can read more about Cheerio on their Github page also.
4. Returning and making use of them
You now have access to those variables that you scraped, you can return them as an object for later use just like this for example ( optional )
let data = { title, rating, ratingCount };
And you can use this data however you want ( save it to a file, save it to the database, make it into a CSV..etc ), this really depends on your needs.
But of course, lets console.log(data);
{ rating:"8.1" ratingCount:"1,078,589" title:"The Silence of the Lambs" }
But of course, this depends on what Movie you wanted to scrape.
Full Code
Ultimately, if you did read and follow the steps properly, your code should primarily work fine and also it should look like this:
const request = require('request-promise'); const cheerio = require('cheerio'); const movieUrl = 'https://www.imdb.com/title/tt0102926/?ref_=nv_sr_1'; const response = await request({ uri: movieUrl, headers: { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.9,fr;q=0.8,ro;q=0.7,ru;q=0.6,la;q=0.5,pt;q=0.4,de;q=0.3', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'www.imdb.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' }, gzip: true } ); let $ = cheerio.load(response); let title = $('div[class="title_wrapper"] > h1').text().trim(); let rating = $('div[class="ratingValue"] > strong > span').text(); let ratingCount = $('div[class="imdbRating"] > a').text(); let data = { title, rating, ratingCount }; console.log(data);
Want to learn more?
Also if you want to learn more and go much more in-depth with the downloading of files, I have a great course with 7 extra hours of great content on web scraping with nodejs.