NodeJs Web Scraping with Puppeteer

So you’ve probably heard of Web Scraping and what you can do with it, and you’re probably here because you want some more info on it.

Web Scraping is basically the process of extracting data from a website, that’s it.

Today we’re going to look at how you can start scraping with Puppeteer for NodeJs

Table of contents

What is Puppeteer?

Puppeteer is a library created for NodeJs which basically gives you the ability to control everything on the Chrome or Chromium browser, with NodeJs.

You can do things like a normal browser would do and a normal human would, for example:

  • Open up different pages ( multiple at the same time )
  • Move the mouse and make use of it just like a human would
  • Press the keyboard and type stuff into input boxes
  • Take screenshots programmatically for different situations
  • Generate PDF’s from website pages
  • Automate specific actions for websites

and many many more things

Puppeteer is created by the folks from Google and also maintained by them and even though’ the project is still pretty new to the market, it has skyrocketed over all the other competitors ( NightmareJs, Casper..etc ) with over 40 000 stars on github.

Setup of the project

The first thing that you need to make sure is to have NodeJs installed in your PC or Mac.

After that you can initiate your first project on a new and empty folder with npm.

You can simply do this with the Terminal by going to the newly created folder and then running the following command:

$ npm init

Now you can input all the project details and you also can just hit Enter

After the setup you should now have a package.json. file with content that looks similar to this:

{
  "name": "LearnScraping.com",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Grohs Fabian",
  "license": "ISC",
  "dependencies": {
 
  }
}

Installing dependencies

Now we can start the installation of the needed Packages

Here’s what we’re going to need

  • puppeteer

So we are going to use npm install

$ npm install puppeteer --save

While this is installing I’m going to take the time and explain to you What is Puppeteer

Puppeteer is an API that lets you manage the Chromium Browser with code written in NodeJs.

And the cool part about this is that Web Scraping with Puppeteer

Preparing the example

Now that we’re done with the boring stuff, let’s actually create an example just so that we can confirm that it’s working and see it in action.

Here is what we are going to build so that you get used to Puppeteer and understand how it works.

Lets create a simple web scraper for IMDB with Puppeteer

And here is what we need to do

  • Initiate the Puppeteer browser and create a new page
  • Go to the specified movie page, selected by a Movie Id
  • Use evaluate to tap into the html of the current page opened with Puppeteer
  • Extract the specific strings / text that you want to extract using query selectors

Seems pretty easy, right?

Building the IMDB Scraper

I’m just going to give you a quick snippet of code and then we’re going to talk about it just a bit.

I am using the Google DevTools to check the html content and the classes so that I can generate a query selector for the Title, Rating and RatingCount

Learning the Selectors and how they work is very useful for this if you want to build custom selectors for different parts of the website that you want to scrape.

Here’s what I’ve built.

const puppeteer = require('puppeteer');
const IMDB_URL = (movie_id) => `https://www.imdb.com/title/${movie_id}/`;
const MOVIE_ID = `tt6763664`;

(async () => {
  /* Initiate the Puppeteer browser */
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  /* Go to the IMDB Movie page and wait for it to load */
  await page.goto(IMDB_URL(MOVIE_ID), { waitUntil: 'networkidle2' });

  /* Run javascript inside of the page */
  let data = await page.evaluate(() => {

    let title = document.querySelector('div[class="title_wrapper"] > h1').innerText;
    let rating = document.querySelector('span[itemprop="ratingValue"]').innerText;
    let ratingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;

    /* Returning an object filled with the scraped data */
    return {
      title,
      rating,
      ratingCount
    }

  });

  /* Outpitting what we scraped */
  console.log(data);

  await browser.close();
})();

You can test out exactly this code and after running it you should see something like this

{
      rating:"9.0"
      ratingCount:"48,386"
      title:"The Haunting of Hill House"
}

And of course, you can edit the code and improve it to go and scrape more details.

This was just for demonstration purposes so that you can see how powerful Web Scraping with Puppeteer is.

This code was written by me and tested in 15 minutes maximum and I’m just trying to emphasize how easy and fast you can do certain things with Puppeteer.

How to run it

There are multiple ways of running the code and I am going to show you 2 ways of doing that.

Via the terminal

You can use the terminal to run it like you’ve probably heard of and you can do that with a simple command just like this:

$ node index.js

And of course, you need to make sure you are in the right project directory with your terminal before actually running the code.

And instead of index.js, you can specify whatever file you want to run / execute.

Via an editor ( VSCode )

And also you can run it directly with an editor that has the option to do so. In my case, I am using both VSCode and phpStorm

You can run it very easily with VSCode by clicking the Debugger tab and then just running it, simple and nice.

And of course, you can change the actual movie that you want to scrape by easily editing this part of the code:

const MOVIE_ID = `tt6763664`;

Where you can input your actual movie id that you get from any IMDB Movie Urls that look like:

https://www.imdb.com/title/tt6763664/?ref_=nv_sr_1

Where the actual movie id is this tt6763664.

How to visually debug with Puppeteer

Before I’m going to end this short tutorial, I want to give you the best snippets of code that you can use when building scrapers with Puppeteer.

Go ahead and replace the line where you initialize the browser, with this:

const browser = await puppeteer.launch({headless: false}); // default is true

What is this going to do?

This is basically going to tell the Chromium browser to NOT use the headless mode, meaning it will show up on your screen and do all the commands you tell it to so that you can see it visually.

Why is this powerful?

Because of the simple fact that you can see and pause with a debugger on any point of the execution and check out what is exactly happening with your code.

This is very powerful when building it for the first time and when checking for errors.

You should not use this mode in a production build, use it for development only.

More debugging tips

I feel like when you are starting out, debugging tips are the best because you try to do certain things and you don’t know for sure if they work and you just want to have the tools to debug your work and make it happen.

Slowing down everything

When you are doing scrapers with Puppeteer, you have the option to give a delay to the browser so that it slows down every action that you program it to do.

const browser = await puppeteer.launch({
   headless: false,
   slowMo: 250 // 250ms slow down
 });

And this is basically going to slow it down by 250ms

Making use of an integrated debugger;

This is also included to any kind of work you are doing with NodeJs so this tip will either blow your mind or you’ve known it already.

Usage of a debugger; 

I personally use Visual Studio Code and PhpStorm with NodeJs plugin

If you don’t have a PhpStorm or WebStorm license, no worries, you can use VSCode

How to do you make use of the debugger?

You simply need to either put a Breakpoint or write debugger; j

debugger;

And when you run it, it will then stop at exactly the line where you put the breakpoint or the debugger.

And how is this powerful?

If you still don’t know what I’m talking about, now after you are stuck in the debugger, you can access any variable available in that specific time, run code and inspect whatever it is you need.

If you still don’t use the debugger, you are missing out.

Bonus snippet

Before ending the actual code related content for this web scraping tutorial, I will give you a cool snippet to play around and also to make use of when needed.

How to take screenshots of the page

Taking a screenshot of the current page opened with Puppeteer can be very useful for testing, debugging and not only.

You can easily do that with the following command

  await page.screenshot({ path: 'screenshot.png' });

And you can place this wherever in the code where you want to take a screenshot and save it. 

You can also check out the other parameters for the screenshot function from the actual Puppeteer API documentation because there are a lot of other interesting parameters that you can give and make use of.

What you shouldn’t do

And of course, it comes to this part where I need to tell you that Scraping is a gray area and not everyone accepts it.

Since you’re basically using someone else’s bandwidth and resources ( when you go to a page and scrape it ), you should be respectful and do it in a mannered way.

Don’t overdo it, know when to stop and what is exceeding the limit.

But how can I know that?

Think of what it actually means to go and scrape 10.000 users or images from someone else’s site and how will that impact the person running the site.

Think of what you would not like to have someone do to your website and don’t do that to others too.

If it seems shady, it probably is and you should probably not do it.

PS: Make sure to read the Terms of Service / Terms of Usage of the specific websites. Some have clear specific terms that don’t allow you to scrape and automate anything. ( Instagram for example )

Want to learn more?

Hopefully you will give this a try and test for yourself the code, Puppeteer is very powerful and you can do a lot with it and fast also.

Also if you want to learn more and go much more in-depth with the downloading of files, I have a great course with more  hours of secret content on web scraping with nodejs.

You’re in for a treat! Get 95% Off my first course on Udemy ( Limited to the first 50 registrations, make sure you’re fast )