MitrahSoft Blog | Web Scraping Using Puppeteer and NodeJS

Puppeteer

Puppeteer is a Node library which provides a high-level API to control Chrome or chromium over the DevTools Protocol. Puppeteer runs headless by default, but it can be configured to run full Chrome. It was maintained by the Chrome DevTools team and an awesome open-source community. When you install the puppeteer, it downloads the recent version of chromium that is guaranteed to work with the API. The benefit of the puppeteer was, it allowed access to the measurement of loading and rendering times provided by the Chrome Performance Analysis tool. Most things that you can do manually in the browser can be done using Puppeteer such as generate screenshots and PDF of pages, automate form submission, UI testing, keyboard etc.,

A Short Demo for Web Scraping, e2e with Puppeteer

Chrome vs Chromium

Chrome :

Chrome is a proprietary web browser developed and maintained by Google.
Chrome has automatic updates, browsing data, and native support for Flash.
Chrome has sandbox support.

Chromium :

Chromium is an open-source web browser developed and maintained by the Chromium Projects
Chromium has no auto updates, browsing data, or Flash support.
Chromium also has sand box support. But some linux distributions may disable the sandbox support.

Prerequisites

To use puppeteer, you have to installed the Node js in your machine.

Installation

npm i puppeteer

Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page. Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.

Headless Browser

A headless browser is a web browser without a graphical user interface. It is often called a scraper or a crawler to read and interact with it. It provides automated control of a web page in an environment similar to popular web browsers, but are executed via a command-line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, colour, font selection and execution of JavaScript and AJAX which are usually not available when using other testing methods.

Headless Chrome

Headless Chrome is a way to run the Chrome browser in a headless environment without the full browser UI. One of the benefits of using Headless Chrome is that your JavaScript tests will be executed in the same environment as users of your site. Headless Chrome gives you a real browser context without the memory overhead of running a full version of Chrome. In that we need not to specify the headless option when the browser will launch.

Headful Chrome

Headful chrome means displaying the browser graphical user interface and it is very useful for debugging. In that we need to set the headless option to false when the browser will launch.

Automated Scripts

An automation script consists of a launch point, variables with corresponding binding values, and the source code. You use wizards to create the components of an automation script. You create scripts and launch points or you create a launch point and associate the launch point with an existing script. It provides many benefits including faster execution of repetitive tasks, ability to parallelize workloads and improved test coverage for your website. In the below example, we have to create a simple login process when the user enters the email and password , then they will redirects to their respective page.

Web Scraping in puppeteer

The first step of web-scraping is to acquire the selectors. A selector is just a path to the data. You have to acquire the selectors when inspect the element of the page, the developer tools window will open. In the Elements tab of Developer Tools, right-click the highlighted element and select CopySelector. In the below example, instead of map we have to use forEach for looping the data.

Web Scraping Using Puppeteer and NodeJS

Chrome vs Chromium

Tags

Archives

Follow us