Web scraping is a handy tool for gathering information from a website that most developers will use at some point in their life. Web scrapers usually load, parse, and extract useful data from a website's HTML code. However, there are times when this isn't enough, and you might need to take screenshots of a website. For example:
Elixir is a functional, concurrent, general-purpose programming language mostly used to build scalable and maintainable web applications. In this article, you'll learn several different ways of taking screenshots programmatically with the Elixir programming language, as well as the limitations of each method. You'll also get a better understanding of what is happening behind the scenes when those screenshots are generated.
For this article, you'll be using Elixir and Node.js for one of the example dependencies. Before we get started, make sure you have the following installed:
Please note that it's highly recommended that you use a tool like asdf to install these dependencies.
According to Stack Overflow, Elixir has quickly become one of the most-loved languages, and not without reason—it combines powerful language features with Ruby-like syntax. In the context of web scraping, Elixir is a great language to use because of its concurrency and functional features.
However, one important thing to keep in mind is that in order to take screenshots of a website, you need to be able to render the website just like any browser would. The following examples will showcase different libraries and tools that can be used, and the limitations of each.
Our first example will be using PuppeteerImg, a library that allows you to take screenshots of websites, to take the screenshots. PuppeteerImg is a wrapper of a Node.js package called puppeteer-img, which is a simple library used to generate screenshots of websites.
The code for this example can be found in this GitHub repo.
Start by creating a new project in your Elixir workspace:
mix new puppeteer_example --sup
On success, you'll see the following output:
* creating README.md
* creating .formatter.exs
* creating .gitignore
* creating mix.exs
* creating lib
* creating lib/puppeteer_example.ex
* creating lib/puppeteer_example/application.ex
* creating test
* creating test/test_helper.exs
* creating test/puppeteer_example_test.exs
Your Mix project was created successfully.
You can use "mix" to compile it, test it, and more:
cd puppeteer_example
mix test
Run "mix help" for more commands.
Next, you'll set up the dependencies and configuration for your project. Go into the puppeteer_example
directory, and add the following to the mix.exs
file:
defp deps do
[
{:puppeteer_img, "~> 0.1.3"}
]
end
Proceed to install the dependencies by running the following command:
mix deps.get
You'll also need to install puppeteer-img
globally, using the following command:
npm i puppeteer-img -g
The next step is to add the main function to your project. Go into the puppeteer_example
directory, and add the following to the lib/puppeteer_example.ex
file:
defmodule PuppeteerExample do
def take_screenshot(url, filename) do
options = [
type: "jpeg",
path: "./" <> filename
]
case PuppeteerImg.generate_image(url, options) do
{:ok, path} -> IO.puts(path) # where "path" == final path where generated image is stored.
{:error, error} -> IO.puts(error) # where "error" == some error message.
end
end
end
Open up the interactive REPL by running iex -S mix
, then run the following command to take a screenshot of the website:
PuppeteerExample.take_screenshot("http://techcrunch.com", "techcrunch.jpeg")
If everything worked correctly, you should see the following output:
./techcrunch.jpeg
:ok
And a new screenshot should be generated in the puppeteer_example
directory.
With very little code, you were able to create an Elixir application that can take screenshots of a website. However, there are a few things to consider with this approach:
For more complex scenarios, you can use Hound to take screenshots of a website. Hound is an Elixir library meant for browser automation and writing integration tests. Behind the scenes, Hound supports multiple headless browsers. Notable features include:
The code for this example can be found in this GitHub repo.
Start by creating a new project in your Elixir workspace:
mix new hound_example --sup
On success, you will see the following output:
* creating README.md
* creating .formatter.exs
* creating .gitignore
* creating mix.exs
* creating lib
* creating lib/hound_example.ex
* creating lib/hound_example/application.ex
* creating test
* creating test/test_helper.exs
* creating test/hound_example_test.exs
Your Mix project was created successfully.
You can use "mix" to compile it, test it, and more:
cd hound_example
mix test
Run "mix help" for more commands.
Start by adding the dependencies to the mix.exs
file:
defp deps do
[
{:hound, "~> 1.0"}
]
end
Proceed to install the dependencies by running the following command:
mix deps.get
Unlike PuppeteerImg, which took care of setting up and launching a headless browser behind the scenes, Hound requires that you do this manually. By default, Hound will use PhantomJS, but you can avoid using another Node.js package by instead using the Selenium WebDriver.
Start by downloading the Selenium standalone server.
Start the server with:
java -jar selenium-server-standalone-3.9.1.jar
Alternatively, if you are using macOS with Homebrew, you can install the Selenium standalone server with:
brew install selenium-server-standalone
selenium-server standalone
In either case, you can confirm that the server is running correctly by visiting localhost:4444 to see if you can see the following output:
Finally, you'll need to configure Hound to use the Selenium server. Go into the hound_example
directory, and create a new configuration file called config/config.exs
:
mkdir config
touch config/config.exs
Add the following to the config/config.exs
file:
import Config
config :hound, driver: "selenium", port: 4444
With your initial configuration complete, you can now work on the main logic. Go into the hound_example
directory, and add the following to the lib/hound_example.ex
file:
defmodule HoundExample do
require Logger
use Hound.Helpers
def take_screenshot(url, filename) do
Logger.info "Taking screenshot of #{url} and saving to #{filename}"
Hound.start_session
navigate_to url
take_screenshot("./#{filename}")
Hound.end_session
Logger.info "Screenshot saved to #{filename}"
{:ok, filename}
end
end
Open up the interactive REPL by running iex -S mix
, then run the following command to take a screenshot of the website:
HoundExample.take_screenshot("https://techcrunch.com/", "techcrunch_hound.jpeg")
If things worked correctly, you should see the following output:
Interactive Elixir (1.13.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> HoundExample.take_screenshot("https://techcrunch.com/", "techcrunch_hound.jpeg")
12:40:00.790 [info] Taking screenshot of https://techcrunch.com/ and saving to amgr_hound.jpeg
12:40:05.423 [info] Screenshot saved to techcrunch_hound.jpeg
{:ok, "techcrunch_hound.jpeg"}
And should have a screenshot generated in the hound_example
directory that looks like this:
You might have noticed that unlike the previous example, where the screenshot happened without opening a new browser window, in this instance, Hound will open a new browser window for you. The Selenium WebDriver defaults to this when taking a screenshot, although headless mode can be achieved by further tweaking the configuration.
Further configuration options can be found in the Hound documentation.
One of the main advantages of using Hound as opposed to PuppeteerImg is that everything happens as part of a session, and you can interact with the page programmatically. This means it can support more complex scenarios, such as:
However, there are still some drawbacks to this approach:
take_screenshot
method is not configurable, and lacks useful options like quality and full-page support.For the final example, you'll leverage Urlbox, a website screenshot service with a simple API. Specifically, this tutorial will use ExURLBox a light wrapper around the Urlbox API.
The code used in this tutorial can be found in this GitHub repo.
Start by creating a new project in your Elixir workspace:
mix new urlbox_example --sup
On success, you will see the following output:
* creating README.md
* creating .formatter.exs
* creating .gitignore
* creating mix.exs
* creating lib
* creating lib/urlbox_example.ex
* creating lib/urlbox_example/application.ex
* creating test
* creating test/test_helper.exs
* creating test/urlbox_example_test.exs
Your Mix project was created successfully.
You can use "mix" to compile it, test it, and more:
cd urlbox_example
mix test
Run "mix help" for more commands.
Start by adding the dependencies to the mix.exs
file:
defp deps do
[
{:ex_urlbox, "~> 0.2.0"}
]
end
Proceed to install the dependencies by running the following command:
mix deps.get
To use Urlbox, you'll need to create an account and get a pair of API credentials. The registration process is straightforward, only asking for an email and password. Once registered, you can retrieve your API credentials directly from the dashboard.
Grab the credentials from the dashboard and add them to the .env
file:
URLBOX_API_KEY="YoUrApIKeY"
URLBOX_API_SECRET="YoUrApISeCreT"
Next, you'll have to configure the project to pull the credentials from the .env
file. Go into the urlbox_example
directory, and create a new configuration file called config/config.exs
:
mkdir config
touch config/config.exs
Add the following to the config/config.exs
file:
import Config
config :ex_urlbox,
api_key: {:system, "URLBOX_API_KEY"},
api_secret: {:system, "URLBOX_API_SECRET"}
This will automatically pull the credentials from the environment variables.
Next, you'll add the main logic to the lib/urlbox_example.ex
file:
defmodule UrlboxExample do
@moduledoc """
Documentation for `UrlboxExample`.
"""
def take_screenshot(url, options \\ [format: "png"]) do
{:ok, screenshot} = ExUrlbox.get(url, options)
screenshot.url
end
end
Then you can open your REPL and run the following command to take a screenshot of the website:
UrlboxExample.take_screenshot("https://techcrunch.com/")
Unlike previous examples, this time around, we go add a Urlbox url to our screenshot, like https://api.urlbox.io/v1/S6vqoSXoPaKZCVjd/0d1a4c912dc683784022d993a5fc45c1c73a2062/png?url=https%3A%2F%2Ftechcrunch.com%2F
Urlbox is doing all the heavy lifting for us, even storing the resulting screenshot. But this is not all we can do with Urlbox, as it provides some advanced features that go beyond just taking a screenshot.
Let's try some of them out by running the following command:
UrlboxExample.take_screenshot("https://www.geeksforgeeks.org/", [format: "pdf", full_page: true, timeout: 100000])
For this request, we added a few additional options:
This results in a full-page PDF generated with the contents of the page.
This kind of flexibility and power opens many different use cases, from the ones covered at the beginning of the article to potential uses in ad-tech to generate advertising assets by converting websites and spreadsheets to sharable PDFs.
In this article, you've learned about three distinct ways of taking screenshots with Elixir, from PuppeteerImg, the most limited approach, to a much more flexible approach using Urlbox.
You've also covered the drawbacks and considerations for the main approaches, and how relying on tools like Puppeteer and Selenium will require special care when deploying your application to production.