Golang Web Scraper

Golang Colly
Golang Web Scraper

I have been flirting with go for a few weeks now and I built a simple forum-like website using gin which is a popular web framework for golang. After building the application, I was satisfied with how much I was able to learn about the language so I decided to do another little project with it. While I was browsing the web mindlessly(like most of us do), I stumbled upon a comment that talked about web scrapping in python then an idea popped in my mind, why not scrape the frontpage of a popular online forum then use the data to populate my own database. Now, scraping a website is not illegal, but it’s good you should know what you can and cannot scrape from a website. Many websites have a robots.txt file which gives such information. While there are tons of web scrapping tutorials on the web mostly in python, I felt there weren’t enough of them in go so I decided to write one. I did my research and found out an elegant golang framework for scraping websites called colly, and with this tool I was able to scrape the frontpage of a popular Nigerian forum called Nairaland.

Scraping framework for extracting the data you need from websites, used for a wide range of applications, like data mining, data processing or archiving. Golang Web Scraping Strategies and Frameworks. 2019-07-13 — 7 min read #golang #go #scraping What follows is a high-level analysis of screen scraping / web scraping strategies and frameworks for golang, current as of July 2019. State of Affairs. Web scraping spans a very broad range of activity including everything from archiving content. But recently, there has been a buzz for Go or Golang as someone says which made me to look into it. As a way to practice my knowledge and understanding of Go, I thought to write a program to scrape a website that requires login. I usually start my learning of any language from scraping if possible because it interests me so much.

The Go language has a ton of hype around it as it’s relatively new, the syntax is relatively easy to pick up as compared to other statically typed languages, it is very fast and natively supports concurrency which makes it a language of choice for many in building cloud services and network applications. We can leverage this speed to scrape websites in a fast and easy way.

Web Scraping

Web scraping is a form of data extraction that basically extracts data from websites. A web scraper is usually a bot that uses the HTTP protocol to access websites, extract HTML elements from them and use the data for various purposes.I’ll be sharing with you how you can scrape a website with minimal effort in go, let’s go 🚀

First, you will need to have go installed on your system and know the basics of the language before you can proceed.We’ll start by creating a folder to house our project, open your terminal and create a folder.

Then initialize a go module, using the go toolchain.

Replace the username and project name with appropriate values, by now we should have two files in our folder called go.mod and go.sum, these will track our dependencies.Next we go get colly with the following command.

Then we can get our hands dirty. Create a new main.go file and fire up your favorite text editor or IDE.

The above is the data structure we will be storing a single post in, it will contain necessary information about a single post. This was all I needed to populate my database, I was not interested in getting the comments since we all know how toxic the comments section of forums can be :).

We need to call the NewCollector function to create our web scrapper, then using CSS selectors, we can identify specific elements to extract data from. The main idea is that we target specific nodes, extract data, build our data structure and dump it in a json file. After inspecting the nairaland HTML structure (which I think is quite messy), I was able to target the specific nodes I wanted.

The OnHTML method registers a callback function to be called every time the scrapper comes across an html node with the selector we passed in. The above code visits every link of frontpage news.

What is happening here is that when we visit each link to a frontpage news, we extract the title, url, body and author name using CSS selectors to identify where they are located, we then build up our post struct with this data and append it to our slice. The OnRequest and OnResponse functions registers a callback each to be called when our scrapper makes a request and receives a response respectively. With this data at our disposal, we can then serialize it into json to be dumped on disk. There are other storage backends you can use if you want to do something advanced, checkout the docs. We then make a call to c.Visit to visit our target website.

We use the standard library’s json package to serialize json then write it to a file on disk, and voila we have written our first scrapping tool in golang, easy right?. Armed with this tool, you can conquer all the web, but remember to check the robots.txt file which tells you what data you can scrape and how to handle the data. You can read more about the robots file here, and remeber to visit the docs to learn more there’s a ton of great examples you can follow along there. Cheers ✌️

Thank you for reading

We can use Spiral Queue and RoadRunner server to implement applications different from classic web setup. In thistutorial, we will try to implement a simple web-scraper application for CLI usage.

The scraped data will be stored in a runtime folder.

The produced code only demonstrates the capabilities and can be improved a lot.

Installing Dependencies

We will base our application on spiral/app-cli - the minimalistic spiral build without ORM, HTTP, and other extensions.

To implement all needed features we will need a set of extensions:

Extension	Comment
spiral/jobs	Queue support
spiral/scaffolder	Faster scaffolding (dev only)
spiral/prototype	Faster prototyping (dev only)
paquettg/php-html-parser	Parsing HTML

To install all needed packages and download app server:

Activate the installed extensions in your AppApp:

Make sure to run php app.php configure to ensure proper installation.

Configure App Server

Let's configure the application server with one default queue in memory. Create .rr.yaml file in the root of the project:

Create Job Handler

Now, let's write a simple job handler which will scan the website, get the HTML content, and jump by links util the specificdepth reached. All the content will be stored in runtime directory.

Create JobHandler via php app.php create:job scrape. We are not going to use CURL for simplicity.

Create command

Create a Command to start scraping php app.php create:command scrape:

Golang Colly

Test it

Launch application server first:

Scape any URL via console command (keep the server running):

To observe how many pages scraped via interactive console:

Golang Web Scraper

The demo solution will scan some pages multiple times, use a proper database or lock mechanism to avoid that.