Youtube video monitoring & analytics by crawling the video detail page

I have started a weird side-project. Ok, let’s be honest if it is not strange, it’s not a side-project.

My confession: I enjoy watching US late-night shows. It might be because when I was traveling in the US, it was a nighttime ritual to watch one (mostly Conan). In Europe, the easiest way to watch it is YouTube (even when you get each episode as seven small videos).

I have some favorite hosts, but I am subscribed to all of them. What always wondered me was how they perform on YouTube against each other. Who has the best ratios, what topics work better (Fallon‘s music videos are likely on the top),…?

For analyzing it, I need some data. And I thought that‘s easy I can setup the data pipeline on the weekend. Spoiler, it was not and took me roughly three weeks.

What data do I need

I would like to have the basic video metrics: views, likes, dislikes, comments on a daily base. Additionally, channel metrics like subscribers. Then some context data like title, tags.

Pretty basic.

YouTube API

A glance in the API documentation told me that I could get this data without any authorization since it is public data. You can also get all videos for a channel from the API.

(https://developers.google.com/youtube/v3/docs/search/list)

Great, let‘s go,

You need a Google Cloud account and a project first.

When created, head over to the API section and activate the YouTube API.

Youtube Data Api

The first step is the get all videos from a channel. You can do this by using this simple call:

import requests

base_url = "https://www.googleapis.com/youtube/v3/"
channel_id = "UCMtFAi84ehTSYSE9XoHefig"

videos = []

r = requests.get(
    "{}search?part=snippet&channelId={}&maxResults=50&key={}".format(
        base_url, channel_id, YT_API_KEY
    )
)

result = r.json()

for video in result["items"]:
    video_id = video.get("id", {}).get("videoId", "")
    if video_id:
        videos.append(
            {
                "video_id": video_id,
                "channel_id": channel_id,
                "video_published_at": video.get("snippet", {}).get("publishedAt", "").split("T")[0],
                "video_title": video.get("snippet", {}).get("title", ""),
                "video_description": video.get("snippet", {}).get("description", ""),
                "video_image": video.get("snippet", {}).get("thumbnails", {}).get("high", {}).get("url", ""),
            }
        )

Based on each video ID, you can retrieve the essential information and some stats:

import requests

base_url = "https://www.googleapis.com/youtube/v3/"

video_ids = ["yeFk2709fgY"]

r = requests.get(
    "{}videos?part=snippet,statistics&id={}&key={}".format(
        base_url, video_ids, YT_API_KEY
    )
)
result = r.json()

video_data = []

video_items = []
for video in result.get("items", []):
    video_data.append(
        {
            "id": video_data.get("id", ""),
            "published_at": video_data.get("snippet", {}).get("publishedAt", ""),
            "title": video_data.get("snippet", {}).get("title", ""),
            "description": video_data.get("snippet", {}).get("description", ""),
            "image": video_data.get("snippet", {}).get("thumbnails", "").get("standard", {}).get("url", ""),
            "image_max": video_data.get("snippet", {}).get("thumbnails", "").get("maxres", {}).get("url", ""),
            "tags": ",".join(video_data.get("snippet", {}).get("tags", [])),
            "views": video_data.get("statistics", {}).get("viewCount", ""),
            "likes": video_data.get("statistics", {}).get("likeCount", ""),
            "dislikes": video_data.get("statistics", {}).get("dislikeCount", ""),
            "comments": video_data.get("statistics", {}).get("commentCount", ""),
        }
    )

This script complies quickly, and a first test run gave me all data I wanted.

I then wrapped the serverless framework around it so that I can run it in Lambda functions.

And then I did a test run with all videos from a channel as a backfill.

It looked great for the first 500 videos then I got a message no one likes to see: Daily quota extended.

Some research showed that the quota thing is pretty tricky with the YouTube API.

(https://support.google.com/youtube/thread/13311954?hl=en)

You get 10.000 points and based on an API request, and you consume them. Some requests are cheaper (like searching for a channel), and some are more expensive (like getting stats for a video). You can request an extension, but you get no guarantee when or if you get it.

With quota, I would never get enough data. And since some people told about unannounced downgrades of their quota, I thought we need to look into a different approach.

Scrape YouTube with scrapy

Since I started a Python project, I looked first into scraping the YouTube video pages with Scrapy. Scrapy is a well equipped and powerful scraping tool and could even have too much overhead for my scenario. But it does not take long to start a scrapy project, so I started with it. A lightweight alternative would be to get the website data with requests and then do the HTML parsing with beautifulSoup.

I ran the first crawl and checked how I could access the data based on CSS IDs and classes, defined some test elements (title and views), and crawled. Nothing…

Did some test, if I simply missed the IDs and finally logged the full HTML I got with scrapy. And none of the data I needed was there.

Some more analysis showed that YouTube likes to make it difficult for you to crawl them.

The initial source code does not have any kind of details for the video. No title, no views…

The data is loaded and added after page load using async functions.

Excellent, so no Scrapy or BeautifulSoup since they can only work with the HTML source code.

Scrape YouTube with Puppeteer

In this case, I needed a different class of tools: website testing tools. These tools emulate a browser request on the website and can wait until all initial JavaScript has run and then check the data on the website. The grandfather of these tools is Selenium, but it requires a hassle of setup when you run on the server, so I went with Puppeteer, a NodeJS package that Google maintains for website testing.

Another local test setup. Puppeteer does not take too much code to get your data. Some tweaking of the waiting times and the XPath definition and voila, I got all the data I wanted. Here is the script:

const puppeteer = require('puppeteer');
const moment = require('moment');

const dataPath = {
    'title': '//ytd-video-primary-info-renderer/div/h1/yt-formatted-string',
    'count': '//yt-view-count-renderer/span[1]',
    'uploadDate': '//ytd-video-primary-info-renderer/div/div/div[1]/div[2]/yt-formatted-string',
    'likes': "//ytd-menu-renderer/div/ytd-toggle-button-renderer[1]/a/*[@id='text']",
    'unlikes': "//ytd-menu-renderer/div/ytd-toggle-button-renderer[2]/a/*[@id='text']",
    'subscribers': "//*[@id='owner-sub-count']"
};

const getYoutubeData = async (videoId) => {
    console.log("getting data for", videoId);

    const browser = await puppeteer.launch({
        headless: true,
        args: [
            "--no-sandbox",
        ]
    });

    const page = await browser.newPage();
    const data = {};
    await page.goto("https://www.youtube.com/watch?v=" + videoId);
    await page.waitFor(2000);
    console.log("on video page");

    // looping through all XPaths defined above to get the different kind of data
    for (const key in dataPath) {
        let path = dataPath[key];
        await page.waitForXPath(path);
        const xElement = await page.$x(path);
        data[key] = await page.evaluate((el, key) => el[key], xElement[0], 'innerHTML');
    }

    // some final data cleaning
    data['count'] = data['count'].replace(/\,/gi, '').replace(' views', '');
    data['uploadDate'] = moment(data['uploadDate'], "MMM DD,YYYY").format('YYYY-MM-DD');
    data['likes'] = data['likes'].replace('K', '000');
    data['unlikes'] = (data['unlikes'].indexOf('.') > -1) ? data['unlikes'].replace('K', '00').replace('.', '') : data['unlikes'].replace('K', '000');
    data['subscribers'] = data['subscribers'].replace('M', '0000').replace('.', '').replace(' subscribers', '');
    console.log("data gathered");

    // I use a different function to save data to a BigQuery table
    await save_data(videoId, data, "base_youtube_daily_snaps", "late_night_dash")
    console.log("saved:", videoId);
    browser.close();

I added some final tweaks on the returned data (removing not needed strings mostly).

Then I wrote a small function that saves the data into a BigQuery table. Run it, and data was saved. Perfect. Now let’s get the stuff on some server and set it up as a nightly job.

Run puppeteer in AWS Lambda or Heroku

I did not know that now the second part of the journey started—finding a proper setup to run Puppeteer in the cloud.

Puppeteer has one issue. It needs a full chromium client to run their test. And this client is a bit heavy (more than 50mb), and it also consumes a lot of RAM (as we all know when we have 70 tabs open in Chrome). And I needed roughly 5.000 tabs open.

My initial plan was to check the views of all videos that all tonight shows have uploaded since the beginning of time. So roughly 15k crawls a day.

Getting Puppeteer running on AWS got the first blocker. You are limited to 50mb for your source code and installed modules. Getting chrome running will take significantly more.

Some people have created specific stripped down packages of Puppeteer for Lambda, but none of them ran without a lot of hassle for me. Maybe if I have more time, I would invest some to find a working and scalable way.

(https://github.com/alixaxel/chrome-aws-lambda)

So next up to Heroku. On Heroku, I have no limit on space.

So I added a small express server instance, a Redis queue, and moved the puppeteer function into a separate worker.

Deployed it and again hit some issues. Turned that out of the box, Puppeteer will not work on Heroku. There is also a workaround package, but it did not work for me after some tests, so I looked at the next candidate.

(https://elements.heroku.com/buildpacks/jontewks/puppeteer-heroku-buildpack)

Run puppeteer with Google App engine

In some posts, I saw that the Google App Engine standard environment comes with an installed puppeteer package out of the box. Huzzah…

(https://github.com/puppeteer/puppeteer/issues/1036)

Some small reconfigures. App Engine is using Cloud tasks as their queue system. And it’s not intuitive to set this up.

But finally, all worked, videos were added to the queue, and I got Puppeteer to crawl my pages. But only with a significant setting. I needed to run AppsEngine on the biggest machine (F4_1G). Unfortunately, this would cost me xx a month. A bit much for a side-project.

Run puppeteer with Apify

It was clear to me that I needed to find a more cost-effective solution and to strip down my project setup (not checking data for all videos out there).

The good thing is, there is a puppeteer as a service called Apify (https://apify.com/). They also offer other crawlers. And it’s not only the crawler as a service. They offer a bunch of services around that: queues, storages, webhooks, and APIs.

Apify

I had them on my list to test them, so now was a good time.

It took some time to understand their different services and how to use them, but after some testing, I had a crawler set up that was getting my YouTube data. And after some more effort, I had a lambda function that was getting all videos for a channel put them on an Apify queue, run the crawl, save the data in a data store and then take this data and write it to BigQuery (in a different Lambda function).

For my budget, I could not run a crawl on all videos, but it‘s still a bit cheaper than the App Engine instance.

So I change my project to show the performance on a monthly base which reduces the number of crawls so it won’t break the bank.

Other options with Puppeteer

If you want to scale a puppeteer service, you need bigger instances. I also had a local docker setup running with Puppeteer and Redis, which worked locally. But then I tested Apify and so put the docker approach on hold. If I scaled it, I would look into this approach. Use Docker and use something like Digital Ocean or AWS Fargate to test some more. And maybe I would even spend some time with AWS lambda to get it running.

For now, I am quite happy with Apify and already have a new side-project idea.