When looking for a YouTuber to sponsor a video for your product, the first thing that comes to your mind is to check the demographics data of that channel. Unless you are working with that YouTuber’s agent, and they are nice enough to send a screenshot of the demographics data, you can only do the guessing game: Since he/she is from the US, then his/her audience must mostly reside in the US, etc.

As the chief builder of the SocialBook website, which has the most comprehensive data in the world for any YouTuber, I am going to share with you how possible it is to calculate all kinds of data, including demographics, psychographics, and even the price of a YouTuber.

Summary: Quick Jump Menu

Profile Overview

The complete profile of a YouTuber on SocialBook looks like this: https://socialbook.io/demo/AzzyLand-614958

SocialBook sample profile

Let’s explain the blocks below one by one: We build it in blocks so that later when we have more engineers(money), we will have them build drag/drop features—you can just pick whatever blocks you like, organize them, and print them out in a fancy PDF format. (Shh! I can hear someone complaining behind my back already.)

Channel Owner Bio Information


Name, bio, follower count, total views, and total posts are pretty easy. You can retrieve them directly from the YouTube API. (Thanks, YouTube.) And Google does provide decent documentation for using their API. (Come on, guys, stop complaining about YouTube API doc; you are yet to see the tech docs we have written.) When you Google “how to get a YouTube API key,” you get  their official document here: https://developers.google.com/youtube/android/player/register

The data can be returned directly from the public API. When I say public, it means that you do NOT need YouTubers to authorize your app to get this information. In other words, NO such popup as this:

(BTW, the image looks so scary! You can even view and manage my rental and purchase history?? I would NEVER click ‘Allow’ in the above …)

Now, how do we get their email? If you ever got to the YouTuber’s About page, did you see the stupid, never-disappearing captcha box? Apparently, nowadays, machines are smarter than humans because I fail the Captcha nearly 80% of the time.

Of course I am not a robot. Wait a minute; I just found that the annoying “click all images containing street sign/car” box didn’t pop up! Did YouTube just remove that because too many people hate it?

We use a browser simulation, meaning we write program/code to simulate a human being opening a browser, going to a YouTube About page and clicking on “I’m not a robot.” So, yes, you are right; a robot just clicked on “I’m not a robot” … Sorry, YouTube.

What if the super annoying “click all images containing street sign/car” box pops up again? Well, I have to keep it a secret for now. You can email me directly, and I’ll tell you how to achieve that.

Now let’s talk about the main channel language. It’s pretty easy for a human to recognize which language the channel is set to but not so easy when it comes to code. We have to first get all the video titles and descriptions, and then do a vote on the language detected by the algorithm, analyzing the titles and descriptions. For example, if we have 5 videos, and >3 videos are 90% in English, then we know this channel is mainly an English Channel. We encountered two problems during this process, namely:

1)   It’s hard to differentiate Korean from Chinese. (Really? Come on; your algorithm is so dumb!) Yes, we eventually fixed it.

2)   Some non-English channels tend to put lots of English in the description, so voting does not reflect the actual channel language in this case.

How do we solve this? Instead of using more data science power on it, we decided to store all the languages for each YouTuber so that when you search for Chinese, those wrongly identified Chinese channels will still be shown.

Oh, and I just noticed ‘Boost Score.’ Well, some of you might have heard of Klout before: https://klout.com/home.  It calculates a score for each influencer, showing their level of influence. The Boost Score is similar. We calculated a 1−100 score to gauge how effective a YouTuber is. We used more than 10 signals—such as follower count, engagement rate, average views, active time, etc.—to calculate the scores. Each signal has different weighting. There is also a green Brand Safe icon:

This is again just using simple NLP (Natural Language Processing). If the channel contains lots of cursing/bad words, the channel will not have this cute leaf icon.

Featured Post and Recent Posts

So, having explained the first block, let’s move to the next one: Featured Post and Recent Posts. These are visualized posts that we think are important to brands.

Top performance video is the video that has the highest views among all the videos. Top performance video within three months is the video within three months that has the highest views. Top sponsored video is the sponsored video(paid by brand) that has the highest views.

Getting this data is relatively easy. You first need a daily cronjob to get all the information at once. So, if your database is using Elasticsearch (www.elastic.co), getting them at once is just a piece of cake. If you are in a big data area, and you do not know Elasticsearch, well, go and learn it. A simple trick would be to have dedicated masters for these heavy writing options so your end user will not be affected. I will write a separate article about our database structure; it’s pretty heavy—MySQL, Mongo, Elasticsearch, Cassandra, Redis, you name it.

Regarding the sponsored video, I got quite a lot of people asking me how to calculate that. It is also much easier than you think. You just need to search for certain keywords to identify whether a brand paid for a video. For example, you can search for “Sponsored by XXX” or “Paid by XXX.” You get the gist. We have a pretty thorough keywords list.

Price and Average Views

Now comes some cool stuff:
channel price

Average view is still easy—we get this along with the cronjob mentioned above. One thing worthy of note is that this average view is the 75 percentile average views. It means that 75% of all videos have higher average views than this number(534.2k). This is super important because sometimes, brands or even YouTubers themselves will ‘Boost’ their videos. They use ads to get more views for that video. (Caution: lots of fake views are involved as well. There is an entire industry for this.) Some claim that it goes viral naturally. Well, I leave that judgment to you. As a result, some videos might have extremely high views. If we just calculate the normal average views, it will be biased and probably not reflect the true health of that channel.

The price part is controversial. We got some brands saying it’s too low, some saying it’s too high. Well, this is the price we think this channel is worth. The algorithm is based on the number of followers, the average views of this channel, the engagement data, and the activity level of this channel. So, if one channel has not been active for quite some time, even if it still has a high number of followers and average views, the price will be lower than those with fewer followers but have been more active.

Channel performance

Again, all the numbers—like highest views, lowest views, latest views, comments per post, likes per post, latest 10 average views—can be obtained just by one single query against Elastic. However, the progress bar you see in the graph is not as easy to get as you think. Average views in a certain time frame means the number of views that this channel’s videos can reach in the given time from its release date. Thus, you can have an idea on that if you find this YouTube channel to make a sponsored video for your brand—i.e., how long you should wait to see if the video can reach the entire channels’ average views/organic average views and sponsored average views respectively. This basically means that we have to track video stats for all videos, and from the graph above, we have to track daily video stats for each video, which is a massive amount of data—hundreds of billions of data entries. We do have some optimization, though. We discovered that once a video surpasses a certain time after it’s launched, the number of views will no longer increase. To save our machine power, we set up a strategy that if a video is just launched, we track the video stats hourly. If the video is 3 days old, we track the stats daily. If the video is a week old, we track weekly, and so on.

There is still a minor issue, though. Since SocialBook is a real-time system; when a YouTuber is just added to our database, we will not be able to get all the data. It’s like a baby just born in the hospital—having no historical data. But once the baby grows, more and more data will be accumulated regarding this baby(channel).

Links and Tags

This was scraped from the About page of a YouTuber; a simple Curl will do.

Surprisingly, tags are also retrieved from YouTube public API. When you upload a video to YouTube, you can also specify tags for that video. Therefore, we can get all tags from all videos on a channel, then choose the 10 most frequent ones.

Game titles is a cool feature. You can also see what games this channel has played and the number of videos containing this gameplay. Furthermore, you can also click on each game title to see which videos contain this game:

We maintain a huge list of all games, including PC games and mobile games. Each entry contains game title, genre, publisher, platform. How do we build the list? There are lots of websites we use. Initially, we thought about crawling apple/google store, but their data does not give us enough detailed genre. So, first, we use Google to find the most popular mobile games under certain genre and then crawler them separately. If some data detail is missing, we then go back to apple/google store to get the details. Once the mobile game is done, we start putting up the PC game list. Steam is a good source. Another hidden treasure is Wikipedia! Each known game on Wikipedia has very detailed information—like genre, mode, publisher, … Everything! For example, you can go to https://en.wikipedia.org/wiki/Counter-Strike to get game details of Counter-Strike:

With all this information, we can allow users to do some cool stuff like searching game genre directly, e.g., FPS. We can even search by publisher, e.g., EA.

Related Channels and Featured Channels

These two are also directly scraped from YouTube page. Related channel is added by YouTube. Featured Channel can be added by the channel owner themselves.

Engagement Rate and Video Category

Easy piece: 75 percentile average views divided by total follower count gives us the ‘Average Views to Subscriber.’ Total likes divided by total views gives us the ‘Likes Per View,’ and the same applies to ‘Comments Per View.’ Note that the total likes count, view counts, and comments count are also retrieved along with the cronjob we mentioned more than once above.

Video category is another easy but time-consuming calculation. YouTube’s public API provides a category for each video. As a result, we just need to get the category of each video and then do some simple math.

Progress Graph

There is no magic of getting this data. You just need a daily job calling YouTube API to get the total views/subscribers daily. Now you may ask, how come the graph shows two years’ data? Well, it’s because we started the data collection process two years ago. This is also a data barrier for other providers. If you start building scraper right now, you will need some time to accumulate all the data.

Another cool thing I like is that we put the most performant video of the month on the trend line. You can easily tell which video goes viral, contributing lots of views/followers to this channel. Again, using Elastic as back end makes all the calculation 100% easier.

Audience Demographics

Now comes the kernel part: the demographics data of a YouTuber.

In order to calculate the demographics data, we need to know each follower’s data. Since followers list is not available through YouTube APIs; all the calculation is based on active commenters. In fact, the data calculated out of commenters makes more sense because it represents the most engaging audience. This is actually very important because, sometimes, you will see the data calculated a bit skewed from the actual channel demographics shown on the channel’s management portal.

We select the most active commenters and then start getting the country and interest of each commenter through YouTube APIs. Be aware that not all commenters will have their interest and country available through APIs. In that case, we can simply ignore that. Remember, this is a sampling process; so, as long as we only sample from the commenters who have data, the result is pretty accurate.

Now comes the most challenging part: the age and gender distribution. The core technology behind this is face recognition. We scan through each commenter. First, we check whether the commenter has an avatar as his/her profile image. If the answer is Yes, then we can easily estimate the age/gender from that avatar. However, in many cases, the commenters will not put his/her face as the avatar. Our algorithm has to be smart about that. A common makeup is to capture the face inside the video frames. For example:
Famous Pewdiepie:

His pretty face is relatively easy to capture. In some other cases, the face is hidden in the corner:

Keep in mind that there are lots of noises, and we have to be really careful about that.

There are several ways to improve the algorithm. One common way is to cross-reference. For example, if this follower has an Instagram account or a twitter account, we can go there and crosscheck. If the commenter also has uploaded videos, we can extract the scripts from his/her videos, perform voice recognition, and then do a vote again: if x+1 out of 2x videos are narrated by a female voice, then we define this commenter is a female.

All these algorithms are pretty complicated. That is why we need 5 powerful GPU machines to do these heavy calculations. That is also the reason that sometimes, you need to wait a bit long when calculating some YouTubers. However, as the data accumulate, there are many overlaps between channel audience. So the calculation will eventually become faster and faster.

Furthermore, we can still tune the data quite a bit because we have lots of training data. Most times, it is not really about the algorithm; it’s about what training data you have. Fortunately enough, we have like 50,000 registered users who authorized us to get the official demographics data. This makes our data much more accurate than our competitors. For example, we found that females are more willing to save their own images as an avatar than guys. We, thus, can adjust the rate accordingly.

Brand Mentions

The last piece is brand mentions:
brand mentions
This process is similar to that of getting game titles. We maintain a large pool of well-known brands, then just do brutal force search on all videos. Be aware that sometimes, YouTubers put tags instead of brand names to represent the brands. We also maintain tags for each brand and search them as well. The tags will come handier once we get into Instagram space.

Just like game titles, when you click on the grounds, a popup will show all the videos that contain this brand.

More features are being added for YouTube profile. For example, similar channels will become available soon.The definition of similarity here is not about video contents across channels, but referring to the audience similarity. A very good use case here is that you can run video pre-roll ads targeting those similar channels once one of them performs well.

Here is also a separate blog explaining the accuracy of our algorithm. in case you are interested.