If you are running an influencer marketing campaign on your own, when choosing an Instagrammer to promote your product on Instagram as part of your influencer marketing campaign, you should always have a measuring metric in mind, which are influencer’s audience demographics of his or her channel. You can ignore this article if you are running your influencer marketing campaign with an MCN or an influencer marketing agency. Because often times, they will only show you the good and juicy part of the influencer but not mentioning anything that might potentially harm their deal. In this case, you can only guess what the audience demographics the influencers selected by the agency have. But I would still recommend you read this article and ask some sharp questions before you hire any agents.

As the chief technology officer of SocialBook.io, which generated the most comprehensive data for any Instagrammer, big or small. I am sharing you this article in a technology perspective and how it is possible to calculate the valuable data we provide, including audience demographics, psychographics, and even the estimated price of a given Instagrammer.


Summary: Quick Jump Menu



Profile Overview

The complete profile of an Instagrammer or Instagram content creator on SocialBook looks like this: https://socialbook.io/#/demo/instagram/pewdiepie 

Let’s explain the blocks below one by one: We build it in blocks so that later when we have more engineers(money), we will have them build drag/drop features—you can just pick whatever blocks you like, organize them, and print them out in a fancy PDF format. (Shh! I can hear someone complaining behind my back already.)



Channel Owner Bio Information

sample instagram influencer profile
sample influencer profile

Content creator basic information like Name, bio, follower count, total views, and total posts are very easy to understand as its name suggested. To be honest, these are publicly available information, and we used Instagram API to generated them. (Glade that Instagram is still allowing API access.) However, we are experiencing some difficulties retrieving some of the Instagram posts recently because the API access is getting stricter. What makes it difficult is that Instagram is constantly changing the image URL to make it hard for us to capture the image. Here is a document from Instagram official page talking about Instagram API access, you will know what I am talking about after reading the Instagram graphic API documents.

According to Instagram official developer API instruction: The Instagram API Platform can be used to build non-automated, authentic, high-quality apps and services. But make sure you follow their platform policy here before you actually implement their API access, especially those policy related to privacy. It’s in America, you know, follow the rules.

Next, let’s see the influencer’s channel posts. People are visual and people are highly intelligent animals that can understand the language that the content creators are posting. Yet, machines are not. Especially lots of content creators know multiple languages plus English. All things machines understand are codes. Our AI machine has to read all the title and descriptions, and then vote on the language that is detected by our algorithm. An example here, if we have 5 Instagram posts, and more than 3 Instagram posts are written 90% or more in the language English, then the machine will determine that this Instagram channel is an English based channel. Machines will need to solve two critical problems during this language detection process, these are:

1)   It is hard to believe, but AI today still can not 100% determine Korean and Chinese.  However, we eventually fixed it.

2)   Lots of content creators are trying to get the English speaking followers, so they put lots of English in their channel, so it’s very hard for AI to determine the channel language after all.

Well, these two are just two of many, AI today is still not as smart as what you thought it is.

Let’s move on to the ‘Boost Score.’ SocialBook system automatically assigns a score for each influencer based on their influences.  The score is based on a 1−100 scale. More than 10 signals are being used to calculate the score—such as follower number, influencer total engagement rate, average views,  last active time, etc. Different signals are assigned with different weights, so some of the factors are more important than others. For example, the engagement rate is much more important than the follower number. You know there are lots of fake followers. 

This is again just using simple NLP (Natural Language Processing). If the channel contains lots of cursing/bad words, the channel will not have this cute leaf icon.


Featured Post and Recent Posts

Pewdiepie all posts
Pewdiepie all Instagram posts

Now you know how SocialBook works using AI and big data from Instagram Open API, let’s take a closer look at what value we provide for influencer marketing managers. Starting with the top right section: All Posts and Featured Posts. The all posts section is a summarized section for all the posts posted by the selected influencer. In this example, PewDiePie. We included all the like counts and comments counts for you to review so that you will see clearly how each post performed.

As for featured posts section, it is a summarize of all the popular posts that were posted by the influencer, which we think is important for influencer marketing managers. This section includes top performance posts, recent three-month top performance posts, and top sponsored posts.

PewDiePie featured posts

Top performance posts are the posts that have received the highest likes among all the posts. Recent three months top performance posts are the posts within three months that have gained the highest likes. Top sponsored posts are the posts that are highly liked being paid by influencer marketing managers from brands that have earned the highest likes.

It is not hard to get these data. Simply using Elasticsearch (www.elastic.co) with a daily cronjob for all the information and you are good to go! So, if your database is using Elasticsearch, getting them at once is just a piece of cake. If you are an influencer marketing manager, you might not know elastic.co, but as an engineer in the big data area, you have to know Elasticsearch, just go and learn it. A simple trick would be to have dedicated masters for these heavy writing options so your end user will not be affected. I will write a separate article about our database structure; it’s pretty heavy—MySQL, Mongo, Elasticsearch, Cassandra, Redis, you name it.

A lot of people ask me how did we calculate the sponsored posts. It is also much easier than you think. You just need to search for certain keywords to identify whether a brand had paid for a post. For example, you can search for “Sponsored by XXX” or “Paid by XXX.” You get the idea. We have a pretty thorough keywords list that we can go after to check if this a post is a sponsored post.


Price and Average Views

Now comes some cool stuff for influencer marketing: the overall channel performance of the influencer.

The average like is still easy—we get this along with the cronjob mentioned above. One thing worthy of note is that this average view is the 75 percentile average likes. It means that 75% of all posts have higher average likes than this number(399.3k). This is super important because sometimes, brands or even Instagram influencers themselves will ‘Boost’ their posts. They use ads to get more views for that post. (Caution: lots of fake likes are involved as well. There is an entire industry for fake followers and fake likes.) Some claim that it goes viral naturally. Well, you can figure this out yourself. As a result, some posts might get extremely high likes. If we just calculate the normal average views, it will be biased and probably not reflecting the true health of that channel.

The estimated price part is controversial. Some influencer marketing managers say that our estimated price is too low, some saying it’s too high indeed. Well, as a data-driven influencer marketing company, this price is the price that the data suggested, that could represent the worth of the channel. The algorithm is based on the number of followers, the average likes of this channel, the engagement rate, and the activity level of this channel. For example, if one channel has not been active for quite some time, even if it still has a high number of followers and average likes, the price will be lower than those with fewer followers but have been more active.

Again, all the numbers—like highest likes, lowest likes, latest likes, comments per post, likes per post, latest 10 average likes — can be obtained just by one single query against Elastic. However, the progress bar you see in the graph is not as easy to get as you think. Average likes in a certain time frame mean the number of likes that this channel’s posts can reach in the given time from its release date. Thus, you can have an idea on that if you find this Instagram channel to make a sponsored post for your brand—i.e., how long you should wait to see if the post can reach the entire channels’ average likes/organic average likes and sponsored average likes respectively.

This basically means that we have to track post stats for all posts, and from the graph above, we have to track daily post stats for each post, which is a massive amount of data—hundreds of billions of data entries. We do have some optimization, though. We discovered that once a post surpasses a certain time after it’s posted, the number of likes will no longer increase. Make sense because it became an old feed and no one will see it anymore when the influencer posts new posts.  To save our machine power, we set up a strategy that if a post is just published, we track the post stats hourly. If the post is published 3 days ago, we track the stats on daily basis. If the post is a week old, we renew the status weekly basis, and so on.

There is still a minor issue, though. Since SocialBook is a real-time system; when an Instagram is just added to our database, we will not be able to get all the data. It’s like a baby just born in the hospital—having no historical data. But once the baby grows, more and more data will be accumulated regarding this baby(channel). Plus, we save some data for the later deleted content.


 Tags

This was scraped from the profile info of an Instagrammer; a simple Curl will do.

 

Surprisingly, tags are also retrieved from Instagram public API. When you publish a post to Instagram, you hashtag. Therefore, we can get all tags from all posts on a channel, then choose the 10 most frequent ones.

Game titles is a cool feature. You can also see what games this channel has played and the number of posts containing this gameplay. Furthermore, you can also click on each game title to see which posts contain this game:

 

 

We maintain a huge list of all games, including PC games and mobile games. Each entry contains a game title, genre, publisher, platform. How did we manage to build the list? We used lots of website services. Initially, we thought about crawling apple/google store, but their data does not give us enough details on the genre. So, first, we use Google to find the most popular mobile games under a certain genre and then separately crawler them. If some data detail is missing, we then go back to apple/google store to get the details. Once the mobile game is done, we start putting up the PC game list. The famous distribution platform developed by Valve Corporation, Steam is a good source for reference. Another hidden treasure is Wikipedia! The FREE online encyclopedia, each known game on Wikipedia has very detailed information—like the genre, mode, publisher, … and everything! For example, you can go to wiki Counter-Strike to get game details on Counter-Strike:

With all this information, we are now allowing influencer marketing managers to find influencers by searching game genre directly, e.g., FPS. We can even search by publisher, e.g., EA


 

Engagement Rate

We all know that in the influencer marketing industry, the most important thing besides follower counts, the second most important number to influencer marketing managers is the influencer engagement rate. This number tells marketers how many people the influencer is actually influencing. For us, using machine learning to count the number is an easy piece, but if an influencer marketing manager wants to count it by him or herself, it will be harder than anything. This is how we calculate it: 75 percentile average views divided by total follower count gives us the ‘Average likes to Subscriber.’ Total likes divided by total views gives us the ‘Likes Per View,’ and the same applies to ‘Comments Per View.’ Note that the total likes count and comments count are also retrieved along with the cronjob we mentioned more than once above.


Progress Graph

There is no magic of getting this channel progress data. You just need a daily job calling Instagram API to get the total views/subscribers daily. Now you may ask, how come the graph shows two years’ data? Well, it’s because we started the data collection process two years ago. This is also a data barrier for other providers. If you start building scraper right now, you will need some time to accumulate all the data.

Another cool thing I like about SocialBook is that we have put the most performant post of the month on the trend line. You can easily tell which post went viral, contributing lots of views on helped influencer increased lots of followers to this channel. Again, using Elastic as back-end makes all the calculation 100% easier, which also allow influencer marketing managers to have a much easier life.

Why this graph is important to influencer marketing? Influencer marketing managers need to know the overall health of the channel: if the influencer is constantly growing his or her follower base or if there is a sudden decrease in the follower count, etc. As an influencer marketing manager, you do not want to hire anyone who just lost 50% of his or her followers.


Audience Demographics

Now comes the kernel part for influencer marketing managers: the demographics data of an Instagram influencer.

In order to calculate the influencer demographics data, we need to know each one of the follower’s data. Since followers list is not available through Instagram APIs; all the calculation is based on active commenters. In fact, the data calculated out of commenters makes more sense for influencer marketing because it represents the most engaging audiences. The fact that all this information is based on commenters is very important because you will find that sometimes influencers might provide you with a prettier graph with zombie followers included, according to a Points North Group study, up to 20% of mid-level influencers’ followers are likely fraudulent. So, as an influencer marketing manager, you should do your best to spot them and only pay for the “actual followers” to increase your ROI.

We select the most active commenters and then start getting the country and interest of each commenter through Instagram APIs. Be aware that not all commenters will have their interest and country available through APIs. In that case, we can simply ignore that. Remember, this is a sampling process; so, as long as we only sample from the commenters who have data, the result is pretty accurate.

Now comes the most challenging part: the age and gender distribution. The core technology behind this is face recognition. We scan through each commenter. First, we check whether the commenter has an avatar as his/her profile image. If the answer is Yes, then we can easily estimate the age/gender from that avatar. However, in many cases, the commenters will not put his/her face as the avatar. Our algorithm has to be smart about that.

Keep in mind that there are lots of noises, and we have to be really careful about that.

There are several ways to improve the algorithm. One common way is to cross-reference. For example, if this follower has an Instagram account or a twitter account, we can go there and crosscheck. If the commenter also has published posts, we can extract the scripts from his/her posts, perform voice recognition, and then do a vote again: if x+1 out of 2x posts are narrated by a female voice, then we define this commenter is a female.

All these algorithms are pretty complicated. That is why we have 5 very powerful GPU machines to do these heavy calculations. That is also the reason that sometimes, you need to wait a bit long when calculating some Instagrammers. However, as the data accumulate, there are many overlaps between channel audience. So the calculation will eventually become faster and faster.

Furthermore, we can still tune the data quite a bit because we have lots of training data. Most times, it is not really about the algorithm; it’s about what training data you have. Fortunately enough, we have like 50,000 registered users who authorized us to get the official demographics data. This makes our data much more accurate than our competitors. For example, we found that females are more willing to save their own images as avatar than guys. We, thus, can adjust the rate accordingly.


Brand Mentions

Here comes the most interesting part for influencer marketing managers. The past brand mentioned by the influencer.

brand mentions
The process is very similar to that of getting game titles. We maintain a large pool of well-known brands, then just do brutal force search on all posts. Be aware that sometimes, Instagrammers put hashtags instead of brand names to represent the brands. We also maintain tags for each brand and search them as well. The tags will come handier once we get into Instagram space.

Just like game titles, when you click on the grounds, a popup will show all the posts that contain this brand.

More features are being added for Instagram profile. For example, similar channels will become available soon. The definition of similarity here is not about post contents across channels, but referring to the audience similarity. A very good use case here is that you can run post preroll ads targeting those similar channels one of them performs well.


Top Commenters

Top commenters section is designed to let influencer marketing managers to know if the selected influencer has been maintaining his/ her community well. You can tell by seeing if him/ her is the top commenter by default. If the top commenter itself is the influencer, that means, the influencer is constantly commenting and replying under his posts, which also means he/ she is doing efforts to maintain the community by being active.

 

Here is also a separate blog explaining the accuracy of our algorithm. in case you are interested.

If you want to take a look at our sample Instagram influencer profile, please click here.