Facebook Friend Recommendation Using Graph Mining

Swetapadma Das
25 min readJul 15, 2021

Still amazed at how accurate the “people you may know” thing on Facebook is?

Real-World Problem

  • Here, we present each user of a social network with a dot called node/vertex and we connect people who are friends/followers/connections with edges called links. So this whole thing in computer science and applied mathematics is called graphs.
  • In Facebook, you must have encountered the recommendation which is known as a friend recommendation, and similarly, in Instagram, it is known as a follower recommendation.
  • Given the graph, we take the help of already existing links. Here, u5 is a friend of u3 and u3 is a friend of u1. Now, Is it sensible to assume that u5 could be a friend of u1?
    Or
    Is u1-u5 link/edge a valid link or not?
  • The graph is a set of vertices and edges. Mathematically, it is written as :
    G = <V,E>
  • Each edge is a pair of two vertices.
    edge = (ui, uj)
  • The directed graphs are usually used in Instagram and Twitter, where one person follows another person, as well as the other person, can also follow back. So, there can be two edges between two vertexes too.
    The dataset we have is also directed graph for our problem statement.
  • For ex: Instagram has follow and follow back system.
    Follow means you’re opting to follow the person. Follow back means someone followed you and you’re being offered the option to follow them back.
  • Path is a sequence of valid edges to go from one vertex to another vertex and the length of a path is equal to the number of edges between two points.
    Here, valid path between a and h :
    a → f → c → b → h, where length = 4
    or
    a → f → c → d →e →b → h, where length = 6
  • Why directed graphs?
    When we are dealing with traveling across path and cost of moving along the same path changes due to some factor, we have to consider the directions here.So directed graph is used in such situations.

Data Format & Limitations

train.csv
  • The dataset was provided by facebook which is removed currently. I’ll provide the data in my github repository.
  • In a nutshell, the dataset consists of a simple file with two columns.
  • Each datapoint is a pair of vertices.
  • In the dataset, we are given pairs of vertices/nodes which contains directed edges/links.
  • Data columns ( 2 columns )
    - Source node
    - Destination node
  • The first datapoint in the above diagram states that there is an edge from U1 to U690569. Here, U1 is the source node and U690569 is the destination node.
  • Here, U1 follows U690569, U315892, U189226. Here, all the edges originating from “1” are listed below. Similarly, there are edges from 2, 3…etc. Remember, these are directed edges.
  • In total, we have 1862220(1.86 million) vertices/nodes and 9437520( 9.43 million) edges/links in our directed graph.
  • We could have metadata like about the city in which the person lives in or in which college they studied or did they study in the same college or some additional information. But, as far as this task is concerned, we have directed graph nodes and edges and have to work on it.
  • So, this is a purely graph-based link prediction problem.
  • But, as the network grows, people are following new people. The network is very dynamic in real-world because today I may have discovered my old friend on Facebook and started following them. So as far as the problem is concerned, Facebook gave us a snapshot of the graph at one time. So, there are some constraints as we cannot understand the evolution of the graph.

Mapping to a Supervised Classification Problem

Let’s map our data to a supervised classification problem.

  • Let’s say we have vertex Ui and Uj.
    If Ui is following Uj or there is a directed edge between Ui and Uj:
    → Then we will label it as “1”.
    If Ui is not following Uj or there is no edge between Ui and Uj:
    → Then we will label it as “0”.
    So, we are mapping this to a binary classification task with “0” implying the absence of an edge and “1” implying the presence of a directed edge.
  • Now, how do we featurize our data ?

    → feature :
    So, in the below figure, we are trying to predict that if the edge between U1 and U2 should be present or not.
    U1 follows → {U3,U4,U5}
    U2 follows → {U3,U4,U6}
    Here, U1 and U2 are having respective sets of nodes they follow. As they have so many common vertices or two sets are highly overlapped, there is a high chance that U1 and U2 have common interests.
    So, there is a high chance that U1 may want to follow U6 and U2 may want to follow U5.
    Similarly, there is a high chance that U1 could follow U2 and U2 could follow U1.

The fact that U1 is following U2 signifies that there is a high chance that U2 will follow back U1.

  • So, these are called graph features.

Business Constraints & Metrics

  • No low-latency requirements
    You can precompute the top 5 or 10 friends Ui should follow once in 2 days or weekly or every night and store it in a hash table-like structure and show it whenever Ui logs in. As we can precompute, so there’s no strong latency requirement
  • We will recommend the highest probability links to a user, so we need to predict the probabilities of the links which are useful.
    I could have 100 such users which Ui could follow and I can have the probability values. I might have 5 slots or 10 slots, where I want to show the most probable top 5 friends the user Ui may want to follow.
  • Ideally, We want high Precision and high Recall when we are recommending Uj to Ui.

Performance Metric for Supervised Learning

  • Both precision and recall are important, hence F1 score is a good choice here.
  • We will also go for Confusion matrix.
  • Another reasonable metric is Precision@topK.
    Let’s say our K = 10
    Let’s say
    Ui = {U1, U2, U3, ….,U10}, here these are the top 10 probable vertices or friends Ui may want to follow.
    Now,
    Precison@top10 means how many of them are actually correct ?
    As in most social networks you don’t get show all the users whom Ui may want to follow, as we have limited space. So, this metric is sensible.

Exploratory Data Analysis(EDA)- PART I

  • Here , we are using networkx library, which is one of the most popular library where you want perform computation on a network or a graph.
  • We will use networkx extensively.
  • There is a file called train.csv which is the raw data.
  • We will create a new file with out the headers “source_node” and “destination_node” present in train.csv.
  • Once we have created that, we will build the graph using networkx library.
    Here, we will provide the file without headers and ask the networkx library to read the edge list and build the graph.
    When we are creating the graph, we ask it to build directed graph (nx.DiGraph()).
  • There is a very simple function “nx.info(graph)”, which gives us information about the graph.
    It says :
    - Type : DiGraph
    - Number of nodes : 1862220
    - Number of edges : 9437519
    - Average in degree : 5.0679
    - Average out degree : 5.0679
  • Outdegree : No of edges orginating from U1 and travelling to other vertices.
    Indegree : No of edges coming into U1 from other vertices.
  • Here, in the graph information , we got values of average in degree and average out degree. So for every vertex, it will measure the in degree and out degree and find the average. Our graph information says that on an average there are 5 edges going into a vertex and 5 edges coming out of a vertex.
  • - A subgraph is basically a subset of vertices and edges in between those vertices. If we want to visualize the subgraph, we can easily create a subgraph using the function “nx.read_edgelist()”.
    - We use the file without headers and state that we need the top 50 rows.
    - We can easily draw using “nx.draw()” function and save the figure.
    - In the below subgraph information, in degree, and out-degree are 0.75. It is less than 1 because there are some empty nodes with no edges at all.
    - In the above case, we can assume that those with no edges may be new users or temporarily banned users or users having zero followers or friends.
This is a small subgraph.
  • So, networkx is a very important tool box.

Exploratory Data Analysis(EDA)- PART II

The below code tells us how many people are there in the graph.

dict in_degree → keys are unique nodes in the graph and corresponding values are number of indegree edges to that node.
dict out_degree → keys are unique nodes in the graph and corresponding values are number of outdegree edges from that node.

In Degree Analysis

  • How many followers are there for each person?
    The no of followers is exactly equal to in degree.
    In the below graph, we can observe that:
    — There are many people with 0 followers.
    — There are some users with more than 0 followers, but you can notice a sharp line which states that there is some user or at least one user with more than 500 followers in our dataset.
    — But, most of the users have less than 40 followers.
  • As there is a scale difference in the above graph, so let’s zoom in.
    — As we move from 0th user to 1.86 million user, sorted in terms of followers, I get this stiff curve.
    — I’m zooming in on the graph from 0th user to 1.5 million user.
    — So, in the below figure, we can observe that around 1.5 million users have followers less than 7.
  • Let’s plot a boxplot.
    — Here, there are a bunch of outliers.
    — There are people with more followers, but most people don’t have more followers.
    — Here mean, median, 25th percentile,75th percentile, 50th percentile are very small. So, this is quite difficult to read the plot.
  • Let’s look at 90th to 100th percentile.
    — We can observe that 90% of the users have followers less than 12.
    — 99% of the users have followers less than 40 or fewer.
    — There is one user with 552 followers.
  • Let’s zoom in from 99th percentile to 100th percentile.
    — Only 0.1% of users have more than 112 followers.
  • Of course, we can plot the pdf and we observed the same.

Out Degree Analysis

  • No of people each person is following ?
    — There is one person who is following more than 1500 people.
    — Most people are following less people.
  • Let’s zoom in on the graph from 0th user to 1.5 million user.
    — We can observe from the graph that, around 1.5 million users follow less than 7 people.
  • Let’s draw box-plot for analysis.
    — There are bunch of outliers.
    — Reading other parameters like mean, median, percentiles from the boxplot graph is difficult as they are very small.
  • There is one person who is following 1566 people and 99 percentile people are following less than 40 people.
  • 99.9 % of people are following less than 123 people.
  • pdf of out degree analysis
  • No of persons having zero followers are 274512 and % is 14.741115442858524.
  • No of persons having zero followers are 188043 and % is 10.097786512871734.
  • No of persons those are not not following anyone and also not having any followers are 0.

Both ( followers + following )

In Di-Graph:
degree(Ui) = In_degree(Ui) + Out_degree(Ui)

  • Min of no of (followers + following) is 1.
    334291 people are having minimum no of (followers + following).
  • Max of no of (followers + following) is 1579.
    Only 1 person is having maximum no of (followers + following).
  • No of persons having (followers + following) less than 10 are 1320326.
  • No of weakly connected components 45558
    weakly connected components wit 2 nodes 32195
    If all the nodes in a graph are connected by some path ignoring direction then that entire graph is weakly connected component

Exploratory Data Analysis(EDA)- PART III

BINARY CLASSIFICATION TASK

(POSING A PROBLEM AS CLASSIFICATION PROBLEM)

  • As, we want to convert this into a binary classification problem as discussed before such that existing edges are labelled as “1” and non existing edges or absent edges are labelled as “0”.
  • In the original graph in the above figure, there is no edge between U1 and U2, U1 and U3, U1 and U4, U2 and U3, U2 and U4, U3 and U4.
  • These edges should be labelled as “0” in the data. But this is not provided in the train.csv file or we don’t have zero labelled data in the dataset.
  • So, I can add those nodes with no edges and label them as “0” in the data.

Now, there’s a cache.

  • As, we need to generate data for label “0”.
  • So, for each vertex, I can have (n-1) edges.
  • As there is n vertex, I will have in total n(n-1) edges.
  • If I have 1.86 million nodes, then no of edges = 1.86 x 10⁶ x (1.86 x 10⁶- 1) edges, which is very large.
  • But we need a smaller percentage of edges out of total possible edges which is about 9.43 million edges to make our data a balanced dataset for binary classification.
Let’s call the possible edges which are not present to be bad edges.
  • We will now randomly sample 9.43 million edges out of total possible edges which are not present in train.csv and construct our new dataset.
  • Remember, we will consider and generate only those bad links for a graph which are not in the graph & whose shortest path is greater than 2.

Exploratory Data Analysis(EDA)- PART IV

TRAIN & TEST SPLIT

In real world , if you are an employee of facebook or Instagram, you have to deal with temporal nature of data. So, you needed to carry out time based splitting.

But we are just provided with graph of a particular time stamp. Since we don’t have the time stamps, we will carry out random based splitting.

  • We splitted the data into 80:20 ratio.
  • We splitted positive links and negative links separately because we need positive training data only for creating graph and feature generation.
  • For positive links : X_train_pos, X_test_pos, y_train_pos, y_test_pos
  • For negative links: X_train_neg, X_test_neg, y_train_neg, y_test_neg
  • Number of nodes in the graph with edges = 9437519
    Number of nodes in the graph without edges = 9437519
  • Number of nodes in the train data graph with edges 7550015 = 7550015
    Number of nodes in the train data graph without edges 7550015 = 7550015
  • Number of nodes in the test data graph with edges 1887504 = 1887504
    Number of nodes in the test data graph without edges 1887504 = 1887504
  • We removed the header and saved the files separately.
  • We created di-graph separately for train positive and test positive.
  • We found the unique nodes in both train positive and test positive graphs.
  • No of people common in train positive and test positive = 1063125
  • No of people present in train but not present in test = 717597
  • No of people present in test but not present in train = 81498
  • % of people not there in Train but exist in Test in total Test data are 7.1200735962845405 %
  • As 7.12 % of test data is not already present in train data, so this leads to cold start problem.
Number of nodes in the train data graph with edges 7550015
Number of nodes in the train data graph without edges 7550015
Number of nodes in the test data graph with edges 1887504
Number of nodes in the test data graph without edges 1887504
  • Now, we appended X_train_neg to X_train_pos.
  • We concatenated (y_train_pos,y_train_neg).
  • We appended X_test_neg to X_test_pos.
  • We concatenated (y_test_pos,y_test_neg).
Data points in train data (15100030, 2)
Data points in test data (3775008, 2)
Shape of traget variable in train (15100030,)
Shape of traget variable in test (3775008,)

Graph : Featurization

Now, we will try to convert these pairs of vertices into some numerical, categorical or binary features to carry out machine learning, as we cannot carry out algorithms on the vertices set.

For any graph-based machine learning problem, of course featurization is the most important part of the project.

For the code: refer to the “fb_featurization ipynb notebook” in my GitHub repository.
Remember we do feature engineering on existing features and creating new features should be done after the train test split to avoid data leakage.

Here we will just focus on how to featurize graph based data.

First of all, we will operate on sets of followers and followee.

1) Similarity Measures: Jaccard Distance

X is the set of followers of U1
Y is the set of followers of U2

𝑗=|𝑋∩𝑌|/|𝑋∪𝑌|

J = 2/4

Given any two sets, jaccard distance or jaccard similarity coefficient basically says:
It is a statistic used for gauging the similarity of sample sets, which is the size of X intersection Y divided by size of X union Y.

In a nutshell, there are lot of common followers between U1 and U2, so U1 may want to follow U2 and U2 may want to follow U1.

  • Higher the jaccard index or more the overlap between two sets of followers or followee , higher the probability that there exist an edge or link between U2 and U2.
  • I can find jaccard index for follower set as well as for followee set.
  • We will use this metric in our actual model too.

2) Similarity Measure : Cosine distance (Otsuka-Ochiai coefficient)

X is the set of followers of U1
Y is the set of followers of U2

𝐶𝑜𝑠𝑖𝑛𝑒 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒=|𝑋∩𝑌|/|𝑋|⋅|𝑌|,
which is used when X and Y are vectors.

But, we will use a simple extension to cosine distance, which is Otsuka Ochiai Coefficient.

Cosine Distance(Otsuka-Ochiai coefficient)= |𝑋∩𝑌|/ SQRT(|𝑋|⋅|𝑌|)
Otsuka-Ochiai coefficient is used when X and Y are set.

  • So, Cosine distance ( Otsuka-Ochiai Coefficient ) will be high when there is more overlap between sets X and Y.
  • Here also, we will compute for the follower as well as for followee.

3) Page Rank

  • PageRank is a very very popular featurization technique for directed graphs as it was used by Google.
  • PageRank was named after Larry Page, one of the founders of Google.
  • It is an algorithm used by google search to rank web pages in their search engine results.
  • It is a way of measuring the importance of website pages.
  • PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the page is.
  • If a lot of pages are having a destination as “B”, then “B” must be important.
    and
    If my page “B” is given as link in many important pages like C,E,D etc, then “B” page value increases.
  • If a user has high pagerank score, then it implies that other users and highly important users are linking to Ui.
  • PageRank can tell us about relative importance.
  • Bill Gates is a celebrity.
    He is followed by some important people like Mark and Warren Buffet and also by some common people like me. So it is quite sure that, he is also an important person.
    Now there is a significantly higher probability that Bill Gates will be followed by “you”.
  • Now, for that 7.12 % of test data which is not present in train data, we can give our mean pagerank value as their pagerank value.
  • The way we will use this for the both vertices is:

4) Shortest Path

  • If there is no edge between two vertices , then we will consider default value as “-1”.
  • If we have an edge between 2 vertices, whose shortest path is of length ‘1’, which makes no sense. In that case, we will remove that edge and calculate the shortest path between those two vertices.
  • Shortest path measures how far two vertices are at the least.

5) Connected Components

  • Strongly Connected Component
    A graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected.
  • Weakly connected component
    A weakly connected component is one in which all components are connected by some path, ignoring direction.
  • Checking for same weakly connected component (Community)

When I ignore directions, i have path from from each vertex to other vertex. So graph S1 and S2 are weakly connected components of a graph.

My college friends and my work related friends can be completely different.So weakly connected components actually creates something called “Community”.

A group of similar users or people with similar interests or people from same college or people with same interest in Machine learning can probably form a same community. So, if Ui and Uj belongs to the same weakly connected component, then it is quite probable that they both may have same interest in something or belong to the same city or can be anything in common.

If I was having even one edge between S1 and S2, then we would have been having a single weakly connected component instead of two.

#getting weekly connected edges from graph 
wcc=list(nx.weakly_connected_components(train_graph))
def belongs_to_same_wcc(a,b):
index = []
if train_graph.has_edge(b,a):
for i in wcc:
if b in i:
index = i
break
if (b in index):
train_graph.remove_edge(b,a)
if compute_shortest_path_length(b,a)==-1:
train_graph.add_edge(b,a)
return 0
else:
train_graph.add_edge(b,a)
return 1
else:
return 0

if train_graph.has_edge(a,b):
for i in wcc:
if a in i:
index= i
break
if (b in index):
train_graph.remove_edge(a,b)
if compute_shortest_path_length(a,b)==-1:
train_graph.add_edge(a,b)
return 0
else:
train_graph.add_edge(a,b)
return 1
else:
return 0
else:
for i in wcc:
if a in i:
index= i
break
if(b in index):
return 1
else:
return 0

In the above code:

  • First, we are checking if there is a direct edge from b to a. If yes, then we are checking do both b and a belong to same wcc or community.
  • If yes, then we removed the direct edge from b to a and calculated the shortest path from b to a. If the shortest path does not exist then we are declaring that they are not in same wcc/community even though both of them belong to same wcc/community.
  • However, if there exists a shortest path from b to a , then it means they could be friends and hence we are declaring them in wcc/community.
  • secondly, we are checking that is there is a direct edge from a to b and follow the similar approach as above and return 1 or 0.
  • If however, there is no direct edge from s to d or from d to s at the first place then we are simply checking that do they belong to same wcc. If yes, then just return 1 else 0.
  • why we have checked for the existence of direct path from b to a and from a to b ?
    In our dataset all of the data points have paths from a to b or from b to a . So if we don’t check for this condition then it will return 1 everywhere in most of the data-points. Hence, if there is a direct path then we have used this strategy that we will remove this direct path and check if a is reachable from d or d is reachable from a via path length more than 2 or indirect path , if yes then only we are declaring them in same wcc/community else not.

6) Adar Index

  • The Adamic/Adar Index is a measure designed to predict links in a social network.
  • Neighborhoods are basically subset of vertices/nodes which are connected in either direction to x.
  • Now, let’s write the formula for Adar Index:
  • Here , we are operation function on U which belongs to intersection of neighborhood of x node and neighborhood of y node.
  • Here, let’s say, U1 and U2 are two vertex belonging to intersection of N(x) and N(y) and both are connected to both x and y.
    U1 has very large neighborhood, so it’s probably a celebrity. So there is a very small chance that x and y are going to be related.
    While U2 has small neighborhood, so it’s a common man like us. As it has small group, it can belong to a college group or work group, so x and y can be related in this case.
  • As size of the N(u) increases, log(N(u)) increases and 1/log(N(u)) decreases. So contribution os nodes like U1 who have large no neighbors, their Adar Index is small and vice versa.

7) Does the person follow back?

If b is following a, then there is very high probability that a will follow b.
So if there is an edge between b and a, the function return 1 else 0. It’s called the follows_back feature. This is a very very useful feature in social media like instagram.

8) Katz Centrality

  • It is similar to what we have seen in Google’s pagerank, but quite an old technique from 1953.
  • Katz centrality computes the centrality for a node based on the centrality of it’s neighbors. It’s a generalization of eigenvector centrality.
  • It is used to measure the relative degree of influence of neighboring nodeswithin a graph or social network.
  • The katz centraloty for node i :
Where A is the adjacency matrix of the graph G with eigen values lambda.
  • The parameter β controls the initial centrality and 𝛼<1/𝜆𝑚𝑎𝑥.

9) Hits Score

  • HITS stands for Hyperlink-Induced Topic Search(often referred as hubs and authorities)is a link analysis algorithm that rates web pages.
  • The HITS algorithm computes two numbers for a node. Authorities estimates the node value based on the incoming links. Hubs estimates the node value based on outgoing links.
  • hubs → More no of outlinks, ex- yahoo.com
  • Authorities → More no of inlinks, ex- cnn.com, bbc.com, mit.edu
  • So, HITS gives every web page wi two scores <hubs, authorities>.
  • Step 1: This is an iterative algorithm. At the very initial stage, for every page, we have auth(p) = 1 and hub(p) = 1.
  • Step 2: Authority update rule : For each p, we update auth(p) to
where Pto is all pages which link to page p.
  • Step 3: Hub update rule: For each p, we update hub(p) to
where Pfrom is all pages to which p links to.
  • Step 4: If we keep running these update rules, they will run till infinity. So once we update the authority or hub score of any page, we normalize it so that the score does not become infinitely large.
  • We will run these steps iteratively and eventually if we do this many times, the authority score of p and hub score of p will converge which is proven mathematically. So run those step iteratively until only when authority score and hub score do not change any more.

10)SVD

  • So you are given a graph G with directed edges Ui → Uj.
  • There is a concept called adjacency matrix which represents all of the information which we have in our graph(which has around 1.78 million users) in the form of a matrix.
  • Adjacency matrix changes graph to graph. Let’s stick to the graph we have.
  • Adjacency matrix(A) of G, where A is of size 1.78M x 1.78M.
  • This matrix is going to be a nxn matrix, whose ijth cell is going to be 1 if there is a directed edge from Ui to Uj else 0.
  • In our case, our adjacency matric is binary matrix.
  • There are other variation of matrix for undirected graphs called weighted graphs, for now let’s stick to directed graphs.
  • We will get very large and sparse matrix as most of the values will be zero.
  • We will decompose the adjacency matrix using SVD.
  • A = U Σ VT
    Where U = size of 1.78M x 6
    Σ = size of 6 x 6
    VT(transpose) = size of 6 x 1.78M
  • In A, U’s ith row represents the same size vector as VT’s ith column. Both are the same size vector representation of ith vertex of the graph. We got this using left singular matrix U and right singular matrix VT.
  • So, for the pair of vertices where we need to predict if there is an edge or not, we have 24 features (for Ui-u-6,vt-6, Uj-u-6,vt-6).
  • Matrix Factorization helps to incorporate implicit feedback i.e information that is not directly given but can be derived from analyzing user behavior.

11)Weight Features

Since Ui is a celebrity, the chances of any two random points knowing each other is very small. But since Uj is not a celebrity, but a common man like us, so the chances of any two random points knowing each other is very high.

How to encode the information that if point is celebrity or not using just the number of links linking into it !

There is a weighted feature called “W” for any point Ui:

Here X = set of vertices linking into Ui or linking out from Ui
  • We can compute the weight features for in links and out links too in the same way.

For every vertex Ui, we have these 6 features.

  • weight of incoming edges
  • weight of outgoing edges
  • weight of incoming edges + weight of outgoing edges
  • weight of incoming edges * weight of outgoing edges
  • 2*weight of incoming edges + weight of outgoing edges
  • weight of incoming edges + 2*weight of outgoing edges

In order to determine the similarity of nodes, an edge weight value was calculated between nodes. Edge weight decreases as the neighbor count goes up. As we have directed edges, we have calculated weighted in and weighted out differently.

The most important part of this case-study was featurization. Now It’s quite easy to go through our problem.

These are the extracted features.

Let’s train our model.

Machine Learning Model

  • Based on our case study, a fairly non linear model might work well.
  • Here, I have used df_final_test data for cross validation. Here test basically means cross validation just to be clear.
  • I will be using Random Forest Classifier because we know random forest tends to work well when there is non linearity and reasonable number of features. I don’t need to do non linear transformation here.
  • I recommend you to try other models too, but I got pretty good accuracy here. Of course SVM, GBDT, Xgboost might work good on this problem.
  • So, there are two hyperparameters for Random Forest.
    - No of estimators
    - Depth
  • I have carried out standard hyperparameter tuning where I used f1 score as the metric.
  • I tried training my model on 10 base models, 50 base models, 100 base models, 250 base models and 450 base models.
  • Here, I have plotted the change in score between my train data and my test/cross validation data as my number of estimator changes.
  • I also hyperparametertuned on the depth of base learners which is depth of 3,9,11,15,20,35,50,70,130 and we got the scores as the depth changes and plotted it. We can observe that beyond a depth, the model is not significantly improving much.
  • I found some reasonably best values for base learners/number of eastimators and depth using RandomisedSearchCV.
RandomForestClassifier(max_depth=14, min_samples_leaf=28, min_samples_split=111, n_estimators=121, n_jobs=-1, random_state=25)
  • Here, we can observe the f1 scores of train and test which are 96.2% and 92.5% respectively , which are fairly close. Since these two values are quite close, we can derive the conclusion that the model is not overfitting. Here test basically means cross validate.
  • I plotted confusion matrix to get the intuition of the process. According to the observation, precision for class “0” is dropping and recall for class “1” is dropping.
Train confusion_matrix
Test confusion_matrix
  • This gives us intuition that, we could add more features to more precisely classify if it belongs to class “0” or not. It is doing pretty well in the train data while it is little dull in the test data. It can happen due to many other problems. If we can recall, I stated about the cold-start problem, where some of the vertices in test set were not present in the original train set already.
  • The very encouraging thing is, there is fairly very small difference between train score and test score which shows ultimately no overfitting. So even on future data or on completely new data, the metrics should work pretty well.
  • Reciever operating characteristic curve for cross validate data:
  • The most important part of the analysis after training our model is feature importance. So follows_back is the most important feature. If Ui is following Uj, then there is a very high chance that Uj will follow back Ui. If you observe the feature importance graph, svd’s, pagerank, kartz are the lowest. Sometimes, very simple features outstand. Of course complex features add values to the performance.
  • Random Forest Classifier turns out be one of the best performing model.

The implementation is available on my github page :

--

--