Relationship Prediction in Social Networks
Relationship Prediction in Social Networks
Extracted features based on the triadic closure theory. Applied multi-layer perceptron with scikit-learn. Entered the top 20% in the Kaggle in-class competition.
In real world social networks, predicting the relationship between two users is an important but also challenging task. In this report, we apply machine learning algorithms to a set of real world data to predict the existence of edges between nodes.
The Problem
Date sets of this project include two parts, training data set and test data. The training set each is a partial crawl of the Twitter social network from several years ago. The training set has 20,000 sources, and each source contains 1,200 sinks on average. The test set is a list of 2,000 pairs of a source and a sink, which represents pairwise relations in the Twitter social network. 1,000 of these test edges are real and withheld from the training network, while the other 1,000 do not actually exist.
In this project, we selection positive and negative training samples from the training data set, generate features from them, and fit those data in a multilayer perceptron model. With the trained neural network, we can predict the probabilities that the edges are true.
Feature Engineering
In this project, the reason behind our selection of features is derived from the triadic closure theory: in a social network, two people are more likely to be friends in the future if they have a common friend 1. Based on this theory, the first feature is the number of nodes which have edges from the source and have edges to the sink, we call these nodes common nodes. The second one is sum of weighted number of common nodes:
$$ \text{Score}(\text{source, sink}) = \sum_{k\in \text{neighbour}(i)\cap \text{neighbour}(j)} \frac{1}{\text{neighbour}(k)} $$
The third one is the number of the edges of the source and the last one is the number of the edges of the sink, because people who have more friends are more likely to have relationships between others. Both numbers are calculated by logarithm.
Methodology
This report introduces the use of Multi-Layer Perceptron (MLP) classifier to predict whether pairwise relationship exist among the two users.
Multi-Layer Perceptron (MLP) is a class of feed-forward artificial neural network that maps a set of input vectors to a set of output vectors. It can be thought of as a directed graph consisting of multiple
1 node layers, each connected to the next layer. Except for input nodes, each node is a neuron (or processing unit) with a non-linear activation function. MLP is a generalization of perceptron, which overcomes the weakness that the perceptron only can recognize linear separable data.
Problems and Solutions
Since the edges between nodes are unidirectional and the crawled network is only a portion of the whole network, most of the sinks does not exist as sources and most of the sources does not exist as sinks in the original data set. Therefore, when we try to process the data and generate training samples, we encounter a problem that if we do a totally random selection of source and sink pairs from all nodes, only a few samples are positive (true edges in the training data). We take a new approach to get the training samples. For positive samples, we randomly choose one source node from the original data, then randomly choose the sink from its sinks. For negative samples, we randomly choose a source from the original data, then we keep choosing a sink randomly until we find one that is not a sink of the source. Although the selected negative sample may be positive in the real world network due to the incompleteness of the crawled data, this is the best strategy we can come up with.
Another problem we have encountered is speed. When trying to find out a feature represent the primary relationship between the source and sink, we calculate the number of common nodes between them. In our naive approach, we do a traverse of the whole data set to find the sources of the sink. Soon we realize that we need a trade-off between time and space since each run takes hours to finish. Alternatively ,we have a single traverse of the data and save all the sinks and their corresponding sources into a separate hash table. Following this approaching, each prediction only takes a couple of minutes.
Improvements
In the chosen features, we calculate the number of sinks of a source and sources of a sink. However, the numbers may vary greatly. We improve it by taking the logarithms of the numbers, which actually achieves a higher accuracy. Furthermore, for the number of common nodes, we treat equally without discrimination to every node, but the nodes have different number of edges. For the nodes have less edges, they should be more important. Thus, we weight common nodes by dividing by the number of their edges. We take the weighted number of common nodes as a new feature.
An important process during the improvement is parameter optimization. After comparing several models, we choose the MLP classifier as the model. We focus on three parameters of the model, including the size of hidden layer, solver and the initial learning rate. The first parameter is a tuple, the ith element in the tuple represents the number of neurons in the ith hidden layer. The default value of it is (100, 0) and we set it as (50,50). Because if we add more layers, the speed of model fitting will be unacceptable and (50,50) is the best fit after comparison. The second parameter, solver, is the solver for weight optimization. There are three choices which are ‘lbfgs’, ‘sgd’ and ‘adam’. After comparison, the default value, ‘adam’, get the best result. The third parameter is the initial learning rate used. It controls the step-size in updating the weights. The default value is 0.001 and we change to 0.0001 and 0.01. After comparison, we choose the default value which get the best result.
Alternatives and Comparison
During the exploration, there are several other methods we found that have similar performance compared to our final choice.
2 Support vector machine is supposed to be one of the most powerful classification algorithms in machine learning. However, the SVM classifier implemented in the scikit-learn package seems to be impractical because of its extremely poor speed. The document of scikit indicates that the underlying algorithm uses libsvm which is written in C language and optimized. We have tried to make the training data set shrink to hundreds of items and gotten low AUCs (which are under 0.7 in a few tests). With an increase of a thousand training samples, the time consumed by training models becomes unacceptable.
Ensemble algorithms such as bagging decision tree and random forest can also achieve a prediction with AUC > 0.80. Random forest is known to have good performance under non-linear circumstances or handling high-dimension features. However, when dealing with low-dimension data in this project, random forest tends to have unstable performance, possibly due to over-fitting.
Logistic regression algorithm implemented in scikit-learn is a simple but also a decent regression algorithm. During the initial state of this project, we use this algorithm to compare different selections of features, because of its stable performance and speed. Nevertheless, this algorithm is difficult to be optimized to achieve a higher AUC.
Limitations and Potentials
There are attempts of finding more features including secondary relationships. We got a pretty high AUC (greater than 0.90) when testing with our own randomly generated test data. But those features failed to describe the public test data since there is a huge difference between the online testing result (lower than 0.80). We suspect that there are flaws in our training data set in the terms of secondary relationships, especially in negative training samples. Other different sampling methods have been adopted but failed to solve this problem. Thus, in our final method, this feature is abandoned.
Since we are given the data of a relation network, there are possibly more information we could dig for. For example, dumping those data into a graph database could make it more efficient to select and calculate features.
TensorFlow is a powerful tool for machine learning tasks especially for neural networks and deep learning. We have already implemented a logistic regression algorithm but have failed to do the optimization work and have not explored the further potentials of TensorFlow due to time limit. An improved, customized and optimized multiple layer perceptron is believed to achieve a better performance compared to the one in scikit-learn.
Conclusion
In our final choice, the data are pre-processed into four-dimension features. Positive and negative training samples are randomly selected and fitted into a multiple layer perceptron. The final AUC we believed is around 0.85 which is the best performance that we could achieve during this competition.
Footnotes
-
Huang, H., Tang, J., & Liu, L. (2015). Triadic Closure Pattern Analysis and Prediction in Social Networks. IEEE Transactions On Knowledge And Data Engineering, 27(12). ↩