Fast Graph Embedding
-
Sampled Spectral Distance Embedding (SSDE)
Contributors: Ali Civril, Eli Bocke-Rivel, Jonathan Purnell,
Malik Magdon-Ismail
Summary:
A given (undirected and connected) graph is
embedded into Cartesian space in such a way as to minimize the loss of
information contained in pairwise distances.
The algorithm executes quickly due to
clever sampling of the nodes and a matrix approximation method that
runs in linear time with respect to the number of nodes in the graph.
Code:
SSDE;
Test Data;
Documentation.
Papers:
SSDE (2007).
Unsupervised Learning
-
Low-Rank Perturbation Gaussian
Mixture Model (LRPGMM)
Contributors: Jonathan Purnell, Saurabh Paul, Malik Magdon-Ismail
Summary:
A modified EM algorithm for
training a Gaussian Mixture Model (GMM). The defining difference is the
approximation of the covariance matrix for each mixture. The GMM can be
trained using the full covariance matrix, a diagonal matrix
approximation (i.e. only the variances), or using a low-rank perturbed
diagonal matrix decomposition (LRPDD). With the LRPDD, the covariance
matrix is represented by a linear number of parameters (with respect to
the dimension of the data being modeled). This decomposition allows for
a linear time (vs. quadratic or cubic) training the GMM and has less
overfitting that occurs with a full rank covariance matrix.
Additionally, there is a naive overlapping clustering algorithm
built-in to the code that clusters the data based on the posterior
probabilities.
Code:
LRPGMM;
Documentation.
Note: This code relies on the ALGLIB
library.
Papers:
LRPGMM (Ideal 2010);
LRPGMM (IJDMMM 2011).
Graph Clustering; Overlapping Communities;
-
SSDE-Cluster
Contributors: Jonathan Purnell, Malik Magdon-Ismail
Summary:
Fast overlapping clustering of a graph which
first applies SSDE to embed the graph into
Carteisan coordinates and then uses a GMM to do soft (probabilistic)
clustering.
Finally the data is clustered in one of two
ways. One method is to use the posterior probabilities to cluster the
nodes in such a way as to create higher quality clusters than would be
found using the GMM's naive clustering function. The quality of the
clusters is based on a metric described in the related work. The other
method is to specify the average number of clusters a data sample is
assigned to. The SSDE-Cluster code provides the code for Cluster, which
clusters the data, and the SSDE-Cluster shell script which provides a
wrapper for SSDE, GMM, and Cluster.
Code:
SSDE-Cluster;
Documentation.
Papers:
SSDE-Cluster (SocialCom 2011).
-
Connected Iterative Scan (CIS)
Contributors: Jeff Baumes, Stephen Kelley, Robert Escriva, James Thompson, Mark Goldberg, Malik Magdon-Ismail
Summary:
Another approach to overlapping clustering of a network. Iteratively expands seed clusters in order to maximize a
two-part density function. The first part focuses on improving the density of edges inside the cluster with respect
to edges leaving the cluster, and the second examines the edge probabilities within the cluster. Balance between the two
parts can be controlled with the lambda value [0, 1]. Higher values usually result in a smaller number of communities, with a lambda
value of 1 only considering cliques as clusters.
Code:
CIS;
Papers:
Overlapping Communities
(SocialCom 2010);
(ISI 2005)
-
Overlapping Clustering Wrapper
Summary:
The code provides a central program where multiple overlapping clustering algorithms can be run from. Currently, there is support for the following algorithms:
(Algorithms other than Connected Iterative Scan must be installed separately - links are provided to the software)
Clique Percolation; CONGA; Connected Iterative Scan; COPRA; Link Communities; OSLOM; Repeated Random Walks; SLPA
Code:
Overlapping Clustering;
Documentation.
Papers:
Clique Percolation; CONGA; COPRA; Repeated Random Walks; SLPA; Link Communities; OSLOM
-
Spectral Clustering of Signed Networks
Contributors: Pranay Anchuri, Malik Magdon-Ismail
Summary:
A fast algorithm for partitioning signed networks based on minimizing
frustration or maximizing signed modularity. The algorithm uses a spectral
approach to obtain a partition into 2 comunities and then iteratively improves
this partitioning by moving the nodes which have low certainty. A partition
into more than two components is obtained by recursively ontinuing to partition
the current partioning as long as the objective can be improved.
Code:
Spectral Partition of Signed Graphs;
Documentation.
Papers:
Communities and Balance in Signed Networks: A Spectral Approach (ASONAM 2012).
-
Post-processing Community Structure
Contributors: James Thompson, Malik Magdon-Ismail
Summary:
Algorithms for pruning detecting communities that are too alike in a community structure.
Code:
Post-processing Community Structure;
Documentation.
Community Evolution Detection
-
LOS Temporal (LOST)
Contributors: James Thompson, Malik Magdon-Ismail
Summary:
A program to detect community evolutions in temporal social networks in a two step process. First, uninterrupted, tightly connected evolutions are detected. Second, similar evolutions are merged together.
Code:
Evolution Detection Code;
Papers:
Detecting Uninterrupted Evolutions.
Merging Evolutions.
-
Anonymizing Vertices
Contributors: James Thompson, Malik Magdon-Ismail
Summary:
Randomly anonymizes a given percentage of actors or vertices in a temporal community structure.
Code:
Vertex Anonymizer;
-
Synthetic Network Model
Contributors: James Thompson, Malik Magdon-Ismail
Summary:
A program to generate a synthetic temporal social network with characteristics similar to those found in real world networks. The temporal network is represented as a series of static networks, frequently called snapshots. A community structure is embedded in each snapshot and community evolutions are embedded in the transitions between snapshots.
Code:
Synthetic Network Generator;
Ranking and Prominence in Networks
-
ABC-Centrality (attentive betweenness centrality)
Contributors: Xiaohui Lu, Sibel Adali, Malik Magdon-Ismail
Summary:
Betweenness centrality measures how critical a node is to information flow in a network. A node is critical (and hence should have high betweeness) if it is on many shortest paths. Two shortcomings of such a measure are:
- It ignores nodes on ``almost shortest'' paths;
- It assumes that a node can provide the same attention to information flow through each of those shortest paths, no matter how many shortest paths the node controls.
There have been attempts to address these concerns in the literature, with partial success. We provide a new measure, ABC centrality, that measures criticality by the amount of attention a node devotes to the information flow between other nodes. Our measure addresses both the aforementioned concerns and can be computed efficiently. It performs as well or better than betweenness centrality on both stylized networks and large scale real data networks, and hence provides a useful tool for measuring node criticality.
Code:
ABC-Centrality;
Documentation.
Papers:
ABC-Centrality (SocialCom 2011).
-
iHypR (Iterative Hyperedge Ranking)
Contributors: Xiaohui Lu, Sibel Adali, Malik Magdon-Ismail
Summary:
We present an iterative hyperedge ranking (iHypR) algorithm for computing prominence of actors in heterogeneous collaborative networks. The algorithm builds on the assumption that prominent actors collaborate on prominent objects, and prominent objects are naturally grouped into prominent clusters or groups (hyperedges in a graph). iHypR makes use of the relationships between actors, objects and hyperedges to compute a global prominence score for the actors in the network. The algorithm is customized for networks of collaborations, but it is generally applicable without further tuning. A core component in our algorithm is the
clustering of objects into hyperedges. We use SSDE-cluster for this. (SSDE-cluster could be replaced with any overlapping clustering algorithm, but we have
found the SSDE-cluster performs best.
Code:
iHypR;
Documentation.
Papers:
iHypR (submitted).
Hidden Groups
-
Extracting Hidden Groups and their Structure from Streaming Interaction Data
Contributors: Ben Caulfield, Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace
Summary:
This program is an implementation of the algorithms described in "Extracting Hidden Groups and their Structure from Streaming Interaction Data" by Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace. This program may be used to find chain and sibling communication triples from interaction data. There are two sub-programs that may be run with this program:
The first is used to determine at what frequency triples must occur in order to be considered significant (called the "significance threshold"). This is done by generating lists of random communications based on the original data and finding the number of triples that occur in these lists.
The second sub-program seperates the triples into cliques and may be used to find larger communication-structures in the data.
Code:
Hidden_Groups;
Documentation.
Papers:
Hidden_Groups.pdf.
-
Find Matchings Between a List of Sent Messages and a List of Received Messages
Contributors: Lingxun Hu, Ben Caulfield, and Malik Magdon-Ismail
Summary:
In a social network, each node is able to send or receive messages. Each node can send messages to any other nodes in the network except to itself. The only information we can acquire is whether a node sent or received a message and the time it sent or received that message.
Our goal is to find as many as possible valid matchings between the sent messages and received messages, based on the information we can get.
We use two different algorithms: dynamic programming methods and chain algorithm from the paper below. The output of this program can be used as the input of program "Extracting Hidden Groups and their Structure from Streaming Interaction Data".
Code:
Find_matchings.zip;
Papers:
Hidden_Groups.pdf.
Portfolio Allocation
-
Optimization Algorithms for Portfolio Allocation
Contributors: Andrew Bolin, Malik Magdon-Ismail
Summary:
Modern portfolio theory (MPT) attempts to minimize risk for a given level of expected return or maximize expected return for a given level of risk. The code provided uses linear/quadratic programming to solve MPT optimizations and find efficient portfolios. Different optimization algorithms are included to handle different models of risk. This project is meant to simulate trading using MPT optimization algorithms. The code allows users to build and rebalance portfolios generated with real data using the MPT algorithms.
Code:
portsim.zip;
Papers:
PorftolioOptimization.pdf
Data:
prices.txt;
SP100.txt;
SP500.txt;
stock_legend.txt.
Semantic Fragment Matching
-
Fragment Matching Through i-degree and i-partitioning
Contributors: John Schwartz, Mark Goldberg, Malik Magdon-Ismail
Summary:
This project attempts to identify graphical queries within a given semantically labeled database structure and return a set of potential matches. Primarily, this is done by indexing the database and query using i-degree technology for faster graph feature comparison. The included code will output a restricted set of potential matches for each query node, in a .MX2 file. The included data is derived from DBpedia, and includes 10 example queries (fragments) with the produced output.
Code:
FragmentMatchCode.zip.
Data:
FragmentMatchData.zip.
Supplimental:
Read Me;
Dual Set Sparsification of Matrices
Learning a Linear Model to Rank
-
Linear Programming Heuristic
Contributors: Malik Magdon-Ismail
Summary:
The matlab code LinearRankerLPsoft.m takes as input a data matrix
with ordered data and outputs a linear model that attempts to preserve
that order using an infinity norm linear programming heuristic with
"soft margin".
Matlab Code:
LinearRankerLPsoft.m (uses MATLAB linprog in the optimization toolbox).
Matlab Code:
TestLPranker.m
(a wrapper that tests the function on a random problem).
Supplimental:
A Short Note on the
Algorithm
Eye Tracking
-
Find Patterns of How People Play Tetris From Eye Tracking Data
Contributors: Lingxun Hu, Malik Magdon-Ismail, Wayne Gray
Summary:
This package has two .jar files: FindSingleMoves.jar and GetLongerChains.jar.
The output of FindSingleMoves.jar is the input of GetLongerChains.jar.
Code:
FindEyeTrackPatternsPackage
Greedy Deep Learning (Node-By-Node)
- GN (ranks data according to distance to first feature) and GCN
(partitions data into supervised classes)
Contributors: Ke Wu, Malik Magdon-Ismail
Summary:
An efficient algorithm for pre-training a deep network that
trains each node sequentially on partitions of the data. The features
produced by the algorithm are also more interpretable.
Python Code:
Code.zip.
ReadMe.txt.
Papers:
(MS Thesis)
(arXiv)
Network Signature
- Get a structured 2D image embedding (signature) of a subgraph of a given network.
Contributors: Kshiteesh Hegde, Malik Magdon-Ismail
Summary:
Given a network and a subgraph size, it saves the structured image of a subgraph. The subgraph is obtained by performing a random walk on the network of length specified by the subgraph size given. A BFS based ordering scheme given in this ASONAM paper is applied to arrive at the signature.
Python Code:
.zip archive.
Papers:
(PhD Thesis, Chapter 3)
Subgraph Classification using Network Signatures
- Identify the parent network of a small subgraph.
Contributors: Kshiteesh Hegde, Malik Magdon-Ismail
Summary:
This package uses the novel structured image embedding feature to perform subgraph classification. Given a subgraph, you can predict its parent network. Please see the cited papers for more details.
Python Code:
.zip archive.
Papers:
(PhD Thesis, Chapter 4)
(arXiv paper)
Network Lens: Node Classification in Topologically Heterogeneous Networks
- This tool, called Network Lens, can be used to classify nodes in a toplogically heterogeneous network.
Contributors: Kshiteesh Hegde, Malik Magdon-Ismail
Summary:
This package uses the novel structured image embedding feature to perform node classification in topologically heterogeneous networks. Given a network, you can classify its nodes into networks the model has been trained on. One of the applications of this tool is clustering. Please see the cited papers for more details.
Python Code:
.zip archive.
Papers:
(PhD Thesis, Chapter 5)
Intrinsic Scale of a Network
- Given a network, get its intrinsic scale - the scale at which is reveals itself.
Contributors: Malik Magdon-Ismail, Kshiteesh Hegde
Summary:
This package uses the novel structured image embedding feature to extract the intrinsic scale of the given network. Networks reveal their identities at a small scale (8-20 nodes). This tool extracts the scale for a given network. Please see the cited papers for more details.
Python Code:
.zip archive.
Papers:
(PhD Thesis, Chapter 6)
Supervised Mixtures of Experts (SGMM, SBMM)
- Given a dataset with continuous or binary numerical values train and learn either an SGMM model or an SBMM model for classification.
Contributors:Georgios Mavroudeas, Xiao Shou, Malik Magdon-Ismail
Summary:
This package gives the ability to train two types of MoE models, one with a mixture of gaussians gate function and one with a mixture of bernullis. The emmision function is pre-setted as a logistic regression, but there is flexibility to adjust it in any classification model. There are multiple options for per cluster L1 and L2 regularization as well as options to train with a mixture of labeled and unlabeled data.
Python Code:
.zip archive.
Papers:
(SGMM Paper)
HMM-Boost
- Train and learn a semi supervised HMM model (HMM boost model) or a mixture of semi supervised HMM models.
Contributors:Georgios Mavroudeas, Malik Magdon-Ismail
Summary:
This package applies the HMM-BOOST methodology to train and learn HMM and MHMM models using a semi supervised algorithm. It has the option to train a traditional unsupervised HMM. The observational model of the HMM is a mixture of gaussians per state.
Python Code:
.zip archive.
Papers:
(HMM-BOOST paper)
Pre Cluster and Merge (PCM)
- Given a standard causal dataset, apply the PCM methodology and automatically uncover effect subpopulations.
Contributors: Georgios Mavroudeas, Malik Magdon-Ismail
Summary:
This package uses the novel PCM algorithm to automatically extract effect clusters from causal populations.
Python Code:
.zip archive.
Papers:
(PCM work, submitted in ICLR 2023)
|