LFD-Lab

Software:
Fast Graph Embedding
Unsupervised Learning
Graph Clustering; Overlapping Communities;
Community Evolution Detection
Ranking and Prominence in Networks
Hidden Groups
Portfolio Allocation
Semantic Fragment Matching
Dual Set Sparsification of Matrices
Learning a Linear Model to Rank
Eye Tracking
Greedy Deep Learning (Node-By-Node)
Network Signature
Subgraph Classification using Network Signatures
Network Lens: Node Classification in Topologically Heterogeneous Networks
Intrinsic Scale of a Network
Supervised Mixture of Experts (SGMM, SBMM)
HMM-Boost
Pre Cluster and Merge (PCM)
Disclaimer: All software is provided for free and as is, with no explicit or implicit warranty. In other words, you use completely at your own risk.

Fast Graph Embedding

Sampled Spectral Distance Embedding (SSDE)
Contributors: Ali Civril, Eli Bocke-Rivel, Jonathan Purnell, Malik Magdon-Ismail
Summary: A given (undirected and connected) graph is embedded into Cartesian space in such a way as to minimize the loss of information contained in pairwise distances. The algorithm executes quickly due to clever sampling of the nodes and a matrix approximation method that runs in linear time with respect to the number of nodes in the graph.
Code: SSDE; Test Data; Documentation.
Papers: SSDE (2007).

Unsupervised Learning

Low-Rank Perturbation Gaussian Mixture Model (LRPGMM)
Contributors: Jonathan Purnell, Saurabh Paul, Malik Magdon-Ismail
Summary: A modified EM algorithm for training a Gaussian Mixture Model (GMM). The defining difference is the approximation of the covariance matrix for each mixture. The GMM can be trained using the full covariance matrix, a diagonal matrix approximation (i.e. only the variances), or using a low-rank perturbed diagonal matrix decomposition (LRPDD). With the LRPDD, the covariance matrix is represented by a linear number of parameters (with respect to the dimension of the data being modeled). This decomposition allows for a linear time (vs. quadratic or cubic) training the GMM and has less overfitting that occurs with a full rank covariance matrix. Additionally, there is a naive overlapping clustering algorithm built-in to the code that clusters the data based on the posterior probabilities.
Code: LRPGMM; Documentation.
Note: This code relies on the ALGLIB library.
Papers: LRPGMM (Ideal 2010); LRPGMM (IJDMMM 2011).

Graph Clustering; Overlapping Communities;

SSDE-Cluster
Contributors: Jonathan Purnell, Malik Magdon-Ismail
Summary: Fast overlapping clustering of a graph which first applies SSDE to embed the graph into Carteisan coordinates and then uses a GMM to do soft (probabilistic) clustering. Finally the data is clustered in one of two ways. One method is to use the posterior probabilities to cluster the nodes in such a way as to create higher quality clusters than would be found using the GMM's naive clustering function. The quality of the clusters is based on a metric described in the related work. The other method is to specify the average number of clusters a data sample is assigned to. The SSDE-Cluster code provides the code for Cluster, which clusters the data, and the SSDE-Cluster shell script which provides a wrapper for SSDE, GMM, and Cluster.
Code: SSDE-Cluster; Documentation.
Papers: SSDE-Cluster (SocialCom 2011).

Connected Iterative Scan (CIS)
Contributors: Jeff Baumes, Stephen Kelley, Robert Escriva, James Thompson, Mark Goldberg, Malik Magdon-Ismail
Summary: Another approach to overlapping clustering of a network. Iteratively expands seed clusters in order to maximize a two-part density function. The first part focuses on improving the density of edges inside the cluster with respect to edges leaving the cluster, and the second examines the edge probabilities within the cluster. Balance between the two parts can be controlled with the lambda value [0, 1]. Higher values usually result in a smaller number of communities, with a lambda value of 1 only considering cliques as clusters.
Code: CIS;
Papers: Overlapping Communities (SocialCom 2010); (ISI 2005)

Overlapping Clustering Wrapper
Summary: The code provides a central program where multiple overlapping clustering algorithms can be run from. Currently, there is support for the following algorithms: (Algorithms other than Connected Iterative Scan must be installed separately - links are provided to the software)

Clique Percolation; CONGA; Connected Iterative Scan; COPRA; Link Communities; OSLOM; Repeated Random Walks; SLPA

Code: Overlapping Clustering; Documentation.
Papers: Clique Percolation; CONGA; COPRA; Repeated Random Walks; SLPA; Link Communities; OSLOM

Spectral Clustering of Signed Networks
Contributors: Pranay Anchuri, Malik Magdon-Ismail
Summary: A fast algorithm for partitioning signed networks based on minimizing frustration or maximizing signed modularity. The algorithm uses a spectral approach to obtain a partition into 2 comunities and then iteratively improves this partitioning by moving the nodes which have low certainty. A partition into more than two components is obtained by recursively ontinuing to partition the current partioning as long as the objective can be improved.
Code: Spectral Partition of Signed Graphs; Documentation.
Papers: Communities and Balance in Signed Networks: A Spectral Approach (ASONAM 2012).

Post-processing Community Structure
Contributors: James Thompson, Malik Magdon-Ismail
Summary: Algorithms for pruning detecting communities that are too alike in a community structure.
Code: Post-processing Community Structure; Documentation.

Community Evolution Detection

LOS Temporal (LOST)
Contributors: James Thompson, Malik Magdon-Ismail
Summary: A program to detect community evolutions in temporal social networks in a two step process. First, uninterrupted, tightly connected evolutions are detected. Second, similar evolutions are merged together.
Code: Evolution Detection Code;
Papers: Detecting Uninterrupted Evolutions. Merging Evolutions.
Anonymizing Vertices
Contributors: James Thompson, Malik Magdon-Ismail
Summary: Randomly anonymizes a given percentage of actors or vertices in a temporal community structure.
Code: Vertex Anonymizer;
Synthetic Network Model
Contributors: James Thompson, Malik Magdon-Ismail
Summary: A program to generate a synthetic temporal social network with characteristics similar to those found in real world networks. The temporal network is represented as a series of static networks, frequently called snapshots. A community structure is embedded in each snapshot and community evolutions are embedded in the transitions between snapshots.
Code: Synthetic Network Generator;

Ranking and Prominence in Networks

ABC-Centrality (attentive betweenness centrality)
Contributors: Xiaohui Lu, Sibel Adali, Malik Magdon-Ismail
Summary: Betweenness centrality measures how critical a node is to information flow in a network. A node is critical (and hence should have high betweeness) if it is on many shortest paths. Two shortcomings of such a measure are:
- It ignores nodes on ``almost shortest'' paths;
- It assumes that a node can provide the same attention to information flow through each of those shortest paths, no matter how many shortest paths the node controls.
There have been attempts to address these concerns in the literature, with partial success. We provide a new measure, ABC centrality, that measures criticality by the amount of attention a node devotes to the information flow between other nodes. Our measure addresses both the aforementioned concerns and can be computed efficiently. It performs as well or better than betweenness centrality on both stylized networks and large scale real data networks, and hence provides a useful tool for measuring node criticality.
Code: ABC-Centrality; Documentation.
Papers: ABC-Centrality (SocialCom 2011).
iHypR (Iterative Hyperedge Ranking)
Contributors: Xiaohui Lu, Sibel Adali, Malik Magdon-Ismail
Summary: We present an iterative hyperedge ranking (iHypR) algorithm for computing prominence of actors in heterogeneous collaborative networks. The algorithm builds on the assumption that prominent actors collaborate on prominent objects, and prominent objects are naturally grouped into prominent clusters or groups (hyperedges in a graph). iHypR makes use of the relationships between actors, objects and hyperedges to compute a global prominence score for the actors in the network. The algorithm is customized for networks of collaborations, but it is generally applicable without further tuning. A core component in our algorithm is the clustering of objects into hyperedges. We use SSDE-cluster for this. (SSDE-cluster could be replaced with any overlapping clustering algorithm, but we have found the SSDE-cluster performs best.
Code: iHypR; Documentation.
Papers: iHypR (submitted).

Hidden Groups

Extracting Hidden Groups and their Structure from Streaming Interaction Data
Contributors: Ben Caulfield, Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace
Summary: This program is an implementation of the algorithms described in "Extracting Hidden Groups and their Structure from Streaming Interaction Data" by Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace. This program may be used to find chain and sibling communication triples from interaction data. There are two sub-programs that may be run with this program: The first is used to determine at what frequency triples must occur in order to be considered significant (called the "significance threshold"). This is done by generating lists of random communications based on the original data and finding the number of triples that occur in these lists. The second sub-program seperates the triples into cliques and may be used to find larger communication-structures in the data.
Code: Hidden_Groups; Documentation.
Papers: Hidden_Groups.pdf.

Find Matchings Between a List of Sent Messages and a List of Received Messages
Contributors: Lingxun Hu, Ben Caulfield, and Malik Magdon-Ismail
Summary: In a social network, each node is able to send or receive messages. Each node can send messages to any other nodes in the network except to itself. The only information we can acquire is whether a node sent or received a message and the time it sent or received that message. Our goal is to find as many as possible valid matchings between the sent messages and received messages, based on the information we can get. We use two different algorithms: dynamic programming methods and chain algorithm from the paper below. The output of this program can be used as the input of program "Extracting Hidden Groups and their Structure from Streaming Interaction Data".
Code: Find_matchings.zip;
Papers: Hidden_Groups.pdf.

Portfolio Allocation

Optimization Algorithms for Portfolio Allocation
Contributors: Andrew Bolin, Malik Magdon-Ismail
Summary: Modern portfolio theory (MPT) attempts to minimize risk for a given level of expected return or maximize expected return for a given level of risk. The code provided uses linear/quadratic programming to solve MPT optimizations and find efficient portfolios. Different optimization algorithms are included to handle different models of risk. This project is meant to simulate trading using MPT optimization algorithms. The code allows users to build and rebalance portfolios generated with real data using the MPT algorithms.
Code: portsim.zip;
Papers: PorftolioOptimization.pdf
Data: prices.txt; SP100.txt; SP500.txt; stock_legend.txt.

Semantic Fragment Matching

Fragment Matching Through i-degree and i-partitioning
Contributors: John Schwartz, Mark Goldberg, Malik Magdon-Ismail
Summary: This project attempts to identify graphical queries within a given semantically labeled database structure and return a set of potential matches. Primarily, this is done by indexing the database and query using i-degree technology for faster graph feature comparison. The included code will output a restricted set of potential matches for each query node, in a .MX2 file. The included data is derived from DBpedia, and includes 10 example queries (fragments) with the produced output.
Code: FragmentMatchCode.zip.
Data: FragmentMatchData.zip.
Supplimental: Read Me;

Dual Set Sparsification of Matrices

Dual Set Spectral Sparsification
Contributors: Malik Magdon-Ismail
Summary: The code DSS.m simultaneously sparsifies two sets of vectors which are a decomposition of the identity, and preserves some spectral properties of both sets at the same time. Similarly, DSF.m sparsifies while preserving spectral properties of one set and Frobenius properties of the other set. These algorithms were used in the paper below to construct a small set (near optimal) of the columns of a matrix that are good for reconstructing the matric in both spectral norm and Frobenius norm.
Matlab Code: DSS.m (DualSet-Spectral-Spectral, Inputs V,U).
Matlab Code: DSSeye.m (DualSet-Spectral-Spectral, Inputs V, U=I_n).
Matlab Code: DSF.m (DualSet-Spectral-Frobenius, Inputs V, A).
Papers: Near-Optimal Column-Based Matrix Reconstruction;
Supplimental: Read Me;

Learning a Linear Model to Rank

Linear Programming Heuristic
Contributors: Malik Magdon-Ismail
Summary: The matlab code LinearRankerLPsoft.m takes as input a data matrix with ordered data and outputs a linear model that attempts to preserve that order using an infinity norm linear programming heuristic with "soft margin".
Matlab Code: LinearRankerLPsoft.m (uses MATLAB linprog in the optimization toolbox).
Matlab Code: TestLPranker.m (a wrapper that tests the function on a random problem).
Supplimental: A Short Note on the Algorithm

Eye Tracking

Find Patterns of How People Play Tetris From Eye Tracking Data
Contributors: Lingxun Hu, Malik Magdon-Ismail, Wayne Gray
Summary: This package has two .jar files: FindSingleMoves.jar and GetLongerChains.jar. The output of FindSingleMoves.jar is the input of GetLongerChains.jar.
Code: FindEyeTrackPatternsPackage

Greedy Deep Learning (Node-By-Node)

GN (ranks data according to distance to first feature) and GCN (partitions data into supervised classes)
Contributors: Ke Wu, Malik Magdon-Ismail
Summary: An efficient algorithm for pre-training a deep network that trains each node sequentially on partitions of the data. The features produced by the algorithm are also more interpretable.
Python Code: Code.zip. ReadMe.txt.
Papers: (MS Thesis) (arXiv)

Network Signature

Get a structured 2D image embedding (signature) of a subgraph of a given network.
Contributors: Kshiteesh Hegde, Malik Magdon-Ismail
Summary: Given a network and a subgraph size, it saves the structured image of a subgraph. The subgraph is obtained by performing a random walk on the network of length specified by the subgraph size given. A BFS based ordering scheme given in this ASONAM paper is applied to arrive at the signature.
Python Code: .zip archive.
Papers: (PhD Thesis, Chapter 3)

Subgraph Classification using Network Signatures

Identify the parent network of a small subgraph.
Contributors: Kshiteesh Hegde, Malik Magdon-Ismail
Summary: This package uses the novel structured image embedding feature to perform subgraph classification. Given a subgraph, you can predict its parent network. Please see the cited papers for more details.
Python Code: .zip archive.
Papers: (PhD Thesis, Chapter 4) (arXiv paper)

Network Lens: Node Classification in Topologically Heterogeneous Networks

This tool, called Network Lens, can be used to classify nodes in a toplogically heterogeneous network.
Contributors: Kshiteesh Hegde, Malik Magdon-Ismail
Summary: This package uses the novel structured image embedding feature to perform node classification in topologically heterogeneous networks. Given a network, you can classify its nodes into networks the model has been trained on. One of the applications of this tool is clustering. Please see the cited papers for more details.
Python Code: .zip archive.
Papers: (PhD Thesis, Chapter 5)

Intrinsic Scale of a Network

Given a network, get its intrinsic scale - the scale at which is reveals itself.
Contributors: Malik Magdon-Ismail, Kshiteesh Hegde
Summary: This package uses the novel structured image embedding feature to extract the intrinsic scale of the given network. Networks reveal their identities at a small scale (8-20 nodes). This tool extracts the scale for a given network. Please see the cited papers for more details.
Python Code: .zip archive.
Papers: (PhD Thesis, Chapter 6)

Supervised Mixtures of Experts (SGMM, SBMM)

Given a dataset with continuous or binary numerical values train and learn either an SGMM model or an SBMM model for classification.
Contributors:Georgios Mavroudeas, Xiao Shou, Malik Magdon-Ismail
Summary: This package gives the ability to train two types of MoE models, one with a mixture of gaussians gate function and one with a mixture of bernullis. The emmision function is pre-setted as a logistic regression, but there is flexibility to adjust it in any classification model. There are multiple options for per cluster L1 and L2 regularization as well as options to train with a mixture of labeled and unlabeled data.
Python Code: .zip archive.
Papers: (SGMM Paper)

HMM-Boost

Train and learn a semi supervised HMM model (HMM boost model) or a mixture of semi supervised HMM models.
Contributors:Georgios Mavroudeas, Malik Magdon-Ismail
Summary: This package applies the HMM-BOOST methodology to train and learn HMM and MHMM models using a semi supervised algorithm. It has the option to train a traditional unsupervised HMM. The observational model of the HMM is a mixture of gaussians per state.
Python Code: .zip archive.
Papers: (HMM-BOOST paper)

Pre Cluster and Merge (PCM)

Given a standard causal dataset, apply the PCM methodology and automatically uncover effect subpopulations.
Contributors: Georgios Mavroudeas, Malik Magdon-Ismail
Summary: This package uses the novel PCM algorithm to automatically extract effect clusters from causal populations.
Python Code: .zip archive.
Papers: (PCM work, submitted in ICLR 2023)