Advances of Machine Learning in Theory & Applications





Research Projects

Each AMALTHEA REU team worked on a chosen research topic. Here you will find descriptions of these topics along with the posters that the teams presented during AMALTHEA's Symposium at the end of the summer experience. The posters are clickable links that will allow you to view them in better (higher) resolution. More details on the teams' research topics can be found in their technical reports (TRs; see the link below each project description). Finally, some of this research was published as conference and journal papers, which you can find under the Publications page.




Poster by Kelvin Cardona; Summer 2007

Title: Statistical Analysis of Functional Data
By: Emily Helfer & Bennett Smith
Graduate mentor(s): Rana Haber
Faculty mentor(s): Adrian M. Peter
Abstract: In this paper, we explore several di fferent statistical techniques for functional data analysis. Initially, we explored non-parametric density estimation for (in finite dimensional) functional data. The theoretical development diff ers slightly from standard multivariate statistics due to the technical issues that arise when working in in finite dimensional function spaces; for example, care must be taken during operations such as integration. We discuss using wavelets and Hermite polynomials as bases for orthogonal series density estimation of finite dimensional data and their extensions to the function space setting. However, due to the exhaustive computations needed to apply orthogonal series density estimation to functional data, we decided to explore using kernel density estimation instead. The downside to this is that the functional data must be spatially dependent to use kernel density estimation. Finally, we study classi fication of functional data using distribution fi elds. We focus on using several function-specifi c features of the data and develop a novel metric learning framework that optimally combines the features and classifi es the data. The practicality of this classi fication technique is demonstrated on several functional datasets.

TR: Helfer, E. and Smith, B., Haber, R. and Peter, A.M. (2015) Statistical Analysis of Functional Data, Technical Report TR-2015-05, The AMALTHEA REU Program, Summer 2015 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Aerial Geiger-Mode LiDAR Feature Extraction for Scene Classifi cation
By: Chelse Bulthuis & Joshua Nuez
Graduate mentor(s): Ayokunle Ade-Aina
Faculty mentor(s): Anthony O. Smith
Abstract: Aerial Geiger mode LiDAR (Light Detection and Ranging) is high defi nition data that utilizes laser technology in order to collect information from target objects on the earth's surface. This information is collected over vast areas using an aircraft mounted Geiger mode sensor and is processed in order to represent objects in space with a point cloud scene. In this scene, groups of points belong to target objects such as houses, trees, cars, power lines, and terrain, which can be visually detected in a 3D graph of the point cloud. While this seems like a trivial task for the human brain, the high de finition and unorganized nature of LiDAR data makes it challenging for a computer to identify these objects. For a computer to identify target objects successfully four problems need to be solved; clustering, neighborhood selection, feature extraction, and classi fication. Clustering is the process of splitting the data into smaller groups while feature extraction is the process of characterizing the data. Our research is focused on the fi rst three. The fi rst step is to partition the data into natural groups or clusters. These clusters must capture the geometry of target objects so that the extracted information, or features, describe characteristics that makes it easy for the computer to classify the points within a cluster. In the course of this research, we evaluated the clustering algorithms, k-nearest neighbors (k-NN), radial search, k-Means, Gaussian mixture model expectation maximization (GMM), and affinity propagation, on LiDAR data. We used the message passing procedure of affinity propagation to identify the ideal number of clusters and initialize cluster centers for both k-Means and GMM, which eliminates inconsistencies in the clustering results due to random initialization that this algorithms were previously associated with. We selected smart neighborhoods based on the degree of homogeneity of each cluster and we extracted features describing the target objects. These features fall into the categories of point specifi c features, geometric and statistical features of clusters. The results that we obtained showed that k-NN and radial search produce nonexclusive clustering which makes them unsuitable for feature extraction. K-Means and affinity propagation created clusters that are consistent in size and mostly homogeneous. While GMM creates inconsistent size clusters but clusters noise, ground, and vegetation fairly well.

TR: Bulthuis, C. and Nuez, J., Ade-Aina, A. and Smith, A.O. (2015) Aerial Geiger-Mode LiDAR Feature Extraction for Scene Classifi cation, Technical Report TR-2015-04, The AMALTHEA REU Program, Summer 2015 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: A Machine Learning Evaluation of Aerial LiDAR for Classi fication of Terrestrial and Urban Structures
By: Benjamin Drebing & Michael Sokolich
Graduate mentor(s): Kaleb Smith
Faculty mentor(s): Anthony O. Smith
Abstract: Light Detection and Ranging (LiDAR) sensors can be used in aircraft to create aerial point clouds modeling landscapes. As these data sets become more readily available, there is an increase in the need for point classifi cation into discrete and meaningful classes. Historically, research to classify LiDAR has either involved a excruciating manual phase, or the collection of auxiliary data. We seek to eliminate both, by investigating automated classifi cation techniques and their application to LiDAR. In this paper we take an in depth analysis of multiple classi fication algorithms and evaluate their performance on standard data sets. We look to classify data based on measures calculated from only (Latitude; Longitude; Elevation) coordinates. More speci fically we discuss Bayesian classifi cation, Linear Discriminant Analysis, Quadratic Discriminant Analysis, and Kernel Linear Discriminant Analysis. Results from all classifi ers are analyzed and compared.

TR: Drebing, B. and Sokolich, M., Smith, K. and Smith, A.O. (2015) A Machine Learning Evaluation of Aerial LiDAR for Classi fication of Terrestrial and Urban Structures, Technical Report TR-2015-03, The AMALTHEA REU Program, Summer 2015 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Classifying Frog Calls Using Gaussian Mixture Model and Locality Sensitive Hashing
By: Kathryn Hollowood & Olatide Omojaro
Graduate mentor(s): Dalwindeerjeet Kular
Faculty mentor(s): Eraldo Ribeiro
Abstract: Biodiversity conservation is a major concern of modern society. Pressures such as littering, pollution, and climate change deteriorate the natural habitats and reduce biodiversity. To help repair this issue, we need some way to track these habitat damages. We need a bio-indicator. Frogs make excellent bio-indicators due to their permeable skin. This skin allows them to absorb water and Oxygen. However, when they live in polluted areas they also absorb toxins. As pollution rise, some frog populations su er decline. Citizen scientists have been monitoring these frog populations by recording their calls for years. However, this process requires expensive equipment, training, and a sizable time commitment. This paper focuses on the automatic classi cation of frog calls using computer programs. Features were extracted from the audio data and classifi ed using two classifi cation methods: Locality Sensitive Hashing and Gaussian Mixture Modeling. Tests performed on a dataset of frog calls of 15 di fferent species produced promising classi fication results for the Gaussian mixture model approach.

TR: Hollowood, K. and Omojaro, O., Kular, D. and Ribeiro, E. (2015) Classifying Frog Calls Using Gaussian Mixture Model and Locality Sensitive Hashing, Technical Report TR-2015-02, The AMALTHEA REU Program, Summer 2015 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: The Minority Game: Sharing Limited Resources in a Competitive World
By: Daniel Romero & Elissa M. Shinseki
Graduate mentor(s): S.M. Mahdi Seyednezhad
Faculty mentor(s): Ronaldo Menezes
Abstract: The minority game (MG) has been extensively studied as a model for resource allocation. This study contributes to the accumulating wealth of literature on MG by investigating two major areas lacking in current research: (1) the effect of varying parameters using an extended version of the standard MG with multiple resources and (2) the use of the evolutionary minority game on complex network topologies. In our multiple resource research, we consider resource weighting, strategy sharing, and varying capacities to study effects on agent attendance, variance, winning rate, and effective resource usage. We find that the use of a single strategy is not as effective as different strategies when attempting to enjoy resources simultaneously. We also find that our original weighting methods may effectively manipulate attendance of multiple resources at once to prioritize one resource over another. In our complex network research, the distribution of probability values for each agent (governing global strategy adherence) and the rate at which these values change are compared across the Kleinberg Small World, Erds-Rnyi, and Barabsi-Albert complex network topologies. When poorly performing agents consult their neighbors for new probability values, the distribution of their values become highly polarized. Further research in this area could be beneficial in better understanding consumer behavior in networks.

TR: Romero, D. and Shinseki, E.M., Seyednezhad, S.M. and Menezes, R. (2015) The Minority Game: Sharing Limited Resources in a Competitive World, Technical Report TR-2015-01, The AMALTHEA REU Program, Summer 2015 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Exploration of Functional Data: Regression, Classification, Interpolation
By: Candice Schumann & John Talbot
Graduate mentor(s): Rana Haber
Faculty mentor(s): Adrain M. Peter
Abstract: Functional Data Analysis (FDA) is a growing field in machine learning which provides a more holistic approach to analysis of functional data that is assumed to have a smooth underlying function. The underlying function can be represented using a basis expansion. We used both the Fourier and the wavelet series expansion to represent our functional data. Traditional machine learning approaches default to treating data as feature vectors. In doing this, they do not leverage the fact that there is a smooth underlying function. For example, derivatives can be taken from the underlying function and more information about the data is gained. In this paper, two different methods of functional data classification are developed and discussed. The first method is classification with principal component kernel regression. The second method is classification with discriminative interpolation. In order to do functional data analysis, traditional machine learning approaches need to be extended to a functional context. For instance, principal component analysis and regression have been extended to functional principal component analysis and functional regression. Finally, we compare these two methods against a kernel support vector machine that treats all data as feature vectors. The classification results are comparable with traditional machine learning approaches.

TR: Schumann, C. and Talbot, J., Haber, R. and Peter, A. (2014) Exploration of Functional Data: Regression, Classification, Interpolation, Technical Report TR-2014-05, The AMALTHEA REU Program, Summer 2014 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Wavelet based density estimation for multidimensional streaming data
By: Daniel Weinand & Gedeon Nyengele
Graduate mentor(s): Mark Moyou
Faculty mentor(s): Adrian M. Peter
Abstract: In the present information economy, the colossal amount of data generated daily has spawned the need for real-time data driven algorithms that extract actionable intelligence with minimal user in uence. Density estimation is one path to extract this intelligence from the data and the incorporation of wavelets serves to boost the accuracy of the density estimation framework. The goal of this project was to develop a multidimensional computational implementation of this wavelet density estimation framework and showcase its utility on a relevant application. The report contains background information on wavelets, density estimation and all the details about the Matlab and Java implementations of the multidimensional wavelet density estimator. Finally, we demonstrate the utility of the approach by applying it to financial market analysis.

TR: Weinand, D. and Nyengele, G., Moyou, M. and Peter, A.M. (2014) Wavelet based density estimation for multidimensional streaming data, Technical Report TR-2014-04, The AMALTHEA REU Program, Summer 2014 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Non-Linear Dimensionality Reduction: An Evaluation of Diffusion Based Analysis
By: Jamie Fox & Tabitha Beavers
Graduate mentor(s): Kaleb Smith
Faculty mentor(s): Adrian M. Peter & Gnana B. Tenali
Abstract: In this paper, we compare two non-linear dimensionality reduction methods, namely diffusion maps, and a recent generalization of this method, vector diffusion maps. While diffusion maps simply use scalar weight values to describe the similarity between two data points, vector diffusion maps utilize both scalar weights, as well as orthogonal transformations to describe this same similarity. Because the vector diffusion map algorithm incorporates this extra information in the composition of its distance metric, we expect this method to perform dimensionality reduction more effectively than the diffusion map algorithm. Consequently, we also expect the distance metric of vector diffusion maps to provide a better representation of the geodesic distance, since the distance metrics describe the relation of the of the embedded data set. This expectation was not confirmed by our results for the 3-dimensional data sets (such as the sphere, torus, and MATLAB's peaks data). In the vast majority of our results, the vector diffusion distance was slightly less accurate in appropriately classifying data points according to the geodesic distance.

TR: Fox, J. and Beavers, T., Smith, K. and Peter, A.M. and Tenali, G.B. (2014) Non-Linear Dimensionality Reduction: An Evaluation of Diffusion Based Analysis, Technical Report TR-2014-03, The AMALTHEA REU Program, Summer 2014 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: WhatFrog: A Comparison of Classification Algorithms for Automated Anuran Recognition
By: Nikita A. Belyaev & Yash-yee K. Logan
Graduate mentor(s): Katrina M. Smart
Faculty mentor(s): Eraldo Ribeiro
Abstract: The motivation for this paper centers on the comparative effectiveness of several classifiers and combinations thereof to identify species of frogs by their calls. Frog identification is important because the presence of frogs indicates the health of a wetland and automated identification system helps to monitor invasive and endangered species. Classification methods investigated were Support Vector Machines (SVM), Gaussian Mixture Models (GMM), Decision Tree (DT), Random Forest (RF), Discriminate Function Analysis (DA), Naive Bayes (NB), Neural Networks (NN), K-Nearest Neighbor (KNN). In addition, optimization of some classification methods was achieved using combination techniques. Overall, the classification algorithm that showed the best result on its own was the Random Forest and the best combination algorithm was Random Forest with K-Nearest Neighbor.

TR: Belyaev, N.A. and Logan, Y.K., Smart, K.M. and Ribeiro, E. (2014) WhatFrog: A Comparison of Classification Algorithms for Automated Anuran Recognition, Technical Report TR-2014-02, The AMALTHEA REU Program, Summer 2014 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Uncovering Criminal Networks From Crime Locations
By: Sarah White & Tobin Yehle
Graduate mentor(s): Hugo Serrano & Marcos Oliviera
Faculty mentor(s): Dr. Ronaldo Menezes
Abstract: Rising computation power allows for more sophisticated tools to identify patterns in criminal activity. New studies show criminals typically commit crimes in areas with which they are familiar, usually close to home. Using this information we proposed a new model based on networks built with links between crimes in close physical proximity. We showed the structure of the criminal world can be partially represented by this spatial network and we analyzed the centrality of nodes in the network to find patterns in the underlying structure of criminal activity.

TR: White S. and Yehle, T. K., Serrano, H., Oliviera, M. and Menezes, R. (2014) Uncovering Criminal Networks From Crime Locations, Technical Report TR-2014-01, The AMALTHEA REU Program, Summer 2014 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Unified framework for human-object interaction recognition
By: Jacob Chen & Haidar K. Khan
Graduate mentor(s): Ivan Bogun
Faculty mentor(s): Dr. Eraldo Ribeiro
Abstract: We propose a framework for human-object interaction recognition in videos. Our method uses trajectories of body parts as a main source of information for classification. More specifically, we represent interactions by a combination of the trajectory's spatial information and the presence of objects. We exploit the appearance of an object in multiple frames to improve object classification and ultimately, interaction recognition. Our experiments show that: (a) using multiple frames increases object recognition,(b) implementing a specialized dynamical time warping kernel allows for accurate classification of the trajectories without discarding spatial information, (c) the combination of motion and object cues, using multiple kernel learning, improves classification. Our algorithm achieves results comparable or superior to the state of the art.

TR: Chen J. and Khan, H. K., Bogun, I. and Ribeiro, E. (2013) Unified framework for human-object interaction recognition, Technical Report TR-2013-05, The AMALTHEA REU Program, Summer 2013 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Large-scale Clustering for Big Data Analytics: A MapReduce Implementation of Hierarchical Affinity Propagation
By: Dillon M. Rose & Jean Michel Rouly
Graduate mentor(s): Rana Haber
Faculty mentor(s): Dr. Adrian M. Peter
Abstract: The accelerated evolution and explosion of the Internet and social media is generating voluminous quantities of data (on zettabyte scales). Paramount amongst the desires to manipulate and extract actionable intelligence from vast big data volumes is the need for scalable, performance-conscious analytics algorithms. To directly address this need, we propose a novel MapReduce implementation of the exemplar-based clustering algorithm known as Affinity Propagation. Our parallelization strategy extends to the multilevel Hierarchical Affinity Propagation algorithm and enables tiered aggregation of unstructured data with minimal free parameters, in principle requiring only a similarity measure between data points. We detail the linear run-time complexity of our approach, overcoming the limiting quadratic complexity of the original algorithm. Experimental validation of our clustering methodology on a variety of synthetic and real data sets (Reuters articles, medical, etc.) demonstrates our competitiveness against other state-of-the-art MapReduce clustering techniques.

TR: Rose D.M. and Rouly, J.M., Haber, R., Mijatovic, N. and Peter, G.B. (2013) Large-scale Clustering for Big Data Analytics: A MapReduce Implementation of Hierarchical Affinity Propagation, Technical Report TR-2013-04, The AMALTHEA REU Program, Summer 2013 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Mixture Distribution Modeling on the Tangent Space of Hyper-Spherical Reproducing Kernel Hilbert Spaces
By: John L. Karlen & Shraddha Singh
Graduate mentor(s):
Faculty mentor(s): Dr. Georgios C. Anagnostopoulos
Abstract: Transforming data non-linearly by lifting it into higher dimensions is a common technique used for data analysis in Machine Learning (ML). Non-linear transformations have been shown to yield data distributions that are more useful for a given ML task than the ones of the original data. This article is motivated by the advances of a previous research on using kernels to embed data onto a hypersphere in the corresponding Reproducing Kernel Hilbert Space (RKHS). The same research also introduced the idea of mapping the data embedded on the hypersphere onto a tangent plane (tangent to the hypersphere at a point referred to as pivot point) via the Logarithmic Map of the corresponding hypersphere. Our research utilizes this concept to introduce pivot point as a variant parameter, which is optimized using block coordinate descent method along with the projected sub-gradient method to model a dataset as Gaussian mixtures. The benefits of this approach are demonstrated by tackling clustering and classification tasks on real publicly available datasets.

TR: Karlen J.L. and Singh, S., and Anagnostopoulos, G.C. (2013) Mixture Distribution Modeling on the Tangent Space of Hyper-Spherical Reproducing Kernel Hilbert Spaces, Technical Report TR-2013-03, The AMALTHEA REU Program, Summer 2013 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Analysis of Medical Data Using Dimensionality Reduction Techniques
By: Matthew G. Kishe & Halley J. Weitzman
Graduate mentor(s):
Faculty mentor(s): Dr. Georgios C. Anagnostopoulos
Abstract: Identifying feature transformations to effectively address a variety of tasks lies at the core of many Machine Learning (ML) models and formulations. In this work we introduce a novel model that allows for transforming arbitrarily distributed data in a non-linear fashion to a finite mixture of Gaussian distributions in the target space. This allows the formation of a suitable basis to develop probabilistic generative models, which would have been impossible, when considering the distribution of the original data. The resulting Gaussian mixture model can subsequently be used for a variety of ML tasks, such as recognition, clustering and dimensionality reduction among others. Our model, dubbed Vortex, is derived from considering discrete-time periodic transformations (orbits) of the original data. Due, mainly, to the periodic nature of these transformations, the implied mapping is a diffeomorphism between the original and the transformed data, which implies that pre-images of the mapping are unique for each mapped sample and can, in principle, be computed efficiently. In this work we consider the application of Vortex to Quadratic Discriminant Analysis (QDA) classification in the target space. Its problem formulation incorporates Group Lasso regularization in order to promote sparsity in the obtained solution and it leads to a convex minimization problem, which we solve efficiently via a standard proximal gradient method. After discussing the model and its associated formulation we demonstrate its merits via a series of experiments involving benchmark classification problems and comparisons to other well-known classification approaches. The experimental results we've obtained indicate that the Vortex model may exhibit very competitive recognition performance on a variety of classification tasks, when compared to a few standard recognition approaches.

TR: Kishe M.G. and Weitzman, H.J., and Anagnostopoulos, G.C. (2013) Feature Mappings via Periodic Evolution of Diffeomorphisms, Technical Report TR-2013-02, The AMALTHEA REU Program, Summer 2013 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Analysis of Medical Data Using Dimensionality Reduction Techniques
By: Robert E. Colgan & David E. Gutierrez
Graduate mentor(s): Jugesh Sundram
Faculty mentor(s): Prof. G. Bhaskar Tenali
Abstract: High-dimensional data can be difficult to analyze, almost impossible to visualize, and expensive to process and store. In many cases, the high-dimensional data points may all lie on or close to a much lower-dimensional surface, or manifold, implying the intrinsic dimensionality of the data is much lower. In that case, the data could be described with fewer dimensions, allowing us to mitigate the curse of dimensionality. Transforming the high dimensional representation of the data to a lower-dimensional one without losing important information is the central problem of dimensionality reduction. Many methods of dimensionality reduction have been developed, including classical techniques like Principal Component Analysis (PCA) and newer methods such as Diffusion Maps (DM). Most of these methods often perform well on some types of data but poorly on others. We apply different dimensionality reduction methods to medical data, including breast tissue tumor data and kidney proteomics data, in order to determine which methods and parameters work best on on the different types of data. To evaluate the performance of the reduction method, we also classify the data in the reduced dimension using standard classification algorithms and evaluate the accuracy.

TR: Colgan R.E. and Gutierrez, D.E., Sundram, J. and Tenali, G.B. (2013) Analysis of Medical Data Using Dimensionality Reduction Techniques, Technical Report TR-2013-01, The AMALTHEA REU Program, Summer 2013 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Modeling Group Dynamics in Virtual Worlds
By: Chris Usher
Graduate mentor(s): Fahad Shah
Faculty mentor(s): Dr. Gita Sukthankar
Abstract: In this study, we examine human social interactions within virtual worlds and address the question of how group interactions are a ffected by the surrounding game environment. To investigate this problem, we analyzed conversational data collected from Second Life, a massively multi-player online environment that allows users to construct and inhabit their own 3D world. Our data collection chatbots were created to be sufficiently lifelike to be indistinguishable from other human participants to casual observers, so as not to perturb neighboring social interactions. Using our partitioning algorithm, we separated continuous chat logs from each region into separate conversations which were used to construct a social network of the participants. Based on statistical studies of social network measures, we conclude that there are significant regional di fferences between social networks formed in Second Life.

TR: Usher, C., Shah, F., and Sukthankar, G. (2009) Modeling Group Dynamics in Virtual Worlds, Technical Report TR-2009-05, The AMALTHEA REU Program, Summer 2009 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Kernel similarity scores for outlier detection in mixed-attribute data sets
By: David Foregger, Julie Manuel
Graduate mentor(s): Ruben Ramirez-Padron
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: The majority of current outlier detection methods are limited to data sets with all attributes of the same type. The few methods that deal with data sets with mixed numerical and categorical data are complex and explicitly depend on the data types of the attributes, which makes them difficult to extend to other types of mixed-attribute data. Kernel functions can be seen as similarity functions defi ned on any set of objects. Kernel functions on di fferent feature representations can be combined by using a set of well established rules. To the best of our knowledge, however, current kernel-based outlier detection methods have been defi ned only on single-type data sets, and the kernel functions used have been components of relatively complex statistical techniques. This work proposes simple but eff ective kernel-based outlier detection methods for single type and mixed-attribute data sets. Three diff erent kernel similarity scores are introduced in order to rank points for outlier detection. For arbitrary mixed-attribute data sets, diff erent kernel and score combinations are used to obtain a single similarity score. It is shown that our approach outperforms the Greedy and AVF algorithms on categorical data sets and Otey et al's algorithm for data sets with numerical and categorical attributes.

TR: Foregger, D., Manuel, J., Ramirez-Padron, R., and Georgiopoulos, M. (2009) Kernel similarity scores for outlier detection in mixed-attribute data sets, Technical Report TR-2009-04, The AMALTHEA REU Program, Summer 2009 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Social Network Analysis for Target Recognition in Swarm Robotics
By: Michael C. Koval, Adina E. S. Rubino ff
Graduate mentor(s): Mahsa Maghami
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: Over the past few years, swarm-based systems have emerged as an attractive and promising approach for implementing distributed autonomous systems. This is useful in di erent applications, such as automatic target recognition (ATR). In ATR, the most pressing concern is the accuracy of the system in detecting and recognizing the targets. To fulfill this requirement, previous approaches require more than one agent to classify a target, which decreases the number of recognition errors. Increasing the number of agents required to recognize a task will make the system more accurate; however, it delays recognition, which is not acceptable in most applications. In addition, agents are never completely identical in the real world; even two identical agents will have sensors with di erent error tolerances. Unfortunately, previous approaches fail to distinguish accurate robots from less accurate ones. In this paper we propose a novel algorithm, named the NetBots algorithm, which improves accuracy in ATR by introducing a social network to the swarm system. This social network keeps track of the robots' target detection histories and is used to estimate the robots' accuracy. Any time two robots deciding a task agree with the majority decision, a link is formed between them. The PageRank algorithm is then used to rank the robots based on the number and ranks of the robots linking to them. Robots that are more accurate will agree with the majority more often, so they will have more links and therefore a higher rank. When deciding a task, higher ranked robots have more influence, so inaccurate robots' votes will not negatively impact the majority decision. This approach increases the overall accuracy of the system by 1.76 percent over 30 tasks, compared to other approaches with no social network.

TR: Koval, M.C., Rubinoff, A.E.S., Maghami, M., and Georgiopoulos, M. (2009) Social Network Analysis for Target Recognition in Swarm Robotics, Technical Report TR-2009-03, The AMALTHEA REU Program, Summer 2009 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Learning Approximate Isometries using Multi-Dimensional Scaling based on Radial Basis Function Neural Networks
By: Ryan Gonet, RuiZhi Yu
Graduate mentor(s): Mingbo Ma
Faculty mentor(s): Dr. Georgios C. Anagnostopoulos
Abstract: This paper presents a new multi-dimensional scaling (MDS) technique for generating approximate isometries by way of using Radial Basis Function neural networks (RBF-NN). Metric MDS using the Sammon mapping1 is available and commonly used to create approximate isometries in order to perform data visualization. However, classical-Sammon mapping has the disadvantage of being unable to interpolate or extrapolate novel samples. Other techniques, such as one using Multi-Layer Perceptron (MLP) to implement the mapping, cannot map datasets where only dissimilarities between patterns are known. By using RBF-NNs to train a model to map patterns from a higher dimension into a lower dimension, one can both perform dimensionality reduction on datasets consisting only of dissimilarities and also interpolate and extrapolate novel patterns. For training, both the delta rule, through conjugate gradient descent and the limited-memory Broyden-Fletcher-Goldfarb-Shanno quasi-Newton method were used. Step lengths were calculated using a line search that finds step lengths obeying the strong Wolfe conditions. Experimental results using a prototype implementation of RBF-NNs to perform Sammon mapping demonstrate that approximate isometries can be found. They also show that datasets consisting of only dissimilarities can still be visualized, where the reduction shows a structural relationship between patterns. This new ability, in conjunction with the ability to interpolate and extrapolate previously unseen patterns, creates novel opportunities for the application of MDS using Sammon mapping as a visualization technique.

TR: Gonet, R., Yu, R., Ma, M., and Anagnostopoulos, G.C. (2009) Learning Approximate Isometries using Multi-Dimensional Scaling based on Radial Basis Function Neural Networks, Technical Report TR-2009-02, The AMALTHEA REU Program, Summer 2009 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Designing ART-based Classifiers through Multi-Objective Memetic Evolution
By: Timothy R. Mersch, Oriana X. Wen
Graduate mentor(s): Rong Li
Faculty mentor(s): Dr. Georgios C. Anagnostopoulos
Abstract: In multi-objective optimization, the two pivotal attributes of the Pareto front are density and diversity. A high-density, high-diversity Pareto front is characterized by an ample set of significantly unique solutions. With respect to classifier models, obtaining a well-sampled Pareto front maximizes the ability to identify, along the trade-o curve relating model complexity and accuracy, the best-generalizing model on a testing set. Evolutionary algorithms operating in the multi-objective space have proven, to a certain extent, successful in optimizingART-based classifiers and approaching the Pareto-optimal set. Building on previous work, a memetic genetic algorithm has been devised that evolves in parallel subpopulations of Fuzzy ARTMAP classifier models in order to attain a Pareto front with improved density and diversity. The genetic algorithm works under a quantized objective space, wherein the population of individual models is subdivided by complexity. The incorporation of specialized structures, from the Hall of Fame to the gene pool, with systematic mechanisms, including a mutation selection scheme and simulated annealing, produces a memetic algorithm that takes advantage of global exploration and localized exploitation. Safeguards against duplication and the combined forces of global and local search manifest substantial enhancement of the density and diversity of the Pareto front. Two sets of experiments were conducted to optimize mutation and to evaluate this genetic algorithm against a simpler implementation. The first series of experiments favors a uniform selection of individuals for mutation. The second series of experiments reveals that the genetic algorithm improves upon a pre-existing multi-objective evolutionary algorithm that optimizes ART-based classifiers. The comparisons between the two algorithms were with either random or trained initialization and with di erent subpopulation sizes. Overall, the novel approach is shown to be superior in terms of hyper-area, density, and two-set coverage of the final Pareto front. Also, the champion networks produced by the novel approach exhibit greater generalization power over those of the previous work.

TR: Mersch, T.R., Wen, O.X., Li, R., and Anagnostopoulos, G.C. (2009) Designing ART-based Classifiers through Multi-Objective Memetic Evolution, Technical Report TR-2009-01, The AMALTHEA REU Program, Summer 2009 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Iterative Inner Solvers for Revised Simplex SVM Training
By: Eric P. Astor, Winnie J. Lung
Graduate mentor(s): Ruben Ramirez-Padron, Christopher G. Sentelle
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: Support Vector Machine (SVM) training is equivalent to solving a large constrained optimization problem. Much work has been spent on decompositional optimization methods for this problem, but non-decompositional approaches have only recently regained attention. Notably, Sentelle’s work in applying Rusin’s revised simplex method to SVM training demonstrated a significantly shorter training time for complex problems, but requires much more memory than the competing decompositional methods. We propose a revision to Sentelle’s approach, replacing his direct inner solver with an iterative solver: the preconditioned conjugate residual method. Results show an unexpected performance penalty, due to extreme ill-conditioning of the inner problems; avoiding this may require the creation of a specialized functional preconditioner.

TR: Astor, E.P., Lung, W.J., Ruben Ramirez-Padron, Christopher G. Sentelle and Georgiopoulos, M. (2008) Iterative Inner Solvers for Revised Simplex SVM Training , Technical Report TR-2008-06, The AMALTHEA REU Program, Summer 2008 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Interactively Evolved Modular Neural Networks for Agent Control
By: Jessica C. Sparks , Roberto Miguez
Graduate mentor(s): John Reeder
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: As the realism in games continues to increase, through improvements in graphics and 3D engines, more focus is placed on the behavior of the simulated agents that inhabit the simulated worlds. The agents in modern video games must become more life like in order to seem to belong in the environments they are portrayed in. Many modern AI’s achieve a high level of realism but this is accomplished through significant developer time spent scripting the behaviors of the Non-Playable Characters or NPC’s. These agents will behave in a believable fashion in the scenarios they have been programmed for but do not have the ability to adapt to new situations. In this paper we introduce a modularized, real-time, co-evolution training technique to evolve adaptable agents with life like behaviors. Experiments conducted produced very promising results regarding efficiency of the technique, and demonstrate potential avenues for future research.

TR: Sparks, J.C., Miguez, R., John Reeder, and Georgiopoulos, M. (2008) Interactively Evolved Modular Neural Networks for Agent Control, Technical Report TR-2008-05, The AMALTHEA REU Program, Summer 2008 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Detecting Outliers in Categorical Data Sets Using Non-Derivable Itemsets
By: Michelle Fox , Gary Gramajo
Graduate mentor(s): Anna Koufakou
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: Outlier Detection is a research field with many applications, such as detecting credit card fraud or network intrusions. Most previous research focused on numerical data and pair-wise distances among data points to detect outliers. Nevertheless, most categorical data sets lack straightforward mapping to numerical values and approaches that rely on computing distances do not apply so easily. Recently, a few outlier methods were proposed for categorical datasets using the concept of Frequent Itemsets (FIs). The number of generated FIs can be far too high, especially in the case of large, dense datasets, containing a high number of categorical values. There has been much research towards summarizing and/or condensing the FIs in a dataset to address this issue. However these ideas have not been applied directly to the field of outlier detection. In this report, we explore the effect of using a condensed representation of Frequent Itemsets, called Non-Derivable Itemsets (NDI), on the accuracy and efficiency of an outlier detection method. Our experimental results indicate that NDI-based Outlier Detection offers significant gains in terms of speed and scalability over a FI-based outlier detection, while maintaining comparable detection accuracy.

TR: Fox, M.S., Gramajo, G.E., Koufakou, A. and Michael Georgiopoulos. (2008) Detecting Outliers in Categorical Data Sets Using Non-Derivable Itemsets , Technical Report TR-2008-04, The AMALTHEA REU Program, Summer 2008 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Development of a Large Vocabulary Continuous Speech Recognition System for Rich Transcription Evaluation Using HTK
By: David A. Wax , Noah A. Larsen, Matthew J. Furstossc, and Veton Z. Kepuska
Graduate mentor(s):
Faculty mentor(s):
Abstract: The focus of this research is to build a large vocabulary continuous speech recognition (LVCSR) system that converts speech to text in accordance with the National Institute of Standards and Technology(NIST) Rich Transcription (RTE) Evaluation requirements. The result of the current effort will serve as a baseline for future work in the development of an advanced speech recognition system based on WUW technology.1 Cambridge University's HTK Speech Recognition Toolkit Version 3.4 serves as the engine in this process. In order to create a sufficiently large speech dataset, multiple corpora are combined, including TIMIT, and NIST RTE 2006 (RT06) and 2007 (RT07) data. Recognition testing and evaluation is performed under a variety of different conditions to find the ideal parameters for optimum accuracy. Modifiable factors include insertion penalties (IP), language models, phonetic questioning, bootstrapping, and skip states. Performance is measured by word error rate (WER). The addition of insertion balancing consistently improved WER at both phone-level and word- level, while the removal of TIMIT shibboleth sentences demonstrated no significant change in WER. Phonetic questioning effectively improved computation time without a significant increase of WER. Training and testing on TIMIT corpus data with the implementation of a language model attained the lowest WER of 78.74%. Although 78.74% WER is higher than what other research has achieved with HTK,2 the addition of the language model improved WER by a relative difference of 48.97%. Additionally, performance is better than expected for using only 1 Gaussian Mixture and 3 output states per HMM.

TR: Wax, D.A., Larsen, N.A., Furstoss, M.J., and Kepuska, V.Z. (2008) Development of a Large Vocabulary Continuous Speech Recognition System for Rich Transcription Evaluation Using HTK , Technical Report TR-2008-03, The AMALTHEA REU Program, Summer 2008 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Multi-stage Automatic License Plate Location & Recognition
By: Rong Li , Musa Yassin Fort
Graduate mentor(s):
Faculty mentor(s): Dr. Georgios C. Anagnostopoulos
Abstract: Since as early as 1970’s, the need of an Automatic License Plate Recognition system (ALPR) has arose based on the need to implement law enforcement and traffic control on transportation systems. This area has motivated a large amount of research effort, and various approaches and solutions have been proposed and implemented. Existing ALPR systems typically consist of modules addressing the following three tasks: license plate localization, character segmentation, and character recognition. In this paper, a novel ALPR system is proposed where a new de-skewing stage is added between license plate localization and character segmentation stages. This new stage allow the system to process images that are taken at an angle with respect to the LP’s normal and from a relatively close distance, which typically results in skew distortion. The additional stage ensures that the characters are all lined up in a horizontal line with the same height, which allows precise character segmentation; furthermore, it rectiffies the characters and allows simple recognition techniques such as cross-correlation to yield good classification results. In addition to the de-skewing stage, a license plate localization method is also proposed. It shares some common ground with the current approaches, as it is based on vertical edge density. Note that images with complicated edge backgrounds have been a common problem for algorithms that use an edge density approach, while the new localization method has a success rate of 86.4% based on a database of 822 images. While at this point in time the prototype system's LP recognition accuracy at 12.3% is not practical, several of its rudimentary technique employed could be exchanged for more sophisticated and effective ones and thus improving system performance significantly in future efforts. However, this work demonstrates that the de-skewing process can be significantly advantageous.

TR: Li,R., Yassin-Fort, M.Y., and Anagnostopoulos G.C. (2008) Multi-stage Automatic License Plate Location & Recognition, Technical Report TR-2008-02, The AMALTHEA REU Program, Summer 2008 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Real-time, Static and Dynamic Hand Gesture Recognition for Human-Computer Interaction
By: S.M. Hassan Ahmed , Todd C. Alexander
Graduate mentor(s):
Faculty mentor(s): Dr. Georgios C. Anagnostopoulos
Abstract: Real-time, static and dynamic hand gesture recognition affords users the ability to interact with computers in more natural and intuitive ways. The hand can be used to communicate much more information by itself compared to computer mice, joysticks, etc. allowing a greater number of possibilities for computer interaction. The purpose of this work was to design, develop and study a practical framework for real-time gesture recognition that can be used in a variety of human-computer interaction applications with the aim of developing a prototype system for controlling Microsoft PowerPoint&tm; presentations. The approach taken is to locate hands starting with the finger tips after investigating potential regions of interest that are maintained through motion detection and subsequent tracking. The fingertips are very distinguishable from many backgrounds allowing a very robust tracking system. Using techniques such as Features from Accelerated Segment Test (FAST) corner detection the tips can be found efficiently. Circles generated using Bresenham’s algorithm were employed for finding corners as well as the center of the palm. Using features generated from these results along with a semi-supervised Adaptive Resonance Theory (ART) neural network, static gestures were able to be classified with an overall accuracy of 75%. Dynamic gestures on the other hand were able to be detected using the tra jectory formed by the center of the hand over a finite amount of time. Simple decision heuristics were then utilized to detect the movement of the hand. This research produced a working prototype of a robust hand tracking and gesture recognition system that can be used in numerous applications.

TR: Ahmed, S.M.H, Alexander, T.C., and Anagnostopoulos G.C. (2008) Real-time, Static and Dynamic Hand Gesture Recognition for Human-Computer Interaction, Technical Report TR-2008-01, The AMALTHEA REU Program, Summer 2008 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: Testing and Improvement of the Triple Scoring Method for Applications of Wake-up Word Technology (2007)
By: Andrew Stiles, Brandon Schmitt & Tad Gertz
Graduate mentor(s): Tudor Klein
Faculty mentor(s): Dr. Veton Kepuska
Abstract: Constant monitoring of an individuals voice and near perfect recognition of a specific word while maintaining consistent rejections of all other words can be realized by implementation of Wake-Up Word (WUW) Speech Recognition (SR) technology. The algorithm shown here has the potential to add robustness to even in a speaker independent environment, and provides much better results for the application of single word recognition when compared to current industry or academic standards such as Microsoft SAPI and HTK respectively. By implementing a Triple Scoring Method (TSM) implemented with Hidden Markov Models (HMM) in the feature domain the WUW modeling results are found to be far superior in single word recognition, providing a 15166.15% increase in correct recognition with Callhome corpus over HTK and a 1303.78% increase over Microsoft SDK.

TR: Stiles, A., Schmitt, B., Gertz, F., Klein, T., and Kepuska, V.Z. (2007) Testing and Improvement of the Triple Scoring Method for Applications of Wake-up Word Technology, Technical Report TR-2007-04, The AMALTHEA REU Program, Summer 2007 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: NEAT Drummer: Interactive Evolutionary Computation for Drum Pattern Generation (2007)
By: Amy Hoover
Graduate mentor(s):
Faculty mentor(s): Dr. Kenneth Stanley
Abstract: A major challenge in computer generated music is breaking the barrier between musical novelty and musical quality. Typically, computer music generators produce either genre-specific patterns that lack innovation or patterns that are given too much freedom and lack cohesion. In an attempt to both constrain the musical search space and produce novel rhythms, a program called NEAT Drummer is introduced. NEAT Drummer evolves neural networks with the NeuroEvolution of Augmenting Topologies (NEAT) that produce compelling drum patterns. To constrain the musical search space, NEAT Drummer accepts a base rhythm or motif from the user and through Interactive Evolutionary Computation (IEC), complexifies that pattern with each successive generation. This work discusses the concepts behind how NEAT Drummer understands and manipulates a base rhythm, which is either predefined by the user through a basic interface or defined by MIDI music file information.

TR: Hoover, A.K., and Stanley, K.O. (2007) NEAT Drummer: Interactive Evolutionary Computation for Drum Pattern Generation, Technical Report TR-2007-03, The AMALTHEA REU Program, Summer 2007 [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: A Grid Based System for Data Mining Using MapReduce (2007)
By: Kelvin Cardona
Graduate mentor(s): Jimmy Secretan
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: We discuss a Grid data mining system based on the MapReduce paradigm of computing. The MapReduce paradigm emphasizes system automation of fault tolerance and redundancy, while keeping the programming model for the user very simple. MapReduce is built closely on top of a distributed file system, that allows efficient distributed storage of large data sets, and allows computation to be scheduled closely to this data. Many machine learning algorithms can be easily integrated into this environment. We explore the potential of the MapReduce paradigm for general large scale data mining. We offer several modifications to the existing MapReduce scheduling system to bring it from a cluster environment to a campus grid that includes desktop PCs, servers and clusters. We provide an example implementation of a machine learning algorithm (the Probabilistic Neural Network) in MapReduce form. We also discuss a MapReduce simulator that can be used to develop further enhancements to the MapReduce system. We provide simulation results for two new proposed scheduling algorithms, designed to improve MapReduce processing on the grid. These scheduling algorithms provide increased storage efficiency and increased job processing speed, when used in a heterogeneous grid environment. This work will be used in the future to produce a fully functioning implementation of the MapReduce runtime system for a grid environment, that will enable easy, data intensive parallel computing for machine learning, with little to no additional hardware investment.

TR: Cardona, K., Secretan, J., Georgiopoulos, M. and Anagnostopoulos G.C. (2007) A Grid Based System for Data Mining Using MapReduce, Technical Report TR-2007-02, The AMALTHEA REU Program, Summer 2007. [PDF]

Poster by Kelvin Cardona; Summer 2007

Title: A Backward Adjusting Strategy for the C4.5 Decision Tree Classifier (2007)
By: Maria Garcia & Jason Beck
Graduate mentor(s): Mingyu Zhong
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: In machine learning, decision trees are employed extensively in solving classification problems. In order to produce a decision tree classifier two main steps need to be followed. The first step is to grow the tree using a set of data, referred to as the training set. The second step is to prune the tree; this step produces a smaller tree with better generalization (smaller error on unseen data). The goal of this project is to incorporate an additional adjustment phase interjected between the growing and pruning phases of a well known decision tree classifier, called the C4.5 decision tree. This additional step reduces the error rate (generalization of the tree) by making adjustments to the non-optimal splits created in the growing phase of the C4.5 classifier. As a byproduct of our work we are also discussing of how the decision tree produced by C4.5 is affected by the change of the C4.5 default parameters, such as CF (confidence factor) and MS (number of minimum split-off) cases, and emphasizing the fact that CF and MS parameter values, different than the default values, lead us to C4.5 trees of much smaller size and smaller error.

TR: Beck, J.R., Garcia, M.E., Zhong, M. Georgiopoulos, M., and Anagnostopoulos G.C. (2007) A Backward Adjusting Strategy for the C4.5 Decision Tree Classifier, Technical Report TR-2007-01, The AMALTHEA REU Program, Summer 2007 [PDF]