Author ORCID Identifier

0000-0001-6296-986X

Date of Award

Fall 12-13-2021

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Dr. Robert Harrison

Abstract

Information retrieval on graphs has applications in diverse areas, including analyzing social networks, computer security, and understanding the evolution of drug resistance in HIV. However, in each domain, there are relationships between data points that need to be preserved to extract meaningful information. The data sets in each area could be represented as data structures that could be embedded in graphs.

This dissertation seeks the approach of feature engineering for the information retrieval from two very different but very large datasets, namely genome sequencing of HIV protease to study the virus’ resistance to the HIV drugs and text data of illicit groups from telegram, a social network platform. The goal of this dissertation is to demonstrate that the new, vital information such as understanding the evolution of HIV protease or predicting if the given message contains illicit information or not, doesn’t have to be based on high-cost computing methods, and the resultant information is still comparable to its expensive alternatives.

To understand the evolution of HIV under the protease inhibitor drugs, this dissertation first demonstrates feature-engineering of the HIV protease, a complex protein structure, as a transitionally and rotationally invariant sparse vector representation preserving the relative positions of Amino Acids. This dissertation then demonstrates the effectiveness of this vector representation and understands the evolution of HIV protease as a minimal spanning tree. In the end, by understanding the branches of minimal spanning trees covering these vector representations of HIV protease through time, this dissertation concludes the important and new observations on the resistance of HIV.

To seek the classification of a message being illicit or licit, this dissertation first under-

stands the nature of the illicit texts in the financial fraud domain in terms of the special words used in known licit and illicit groups in different contexts and hence frequencies. This dissertation feature-engineers the texts of these groups as sparse vectors by constructing a new bag of words comprising these informative words. In the end, this dissertation demonstrates the effectiveness of these sparse vector representations by applying shallow classifiers determining the ownership of the message given two groups.

File Upload Confirmation

1

Available for download on Wednesday, December 07, 2022

Share

COinS