|Negative selection of written language using character multiset statistics|
|Authors:||Matti Pöllä, Timo Honkela|
|Citation:||Journal of Computer Science and Technology 25 (6): 1256-1266. 2010.|
|Publication type:||Journal article|
|Google Scholar cites:||Citations|
|Added by Wikilit team:||Added on initial load|
|Article:||Google Scholar BASE PubMed|
|Other scholarly wikis:||AcaWiki Brede Wiki WikiPapers|
|Web search:||Bing Google Yahoo! — Google PDF|
We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.
"A common data mining problem is how to locate and analyze changes in the sequences in order to detect either intentional modifications or unintentional corruption in the data.
In the following, we present a theoretical analysis on locating modifications in a symbol sequence using a combination of negative selection and character frequency analysis as a preprocessing stage. Also, we analyze how a na¨ıve first-order statistical language model can be used to approximate the detection probability."
|Theory type:||Design and action|
|Wikipedia coverage:||Sample data|
|Theories:||"In Section 3, we present the
symbol frequency algorithm for anomaly detection in discrete sequences. In Section 4, we present two Markov chain models for analyzing the theoretical detection probabilities."
|Research design:||Mathematical modeling, Simulation, Statistical analysis|
|Data source:||Simulation results, Websites|
|Collected data time dimension:||Longitudinal|
|Unit of analysis:||Edit|
|Wikipedia data extraction:||Live Wikipedia|
|Wikipedia page type:||Article|
"Applying the NSA to sparse data such as written language requires special attention in terms of matching rules and representation of artificial antibodies due to the sparsity of data. This study has introduced a general method based on first-order statistics of character strings to implement a compact negative representation of text documents and for detecting changes in them. The distinctive feature of this method is to apply a special matching rule which compares the statistics of two strings instead of traditional bit-wise matching rules. As a side product this matching rule has the ability to detect deletions, which is a well known pitfall of many NSA algorithms.
We have shown that a reasonably simple Markov model and a prior character probability distribution can be effective in predicting the detection sensitivity of SF-NSA. Further, this model can be used in optimized versions of the detector generating algorithm."
"Edits between successive Article versions"