Last modified on January 30, 2014, at 20:30

Negative selection of written language using character multiset statistics

Publication (help)
Negative selection of written language using character multiset statistics
Authors: Matti Pöllä, Timo Honkela [edit item]
Citation: Journal of Computer Science and Technology 25 (6): 1256-1266. 2010.
Publication type: Journal article
Peer-reviewed: Yes
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Format: BibTeX
Negative selection of written language using character multiset statistics is a publication by Matti Pöllä, Timo Honkela.

[edit] Abstract

We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.

[edit] Research questions

"A common data mining problem is how to locate and analyze changes in the sequences in order to detect either intentional modifications or unintentional corruption in the data.

In the following, we present a theoretical analysis on locating modifications in a symbol sequence using a combination of negative selection and character frequency analysis as a preprocessing stage. Also, we analyze how a na¨ıve first-order statistical language model can be used to approximate the detection probability."

Research details

Topics: Data mining [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "In Section 3, we present the

symbol frequency algorithm for anomaly detection in discrete sequences. In Section 4, we present two Markov chain models for analyzing the theoretical detection probabilities." [edit item]

Research design: Mathematical modeling, Simulation, Statistical analysis [edit item]
Data source: Simulation results, Websites [edit item]
Collected data time dimension: Longitudinal [edit item]
Unit of analysis: Edit [edit item]
Wikipedia data extraction: Live Wikipedia [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"Applying the NSA to sparse data such as written language requires special attention in terms of matching rules and representation of artificial antibodies due to the sparsity of data. This study has introduced a general method based on first-order statistics of character strings to implement a compact negative representation of text documents and for detecting changes in them. The distinctive feature of this method is to apply a special matching rule which compares the statistics of two strings instead of traditional bit-wise matching rules. As a side product this matching rule has the ability to detect deletions, which is a well known pitfall of many NSA algorithms.

We have shown that a reasonably simple Markov model and a prior character probability distribution can be effective in predicting the detection sensitivity of SF-NSA. Further, this model can be used in optimized versions of the detector generating algorithm."

[edit] Comments

"Edits between successive Article versions"

Further notes[edit]