Biography
Han-joon Kim has received his BS and MS degrees in Computer Science and Statistics from Seoul National University, Seoul, Korea in 1994 and 1996, respectively. He has done his PhD degree in Computer Science and Engineering from Seoul National University, Seoul, Korea in 2002. He is currently a Professor at the School of Electrical and Computer Engineering, University of Seoul, Korea. His current research interests include Data/Text Mining, Database Systems, and Intelligent Information Retrieval.
Abstract
This paper describes a new way of producing more significant word networks from textual data by combining text clustering and keyword association techniques. Basically, one of crucial aspects in text mining is the analysis of concept relationships, where concepts originate from keywords. The problem is to discover more reasonable set of keywords and their relationships called ‘word network’. In general, the word networks can be built by using the frequency of co-occurrence of words indexed. However, only the co-occurrence frequency is not enough to measure the strength of associations among words because significant associations with relatively low frequency are ignored. In our work, to overcome the problem, we intend to perform the word association task over the clustered results for incoming documents instead of a whole document. Rather than building a word network from the entire set of documents, it is likely to extract more meaningful word associations from the clustered results of the documents. Our proposed method is performed broadly in two steps: Firstly, a given documents collection is partitioned into a set of clusters, each of which is represented as a minimum spanning tree by conducting a priori association mining. Here, we note that each cluster includes a set of documents with similar word occurrence patterns, and thus it would have cluster-specific words and their strong associations. Thus, as a next step, our method iteratively compute weighted mutual information that evaluates the degree of significance between two word nodes, and extracts the top-N significant words and their word associations hidden in each cluster.
Biography
Wangjun He received his MSc degree in Geographic Information System from Chinese Academy of Surveying and Mapping in the year 2012. He received his Bachelor’s degree from Tongji University in 2009. Now he is an Assistant Professor at the Chinese Academy of Surveying and Mapping. He has published more than 8 papers in various journals. His research interests are in the areas of Spatial Data Visualization and Government Geographic Information Service.
Abstract
With the growing geographical data, typical spatial overlay methods for vector data in current GIS platform were unable to adapt to voluminous vector data. Thus, this paper presents a novel spatial overlay method for vector data based on the distributed memory computing framework. Firstly, according to the principle of distributed computing, i.e., map and reduce the vector data were divided into several grids. In this way, several partitions were made for the vector data with the aim of parallel computing. Moreover, with this method, unnecessary calculations between the apart spatial objects can be avoided. Secondly, STRtree data structure was constructed in each grid to solve the problem of the uneven distribution in each grid. Meanwhile, with the STR-tree data structure, the efficiency of overlay operation in the same grid can be improved, and the data unevenly distributed problem can be solved by this way. The final comparison between this method and other typical methods shows that this method can significantly improve the overlay operation’s performance for the large-scale vector data.