Scientific Program

Conference Series Ltd invites all the participants across the globe to attend 2nd International Conference on Big Data Analysis and Data Mining San Antonio, USA.

Day 2 :

  • Data Mining Applications

Session Introduction

E. Cabral Balreira

Trinity University, USA.

Title: Using an Oracle to improve the Quality of Predictions.
Speaker
Biography:

E. Cabral Balreira is an associate professor of mathematics at Trinity University in San Antonio, Texas. His main area of research is differential topology with applications to global invertibility and discrete dynamics. Motivated by his interest in mathematics and sports, Balreira and his colleagues at Trinity developed a mathematical method, the Oracle, to rank sports teams and predict tournament outcomes. Balreira received his Ph.D. from the University of Notre Dame in 2006.

Abstract:

There are many different approaches to rank teams in the NFL and NBA. One is able to make predictions of game outcomes and decide postseason odds. In this project, we are in interested in quantifying the accuracy and improving the quality of the predictions given by a particular ranking model. We introduce a method to address the over-confidence of game predictions that increases the quality of predictions when assessed by standard scoring metrics. In addition, we develope a method to better include homefield advantage in a ranking method. We evaluate our predictions in the past 15 years of NFL and NBA and show that our newly developed ranking method called the Oracle consistently outperformes currently available computer models in accuracy and quality of predictions.

Saima Kashif

NED University of Engineering and Technology, Pakistan

Title: Virtual Screening of Natural Compounds that can Inhibit HCV
Speaker
Biography:

Ms. Saima Kashif has completed Masters in Biomedical Engineering with dissertation with 3.83CGPA from NED University of Engineering and Technology Karachi Pakistan. Currently she is working as a Lecturer and research supervisor at Department of Biomedical Engineering, NED University of Engineering and Technology Karachi, Pakistan.

Abstract:

Hepatitis C virus (HCV) is highly prevalent in Pakistan, and it’s infection can lead to chronic liver disease and hepatocellular carcinoma. The currently available treatment for treating HCV infection is not effective in all patients, has adverse side effects, and is not easily affordable. The aim of this study was to explore for the natural compounds that can inhibit the replication of Hepatitis C virus (HCV). Fluoroquinolones are the chemicals known as potent- active compounds that inhibit the replication of HCV genome by targeting its helicase protein NS3. Using 40 fluoroquinolones as reference molecules, a data set of 4000 natural products was screened that bore structural similarities with fluoroquinolone. From this data set, Random Forest classifier was used to predict active natural compounds that may have an inhibitory effect against HCV NS3 activity. This Random Forest classifier builds a set of decision trees by using a set of 0D, 1D, 2D, 3D and others, total 2080 molecular descriptors, of the two data sets i.e., the training and testing set. Compounds with RF score >0.5 are classified as an active compound against HCV. Using this approach, out of 4000 test molecules, 147 molecules were predicted to be active against HCV NS3 helicase. These predicted active compounds can be analyzed further using in silico and in vitro experimental models to discover a effective drug against HCV. The above-described approach is useful in discovering new, more portent and affordable drugs for treating HCV infection.

Biography:

Sadaf Naeem is currently working in Biochemistry department in University of Karachi, Pakistan

Abstract:

Data on phytochemical constituents of herbs commonly used in traditional Indonesian medicine have been compiled as a database using ChemDBSoft software. This database (the Indonesian Herbal constituents database, IHD) contains details on 1,242 compounds found in 33 different plants. For each entry, the IHD gives details of chemical structure, trivial and systematic name, CAS registry number, pharmacology (where known), toxicology (LD50), botanical species, the part(s) of the plant(s) where the compounds are found, typical dosage(s) and reference(s). A second database has been also been compiled for plant-derived compounds with known activity against the enzyme, aldose reductase (AR). This database (the aldose reductase inhibitors database, ARID) contains the same details as the IHD, and currently comprises information on 112 different AR inhibitors. In the search for novel leads active against AR to provide for new forms of symptomatic relief in diabetic patients – virtual screening of all compounds in the IHD has been performed using (a) random forest (RF) modelling, and (b) molecular docking. For the RF modelling, 3 sets of chemical descriptors – Constitutional, RDF and 3DMoRSE (computed using the DRAGON software) were employed to classify all compounds in the combined ARID and IHD databases as either active or inactive as AR inhibitors. The resulting RF models (which give misclassification rates of ~10%) were used to identify putative new AR inhibitors in the IHD, with such compounds being identified as those giving a mean RF score > 0.5. Virtual screening of the IHD was also performed using the docking software, Molegro Virtual Dokcer (MVD), In the docking studies reported here, carboxyl-containing IHD compounds were docked in to the active site of the 11 crystal structures of AR bound to different carboxyl containing ligands. Calculation of the averages of these 11 MVD re-rank scores was carried out as a means to identify anomalous docking results. In-vitro assays were subsequently performed to determine the inhibitory activity against human recombinant AR for four of the compounds obtained as hits in RF in-silico screenings and for a further four compounds obtained as hits in the docking studies. All four of the RF and docking hits were (as predicted) active as AR inhibitors with IC50s in the micromolar range.

Ahmed M.Jihad al-Kubaisi

Department of Education Fallujah, Iraq

Title: Web Mapping Applications
Biography:

He holds PhD - Iraq - Tikrit University - Faculty of Education - Department of Geography - specialty maps and geographic information systems ,A member of the Iraqi Geographic Society 2011 , Director of the Center for Idrisi geographic techniques - Fallujah, Iraq Member of the Union of Iraqi Writers -2004, Global certification test certificate Ic3 Computers and Information Technology. Research published in the Iraqi magazines (4) in Arabic magazines (3) in international journals (4) influential factor in international journals (Impact Factor) number (2) the number of published books (2) the number of books in print (3) BOOKS (2) Presenting

Abstract:

This paper presents how to prepare an advanced database and employ Web maps available on the Internet through the application of (ArcGIS On Line) to the city of Kirkuk, a case study. The design of effective maps on the website allows display data effectively so as to encourage the user to make a decision. Use this application to the preparation of land use maps (Health -Education - Recreational - Other - Land space). The search area, the adoption of data open source, was painted spatial layers to land use on an individual basis through the use of spatial symbols available in the application interface, and then assembled in the map final collected, to be beyond the possibility of participation through the media and other programs such as (Google Earth) can made available to decision-makers. The research aims to use Web maps available on the Internet data through the application of (ArcGis On Line) and applied to the city of Kirkuk - the military district case study, and perhaps one of the most sophisticated tools are interactive maps of the data effective Aljgraveh.valtsamam maps on the website allows viewing data effectively so as to encourage the user to make a decision. Follow the search according to the style of practical work through the use of geographic and spatial data available on the Internet to prepare the land use map in the search area, and drawing classes and coded down to the interactive map can be shared and disseminated.

Speaker
Biography:

Archana Shukla has done her Ph.D in computer science department from Motial Nehru National Institute of technology Allahabad. She is currently working as a Visiting Faculty in computer science department.

Abstract:

  • Big Data Applications
Location: USA

Session Introduction

Alex V. Vasenkov

Multi Scale Solutions, USA

Title: Big Data for Research and Development
Speaker
Biography:

Dr. Vasenkov has received a Ph. D. degree in Physics and Mathematics from the Russian Academy of Science in 1996. Dr. Vasenkov’s research was funded through peer-reviewed grants and contracts from major federal agencies including the DOE, NSF, and the DoD and private companies such as Samsung Advanced Institute of Technology. In 2013, he co-founded a small-business company Multi Scale Solutions Inc. focusing on the development and commercialization of scientific business intelligence software. Dr. Vasenkov is executive editor of Journal of Nanomedicine & Nanotechnology. He authored/co-authored over 30 publications in peer-reviewed journals, 1 book chapter, and over 50 presentations at leading scientific conferences.

Abstract:

This talk will focus on Big Data for Research and Development (R&D). There are several definitions of Big Data which create confusion about this subject. There is even more confusion about synthetic big data that can be defined as a collection of research articles, Ph. D. theses, patents, test reports and product description reports. Such data have emerging attributes like high volume, high velocity, high variety, and veracity that make an analysis of synthetic data difficult. There is an emergent need for a framework that can synergistically integrate search or information retrieval (IR) with information extraction (IE). Traditional IR-based text searching can be used for a quick exploration of large collections of synthetic data. However, this approach is incapable of finding specific R&D concepts in such collections and establishing connections between these concepts. Also, the IR models lack an ability to learn concepts and relationships between the concepts. In contrast, the IE models are too specific and typically require customization for a domain of interest. A novel framework will be presented and its feasibility to mine synthetic data will be shown. It was found possible to partially or fully automate analysis of synthetic data to find labeled information and connecting concepts. The present framework can help individuals to identify non-obvious solutions to R&D problems, to serve as an input for innovation, or to categorize prior art relevant to a technological concept or a patent application in question.

Speaker
Biography:

Pengchu Zhang has more then ten years experiences in computer modeling/simulation, machine learning, data mining and unstructured data analysis at Sandia National Laboratories. His recent interesting is to develop and apply technologies of Deep Learning in enterprise knowledge sharing and management.

Abstract:

A significant problem that reduces the effectiveness of enterprise search is query terms that do not exist in the enterprise data. Consequently, enterprise search generates no results or the answers match the exact query terms and do not take into account related terms. This results in a high rate of false positives in terms of information relevance. Recent developments in neural language model (NLM), specifically, the word2vec model initiated by Google researchers has drawn a great deal of attention in last two years. This model uses multiple layers of neural networks to represent words into vector spaces. The vector representation of words carries both semantic as well syntactic meanings. Terms with the semantic similarities are close together in the vector space as measured by their Euclidean distances. Enterprise search may utilize the “contextual” relationships between words to intelligently increase the breath and quality of search results. Application of the NML in our enterprise search promises to significantly improve the findability and relevance of returned information. We expand the query term(s) into a set of related terms using the trained term vectors based on corporate data repositories as well as well as making use of Wikipedia. The expanded set of terms is used to search the indexed enterprise data. The most relevant data rises in ranking including documents which may not contain the original query terms. In this presentation, we will also discuss the potential and limitations of applying NLM in search and other aspects of enterprise knowledge management.

Speaker
Biography:

Ibrahim Abaker Targio Hashem is currently a Ph.D. degree candidate at the Department of Compute Systems UM; He has been working on big data since 2013, his article on big data becomes top most downloaded in 2014 Information System journal 2013 elsevier. He has experienced on configuring Hadoop MapReduce in Multi-node cluster. His main research interests include big data, cloud computing, distributed computing, and network.

Abstract:

Over the past few years, the continuous increase in computational capacity has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data offer a new era in data exploration and utilization. The major enabler for underlying many big data platforms is certainly the MapReduce computational paradigm. MapReduce is recognized as a popular programming model for the distributed and scalable processing of big data and is increasingly being used in different applications mostly because of its important features that include scalability, flexibility, ease of programming, and fault-tolerance. Scheduling tasks in MapReduce across multiple nodes have shown to be multi-objective optimization problem. The problem is even more complex by using virtualized clusters in a cloud computing to execute a large number of tasks. The complexity lies in achieving multiple objectives that may be of conflicting nature. For instance, scheduled tasks may require to make several tradeoffs between the job performance, data locality, fairness, resource utilization, network congestion and reliability. These conflicting requirements and goals are challenging to optimize due to the difficulty of predicting a new incoming job’s behavior and its completion time. To address this complication, we introduce a multi-objective approach using genetic algorithms. The goal is to minimize two objectives: execution time and budget of each node executing the task in the cloud. The contribution of this research is to propose a novel adaptive model to communicate with the task scheduler of resource management. The proposed model periodically queries for resource consumption data and uses to calculate how the resources should be allocated to each task. It passes the information to the task scheduler by adjusting task assignments to task nodes accordingly. The model evaluation is realized in scheduling load simulator. PingER is the Internet End-to-End performance measurement, was chosen for performance analysis of the model. We believe this proposed solution is timely and innovative as it provides a robust resource management where users can perform better scheduling for big data processing in a seamless manner.

David Sung

KyungHee University School of Management Graduate School, South Korea

Title: The Effects of Google Trends on Tourism Industry in South Korea
Speaker
Biography:

David Sung is a graduate student in the Master Program for Information Management at Kyunghee University, Seoul, Korea. He has a B.A. in Economics and applied mathematics from Konkuk University. Also, Sang-Hyun Park is a graduate student in the Master Program for Information Management at Kyunghee University, Seoul, Korea. And he has a B.A department of applied organic materials engineering from Inha University.

Abstract:

This study deals with the effects of an online search engine on the level of tourism through analyzing any change in the amount of searched information. The paper has the purpose to improve the forecast of tourism industry into South Korea by utilizing of Google Trends data which provide the data is provided in a relative query and time series format of data. This particular work attempts to analyze Google Trends based on with 3 specific keywords, “hotels, flights and tour” that are searched by potential tourists around the world from the world to South Korea. We conduct this study through a series of multiple regression models with one- month time-lag in order to forecast for easily forecasting performance. Accordingly, the findings suggest that Google Trends can be a good source of estimating tourism industry. Therefore, the business strategy makers in tourism industry related of South Korea can easily utilize of the Google Trends data for their future decisions. This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-20141823)

  • Artificial Intelligence
Location: USA

Session Introduction

Mohamed M. Mostafa

Gulf University for Science and Technology, Kuwait

Title: Oil price forecasting using gene expression programming and artificial neural networks
Speaker
Biography:

Mohamed M. Mostafa has received a PhD in Business from the University of Manchester, UK. He has also earned a MS in Applied Statistics from the University of Northern Colorado, USA, a MA in Social Science Data Analysis from Essex University, UK, a MSc in Functional Neuroimaging from Brunel University, UK, a MBA and a BSc at Port Said University, Egypt. He has published over 60 research papers in several leading academic peer reviewed journals

Abstract:

Mohamed M. Mostafa has received a PhD in Business from the University of Manchester, UK. He has also earned a MS in Applied Statistics from the University of Northern Colorado, USA, a MA in Social Science Data Analysis from Essex University, UK, a MSc in Functional Neuroimaging from Brunel University, UK, a MBA and a BSc at Port Said University, Egypt. He has published over 60 research papers in several leading academic peer reviewed journals

Speaker
Biography:

SookYoung Son is staff of Hyundai Heavy Indsutreis, the world’s largest shipbuilding company. She has completed her MS in Management Engineering at the Ulsan National Institute of Science and Tehcnology (UNIST). Her research interests include process mining, manufacturing process analysis, data mining, and supply chain management.

Abstract:

Construction of an offshore plant is a huge project involves many activities and resources. Furthermore, multiple projects are usually carried out simultaneously. For this reason, it is difficult to manage the offshore plant manufacturing processes. To improve current processes, it is important to analyze and find problems while the project is ongoing. That analysis usually requires a few months to perform. To shorten the analysis period, this study proposes a method of analyzing offshore plant manufacturing processes using process mining. Process mining is a technique that derives useful information related to processes from the event logs in information systems. It generates a process model from event logs and measures performance of processes, tasks, and resources. In the proposed method, a process model can be generated from the data of several simutaneous projects. The generated process model is used to analyze the overall processes in a company. The generated process model then can be compared to the planned process model to find delays. Moreover, workload analysis can be conducted in terms of resources capacity. Actual workloads for each department can be calculated and compared to its capacity. To verify the proposed process mining method, a case study is conducted. The proposed method was applied to analyze the offshore plant construction process in a heavy industry company in Korea.

Ch.Mallikarjuna Rao

Gokaraju Rangaraju Institute of Engineering and Technology , India ,

Title: Segmentation Based Clustering for Knowledge Discovery from Spatial Data
Speaker
Biography:

Ch.Mallikarjuna Rao Received his B.Tech degree in computer Science and engineering from Dr.Baba sahib Ambedkar Marathwada University, Aurangabad, Maharastra.India in 1998, and M.Tech Degree in Computer Science and Engineering from J.N.T.U Anantapur ,Andhrapradesh,India in 2007. He is currently pursuing his Ph.D degree from JNTU Ananthapur University, Andhra Pradesh. Currently he is working as Associate Professor in the department of Computer Science and Engineering of Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad,Telangana stat, India. His research interest includes Data bases and data mining.

Abstract:

Global Positioning Systems (GPS) and other data acquisition systems have been collecting huge amount of geographical data that is growing in exponential growth. Mining such data can extract unknown and latent information from spatial datasets that are characterized by complexity, dimensionality and large size. However, it is challenging to do so. Towards this end geographical knowledge discovery through spatial data mining has emerged as an attractive field that provides methods to leverage useful applications. Remote sensing imagery is the rich source of geographical data. Analyzing such data can provide actionable knowledge for making strategic decisions. In this paper we proposed a methodology that is used to perform clustering on remote sensing images. The data sets are collected through World Wind application provided by NASA. The images are with .TIF extension. The methodology includes feature extraction, training, building classifier and cluster analysis. We built a prototype application that demonstrates the proof of concept. The implementation has taken native method support from Fiji and Weka to realize the proposed methodology. The empirical results revealed that the spatial clustering is made with high accuracy.

Mallikarjuan Rao. G

Gokaraju raju Rangaraju Institute of Engineering and Technology,India

Title: Mining of Facial Features for Gender and Expression Classification
Speaker
Biography:

Mallikarjuna Rao G. attained B.Tech degree from Nagarjuna University in Electronics and Communication in 1988. He got his first post graduation degree , M.E from Osmanina University in 1992 in Digital Systems. His second post graduate, M.Tech, is obtained from Jawaharalal Nehru Technological university , Hyderabad in Computer Science and Engineering in 2002. He is pursuing P.hD from JNTUH. His research interests are Pattern Recognition , parallel computing and artificial neural networks. Professor Rao has 12 publications in various international journals and conferences. He is established Parallel Computing and Operating Systems Lab at GRIET under MODROB scheme. He proposed salable and portable feature extraction technique Local Active Pixel Pattern ,LAPP.

Abstract:

Facial features if properly analyzed /mined are capable of providing information about the gender classification, expression classification and age approximation. Each one of the task is challenging for the researchers due to the involved computational effort and dynamism associated with environmental conditions of capturing domain. The recent trends in machine based human facial image processing are concerned with the topics about predicting age, classifying gender, race, and expressions from facial images, and so on. However, machine based approaches have not yet captured intelligence of mining process involved in human brain. More over researchers attempted each one as a separate problem and suggested data dependent solutions. Further the algorithms have not exploring available parallel computational power existing machines multi-core machines. Hence, in this paper we would like to propose unified scalable approach based on artificial neural network with a new pre-processing feature extraction technique, local active pixel pattern, LAPP. Intelligence extracted from pixel level mining is induced in the machine so that after training it is capable of classifying the mood of the person. Heterogeneity of excremental data sets have truly demonstrated the significance of our approach. FERET facial database has been used for gender classification , YALE, JAFFE, Face expression and CMU AMP Face Expression Databases have been used for expression classification and FGNET AGE data set has been used for age approximation.Results are encouraging with more the 90% accuracy.