Scientific Program| Data Mining Conferences | Big Data Conferences| USA | Asia Pacific | Middle East | 2017 |Big Data Analytics Events| Big Data Europe

Conference Series Ltd invites all the participants across the globe to attend 4th International Conference on BigData Analysis and Data Mining Paris, France.

Submit your Abstract

or e-mail to

events@conferenceseries.com
datamining@annualconferences.org
dataanalysis@conferenceseries.com

Day 2 : September 8, 2017

Keynote Forum

Mikhail Moshkov

King Abdullah University of Science and Technology (KAUST), Saudi Arabia

Keynote: Extensions of Dynamic Programming: Applications for Decision Trees

Time : 09:30-10:00

Conference Series Data Mining 2017 International Conference Keynote Speaker Mikhail Moshkov photo

Biography:

Mikhail Moshkov is professor in the CEMSE Division at King Abdullah University of Science and Technology, Saudi Arabia since October 1, 2008. He earned master’s degree from Nizhni Novgorod State University, received his doctorate from Saratov State University, and habilitation from Moscow State University. From 1977 to 2004, Dr. Moshkov was with Nizhni Novgorod State University. Since 2003 he worked in Poland in the Institute of Computer Science, University of Silesia, and since 2006 also in the Katowice Institute of Information Technologies. His main areas of research are complexity of algorithms, combinatorial optimization, and machine learning. Dr. Moshkov is author or coauthor of five research monographs published by Springer.

Abstract:

In the presentation, we consider extensions of dynamic programming approach to the investigation of decision trees as algorithms for problem solving, as a way for knowledge extraction and representation, and as classifiers which, for a new object given by values of conditional attributes, define a value of the decision attribute. These extensions allow us (i) to describe the set of optimal decision trees, (ii) to count the number of these trees, (iii) to make sequential optimization of decision trees relative to different criteria, (iv) to find the set of Pareto optimal points for two criteria, and (v) to describe relationships between two criteria. The applications include the minimization of average depth for decision trees sorting eight elements (this question was open since 1968), improvement of upper bounds on the depth of decision trees for diagnosis of 0-1-faults in read-once combinatorial circuits over monotone basis, existence of totally optimal (with minimum depth and minimum number of nodes) decision trees for Boolean functions, study of time-memory tradeoff for decision trees for corner point detection, study of relationships between number and maximum length of decision rules derived from decision trees, study of accuracy-size tradeoff for decision trees which allows us to construct enough small and accurate decision trees for knowledge representation, and decision trees that, as classifiers, outperform often decision trees constructed by CART. The end of the presentation is devoted to the introduction to KAUST.

Keynote Forum

Fuad Aleskerov

National Research University Higher School of Economics,Russia

Keynote: Effective choice and ranking of alternatives in search and recommendation problems

Time : 10:00-10:30

Conference Series Data Mining 2017 International Conference Keynote Speaker Fuad Aleskerov photo

Biography:

Professor Fuad Aleskerov is a leading scientist in mathematics and multicriterial choice and decision making theory. Fuad Aleskerov is the Head of the International Laboratory of Decision Choice and Analysis and the Head of the Department of Mathematics for Economics of the National Research University Higher School of Economics (Moscow, Russia). He has published 10 books, many articles in leading academic journals. He is a member of several scientific societies, member of editorial boards of several journals, founder and head of many conferences and workshops. He has been an invited speaker on numerous international conferences, workshops, and seminars.

Abstract:

The problem of the high computational complexity of most accurate algorithms in search, rank, and recommendation applications is critical when we deal with large datasets. Even the quadratic complexity may be unadmissible. Thus, the task is to develop efficient algorithms by consistent reduction of information and by the use of linear algorithms on the first steps.

The problem of whether functions of several variables can be expressed as superposition of functions of fewer variables was firstly formulated by Hilbert in 1900 as the Hilbert’s thirteens problem. The answer to this general question for the class of continuous functions was given in 1957 by Arnold and Kolmogorov. For the class of choice functions this matter was studied only by our team.

A new effective method for search, ranking, and recommendation problems in large datasets is proposed based on superposition of choice functions. The developed algorithms have low computational complexity so they can be applied on big data. One of the main features of the method is the ability to identify the set of efficient options when one deals with large number of options or criteria. Another feature of the method is the ability to adjust its computational complexity. The application of the developed algorithms to the Microsoft LETOR dataset showed 35% higher efficiency comparing to the standard techniques (for instance, SVM).

The proposed methods can be applied, for instance, for the selection of effective options in search and recommendation systems, decision support systems, Internet networks, traffic classification systems and other relevant fields.

Keynote Forum

Omar M Knio

King Abdullah University of Science and Technology, Saudi Arabia

Keynote: Data enabled approaches to sensitivity analysis, calibration and risk visualization in general circulation models

Time : 10:50-11:20

Conference Series Data Mining 2017 International Conference Keynote Speaker Omar M Knio photo

Biography:

Omar M Knio completed his PhD from MIT in 1990. He held a Post-doctoral position at MIT, before joining the Mechanical Engineering Faculty at Johns Hopkins University in 1991. In 2011, he joined Mechanical Engineering and Materials Science Department at Duke University. In 2013, he joined AMCS Program at KAUST, where he served as Deputy Director of the SRI Center for Uncertainty Quantification in Computational Science and Engineering. He has co-authored over 100 journal papers and two books.

Abstract:

This talk discusses the exploitation of large databases of model realizations for assessing model sensitivities to uncertain inputs and for calibrating physical parameters. Attention is focused on databases of individual realizations of ocean general circulation model, built through efficient sampling approaches. Attention is then focused on the use of sampling schemes to build suitable representations of the dependence of the model response on uncertain input data. Non-intrusive spectral projections and regularized regressions are used for this purpose. Bayesian inference formalism is then applied to update the uncertain inputs based on available measurements or observations. We illustrate the implementation of these techniques through extreme-scale applications, including inference physical parametrizations and quantitative assessment and visualization of forecast uncertainties.

Plenary Session

Session Introduction

Fairouz Kamareddine

Heriot-Watt University, UK

Title: Computerizing mathematical text, debugging proof assistants and object versus metal level

Biography:

Fairouz Kamareddine has been involved in a number of worldwide consultancy assignments, especially on education and research issues working for United Nations and the EU. Her research interest includes Interface of Mathematics, Logic and Computer Science. She has held numerous invited positions at universities worldwide. She has well over 100 published articles and a number of books. She has played leading roles in the development, administration and implementation of interdisciplinary, international, and academic collaborations and networks.

Abstract:

Mathematical texts can be computerized in many ways that capture differing amounts of the mathematical meaning. At one end, there is document imaging, which captures the arrangement of black marks on paper, while at the other end there are proof assistants (e.g., Mizar, Isabelle, Coq, etc.), which capture the full mathematical meaning and have proofs expressed in a formal foundation of mathematics. In between, there are computer type setting systems (e.g., LATEX and presentation MathML) and semantically oriented systems (e.g., Content MathML, OpenMath, OMDoc, etc.). In this talk, we advocate a style of computerization of mathematical texts which is flexible enough to connect the different approaches to computerization, which allows various degrees of formalization, and which is compatible with different logical frameworks (e.g., set theory, category theory, type theory, etc.) and proof systems. The basic idea is to allow a man-machine collaboration which weaves human input with machine computation at every step in the way. We propose that the huge step from informal mathematics to fully formalized mathematics be divided into smaller steps, each of which is a fully developed method in which human input is minimal. We also propose a method based on constraint generation and solving to debug proof assistants written in SML and reflect on the need and limitation of mixing the object and meta-level.

Bennett B. Borden

Drinker Biddle & Reath, Washington, DC

Title: Big Data Winners and Losers: The Ethics of Algorithms

Biography:

Bennett B. Borden is a partner at Drinker Biddle & Reath and its Chief Data Scientist, the only Chief Data Scientist who is also a practicing attorney. Bennett is a globally recognized authority on the legal, technology and policy implications of information. Bennett’s ground-breaking research into the use of machine-based learning and unstructured data for organizational insight is now being put to work in data-driven early warning systems for clients to detect and prevent corporate fraud and other misconduct. Bennett received his Masters of Science in Data Analytics at New York University and his JD from Georgetown University.

Abstract:

Analytic models are playing an increasing role in the development, delivery and availability of goods and services. Who gets access to what goods or services and at what price are increasingly influenced by algorithms. This may not matter when we’re talking about a $0.25 coupon for a candy bar, but what about public goods and services like education, healthcare, and energy distribution? What about predicting who will get a job or how we will police our society? In this session, we will explore the socioeconomic impact of algorithms, the ethics of big data, and how to work ethics into our analytics projects.

Sumit Pal

Independent Consultant (Architect, Developer) Big Data & Data Science, USA

Title: SQL On Big Data â€“ Technology, Architecture and Innovations

Biography:

Sumit has more than 22 years of experience in the Software Industry in various roles spanning companies from startups to enterprises.He is a big data, visualisation and data science consultant and a software architect and big data enthusiast and builds end-to-end data-driven analytic systems.Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team) and Verizon (Big Data analytics team) in a career spanning 22 years.Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java and Python.Sumit has spoken at Big Data Conferences in Boston, Chicago, Las Vegas and Vancouver.He has extensive experience in building scalable systems across the stack from middletier, data tier to visualization for analytics applications, using BigData, NoSQL DB. Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, Data Science with Scala, Java and Python and SQL.Sumit started his career being part of SQLServer Development Team at Microsoft in 1996-97 and then as a Core Server Engineer for Oracle Corporation at their OLAP Development team in Boston, MA, USA.Sumit has also worked at Verizon as an Associate Director for Big Data Architecture, where he strategized, managed, architected and developed platforms and solutions for analytics and machine learning applications.Sumit has also served as Chief Architect at ModelN/LeapfrogRX (2006-2013)- where he architected the middle tier core Analytics Platform with open source olap engine (Mondrian) on J2EE and solved some complex Dimensional ETL, Modelling and performance optimization problems.

Abstract:

With the rapid adoption of Hadoop in the enterprise it has become all the more important to build SQL Engines on Hadoop for all kinds of workloads for almost all kind of end users and use cases. From low latency analytics based SQL to ACID based semantics on Hadoop for Operational Systems, to SQL for handling unstructured and streaming data, SQL is fast becoming the ligua-franca in the big data world too. The talk focuses on the exciting tools, technologies and innovations and their underlying architectures and the exciting road ahead in this space. This is a fiercely competitive landscape with vendors and innovators trying to capture mindshare and piece of the pie – with a whole suite of innovations like – index based SQL solutions in Hadoop to OLAP with Apache Kylin and Tajo to BlinkDB and MapD.

Topics :

Why SQL on Hadoop
Challenges of SQL on Hadoop
SQL on Hadoop Architectures for Low Latency Analytics ( Drill, Impala, Presto, SparkSQL, JethroData)
SQL on Hadoop Architecture for Semi-Structured Data
SQL on Hadoop Architecture for Streaming Data and Operational Analytics
Innovations ( OLAP on Hadoop, Probabilistic SQL Engines, GPU Based SQL Solutions )

Bertrand Hassani

University College London, UK

Title: Regulatory learning; what framework can be proposed to ensure that the financial environment can be controlled?

Biography:

Bertrand Hassani is an Associate Researcher at Paris 1 University and University College London. He wrote several articles dealing with Risk Measures, Data Combination, Scenario Analysis and Data Science. He was the Global Head of Research and Innovations of Santander Group for the risk division and is now the Chief Data Scientist of Capgemini Consulting. In this role, he aims at developing novel approaches to measure risk (financial and non-financial), to know the customers, to improve predictive analytics for supply chain (among others) relying on methodologies coming from the field of data science (data mining, machine learning, A.I., etc.).

Abstract:

The arrival of big data strategies is threatening the latest trends in financial regulation related to the simplification of models and the enhancement of comparability of approaches chosen by the various entities. Indeed, the intrinsic dynamic philosophy of big data strategies is almost incompatible with the current legal and regulatory framework in particular the one related to model risk governance. Besides, as presented in our application to credit scoring, the model selection may also evolve dynamically forcing both practitioners and regulators to develop libraries of models, strategies allowing to switch from one to the other and supervising approaches allowing financial institutions to innovate in a risk mitigated environment. The purpose of this paper is therefore to analyse the issues related to the big data environment and in particular to machine learning models and to propose solutions to regulators, analyzing the features of each and every algorithm implemented, for instance a logistic regression, a support vector machine, a neural network, a random forest and gradient boosting.

Pierre Hansen

GERAD and HEC MontrÃ©al,Canada

Title: Two applications of Variable Neighborhood Search in Data Mining

Biography:

Pierre Hansen is professor of Operations Research in the department of decision sciences of HEC Montréal. His research is focussed on combinatorial optimization, metaheuristics and graph theory. With Nenad Mladenovic, he has developed the Variable Neighborhood Search metaheuristic, a general framework for building heuristics for a variety of combinatorial optimization and graph theory problems. Pierre Hansen has received the EURO Gold Medal in 1986 as well as other prizes. He is a member of the Royal Society of Canada and the author or co-author of close to 400 scientific papers. His first paper on VNS was cited almost 3000 times.

Abstract:

Many problems can be expreesed as global or combinatorial optimization problems, however due the vast increase in the availability of data bases, realistically sized instances cannot be solved in reasonnable time. Therfore, one must be often content with approximate solutions obtained by heuristics. These heuristics can be studied systematically by some general frameworks or metaheuristics (genetic search, tabu search, simulated annealing, neuron networks, ant colonies and others). Variable Neighborhood Search (VNS) proceeds by systematic change of neighborhoods bith in the descent phase towards a local minimum and in a perturbation phase to get out of the corresponding valley. VNS heuristics have been developed for many classical problems such as TSP, quadratic assignment, p-median, and others. Instances of the latter problem with 89600 entities in the Euclidean plane have been solved with an ex-post error not larger than 3%.

In the last 2 decades, several discovery systems for graph theory have been proposed (Cvetkovic's Graph; Fajtlowicz's Graffiti; Caporossi and Hansen's AutoGraphiX (AGX)). AGX uses VNS to find systematically extremal graphs for many conjectures judged to be of interest. Aouchiche systematically studied relations between 20 graph invariants taken by pairs, considering the four basic operations (-, +, /, x). Conjectures were found in about 700 out of 1520 cases. A majority of these 1520 cases were easy and solved automatically by AGX. Examination of the extremal graphs found suggests open conjectures to be solved by graph theoretical means. This led to several tens of papers by various authors mainly from Serbia and from China.

Parameshwar P Iyer

Indian Institute of Science, India

Title: From knowledge to wealth: Intellectual property and technology licensing

Biography:

Parameshwar P Iyer holds a Bachelore degree from IIT Kharagpur, Master degree from University of Illinois, and a Doctorate from University of California. He has over 43 years of teaching, research, and consultancy experience in Engineering and Technology Management. He has held several important academic and administrative positions, including being the Founder Professor of an IIM in India. He has over 50 publications and four books to his credit. He has received several professional awards, including a prestigious Fellowship from Unversity of Illinois. Currently, he is a Principal Research Scientist at Indian Institute of Science.

Abstract:

The Office of Intellectual Property and Technology Licensing (IPTeL) at the Indian Institute of Science is geared towards identifying and sourcing the IP at IISc, facilitating the steps towards protection of the IP, continuously safeguarding and prosecuting the IP, and finally enabling the commercial and social utilization of the IP, through appropriate means of licensing and technology transfer.

The specific activities and roles of IPTeL are as follows:

• Identifying the sources of Intellectual Property at the Indian Institute of Science;

• Inviting Invention Disclosures from the Inventor(s);

• Evaluation and Assessment of the above Invention Disclosure(s);

• Assisting the Inventor(s) with prior art search;

• Assigning appropriate Patent Attorney(s) to assist the Inventor(s) in drafting claims and specifications for the Invention(s);

• Informing the Inventor(s) of the various stages of progress during the filing, prosecution, and maintenance of the Intellectual Property;

• Assisting in the efforts of the Inventor(s) towards Technology Licensing, and other forms of Technology Transfer to bring the Invention to “commercial” and/or “social” practice.

Some of the interesting “inventions” from the Institute, and the accompanying “technology transfer” practices are highlighted in this paper. These include:

• Portable Washing Machine for Rural Areas

• Flood-resistant Septic Tank

• Optical nano sensor

• Electrical Gradient Augmented Fluid Filtration Apparatus

• Patient Transfer Device

• Electrochemical Test Cell

Nada Philip

Kingston University London, UK

Title: Big data analytics and visualization for healthcare â€“ A glance into the AEGLE platform features via its presentation layer

Biography:

Nada Philip is an Associate Professor in the field of Mobile Health at Kingston University London. She is the Founder of Digital Media for Health Research Group. She has been involved in many national and international mHealth projects in the areas of “Chronic disease management (e.g. diabetes, COPD and Cancer), big data analytics for health, decision support systems, telehealth and telecare in developing countries, social robotics, medical video streaming, namely; OTELO, Delphi, LTE-Health, WELCOME and AEGLE”. She is the author and co-author of more than 50 journals, peer reviewed conferences and book chapters. She is a member of the review panels for many grants bodies, transactions, and conferences.

Abstract:

Recently there has been great interest in health data and on how to produce value out of it so that it can support optimized integrated clinical decision making about the health of individuals and reduce cost. Many initiatives exist at the European level and worldwide, pinpointing the importance and usefulness of healthcare big data. One of these initiatives is the AEGLE EU project which aims to generate value from the healthcare data value chain data with the vision to improve translational medicine and facilitate personalized and integrated care services and thus improving overall healthcare at all levels, to promote data-driven research across Europe and to serve as an enabler technology platform, enabling business growth in the field of big data analytics for healthcare. In this talk, we will be looking into the main technical features of the AEGLE platform through its presentation layer that act as the user interface to the Big Data analytics and visualization infrastructure and algorithms of the AEGLE platform. In particular, we will be looking into the type 2 diabetes (T2D) use case and present some of the initial results of scenarios that aims at using analytics and visualization tools to discover knowledge and the causing factors of diabetes complications and their correlation to medication modality as well as biological markers while providing the potential to act as a predictive tool of such complications.

Data Mining Applications in Science, Engineering, Healthcare and Medicine | Data Mining Methods and Algorithms | Data Mining Tools and Software | Big Data Applications | Data Mining Tasks and Processes | Data Privacy and Ethics | Big Data Technologies | Social network analysis | Business Analytics | Search and data mining | Clustering

Chair

Fionn Murtagh

University of Huddersfield, UK

Co-Chair

Narasimha Murty Yedidi

Electronic Arts, USA

Session Introduction

Knut Hinkelmann

FHNW University of Applied Sciences and Arts Northwestern, Switzerland

Title: An innovation framework for big data

Biography:

Knut Hinkelmann is Professor for Information Systems and Head of the Master of Science in Business Information Systems at the FHNW University of Applied Sciences and Arts Northwestern Switzerland. He is a Research Associate at the University of Pretoria (South Africa) and Adjunct Professor at the University of Camerino (Italy). Prior to that he has worked at the German Research Center for Artificial Intelligence (DFKI) and as Product Manager Innovation for Insiders Information Management

Abstract:

Big data has challenged existing business models and provided new business opportunities. Companies capture large volume and variety of transactional data, which contains information about their customers, suppliers and operations. Advanced data analysis techniques can help to gain knowledge about patterns, trends and user behaviors from such large datasets. This knowledge empowers businesses to innovate new products, services and business models. Scientific literature discusses use cases of value creation, data analysis techniques and technologies supporting big data. According to recent studies, the main challenge faced by companies is the proper utilization of the knowledge extracted via data analysis to create meaningful innovations. The current innovation frameworks like Google’s Design Sprint guide organizations to create innovative IT applications in an agile manner. Design thinking oriented innovation frameworks like the one from Beckman and Barry (2007) place a strong emphasis on observations, e.g. to understand customer behavior and to identify their (implicit) needs. In today’s digitalized world, however, the observation of such behavior requires analyzing digital traces of on-line transactions and combining it with data from different sources. We therefore propose to develop an innovation framework for big data that helps companies to exploit the knowledge generated from such data present within or outside the organization. This framework will provide the best practices, data analysis tools and technologies that can guide the companies to innovate from big data. In order to give meaning to the identified patterns, the data analysis is combined with background knowledge represented in ontology.

Alla Sapronova

Uni Research Computing, Norway

Title: Prune the inputs, increase data volume, or select a different classification method â€“ a strategy to improve accuracy of classification

Biography:

Dr. Alla Sapronova has completed her PhD at the age of 29 years from Moscow State University, Russia and postdoctoral studies from UniFob, University of Bergen, Norway. She is the Head of Data Science at Center for Big Data Analysis, Uni Research, a multidisciplinary research institute in Bergen, Norway. Last 5 years she has published more than 15 papers in reputed journals and has been serving as an external censor for University of Bergen, Norway and Nelson Mandela Municipal University, South Africa.

Abstract:

Classification, the process of assigning data into labeled groups, is one of the most common operation in data mining. Classification can be used in predictive modeling to learn the relation between desired feature-vector and labeled classes. When the data set contains arbitrary big number of missed data and/or the amount of data samples is not adequate to the data complexity, it is important to define a strategy that allows to reach highest possible classification accuracy. In this work authors present results on classification-based predictive model's accuracy for three different strategies: input pruning, semi-auto selection of various classification methods, and data volume increase. Authors suggest that a satisfactory level of model's accuracy can be reached when preliminary input pruning is used.

The presented model is connecting fishing data with environmental variables. Even with limited number of samples the model is able to resolve the type of the fish with up to 92% of accuracy.

The results of using various classification methods are shown and suggestions are made towards defining the optimal strategy to build an accurate predictive model, opposed to common trial-and-error method. Different strategies for input pruning that assure information's preservation are described.

Eleni Rozaki

Institute of Technology Tallaght, Ireland

Title: Business analytics applications in budget modelling to improve network performance

Biography:

Eleni Rozaki obtained an honours degree in economics and earned her M.Sc. degree in Quantitative methods and Informatics from University of Bari Italy. Her PhD degree is in the area of data mining and business analytics in telecommunication networks at Cardiff University, United Kingdom. She has experience as a data analyst in IT and in the telecommunication industry in Ireland. She is currently working as an associate lecturer in Institute of Technology Tallaght, National College of Ireland and Dublin Institute of Technology. Her current research interests include business analytics, predictive modelling, decision support systems and data mining techniques

Abstract:

Efficient and effective network performance monitoring is a vital part of telecommunications networks. Finding and correcting faults, however, can be very costly. In order to address network issues associated with financial losses a model in business analytics for budget planning is presented. The model that is presented is based on previous work of network fault detection, along with cost considerations and customer segmentation. This work focuses on data mining techniques to show the cost probability of the distribution of network alarms based on budget planning classification rules using predictive analytics to determine the minimum bandwidth costs that are possible with network optimisation. The results of the tests performed show that reductions in optimisation costs are possible; some test cases are clustered, of which the results were used to create a performance -based budget model. The results also find out the clients’ demographics, customers’ churn and simultaneously the financial cost of network optimisation in order to review an efficient budget process and improve expenditure prioritisation

Derrick K Rollins

Iowa State University, USA

Title: Powerful and novel multivariate statistical approaches in big data sets and in data mining with applications to bio-, medical-, and material-informatics

Biography:

Derrick K Rollins is a Professor of Chemical and Biological Engineering and Statistics. He completed his Graduation in Statistics and Chemical Engineering from Ohio State University. He worked at DuPont chemical company for seven years and four months as a Faculty Intern. He received various awards that include: The 2012 Tau Beta Pi McDonald Mentoring Award, the 1996 American Association for the Advancement of Science (AAAS) Mentor Award and National Science Foundation Presidential Faculty Fellows Award. His research areas include “Blood glucose monitoring, modeling and control, and (medical-, bio-, and material-) informatics”.

Abstract:

Advanced statistical methodologies have key roles to contribute in data mining and informatics for large data sets. Over the years, our research has developed a number of statistical techniques exploiting multivariate analysis and methodologies in many applications including bioinformatics, specifically, microarray data sets in a number of applications, medical informatics, including disease diagnosis and discovery, and in material informatics, including the development and evaluation of material properties and testing techniques. In this talk, we present the tools and methodologies that we have developed over the years and discuss their attributes and strengths. The two primary multivariate statistical methodologies that we have exploited have been principal component analysis (PCA) and cluster analysis (CA). This talk will break this technique down for the non-expert and then demonstrate their strengths in handling large data sets to extract critical information that can be exploited in analysis, inference, diagnosis and discovery.

Christophe Jouis

Title: Contextual Exploration (EC3): a strategy for the detection, extraction and visualization of target data

Biography:

Witold Dzwinel holds full professor position at the AGH University of Science and Technology, Department of Computer Science in Krakow. His research activities focuses on computermodeling and simulation by using discrete particles. Simultaneously, he is doing research in interactive visualization of big data and machine learning algorithms. Professor Dzwinel is the author and co-author of about 190 papers in computational science, computational intelligence and physics.

Abstract:

EC3 is intended to extract relevant information from large heterogeneous and multilingual text data, in particular in WEB 3.0. The project is based on an original method: contextual exploration. EC3 does not need syntactic analysis, statistical analysis or a "general" ontology. EC3 uses only small ontologies called "linguistic ontologies" that express the linguistic knowledge of a user who must concentrate on the relevant information from one point of view. This is why EC3 works very quickly on large corpus, whose components can be both whole books as well as "short texts": SMS to books. At the output, EC3 offers a visual representation of information using an original approach: the "Memory Islands".EC3 is implanted in the ACASA / LIP6 team. EC3 is tested on the large digitized corpus provided by the Labex OBVIL «Observatoire de la Vie Littéraire», in partnership with the Bibliothèque Nationale de France (http://obvil.paris-sorbonne.fr/). OBVIL intends to develop all the resources offered by digitization and computer applications to examine French literature from the sixteenth to the twentieth century, English and American literature, Italian literature, Spanish literature, in its most traditional formats and media Or the most innovative.

Giovanni Rossi

University of Bologna, Italy

Title: Objective function-based clustering via near-Boolean optimization

Biography:

Giovanni Rossi completed his PhD in Economics and joined Computer Science department at University of Bologna. His research interests include “Discrete applied mathematics, combinatorial geometries, lattice and pseudo-Boolean functions, network community structure and graph clustering”.

Abstract:

Objective function-based clustering is here looked as a maximum-weight set partitioning combinatorial optimization problem, with the instance given by a pseudo-Boolean (set) function assigning real-valued cluster scores (or costs, in case of minimization) to data subsets, while on every partition of data the global objective function takes the value given by the sum over clusters (or blocks) of their individual score. The instance may thus maximally consist of 2n reals, where n is the number of data, although in most cases the scores of singletons and pairs also fully determine the scores of larger clusters, in which the pseudo- Boolean function is quadratic. This work proposes to quantify the cluster score of fuzzy data subsets by means of the polynomial MLE (multi-linear extension) of pseudo-Boolean functions, thereby translating the original discrete optimization problem into a continuous framework. After analyzing in these terms, the modularity maximization problem, two further examples of quadratic cluster score functions for graph clustering are proposed, while also analyzing greedy (local and global) search strategies.

Abdul Basit

State Bank of Pakistan, Pakistan

Title: Data mining via entropy: Application on trade database

Biography:

Abdul Basit has completed his MS in Social Sciences at SZABIST, Pakistan. Currently, he is the PhD research scholar in the discipline of Statistics at National College of Business Administration & Economics Lahore, Pakistan. He is an Assistant Director of Statistics & DWH Department of State Bank of Pakistan. He has published four research papers in the journals and many articles were presented in national and international conferences.

Abstract:

Entropy is a mathematical tool to gather maximum information about understanding distribution, systems, surveys and databases. We are introducing entropy as a tool of data mining which provides the maximum information about the trade behavior in different regions. In this study, we also derived a new entropy measure for data mining. This study will lead us to explore the new avenues of business and investment in Pakistan. China is the biggest player of global trade from Asian region. To expand the scope of competitiveness, China is continuously investing in different projects around the world. The China–Pakistan Economic Corridor (CPEC) is one of the major projects. The corridor is considered to be an extension of China's economic ambition One Belt-One Road initiative (OBOR). In future China wants to expand her trade with the world using the CPEC to enhance the scope of competitiveness. Pakistan also believes in open trade and continuously trying to enhance trade with the world. To attain maximum advantage of CPEC, Pakistan needs to explore the opportunities for the investors and business communities. In this study, we will develop linkages between the trends of our industries, commodities and their future demand in different regions.

Young Researchers Forum

Session Introduction

Febriana Misdianti

University of Manchester School of Computer Science, UK

Title: Implementation and performance comparison of Wilson, holdout, and Multiedit algorithm for edited k-nearest neighbor

Biography:

Febriana Misdianti is a Post-graduate student at University of Manchester. She has completed her Bachelor degree of Computer Science at Universitas Indonesia and has two years of working experience in startup companies in Jakarta and Singapore. She has been winning a lot of competitions related to computer science and has published a paper about data security in a reputable journal.

Abstract:

K-nearest neighbor (k-NN) classifier is widely used for classifying data in various segments. However, k-NN classifier has high computational cost because it uses linear search through all training data. In naïve implementation, k-NN will go through all training set i to compute its distance d from input data (O(nd)). Then, it loops again for all training set n to find k smallest results (O(nk)). So, the overall time complexity for k-NN is O(nd+nk). Thus, it is not suitable to classify multidimensional data with a huge number of training set. Meanwhile, k-NN needs a large number of samples in order to work well. There have been several ideas proposed to increase the time performance of k-NN in predicting a test data. One of the popular ideas is by reducing the number of TS in a model so that it would cut the testing time as the number of data that need to be explored becomes smaller. The aim of this experiment is to implement k-NN editing algorithms that cut the number of training data so that it become faster in predicting an input. This experiment implements three editing algorithms, namely, Wilson’s editing, holdout editing, and Multiedit algorithm, and also compare the performance of them.

Sofie De Cnudde

Antwerp University, Belgium

Title: Deep learning classification for large behavioral data

Biography:

Sofie De Cnudde is currently finishing her PhD at the University of Antwerp in Belgium. Her PhD analyses and compares the performance of wide and deep classification techniques on large and sparse human behavioral data. She has given tutorials on the use of deep learning for large-scale classification and has currently finished a research stay at an e-commerce company in London. Her work has been published in Expert Systems with Applications, Decision Support Systems and has been presented at conferences such as IFORS.

Abstract:

Deep learning has demonstrated significant performance improvements over traditional state-of-the-art classification techniques in knowledge discovery fields such as computer vision and natural language processing. Moreover, the representation-learning nature of the deep learning architecture enables intuitive interpretation of these complex models. The predictive analysis of very high-dimensional human behavioral data (originating from contexts such as e-commerce) could highly benefit from a complex classification model on the one hand and from an intuitive insight in the many fine-grained features on the other hand. The research question we investigate is whether and how deep learning classification can extend its superior results to this omnipresent type of big data. We present the following three contributions. First and foremost, the results of applying deep learning on large, sparse behavioral data sets demonstrate as good as or better results compared to shallow classifiers. We find significant performance improvements over linear support vector machines, logistic regression with stochastic gradient descent and a relational classifier. Moreover, we shed light on hyper-parameter values to facilitate the adoption of deep learning techniques in practice. The results demonstrate that an unsupervised pre-training step does not improve classification performance and that a tanh non-linearity achieves the best predictive performance. Lastly, we entangle the meaning of the neurons in a manner that is intuitive for researchers and practitioners, and show that the separate neurons identify more nuances in the many fine-grained features compared to the shallow classifiers.

Video Presentation

Session Introduction

Joseph Bonello

University of Malta, Malta

Title: Automated dataset generator

Biography:

Joseph Bonello is an Assistant Lecturer at University of Malta. He has worked on various EU related projects. He worked on various IT projects for the Department of Customs. He has managed the development of the IRRIIS project for a local member of the consortium. He has experience in developing commercial applications including work for local banking firms and government. He completed his BSc (Hons) in IT in 2002 at University of Malta and Master’s degree at University of Malta on Automated Detection and Optimization of NP-Hard Scheduling Problems in Manufacturing.

Abstract:

Many IT organizations today spend more than 40% of the project’s budget on software testing (Capgemini, 2015; Tata, 2015). One way of making the testing process easier is to automatically generate synthetic data that would be similar to the data inputted in the real environment. The benefits offered by an automated process are invaluable since real datasets can be hard to obtain due to data sensitivity and confidentiality issues. Automated Dataset Generator (ADaGe) enables the end-user to define and tweak a dataset definition through an interface in a way that satisfies the system’s requirements. Our tool enables the automatic generation of small to large amount of synthesized data that can be outputted to a file or streamed to a real-time processing system.

Jose Manuel Lopez-Guede

University of the Basque Country (UPV/EHU), Spain

Title: Data Mining applied to Robotics

Biography:

Jose Manuel Lopez-Guede received the Ph.D. degree in Computer Sciences from Basque Country University. He got 3 investigation grants and worked in a company 4 years. Since 2004 he worked as full time Lecturer and since 2012 as assoc. prof. He has been involved in 24 competitive projects and published more than 100 papers, 25 on Educational Innovation and the remaining in specific research areas. He has 20 ISI JCR publications, more than 15 other journals and more than 40 conferences. He has belonged to more than 10 organizing committees of international conferences and to more than 15 scientific committees.

Abstract:

One of the key activities of Data Mining is to discover and make clear hidden relations and working rules in complex systems. Robotics is a complex scope in which first prinicples approaches have been used to solve straight problems, but that approach is not enought to deal with complex problems, where more intelligence-based approaches are needed. Data Mining can be used for autonomous learning of control algorithms for Linked Multicomponent Robotic Systems (L-MCRS), because it is an open research field. Single Robot Hose Transport (SRHT) is a limit case of this kind of systems, when one robot moves the tip of a hose to a desired position, while the other hose extreme is attached to a source position. Reinforcement Learning (RL) algorithms have been applied to learn autonomously the robot control in SRHT, specifically Q-Learning and TRQ-Learning have been applied with success. However storing the state-action value functional information in tabular form produces large and intractable data structures, and using the Data Mining approach the problem can be addressed by discivering and learning the state-action values of Q-table by Extreme Learning Machines (ELM), obtaining a data reduction because the number of ELM parameters is much less than the Q-table's size. Moreover, ELM implements a continuous map which can produce compact representations of the Q-table, and generalizations to increased space resolution and unknown situations.