Day 2 :
University of Derby, UK
Fionn Murtagh is Professor of Data Science and was Professor of Computer Science, including Department Head, in many universities. Fionn was Editor-in-Chief of the Computer Journal (British Computer Society) for more than 10 years, and is an Editorial Board member of many journals. With over 300 refereed articles and 30 books authored or edited, his fellowships and scholarly academies include: Fellow of: British Computer Society (FBCS), Institute of Mathematics and Its Applications (FIMA), International Association for Pattern Recognition (FIAPR), Royal Statistical Society (FRSS), Royal Society of Arts (FRSA).Elected Member: Royal Irish Academy (MRIA), Academia Europaea (MAE). Senior Member IEEE.
The benefits and also the challenges of Big Data analytics can be addressed in innovative ways. It is known that analytical focus is important. Considering just as an analogy for our analytics, how a microscope or a telescope bring about observation and measurement at very fine scales and at very gross scales, we can take that analogy as being associated with the resolution scale of our analysis. Another challenge is the bias in Big Data. But we may calibrate our analytical process with a Big Data framework or infrastructure. A further challenge of an ethical nature, is how respresentativity replaces the individual. So we want "to rehabilitate the individual". Important opportunities arise from contextualization. That can be associated with the resolution scale of our analytics, and it can also be supported by full account taken of appropriate contexts. The innovation that stems from the different facets of our analytical procedures can be of great benefit. Here we seek to discuss many such themes that are always in the context of interesting and important case studies. The main case studies for us here include the following: analytics of mental health and associated well-being; social media analytics based on Twitter; questionnaire and survey analytics with many respondents. Ultimately what is sought is not just scalability alone, but also new and insightful, revealing and rewarding, perspectives, returns and benefits. A book of ours, to be published in April 2017: Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics.
National Research University Higher School of Economics,Russia
Professor Fuad Aleskerov is a leading scientist in mathematics and multicriterial choice and decision making theory. Fuad Aleskerov is the Head of the International Laboratory of Decision Choice and Analysis and the Head of the Department of Mathematics for Economics of the National Research University Higher School of Economics (Moscow, Russia). He has published 10 books, many articles in leading academic journals. He is a member of several scientific societies, member of editorial boards of several journals, founder and head of many conferences and workshops. He has been an invited speaker on numerous international conferences, workshops, and seminars.
The problem of the high computational complexity of most accurate algorithms in search, rank, and recommendation applications is critical when we deal with large datasets. Even the quadratic complexity may be unadmissible. Thus, the task is to develop efficient algorithms by consistent reduction of information and by the use of linear algorithms on the first steps.
The problem of whether functions of several variables can be expressed as superposition of functions of fewer variables was firstly formulated by Hilbert in 1900 as the Hilbert’s thirteens problem. The answer to this general question for the class of continuous functions was given in 1957 by Arnold and Kolmogorov. For the class of choice functions this matter was studied only by our team.
A new effective method for search, ranking, and recommendation problems in large datasets is proposed based on superposition of choice functions. The developed algorithms have low computational complexity so they can be applied on big data. One of the main features of the method is the ability to identify the set of efficient options when one deals with large number of options or criteria. Another feature of the method is the ability to adjust its computational complexity. The application of the developed algorithms to the Microsoft LETOR dataset showed 35% higher efficiency comparing to the standard techniques (for instance, SVM).
The proposed methods can be applied, for instance, for the selection of effective options in search and recommendation systems, decision support systems, Internet networks, traffic classification systems and other relevant fields.
- Data Mining Tasks and Processes
Independent Consultant (Architect, Developer) Big Data & Data Science, USA
Sumit has more than 22 years of experience in the Software Industry in various roles spanning companies from startups to enterprises.He is a big data, visualisation and data science consultant and a software architect and big data enthusiast and builds end-to-end data-driven analytic systems.Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team) and Verizon (Big Data analytics team) in a career spanning 22 years.Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java and Python.Sumit has spoken at Big Data Conferences in Boston, Chicago, Las Vegas and Vancouver.He has extensive experience in building scalable systems across the stack from middletier, data tier to visualization for analytics applications, using BigData, NoSQL DB. Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, Data Science with Scala, Java and Python and SQL.Sumit started his career being part of SQLServer Development Team at Microsoft in 1996-97 and then as a Core Server Engineer for Oracle Corporation at their OLAP Development team in Boston, MA, USA.Sumit has also worked at Verizon as an Associate Director for Big Data Architecture, where he strategized, managed, architected and developed platforms and solutions for analytics and machine learning applications.Sumit has also served as Chief Architect at ModelN/LeapfrogRX (2006-2013)- where he architected the middle tier core Analytics Platform with open source olap engine (Mondrian) on J2EE and solved some complex Dimensional ETL, Modelling and performance optimization problems.
With the rapid adoption of Hadoop in the enterprise it has become all the more important to build SQL Engines on Hadoop for all kinds of workloads for almost all kind of end users and use cases. From low latency analytics based SQL to ACID based semantics on Hadoop for Operational Systems, to SQL for handling unstructured and streaming data, SQL is fast becoming the ligua-franca in the big data world too. The talk focuses on the exciting tools, technologies and innovations and their underlying architectures and the exciting road ahead in this space. This is a fiercely competitive landscape with vendors and innovators trying to capture mindshare and piece of the pie – with a whole suite of innovations like – index based SQL solutions in Hadoop to OLAP with Apache Kylin and Tajo to BlinkDB and MapD.
- Why SQL on Hadoop
- Challenges of SQL on Hadoop
- SQL on Hadoop Architectures for Low Latency Analytics ( Drill, Impala, Presto, SparkSQL, JethroData)
- SQL on Hadoop Architecture for Semi-Structured Data
- SQL on Hadoop Architecture for Streaming Data and Operational Analytics
- Innovations ( OLAP on Hadoop, Probabilistic SQL Engines, GPU Based SQL Solutions )
Drinker Biddle & Reath, Washington, DC
Bennett B. Borden is a partner at Drinker Biddle & Reath and its Chief Data Scientist, the only Chief Data Scientist who is also a practicing attorney. Bennett is a globally recognized authority on the legal, technology and policy implications of information. Bennett’s ground-breaking research into the use of machine-based learning and unstructured data for organizational insight is now being put to work in data-driven early warning systems for clients to detect and prevent corporate fraud and other misconduct. Bennett received his Masters of Science in Data Analytics at New York University and his JD from Georgetown University.
Analytic models are playing an increasing role in the development, delivery and availability of goods and services. Who gets access to what goods or services and at what price are increasingly influenced by algorithms. This may not matter when we’re talking about a $0.25 coupon for a candy bar, but what about public goods and services like education, healthcare, and energy distribution? What about predicting who will get a job or how we will police our society? In this session, we will explore the socioeconomic impact of algorithms, the ethics of big data, and how to work ethics into our analytics projects.
Jenny Lundberg has completed her PhD in Computer Science at 2011 at BTH, the most profiled IT University in Sweden. She is employed as a senior lecturer at Linnaeus University and as a researcher at Lund University. She has extensive international research & education collaboration experiences. Her research interest is in health applications and work in close cooperation with clinical researchers in healthcare with Big Data, e- and m-health approaches & techniques. Taking an active approach to including computing competence in early age of education, fostering technology for all, are another important focus area for her.
As the industry 4.0 era gives us extensive IoT opportunities to provide evidence and context based data, opening for new approaches and methods to meet societal challenges. The handling of chronic diseases are global and pose challenges to the current health systems. The incidence of the chronic disease diabetes are of epidemic character. 2015, 415 million in the world have diabetes and it is estimated that in 2040, 642 million in the world will have diabetes http://www.idf.org/about-diabetes/facts-figures. More specifically:
• Diabetes is a heterogeneous group of conditions that all result in, and defined by, plasma glucose rises above normal levels chronic, if not well treated. If untreated, death is sooner or later, if later with a lot of unpleasant complications over time.
• There are two main types of diabetes, of interest to know the type 1 (10% of all); and type 2 (85-90% of all).
Diabetes places extremely high demands on the individual in terms of self-care, and a lot of complications can occur. It is a well-known fact that this creates serious health condition and high social costs. Potentially, this can be prevented with new methods for better support for self-care. Given developments, recent advances in mobile computing, sensor technology, Big Data can be used to better understand diabetes, measurements and data. To overcome some of the problems in this area, open data, social media, special designed apps, sensors and wearables can be used to find proactive ways and methods of diabetes treatment.
Kazumi Nakamatsu received the Dr. Sc. from Kyushu University, Japan. He is a Professor at University of Hyogo, Japan. He contributed over 150 journal/conference papers and book chapters, and edited/authored 12 books published by prominent publishers. He has chaired various international conferences/workshops, and he has been a program committee member/chair of academic conferences. He serves as Editor-in-Chief of the Int’l J. Reasoning-based Intelligent Systems and an editorial member/associate editor of many international journals. He has contributed numerous invited lectures at conferences/academic organizations. He received some conference/best paper awards at some international conferences.
Paraconsistent logic is a well-known formal logic that can deal with contradiction in the framework of a logical system consistently. One of paraconsistent logics called annotated logic was proposed by Prof. Newton da Costa et al. and its logic program was been developed by Prof. V.S. Subrahmanian et al. later as a tool of dealing with data in knowledge bases. Some years later a kind of paraconsistent annotated logic program has been developed for dealing with non-monotonic reasoning such as default reasoning by Kazumi Nakamatsu. Recently a paraconsistent annotated logic program called Extended Vector Annotated Logic Program with Strong Negation (abbr. EVALPSN) that can deal with conflict resolving, defeasible deontic reasoning, plausible reasoning, etc. has been developed and already been applied to various intelligent controls and safety verification systems such as pipeline valve control, traffic signal control, railway interlocking safety verification, etc. Furthermore, most recently one specific version of EVALPSN called Before-after EVALPSN (abbr. Bf-EVALPSN) that can deal with before-after relations between processes (time intervals) has been developed.
In this lecture, I introduce how EVALPSN and Bf-EVALPSN deal with contradictory data with a small example and can be applied to intelligent control or safety verification of sensed data.
University of Jyväskylä,Finland
Tommi Kärkkäinen has completed his PhD from the University of Jyväskylä in 1995 and worked as a full professor in the Faculty of Information Technology since 2002. He has been and is serving in many positions of administration and responsibility at the faculty and the university level. He has published over 150 research papers, led dozens of R&D projects, and supervised over 20 PhD theses
Clustering is the most common unsupervised, descriptive analysis technique to reveal hidden patterns and profiles from a dataset. There exists large number of different clustering algorithms, but approaches that specifically address clustering of sparse datasets are still scarce, even if real world datasets are many times characterized by missing values with unknown sparsity pattern. Typical approaches in the knowledge discovery process is to either completely omit the observations with missing values or use some imputation method to “fill in the holes” of data. However, the throw data away approach does not utilize all possible data and the imputation necessarily introduces assumptions about the unknown density of the data. Moreover, by the well-known curse-of-dimensionality results, such assumptions are no more valid in the high dimensional spaces.
The purpose of this presentation is to describe and summarize a line of research that address the sparse clustering problems with the available data strategy and robust prototypes. The strategy allows one to utilize all available data without any additional assumptions. The actual prototype-based clustering algorithm, the k-SpatialMedians, relies on the computation of a robust prototype as cluster centroid, again argumenting on nonGaussian within-cluster error in comparison to the classical k-means method. As with any prototype-based algorithm, the initialization step of the locally improving relocation algorithm has an important role and should be designed to handle the sparse data. Such an approach is proposed and the scalability of a distributed implementation of the whole algorithm is tested with openly available large and sparse datasets.
AGH University of Science and Technology, Poland
Witold Dzwinel holds full professor position at the AGH University of Science and Technology, Department of Computer Science in Krakow. His research activities focuses on computermodeling and simulation by using discrete particles. Simultaneously, he is doing research in interactive visualization of big data and machine learning algorithms. Professor Dzwinel is the author and co-author of about 190 papers in computational science, computational intelligence and physics.
Data embedding (DE) and graph visualization (GV) methods are very congruent tools used in Exploratory Data Analysis for visualization of complex data such as high-dimensional data and complex networks, respectively. However, high computational complexity and memory loads of existing DE and GV algorithms (based on t-SNE concept from one hand, and force-directed methods from the other), considerably hinders visualization of truly large and big data consisting of as many as M~106+ data objects and N~103+ dimensions. In this paper, we demonstrate the high computational efficiency and robustness of our approach to data embedding and interactive data visualization. We show that by employing only a small fraction of distances between data objects one can obtain very satisfactory reconstruction of the topology of N-D data in 2D in a linear-time O(M). The ivhd (interactive visualization of high-dimensional data) method quickly and properly reconstructs the N-D data topology in a fraction of computational time required for the state-of-art DE methods such as bh-SNE and all its clones. Our method can be used for both metric and non-metric (e.g. large graphs) data visualization. Moreover, we demonstrate that even poor approximations of the nn-nearst neighbor graph, representing high-dimensional data, can yield acceptable data embeddings. Furthermore, some incorrectness in the nearest neighbor list can often be useful to improve the quality of data visualization. This robustness of ivhd, together with its high memory and time efficiencies, meets perfectly the requirements of big and distributed data visualization, when finding the accurate nearest neighbor list represents a great computational challenge.
Nicola Wesner has completed his PhD in Economics at the age of 27 years from Paris X Nanterre University in 2001 and is an Associate Actuary since 2011. He is the head of the Pension Department at Mazars Actuariat, an audit and advisory firm. He has published many papers in reputed journals and has specialized periodicals on various subject such as econometrics, quantitative finance, insurance and pension, and data mining
This paper presents a very simple and intuitive multi-objective optimization method that makes use of interactive visualization techniques. This approach stands mid-way between the brush and link technique, a visual method used in operational research for exploratory analysis of multidimensional data sets and interactive multi-criteria decision methods that use the concept of reference point. Multiple views of the potential solutions on scatterplots allow the user to directly search acceptable solutions in bi-objective spaces whereas a Venn diagram displays information about the relative scarcity of potential acceptable solutions under distinct criteria. Those very intuitive data visualization techniques allow for comprehensive interpretation and permit to communicate the results efficiently. More generally the combination of information visualization with data mining allows the user to specify what he is looking for, yields easily reportable results and respects human responsibility. An application to the visual steering of genetic algorithms in a multi-criteria strategic asset allocation optimization problem is presented.
Institute of Technology Tallaght, Ireland
Eleni Rozaki obtained an honours degree in economics and earned her M.Sc. degree in Quantitative methods and Informatics from University of Bari Italy. Her PhD degree is in the area of data mining and business analytics in telecommunication networks at Cardiff University, United Kingdom. She has experience as a data analyst in IT and in the telecommunication industry in Ireland. She is currently working as an associate lecturer in Institute of Technology Tallaght, National College of Ireland and Dublin Institute of Technology. Her current research interests include business analytics, predictive modelling, decision support systems and data mining techniques
Efficient and effective network performance monitoring is a vital part of telecommunications networks. Finding and correcting faults, however, can be very costly. In order to address network issues associated with financial losses a model in business analytics for budget planning is presented. The model that is presented is based on previous work of network fault detection, along with cost considerations and customer segmentation. This work focuses on data mining techniques to show the cost probability of the distribution of network alarms based on budget planning classification rules using predictive analytics to determine the minimum bandwidth costs that are possible with network optimisation. The results of the tests performed show that reductions in optimisation costs are possible; some test cases are clustered, of which the results were used to create a performance -based budget model. The results also find out the clients’ demographics, customers’ churn and simultaneously the financial cost of network optimisation in order to review an efficient budget process and improve expenditure prioritisation