Day 2 :
King Abdullah University of Science and Technology (KAUST), Saudi Arabia
Time : 09:30-10:00
Mikhail Moshkov is professor in the CEMSE Division at King Abdullah University of Science and Technology, Saudi Arabia since October 1, 2008. He earned master’s degree from Nizhni Novgorod State University, received his doctorate from Saratov State University, and habilitation from Moscow State University. From 1977 to 2004, Dr. Moshkov was with Nizhni Novgorod State University. Since 2003 he worked in Poland in the Institute of Computer Science, University of Silesia, and since 2006 also in the Katowice Institute of Information Technologies. His main areas of research are complexity of algorithms, combinatorial optimization, and machine learning. Dr. Moshkov is author or coauthor of five research monographs published by Springer.
In the presentation, we consider extensions of dynamic programming approach to the investigation of decision trees as algorithms for problem solving, as a way for knowledge extraction and representation, and as classifiers which, for a new object given by values of conditional attributes, define a value of the decision attribute. These extensions allow us (i) to describe the set of optimal decision trees, (ii) to count the number of these trees, (iii) to make sequential optimization of decision trees relative to different criteria, (iv) to find the set of Pareto optimal points for two criteria, and (v) to describe relationships between two criteria. The applications include the minimization of average depth for decision trees sorting eight elements (this question was open since 1968), improvement of upper bounds on the depth of decision trees for diagnosis of 0-1-faults in read-once combinatorial circuits over monotone basis, existence of totally optimal (with minimum depth and minimum number of nodes) decision trees for Boolean functions, study of time-memory tradeoff for decision trees for corner point detection, study of relationships between number and maximum length of decision rules derived from decision trees, study of accuracy-size tradeoff for decision trees which allows us to construct enough small and accurate decision trees for knowledge representation, and decision trees that, as classifiers, outperform often decision trees constructed by CART. The end of the presentation is devoted to the introduction to KAUST.
National Research University Higher School of Economics,Russia
Professor Fuad Aleskerov is a leading scientist in mathematics and multicriterial choice and decision making theory. Fuad Aleskerov is the Head of the International Laboratory of Decision Choice and Analysis and the Head of the Department of Mathematics for Economics of the National Research University Higher School of Economics (Moscow, Russia). He has published 10 books, many articles in leading academic journals. He is a member of several scientific societies, member of editorial boards of several journals, founder and head of many conferences and workshops. He has been an invited speaker on numerous international conferences, workshops, and seminars.
The problem of the high computational complexity of most accurate algorithms in search, rank, and recommendation applications is critical when we deal with large datasets. Even the quadratic complexity may be unadmissible. Thus, the task is to develop efficient algorithms by consistent reduction of information and by the use of linear algorithms on the first steps.
The problem of whether functions of several variables can be expressed as superposition of functions of fewer variables was firstly formulated by Hilbert in 1900 as the Hilbert’s thirteens problem. The answer to this general question for the class of continuous functions was given in 1957 by Arnold and Kolmogorov. For the class of choice functions this matter was studied only by our team.
A new effective method for search, ranking, and recommendation problems in large datasets is proposed based on superposition of choice functions. The developed algorithms have low computational complexity so they can be applied on big data. One of the main features of the method is the ability to identify the set of efficient options when one deals with large number of options or criteria. Another feature of the method is the ability to adjust its computational complexity. The application of the developed algorithms to the Microsoft LETOR dataset showed 35% higher efficiency comparing to the standard techniques (for instance, SVM).
The proposed methods can be applied, for instance, for the selection of effective options in search and recommendation systems, decision support systems, Internet networks, traffic classification systems and other relevant fields.
King Abdullah University of Science and Technology, Saudi Arabia
Omar M Knio completed his PhD from MIT in 1990. He held a Post-doctoral position at MIT, before joining the Mechanical Engineering Faculty at Johns Hopkins University in 1991. In 2011, he joined Mechanical Engineering and Materials Science Department at Duke University. In 2013, he joined AMCS Program at KAUST, where he served as Deputy Director of the SRI Center for Uncertainty Quantification in Computational Science and Engineering. He has co-authored over 100 journal papers and two books.
This talk discusses the exploitation of large databases of model realizations for assessing model sensitivities to uncertain inputs and for calibrating physical parameters. Attention is focused on databases of individual realizations of ocean general circulation model, built through efficient sampling approaches. Attention is then focused on the use of sampling schemes to build suitable representations of the dependence of the model response on uncertain input data. Non-intrusive spectral projections and regularized regressions are used for this purpose. Bayesian inference formalism is then applied to update the uncertain inputs based on available measurements or observations. We illustrate the implementation of these techniques through extreme-scale applications, including inference physical parametrizations and quantitative assessment and visualization of forecast uncertainties.
- Plenary Session
Drinker Biddle & Reath, Washington, DC
Bennett B. Borden is a partner at Drinker Biddle & Reath and its Chief Data Scientist, the only Chief Data Scientist who is also a practicing attorney. Bennett is a globally recognized authority on the legal, technology and policy implications of information. Bennett’s ground-breaking research into the use of machine-based learning and unstructured data for organizational insight is now being put to work in data-driven early warning systems for clients to detect and prevent corporate fraud and other misconduct. Bennett received his Masters of Science in Data Analytics at New York University and his JD from Georgetown University.
Analytic models are playing an increasing role in the development, delivery and availability of goods and services. Who gets access to what goods or services and at what price are increasingly influenced by algorithms. This may not matter when we’re talking about a $0.25 coupon for a candy bar, but what about public goods and services like education, healthcare, and energy distribution? What about predicting who will get a job or how we will police our society? In this session, we will explore the socioeconomic impact of algorithms, the ethics of big data, and how to work ethics into our analytics projects.
Independent Consultant (Architect, Developer) Big Data & Data Science, USA
Sumit has more than 22 years of experience in the Software Industry in various roles spanning companies from startups to enterprises.He is a big data, visualisation and data science consultant and a software architect and big data enthusiast and builds end-to-end data-driven analytic systems.Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team) and Verizon (Big Data analytics team) in a career spanning 22 years.Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java and Python.Sumit has spoken at Big Data Conferences in Boston, Chicago, Las Vegas and Vancouver.He has extensive experience in building scalable systems across the stack from middletier, data tier to visualization for analytics applications, using BigData, NoSQL DB. Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, Data Science with Scala, Java and Python and SQL.Sumit started his career being part of SQLServer Development Team at Microsoft in 1996-97 and then as a Core Server Engineer for Oracle Corporation at their OLAP Development team in Boston, MA, USA.Sumit has also worked at Verizon as an Associate Director for Big Data Architecture, where he strategized, managed, architected and developed platforms and solutions for analytics and machine learning applications.Sumit has also served as Chief Architect at ModelN/LeapfrogRX (2006-2013)- where he architected the middle tier core Analytics Platform with open source olap engine (Mondrian) on J2EE and solved some complex Dimensional ETL, Modelling and performance optimization problems.
With the rapid adoption of Hadoop in the enterprise it has become all the more important to build SQL Engines on Hadoop for all kinds of workloads for almost all kind of end users and use cases. From low latency analytics based SQL to ACID based semantics on Hadoop for Operational Systems, to SQL for handling unstructured and streaming data, SQL is fast becoming the ligua-franca in the big data world too. The talk focuses on the exciting tools, technologies and innovations and their underlying architectures and the exciting road ahead in this space. This is a fiercely competitive landscape with vendors and innovators trying to capture mindshare and piece of the pie – with a whole suite of innovations like – index based SQL solutions in Hadoop to OLAP with Apache Kylin and Tajo to BlinkDB and MapD.
- Why SQL on Hadoop
- Challenges of SQL on Hadoop
- SQL on Hadoop Architectures for Low Latency Analytics ( Drill, Impala, Presto, SparkSQL, JethroData)
- SQL on Hadoop Architecture for Semi-Structured Data
- SQL on Hadoop Architecture for Streaming Data and Operational Analytics
- Innovations ( OLAP on Hadoop, Probabilistic SQL Engines, GPU Based SQL Solutions )
University College London, UK
Bertrand Hassani is an Associate Researcher at Paris 1 University and University College London. He wrote several articles dealing with Risk Measures, Data Combination, Scenario Analysis and Data Science. He was the Global Head of Research and Innovations of Santander Group for the risk division and is now the Chief Data Scientist of Capgemini Consulting. In this role, he aims at developing novel approaches to measure risk (financial and non-financial), to know the customers, to improve predictive analytics for supply chain (among others) relying on methodologies coming from the field of data science (data mining, machine learning, A.I., etc.).
The arrival of big data strategies is threatening the latest trends in financial regulation related to the simplification of models and the enhancement of comparability of approaches chosen by the various entities. Indeed, the intrinsic dynamic philosophy of big data strategies is almost incompatible with the current legal and regulatory framework in particular the one related to model risk governance. Besides, as presented in our application to credit scoring, the model selection may also evolve dynamically forcing both practitioners and regulators to develop libraries of models, strategies allowing to switch from one to the other and supervising approaches allowing financial institutions to innovate in a risk mitigated environment. The purpose of this paper is therefore to analyse the issues related to the big data environment and in particular to machine learning models and to propose solutions to regulators, analyzing the features of each and every algorithm implemented, for instance a logistic regression, a support vector machine, a neural network, a random forest and gradient boosting.
GERAD and HEC Montréal,Canada
Pierre Hansen is professor of Operations Research in the department of decision sciences of HEC Montréal. His research is focussed on combinatorial optimization, metaheuristics and graph theory. With Nenad Mladenovic, he has developed the Variable Neighborhood Search metaheuristic, a general framework for building heuristics for a variety of combinatorial optimization and graph theory problems. Pierre Hansen has received the EURO Gold Medal in 1986 as well as other prizes. He is a member of the Royal Society of Canada and the author or co-author of close to 400 scientific papers. His first paper on VNS was cited almost 3000 times.
Many problems can be expreesed as global or combinatorial optimization problems, however due the vast increase in the availability of data bases, realistically sized instances cannot be solved in reasonnable time. Therfore, one must be often content with approximate solutions obtained by heuristics. These heuristics can be studied systematically by some general frameworks or metaheuristics (genetic search, tabu search, simulated annealing, neuron networks, ant colonies and others). Variable Neighborhood Search (VNS) proceeds by systematic change of neighborhoods bith in the descent phase towards a local minimum and in a perturbation phase to get out of the corresponding valley. VNS heuristics have been developed for many classical problems such as TSP, quadratic assignment, p-median, and others. Instances of the latter problem with 89600 entities in the Euclidean plane have been solved with an ex-post error not larger than 3%.
In the last 2 decades, several discovery systems for graph theory have been proposed (Cvetkovic's Graph; Fajtlowicz's Graffiti; Caporossi and Hansen's AutoGraphiX (AGX)). AGX uses VNS to find systematically extremal graphs for many conjectures judged to be of interest. Aouchiche systematically studied relations between 20 graph invariants taken by pairs, considering the four basic operations (-, +, /, x). Conjectures were found in about 700 out of 1520 cases. A majority of these 1520 cases were easy and solved automatically by AGX. Examination of the extremal graphs found suggests open conjectures to be solved by graph theoretical means. This led to several tens of papers by various authors mainly from Serbia and from China.
Indian Institute of Science, India
Parameshwar P Iyer holds a Bachelore degree from IIT Kharagpur, Master degree from University of Illinois, and a Doctorate from University of California. He has over 43 years of teaching, research, and consultancy experience in Engineering and Technology Management. He has held several important academic and administrative positions, including being the Founder Professor of an IIM in India. He has over 50 publications and four books to his credit. He has received several professional awards, including a prestigious Fellowship from Unversity of Illinois. Currently, he is a Principal Research Scientist at Indian Institute of Science.
The Office of Intellectual Property and Technology Licensing (IPTeL) at the Indian Institute of Science is geared towards identifying and sourcing the IP at IISc, facilitating the steps towards protection of the IP, continuously safeguarding and prosecuting the IP, and finally enabling the commercial and social utilization of the IP, through appropriate means of licensing and technology transfer.
The specific activities and roles of IPTeL are as follows:
• Identifying the sources of Intellectual Property at the Indian Institute of Science;
• Inviting Invention Disclosures from the Inventor(s);
• Evaluation and Assessment of the above Invention Disclosure(s);
• Assisting the Inventor(s) with prior art search;
• Assigning appropriate Patent Attorney(s) to assist the Inventor(s) in drafting claims and specifications for the Invention(s);
• Informing the Inventor(s) of the various stages of progress during the filing, prosecution, and maintenance of the Intellectual Property;
• Assisting in the efforts of the Inventor(s) towards Technology Licensing, and other forms of Technology Transfer to bring the Invention to “commercial” and/or “social” practice.
Some of the interesting “inventions” from the Institute, and the accompanying “technology transfer” practices are highlighted in this paper. These include:
• Portable Washing Machine for Rural Areas
• Flood-resistant Septic Tank
• Optical nano sensor
• Electrical Gradient Augmented Fluid Filtration Apparatus
• Patient Transfer Device
• Electrochemical Test Cell
Kingston University London, UK
Nada Philip is an Associate Professor in the field of Mobile Health at Kingston University London. She is the Founder of Digital Media for Health Research Group. She has been involved in many national and international mHealth projects in the areas of “Chronic disease management (e.g. diabetes, COPD and Cancer), big data analytics for health, decision support systems, telehealth and telecare in developing countries, social robotics, medical video streaming, namely; OTELO, Delphi, LTE-Health, WELCOME and AEGLE”. She is the author and co-author of more than 50 journals, peer reviewed conferences and book chapters. She is a member of the review panels for many grants bodies, transactions, and conferences.
Recently there has been great interest in health data and on how to produce value out of it so that it can support optimized integrated clinical decision making about the health of individuals and reduce cost. Many initiatives exist at the European level and worldwide, pinpointing the importance and usefulness of healthcare big data. One of these initiatives is the AEGLE EU project which aims to generate value from the healthcare data value chain data with the vision to improve translational medicine and facilitate personalized and integrated care services and thus improving overall healthcare at all levels, to promote data-driven research across Europe and to serve as an enabler technology platform, enabling business growth in the field of big data analytics for healthcare. In this talk, we will be looking into the main technical features of the AEGLE platform through its presentation layer that act as the user interface to the Big Data analytics and visualization infrastructure and algorithms of the AEGLE platform. In particular, we will be looking into the type 2 diabetes (T2D) use case and present some of the initial results of scenarios that aims at using analytics and visualization tools to discover knowledge and the causing factors of diabetes complications and their correlation to medication modality as well as biological markers while providing the potential to act as a predictive tool of such complications.
- Poster Presentation
Knut Hinkelmann is Professor for Information Systems and Head of the Master of Science in Business Information Systems at the FHNW University of Applied Sciences and Arts Northwestern Switzerland. He is a Research Associate at the University of Pretoria (South Africa) and Adjunct Professor at the University of Camerino (Italy). Prior to that he has worked at the German Research Center for Artificial Intelligence (DFKI) and as Product Manager Innovation for Insiders Information Management
Big data has challenged existing business models and provided new business opportunities. Companies capture large volume and variety of transactional data, which contains information about their customers, suppliers and operations. Advanced data analysis techniques can help to gain knowledge about patterns, trends and user behaviors from such large datasets. This knowledge empowers businesses to innovate new products, services and business models. Scientific literature discusses use cases of value creation, data analysis techniques and technologies supporting big data. According to recent studies, the main challenge faced by companies is the proper utilization of the knowledge extracted via data analysis to create meaningful innovations. The current innovation frameworks like Google’s Design Sprint guide organizations to create innovative IT applications in an agile manner. Design thinking oriented innovation frameworks like the one from Beckman and Barry (2007) place a strong emphasis on observations, e.g. to understand customer behavior and to identify their (implicit) needs. In today’s digitalized world, however, the observation of such behavior requires analyzing digital traces of on-line transactions and combining it with data from different sources. We therefore propose to develop an innovation framework for big data that helps companies to exploit the knowledge generated from such data present within or outside the organization. This framework will provide the best practices, data analysis tools and technologies that can guide the companies to innovate from big data. In order to give meaning to the identified patterns, the data analysis is combined with background knowledge represented in ontology.
- Data Mining Applications in Science, Engineering, Healthcare and Medicine | Data Mining Methods and Algorithms | Data Mining Tools and Software | Big Data Applications | Data Mining Tasks and Processes | Data Privacy and Ethics | Big Data Technologies | Social network analysis | Business Analytics | Search and data mining | Clustering
University of Huddersfield, UK
Narasimha Murty Yedidi
Electronic Arts, USA
Uni Research Computing, Norway
Dr. Alla Sapronova has completed her PhD at the age of 29 years from Moscow State University, Russia and postdoctoral studies from UniFob, University of Bergen, Norway. She is the Head of Data Science at Center for Big Data Analysis, Uni Research, a multidisciplinary research institute in Bergen, Norway. Last 5 years she has published more than 15 papers in reputed journals and has been serving as an external censor for University of Bergen, Norway and Nelson Mandela Municipal University, South Africa.
Classification, the process of assigning data into labeled groups, is one of the most common operation in data mining. Classification can be used in predictive modeling to learn the relation between desired feature-vector and labeled classes. When the data set contains arbitrary big number of missed data and/or the amount of data samples is not adequate to the data complexity, it is important to define a strategy that allows to reach highest possible classification accuracy. In this work authors present results on classification-based predictive model's accuracy for three different strategies: input pruning, semi-auto selection of various classification methods, and data volume increase. Authors suggest that a satisfactory level of model's accuracy can be reached when preliminary input pruning is used.
The presented model is connecting fishing data with environmental variables. Even with limited number of samples the model is able to resolve the type of the fish with up to 92% of accuracy.
The results of using various classification methods are shown and suggestions are made towards defining the optimal strategy to build an accurate predictive model, opposed to common trial-and-error method. Different strategies for input pruning that assure information's preservation are described.
State Bank of Pakistan, Pakistan
Abdul Basit has completed his MS in Social Sciences at SZABIST, Pakistan. Currently, he is the PhD research scholar in the discipline of Statistics at National College of Business Administration & Economics Lahore, Pakistan. He is an Assistant Director of Statistics & DWH Department of State Bank of Pakistan. He has published four research papers in the journals and many articles were presented in national and international conferences.
Entropy is a mathematical tool to gather maximum information about understanding distribution, systems, surveys and databases. We are introducing entropy as a tool of data mining which provides the maximum information about the trade behavior in different regions. In this study, we also derived a new entropy measure for data mining. This study will lead us to explore the new avenues of business and investment in Pakistan. China is the biggest player of global trade from Asian region. To expand the scope of competitiveness, China is continuously investing in different projects around the world. The China–Pakistan Economic Corridor (CPEC) is one of the major projects. The corridor is considered to be an extension of China's economic ambition One Belt-One Road initiative (OBOR). In future China wants to expand her trade with the world using the CPEC to enhance the scope of competitiveness. Pakistan also believes in open trade and continuously trying to enhance trade with the world. To attain maximum advantage of CPEC, Pakistan needs to explore the opportunities for the investors and business communities. In this study, we will develop linkages between the trends of our industries, commodities and their future demand in different regions.
Institute of Technology Tallaght, Ireland
Eleni Rozaki obtained an honours degree in economics and earned her M.Sc. degree in Quantitative methods and Informatics from University of Bari Italy. Her PhD degree is in the area of data mining and business analytics in telecommunication networks at Cardiff University, United Kingdom. She has experience as a data analyst in IT and in the telecommunication industry in Ireland. She is currently working as an associate lecturer in Institute of Technology Tallaght, National College of Ireland and Dublin Institute of Technology. Her current research interests include business analytics, predictive modelling, decision support systems and data mining techniques
Efficient and effective network performance monitoring is a vital part of telecommunications networks. Finding and correcting faults, however, can be very costly. In order to address network issues associated with financial losses a model in business analytics for budget planning is presented. The model that is presented is based on previous work of network fault detection, along with cost considerations and customer segmentation. This work focuses on data mining techniques to show the cost probability of the distribution of network alarms based on budget planning classification rules using predictive analytics to determine the minimum bandwidth costs that are possible with network optimisation. The results of the tests performed show that reductions in optimisation costs are possible; some test cases are clustered, of which the results were used to create a performance -based budget model. The results also find out the clients’ demographics, customers’ churn and simultaneously the financial cost of network optimisation in order to review an efficient budget process and improve expenditure prioritisation
Derrick K Rollins is a Professor of Chemical and Biological Engineering and Statistics. He completed his Graduation in Statistics and Chemical Engineering from Ohio State University. He worked at DuPont chemical company for seven years and four months as a Faculty Intern. He received various awards that include: The 2012 Tau Beta Pi McDonald Mentoring Award, the 1996 American Association for the Advancement of Science (AAAS) Mentor Award and National Science Foundation Presidential Faculty Fellows Award. His research areas include “Blood glucose monitoring, modeling and control, and (medical-, bio-, and material-) informatics”.
Advanced statistical methodologies have key roles to contribute in data mining and informatics for large data sets. Over the years, our research has developed a number of statistical techniques exploiting multivariate analysis and methodologies in many applications including bioinformatics, specifically, microarray data sets in a number of applications, medical informatics, including disease diagnosis and discovery, and in material informatics, including the development and evaluation of material properties and testing techniques. In this talk, we present the tools and methodologies that we have developed over the years and discuss their attributes and strengths. The two primary multivariate statistical methodologies that we have exploited have been principal component analysis (PCA) and cluster analysis (CA). This talk will break this technique down for the non-expert and then demonstrate their strengths in handling large data sets to extract critical information that can be exploited in analysis, inference, diagnosis and discovery.
LIP6 (CNRS -UPMC Sorbonne Université), France
Witold Dzwinel holds full professor position at the AGH University of Science and Technology, Department of Computer Science in Krakow. His research activities focuses on computermodeling and simulation by using discrete particles. Simultaneously, he is doing research in interactive visualization of big data and machine learning algorithms. Professor Dzwinel is the author and co-author of about 190 papers in computational science, computational intelligence and physics.
EC3 is intended to extract relevant information from large heterogeneous and multilingual text data, in particular in WEB 3.0. The project is based on an original method: contextual exploration. EC3 does not need syntactic analysis, statistical analysis or a "general" ontology. EC3 uses only small ontologies called "linguistic ontologies" that express the linguistic knowledge of a user who must concentrate on the relevant information from one point of view. This is why EC3 works very quickly on large corpus, whose components can be both whole books as well as "short texts": SMS to books. At the output, EC3 offers a visual representation of information using an original approach: the "Memory Islands".EC3 is implanted in the ACASA / LIP6 team. EC3 is tested on the large digitized corpus provided by the Labex OBVIL «Observatoire de la Vie Littéraire», in partnership with the Bibliothèque Nationale de France (http://obvil.paris-sorbonne.fr/). OBVIL intends to develop all the resources offered by digitization and computer applications to examine French literature from the sixteenth to the twentieth century, English and American literature, Italian literature, Spanish literature, in its most traditional formats and media Or the most innovative.
University of Bologna, Italy
Giovanni Rossi completed his PhD in Economics and joined Computer Science department at University of Bologna. His research interests include “Discrete applied mathematics, combinatorial geometries, lattice and pseudo-Boolean functions, network community structure and graph clustering”.
Objective function-based clustering is here looked as a maximum-weight set partitioning combinatorial optimization problem, with the instance given by a pseudo-Boolean (set) function assigning real-valued cluster scores (or costs, in case of minimization) to data subsets, while on every partition of data the global objective function takes the value given by the sum over clusters (or blocks) of their individual score. The instance may thus maximally consist of 2n reals, where n is the number of data, although in most cases the scores of singletons and pairs also fully determine the scores of larger clusters, in which the pseudo- Boolean function is quadratic. This work proposes to quantify the cluster score of fuzzy data subsets by means of the polynomial MLE (multi-linear extension) of pseudo-Boolean functions, thereby translating the original discrete optimization problem into a continuous framework. After analyzing in these terms, the modularity maximization problem, two further examples of quadratic cluster score functions for graph clustering are proposed, while also analyzing greedy (local and global) search strategies.
- Video Presentation
Joseph Bonello is an Assistant Lecturer at University of Malta. He has worked on various EU related projects. He worked on various IT projects for the Department of Customs. He has managed the development of the IRRIIS project for a local member of the consortium. He has experience in developing commercial applications including work for local banking firms and government. He completed his BSc (Hons) in IT in 2002 at University of Malta and Master’s degree at University of Malta on Automated Detection and Optimization of NP-Hard Scheduling Problems in Manufacturing.
Many IT organizations today spend more than 40% of the project’s budget on software testing (Capgemini, 2015; Tata, 2015). One way of making the testing process easier is to automatically generate synthetic data that would be similar to the data inputted in the real environment. The benefits offered by an automated process are invaluable since real datasets can be hard to obtain due to data sensitivity and confidentiality issues. Automated Dataset Generator (ADaGe) enables the end-user to define and tweak a dataset definition through an interface in a way that satisfies the system’s requirements. Our tool enables the automatic generation of small to large amount of synthesized data that can be outputted to a file or streamed to a real-time processing system.