Subject: DATA MINING (A.A. 2020/2021)
Unit Data mining
Related or Additional Studies (lesson)
- Becoming familiar with data mining principles and tools in the analysis of data of high dimensionality and complexity and their graphical representation;
- Acquire knowledge of experimental sciences research methodology and instruments, of its use in productive context;
- Recognize the different kind and structure of data. Assess codification, transformation and preprocessing issues and their relevance in information retrieval;
-Develop competences on the main multivariate methods in the following contexts: a) exploratory data analysis and outliers detection; b) cluster analysis and c) linear modelling.
-Develop competences for building appropriate data models through practical exercises by using commercial and/or open source software.
-Acquire knowledge and competences of tools for model validation and critical interpretation of data analysis results.
Few basic statistics concepts (mean, standard deviation, normal distribution, confidence intervals) and knowledge of basic about programing and algorithms structures.
- Course introduction. Motivation, context, applicative field for data mining, focus on scientific field.
- The Data Driven Discovery paradigm in experimental sciences.
- Nature and peculiarities of scientific data sets in relation to the different methods of data acquisition (physical, chemical measures, instrumental analysis, digital and hyperspectral imaging, hyphenated techniques, spatial and temporal varying data, etc.) and descritpion of molecular structure.
- Understand and organize data.
- Statistical Data exploratiom, univariate and multivariate. Graphical representation.
- Data Quality. Preprocessing. Outliers.
- Decomposition/projection techniques (PCA, mention to other).
- Similarity/dissimilrity measures. Clustering: taxonomy and basic of different methods and approaches. - Linear Modelling: introduction to class-modeling, classification/discrimination and calibration/regression.
-Model validation methodology.
-Most diffuse algorithms.
Frontal lessons with slides projection integrated by manual writing on board/screen. For each main subject, analysis of specific data set will be illustrated as an example and expalined the main frame of a software which implement these mthods and it is available, via a campus license, to the students (Matlab environment). Some exercises will be assigned to be conducted independently by the students on literature data sets. Students are as well allowed to use open source software in other environments or to develop their own routines. The Lessons, due to the COVID19 emergency, will be conducted at distance in a synchronous modality. Meetings will be organized in small groups, when requested by the students, for discussion of the topics presented in class, to solve issues related to exercises, to discuss correction to the students' reports.
To profit assessment there is a final oral exam. Depending on the evolution of COVID19 energency it could be in presence or at distance (google meet or similar platforms). During the course exercises are assigned to each student, which aid preparation of final exam: it is requested the delivery of an electronic report for each exercise. The reports are evaluated according to the criteria: organization, language and ability to synthesize; Selection of appropriate methods of analysis; correct application; ability to describe and interpret the results. Final assessment: small groups (2 students) are assigned a project to analyze a data set with one or more of the methods presented in the course. During the final verification test it is required to make a presentation (slides) and oral discussion of the results obtained, the methods and the software used. During the discussion, questions are asked by the teacher to each student taking a cue from the presentation in order to ascertain the preparation on the different topics covered in the course. In the attribution of the final verification score are evaluated: the correctness of the choice of the methods used and the knowledge of how they are implemented in the software used (30%); the ability to apply the acquired knowledge in the given data set (30%); communication abilities (10%); the level of knowledge acquired (30%). The final mark is expressed in thirtieths with eventual mention of honor.
Knowledge and understanding
-Being familiar with data driven discovery paradigm
-Recognizing the different nature of data and codification issues and how these link to the different experimental techniques and research methods.
-Knowledge of multivariate data analysis tools for information retrieval and elaboration of scientific data.
- Knowledge of basic algorithms and tools for data mining and data modelling.
- Knowledge of main informatic tools to model scientific data.
Capability to apply knowledge and understanding
- Application of acquired knowledge/methods to model and develop applications for problem solving in research and application.
- Understand which methods are suitable to develop software application as function of the different scientific research methodologies.
Autonomy in judgement
- Knows how to critically discuss and present obtained results.
- Capability of suggesting the most efficient data analysis methods and developing fit for purpose software.
- Writing reports, document software applications.
- Understand the needs and issues posed by "customer".
- Capability to communicate data analysis results.
- Individuate the best suited references and web resources to improve knowledge of information retrieval and elaboration from scientific data.
- Recognize the most appropriate languages to develop the software applications.
-Capability to deepen knowledge on close/synergistic aspects with respect to the subjects of the course.
Pang-Ning Tan, Michael Steinbach, Vipin Kumar Introduction to Data Mining. Pearson International, 2006.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The elements of statistical learning. Data Mining, Inference, and Prediction. 2nd Ed. Springer Series in Statistics, Springer. Stanford, California 2008
K. Varmuza, P. Filzmoser, Introduction to multivariate statistical analysis in chemometrics, CRC press 2009;
PLS-Toolbox Manual , http://www.eigenvector.com;
R. Wherens, Chemometrics with R, Springer 2011