Data Analysis in Big Data Environments	Code: M2.858 : 6

View general information Description The subject within the syllabus as a whole Professional fields to which it applies Prior knowledge Information prior to enrolment Learning objectives and results Content View the UOC learning resources used in the subject Guidelines on assessment at the UOC View the assessment model

This is the course plan for the second semester of the academic year 2023/2024. To check whether the course is being run this semester, go to the Virtual Campus section More UOC / The University / Programmes of study section on Campus. Once teaching starts, you'll be able to find it in the classroom. The course plan may be subject to change.

Description

This course constitutes an introduction to Big Data systems and technologies. The first block addresses the technological structure behind Big Data projects which includes relevant aspects such as the distributed calculation and storage system or the management of the cluster's hardware resources. The next block addresses the two main models of distributed processing: batch and stream processing for simple and complex events. We will see the main functions and characteristics of the most widely used frameworks today, paying special attention to the two great standards: Apache Hadoop and Apache Spark. Finally, in the last block of the subject, we will review the main data analysis libraries, including machine learning, graph analysis and massive data visualization paying special attention when this methods are applied to Big Data problems.

The subject within the syllabus as a whole

This subject corresponds to an optional course within the University master's degree in Data Science and in Computational Engineering and Mathematics (joint URV, UOC).

Professional fields to which it applies

The course provides knowledge useful in different professional fields related to the development of software, data science, and machine learning on systems that require the use of big data technology. The course will also be useful for the management or consulting of projects based on Big Data systems, among others.

Prior knowledge

This course requires students to have basic to intermediate programming skills in Python since 90% of the course will be based on that language. The remaining 10% will be done in Java. Basic to intermediate knowledge about data analysis, machine learning, and computer networking are also assumed.

The course also includes case studies and autonomous information research, it is advisable for the student to be familiar with the search information sources over the internet, the analysis of quantitative and qualitative information, the ability to synthesize and obtain conclusions, as well as possess certain written communication skills.

Information prior to enrolment

None.

Learning objectives and results

The objectives that student will achieve through this course are the following:

Understand the concepts and formal definitions associated with Big Data and related concepts.
Identify the necessary technological elements in any project related to Big Data.
Be able to decide about the most appropriate methodologies for the implementation of Big Data systems.
Learn about the main tools available in the Big Data ecosystem, especially the Apache Hadoop and Apache Spark ecosystems.
Build machine learning models to generate knowledge as a result Big Data analysis.
Know the basic operation of the main Big Data tools and frameworks, such as HDFS or Apache Spark.

Content

The course consists of 6 thematic blocks, each one supported by its own didactic material:

1) Introduction to Big Data: In this first module, introduces the concept of Big Data and discusses the conceptual change that implies.

2) Typologies and architectures of a Big Data systems: Understand the different typologies and architectures of Big Data systems, being able to identify which architecture should be used according to the characteristics of each problem, including the specificities of data. We will introduce the two main distributed computing systems, MapReduce and Apache Spark, emphasizing their strengths and weaknesses. Next, we will define the tasks that resource managers perform, focusing on Apache Mesos and YARN.

3) Capture, pre-process and storage of big data: Know the basic characteristics of the data capture, pre-processing and storage processes in Big Data environments. That is, being able to understand the peculiarities that Big Data implies in each of these phases of data analysis and know the main Big Data tools and technologies that support it. We will discuss the storage and management of information, focusing on the HDFS distributed file system and NoSQL databases.

4) Big Data Analysis: Fundamental Techniques: Know and understand the principal techniques and tools of data mining and machine learning for big data. Know what differentiates them from traditional data mining techniques and tools, and when and how to use them. To reach the objectives of this block. We will start by review the basic concepts behind algorithm design and complexity, and parallelization. We then will frame this concepts on the Apache Hadoop and Apache Spark ecosystems. We will finalize this block by reviewing the tools that Hadoop and Spark offer to develop machine learning tools.

5) Big Data Analysis: Advanced Techniques: This module introduces advanced techniques related to data mining and machine learning. Specifically, there will be techniques related to graph analysis (graph mining), text analysis (text mining) and streaming data processing.

6) Incremental learning: In this module of the subject we will review what opportunities the field of machine learning offers when data arrives in the form of flow. We will review the supervised and unsupervised models, going into detail, with two concrete examples: K-means grouping model (unsupervised) and linear regression (supervised). Although the student already knows these models, widely used, it will be shown that when the data arrives in the form of flow the way of working changes substantially. Finally, we will review several use cases that the student will be able to work on to reinforce the concepts seen in this final module.

View the UOC learning resources used in the subject


Introducción al big data	PDF
Tipologías y arquitecturas de un sistema big data	PDF
Captura, preprocesamiento y almacenamiento de datos masivos	PDF
Análisis de datos masivos	PDF
Análisis de datos masivos. Técnicas avanzadas	PDF
Vídeo presentación PLA 1.1. Introducción a los datos masivos (Big Data)	Audiovisual
Vídeo contenidos PLA 1.2. Introducción a los datos masivos (Big Data)	Audiovisual
Vídeo presentación PLA 2.1. Tipologías y arquitecturas de un sistema Big Data	Audiovisual
Vídeo contenidos PLA 2.2. Tipologías y arquitecturas de un sistema Big Data	Audiovisual
Vídeo presentación PLA 3.1. Captura, pre-procesado y almacenamiento de datos masivos	Audiovisual
Vídeo contenidos PLA 3.2. Captura, pre-procesado y almacenamiento de datos masivos	Audiovisual
Vídeo presentación PLA 4.1. Análisis de datos masivos	Audiovisual
Vídeo contenidos PLA 4.2. Análisis de datos masivos	Audiovisual
Vídeo presentación PLA 5.1. Análisis de datos masivos. Técnicas avanzadas	Audiovisual
Vídeo contenidos PLA 5.2. Análisis de datos masivos. Técnicas avanzadas	Audiovisual
Espacio de recursos de ciencia de datos	Web
Massive data analysis	PDF
Big data capture preprocessing and storage	PDF
Introduction to big data	PDF
Massive data analysis. Advanced techniques	PDF
Typologies and architectures of a big data system	PDF
Video presentation PLA 5.1. Massive data analysis. Advanced techniques	Audiovisual
Video content PLA 3.2. Capture, pre-process and store massive data	Audiovisual
Video content PLA 5.2. Massive data analysis. Advanced techniques	Audiovisual
Video presentation PLA 2.1. Typologies and architectures of a Big Data system	Audiovisual
Video content PLA 2.2. Typologies and architectures of a Big Data system	Audiovisual
Video presentation PLA 4.1. Massive data analysis	Audiovisual
Video content PLA 4.2. Massive data analysis	Audiovisual
Video presentation PLA 3.1. Capture, pre-process and store massive data	Audiovisual
Video content PLA 1.2. Introduction to massive data (Big Data)	Audiovisual
Video presentation PLA 1.1. Introduction to massive data (Big Data)	Audiovisual
Uso de dataframes con Apache Spark	Audiovisual
Uso de RDDs con Apache Spark	Audiovisual
Apache Flume. Documentación	Audiovisual
Apache Flume. Configuración	Audiovisual
Apache Flume. Implementación sources	Audiovisual
Apache Flume. Agente	Audiovisual

Guidelines on assessment at the UOC

The assessment process is based on the student's personal work and presupposes authenticity of authorship and originality of the exercises completed.

Lack of authenticity of authorship or originality of assessment tests, copying or plagiarism, the fraudulent attempt to obtain a better academic result, collusion to copy or concealing or abetting copying, use of unauthorized material or devices during assessment, inter alia, are offences that may lead to serious academic or other sanctions.

Firstly, you will fail the course (D/0) if you commit any of these offences when completing activities defined as assessable in the course plan, including the final tests. Offences considered to be misconduct include, among others, the use of unauthorized material or devices during the tests, such as social media or internet search engines, or the copying of text from external sources (internet, class notes, books, articles, other students' essays or tests, etc.) without including the corresponding reference.

And secondly, the UOC's academic regulations state that any misconduct during assessment, in addition to leading to the student failing the course, may also lead to disciplinary procedures and sanctions.

The UOC reserves the right to request that students identify themselves and/or provide evidence of the authorship of their work, throughout the assessment process, and by the means the UOC specifies (synchronous or asynchronous). For this purpose, the UOC may require students to use a microphone, webcam or other devices during the assessment process, and to make sure that they are working correctly.

The checking of students' knowledge to verify authorship of their work will under no circumstances constitute a second assessment.

View the assessment model

You can only pass the course if you participate in and pass the continuous assessment. Your final mark for the course will be the mark you received in the continuous assessment.