Data Analysis in Big Data Environments Code:  M2.858    :  6
View general information   Description   The subject within the syllabus as a whole   Professional fields to which it applies   Prior knowledge   Information prior to enrolment   Learning objectives and results   Content   View the UOC learning resources used in the subject   Guidelines on assessment at the UOC   View the assessment model  
This is the course plan for the first semester of the academic year 2024/2025. To check whether the course is being run this semester, go to the Virtual Campus section More UOC / The University / Programmes of study section on Campus. Once teaching starts, you'll be able to find it in the classroom. The course plan may be subject to change.

This course constitutes an introduction to Big Data systems and technologies. The first block addresses the technological structure behind Big Data projects which includes relevant aspects such as the distributed calculation and storage system or the management of the cluster's hardware resources. The next block addresses the two main models of distributed processing: batch and stream processing for simple and complex events. We will see the main functions and characteristics of the most widely used frameworks today, paying special attention to the two great standards: Apache Hadoop and Apache Spark. Finally, in the last block of the subject, we will review the main data analysis libraries, including machine learning, graph analysis and massive data visualization paying special attention when this methods are applied to Big Data problems.

Amunt

This subject corresponds to an optional course within the University master's degree in Data Science and in Computational Engineering and Mathematics (joint URV, UOC).

Amunt

The course provides knowledge useful in different professional fields related to the development of software, data science, and machine learning on systems that require the use of big data technology. The course will also be useful for the management or consulting of projects based on Big Data systems, among others.

Amunt

This course requires students to have basic to intermediate programming skills in Python since 90% of the course will be based on that language. The remaining 10% will be done in Java. Basic to intermediate knowledge about data analysis, machine learning, and computer networking are also assumed.


The course also includes case studies and autonomous information research, it is advisable for the student to be familiar with the search information sources over the internet, the analysis of quantitative and qualitative information, the ability to synthesize and obtain conclusions, as well as possess certain written communication skills.

Amunt

None.

Amunt

The objectives that student will achieve through this course are the following:

  • Understand the concepts and formal definitions associated with Big Data and related concepts.
  • Identify the necessary technological elements in any project related to Big Data.
  • Be able to decide about the most appropriate methodologies for the implementation of Big Data systems.
  • Learn about the main tools available in the Big Data ecosystem, especially the Apache Hadoop and Apache Spark ecosystems.
  • Build machine learning models to generate knowledge as a result Big Data analysis.
  • Know the basic operation of the main Big Data tools and frameworks, such as HDFS or Apache Spark.

Amunt

The course consists of 7 thematic modules, each of which is supported by educational material and a series of exercises. The content associated with each thematic module is detailed below:

1) Introduction to Big Data.

The first module introduces the concept of Big Data and discusses the conceptual change that implies.

2) Typologies and architectures of a big data system: collection, pre-processing, and storage

This module covers understanding different Big Data system typologies and architectures, guiding students to select the appropriate architecture based on problem characteristics, including data specifics. It introduces MapReduce and Apache Spark while highlighting their strengths and weaknesses, and also delves into resource management tasks with a focus on Apache Mesos and YARN. Additionally, it explores the fundamental aspects of data capture, pre-processing, and storage in Big Data environments, emphasizing the unique challenges in each phase and discussing key tools and technologies, such as HDFS and NoSQL databases, for information storage and management.

3) Resource managers for large-scale data processing

In this module, we'll focus on managing resources in shared Big Data systems, particularly using YARN as a resource manager. We'll examine how tasks are distributed across the cluster and explore the statistics that can help enhance performance. Resource allocation, including RAM, CPU, and network capacity, is crucial in complex Big Data systems with multiple users and tasks, and various resource managers, from basic concepts to widely used ones like Apache YARN, will be explored based on the complexity of tasks being coordinated.

4) Big Data Analysis: Fundamental Techniques

Know and understand the principal techniques and tools of data mining and machine learning for big data. Know what differentiates them from traditional data mining techniques and tools, and when and how to use them. To reach the objectives of this block. We will start by review the basic concepts behind algorithm design and complexity, and parallelization. We then will frame these concepts on the Apache Hadoop and Apache Spark ecosystems. We will finalize this block by reviewing the tools that Hadoop and Spark offer to develop machine learning tools.

5) Big Data Analysis: Advanced Techniques

This module introduces advanced techniques related to data mining and machine learning. Specifically, there will be techniques related to graph analysis (graph mining), text analysis (text mining) and streaming data processing.

6) Incremental learning

In this module of the subject, we will review what opportunities the field of machine learning offers when data arrives in the form of flow. We will review the supervised and unsupervised models, going into detail, with two concrete examples: K-means grouping model (unsupervised) and linear regression (supervised). Although the student already knows these models, widely used, it will be shown that when the data arrives in the form of flow the way of working changes substantially. Finally, we will review several use cases that the student will be able to work on to reinforce the concepts seen in this final module.

7) Analysis of Innovative Trends in Big Data

The aim of this module is for students to explore the latest trends in the field of Big Data, staying up-to-date in an ever-evolving area, and developing a deeper understanding of the possibilities of large-scale data analysis. Students work in teams to research and analyze an emerging trend in the Big Data field, identifying its relevance, current status, and applications, and presenting their findings collaboratively. This activity promotes understanding of innovations in the processing of massive data and their application in various contexts.

Amunt

Introducción al big data PDF
Tipologías y arquitecturas de un sistema big data PDF
Captura, preprocesamiento y almacenamiento de datos masivos PDF
Análisis de datos masivos PDF
Análisis de datos masivos. Técnicas avanzadas PDF
Vídeo presentación PLA 1.1. Introducción a los datos masivos (Big Data) Audiovisual
Vídeo contenidos PLA 1.2. Introducción a los datos masivos (Big Data) Audiovisual
Vídeo presentación PLA 2.1. Tipologías y arquitecturas de un sistema Big Data Audiovisual
Vídeo contenidos PLA 2.2. Tipologías y arquitecturas de un sistema Big Data Audiovisual
Vídeo presentación PLA 3.1. Captura, pre-procesado y almacenamiento de datos masivos Audiovisual
Vídeo contenidos PLA 3.2. Captura, pre-procesado y almacenamiento de datos masivos Audiovisual
Vídeo presentación PLA 4.1. Análisis de datos masivos Audiovisual
Vídeo contenidos PLA 4.2. Análisis de datos masivos Audiovisual
Vídeo presentación PLA 5.1. Análisis de datos masivos. Técnicas avanzadas Audiovisual
Vídeo contenidos PLA 5.2. Análisis de datos masivos. Técnicas avanzadas Audiovisual
Espacio de recursos de ciencia de datos Web
Massive data analysis PDF
Big data capture preprocessing and storage PDF
Introduction to big data PDF
Massive data analysis. Advanced techniques PDF
Typologies and architectures of a big data system PDF
Video presentation PLA 5.1. Massive data analysis. Advanced techniques Audiovisual
Video content PLA 3.2. Capture, pre-process and store massive data Audiovisual
Video content PLA 5.2. Massive data analysis. Advanced techniques Audiovisual
Video presentation PLA 2.1. Typologies and architectures of a Big Data system Audiovisual
Video content PLA 2.2. Typologies and architectures of a Big Data system Audiovisual
Video presentation PLA 4.1. Massive data analysis Audiovisual
Video content PLA 4.2. Massive data analysis Audiovisual
Video presentation PLA 3.1. Capture, pre-process and store massive data Audiovisual
Video content PLA 1.2. Introduction to massive data (Big Data) Audiovisual
Video presentation PLA 1.1. Introduction to massive data (Big Data) Audiovisual
Uso de dataframes con Apache Spark Audiovisual
Uso de RDDs con Apache Spark Audiovisual
Apache Flume. Documentación Audiovisual
Apache Flume. Configuración Audiovisual
Apache Flume. Implementación sources Audiovisual
Apache Flume. Agente Audiovisual

Amunt

Assessment at the UOC is, in general, online, structured around the continuous assessment activities, the final assessment tests and exams, and the programme's final project.

Assessment activities and tests can be written texts and/or video recordings, use random questions, and synchronous or asynchronous oral tests, etc., as decided by each teaching team. The final project marks the end of the learning process and consists of an original and tutored piece of work to demonstrate that students have acquired the competencies worked on during the programme.

To verify students' identity and authorship in the assessment tests, the UOC reserves the right to use identity recognition and plagiarism detection systems. For these purposes, the UOC may make video recordings or use supervision methods or techniques while students carry out any of their academic activities.

The UOC may also require students to use electronic devices (microphones, webcams or other tools) or specific software during assessments. It is the student's responsibility to ensure that these devices work properly.

The assessment process is based on students' individual efforts, and the assumption that the student is the author of the work submitted for academic activities and that this work is original. The UOC's website on academic integrity and plagiarism has more information on this.

Submitting work that is not one's own or not original for assessment tests; copying or plagiarism; impersonation; accepting or obtaining any assignments, whether for compensation or otherwise; collaboration, cover-up or encouragement to copy; and using materials, software or devices not authorized in the course plan or instructions for the activity, including artificial intelligence and machine translation, among others, are examples of misconduct in assessments that may have serious academic and disciplinary consequences.

If students are found to be engaging in any such misconduct, they may receive a Fail (D/0) for the graded activities in the course plan (including final tests) or for the final grade for the course. This could be because they have used unauthorized materials, software or devices (such as artificial intelligence when it is not permitted, social media or internet search engines) during the tests; copied fragments of text from an external source (the internet, notes, books, articles, other students' work or tests, etc.) without the corresponding citation; purchased or sold assignments, or undertaken any other form of misconduct.

Likewise and in accordance with the UOC's academic regulations, misconduct during assessment may also be grounds for disciplinary proceedings and, where appropriate, the corresponding disciplinary measures, as established in the regulations governing the UOC community (Normativa de convivència).

In its assessment process, the UOC reserves the right to:

  • Ask students to provide proof of their identity as established in the UOC's academic regulations.
  • Ask students to prove the authorship of their work throughout the assessment process, in both continuous and final assessments, through a synchronous oral interview, of which a video recording or any other type of recording established by the UOC may be made. These methods seek to ensure verification of the student's identity, and their knowledge and competencies. If it is not possible to ensure the student's authorship, they may receive a D grade in the case of continuous assessment or a Fail grade in the case of the final assessment.

Artificial intelligence in assessments

The UOC understands the value and potential of artificial intelligence (AI) in education, but it also understands the risks involved if it is not used ethically, critically and responsibly. So, in each assessment activity, students will be told which AI tools and resources can be used and under what conditions. In turn, students must agree to follow the guidelines set by the UOC when it comes to completing the assessment activities and citing the tools used. Specifically, they must identify any texts or images generated by AI systems and they must not present them as their own work.

In terms of using AI, or not, to complete an activity, the instructions for assessment activities indicate the restrictions on the use of these tools. Bear in mind that using them inappropriately, such as using them in activities where they are not allowed or not citing them in activities where they are, may be considered misconduct. If in doubt, we recommend getting in touch with the course instructor and asking them before you submit your work.

Amunt

You can only pass the course if you participate in and pass the continuous assessment. Your final mark for the course will be the mark you received in the continuous assessment.

 

Amunt