M-Learning: enfoque heurístico para recompensas diferidas en el aprendizaje por refuerzo

dc.contributor.advisorPerdomo Charry, César Andrey
dc.contributor.authorMora Cortés, Marlon Sneider
dc.contributor.authorPerdomo Charry , César Andrey
dc.contributor.authorPerdomo Charry , Oscar Julián
dc.contributor.orcidPerdomo Charry, Cesar Andrey [0000-0001-7310-4618]
dc.date.accessioned2025-03-10T20:43:32Z
dc.date.available2025-03-10T20:43:32Z
dc.date.created2025-02-21
dc.descriptionEl diseño actual de los métodos de aprendizaje por refuerzo requiere grandes recursos computacionales. Algoritmos como Deep Q-Network (DQN) han obtenido resultados sobresalientes en el avance de este campo. Sin embargo, la necesidad de ajustar miles de parámetros y ejecutar millones de episodios de entrenamiento sigue siendo un reto importante. Este documento propone un análisis comparativo entre el algoritmo Q-Learning, que sentó las bases del Deep Q-Learning, y nuestro método propuesto, denominado M-Learning. La comparación se lleva a cabo utilizando Procesos de Decisión de Markov con recompensa retardada como marco general del banco de pruebas. En primer lugar, este documento proporciona una descripción completa de los principales retos relacionados con la implementación de Q-Learning, especialmente en lo que respecta a sus múltiples parámetros. A continuación, se presentan los fundamentos de nuestra heurística propuesta, incluida su formulación, y se describe en detalle el algoritmo. La metodología utilizada para comparar ambos algoritmos consistió en entrenarlos en el entorno de Frozen Lake. Los resultados experimentales, junto con un análisis de las mejores soluciones, demuestran que nuestra propuesta requiere menos episodios y presenta una menor variabilidad en los resultados. En concreto, M-Learning entrena a los agentes un 30,7% más rápido en el entorno determinista y un 61,66% más rápido en el entorno estocástico. Además, consigue una mayor consistencia, reduciendo la desviación estándar de las puntuaciones en un 58,37% y un 49,75% en los entornos determinista y estocástico, respectivamente.
dc.description.abstractThe current design of reinforcement learning methods requires extensive computational resources. Algorithms such as Deep Q-Network (DQN) have obtained outstanding results in advancing the field. However, the need to tune thousands of parameters and run millions of training episodes remains a significant challenge. This document proposes a comparative analysis between the Q-Learning algorithm, which laid the foundations for Deep Q-Learning, and our proposed method, termed M-Learning. The comparison is conducted using Markov Decision Processes with delayed reward as a general test bench framework. Firstly, this document provides a full description of the main challenges related to implementing Q-Learning, particularly concerning its multiple parameters. Then, the foundations of our proposed heuristic are presented, including its formulation, and the algorithm is described in detail. The methodology used to compare both algorithms involved training them in the Frozen Lake environment. The experimental results, along with an analysis of the best solutions, demonstrate that our proposal requires fewer episodes and exhibits reduced variability in the outcomes. Specifically, M-Learning trains agents 30.7% faster in the deterministic environment and 61.66% faster in the stochastic environment. Additionally, it achieves greater consistency, reducing the standard deviation of scores by 58.37% and 49.75% in the deterministic and stochastic settings, respectively.
dc.format.mimetypepdf
dc.identifier.urihttp://hdl.handle.net/11349/93453
dc.language.isospa
dc.publisherUniversidad Distrital Francisco José de Caldas
dc.relation.referencesB. Cottier, R. Rahman, L. Fattorini, N. Maslej, and D. Owen, “The rising costs of training frontier ai models,” 2024.
dc.relation.referencesV. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” 2013.
dc.relation.referencesV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb 2015.
dc.relation.referencesA. K. Sadhu and A. Konar, “Improving the speed of convergence of multi- agent q-learning for cooperative task-planning by a robot-team,” Robotics and Autonomous Systems, vol. 92, pp. 66–80, 2017
dc.relation.referencesL. Canese, G. C. Cardarilli, M. M. Dehghan Pir, L. Di Nunzio, and S. Span`o, “Design and development of multi-agent reinforcement learn- ing intelligence on the robotarium platform for embedded system appli- cations,” Electronics, vol. 13, no. 10, 2024.
dc.relation.referencesJ. Torres, Introducci´on al aprendizaje por refuerzo profundo: Teor´ıa y pr´actica en Python. Direct Publishing, Independently Published, 2021.
dc.relation.referencesM. Lapan, Deep Reinforcement Learning Hands-On. Birmingham, UK: Packt Publishing, 2018.
dc.relation.referencesN. Balaji, S. Kiefer, P. Novotn´y, G. A. P´erez, and M. Shirmohammadi, “On the complexity of value iteration,” 2019.
dc.relation.referencesR. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018.
dc.relation.referencesB. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-learning algorithms: A comprehensive classification and applications,” IEEE Access, vol. 7, pp. 133653–133667, 2019.
dc.relation.referencesS. Liu, X. Hu, and K. Dong, “Adaptive double fuzzy systems based q- learning for pursuit-evasion game,” IFAC-PapersOnLine, vol. 55, no. 3, pp. 251–256, 2022. 16th IFAC Symposium on Large Scale Complex Sys- tems: Theory and Applications LSS 2022
dc.relation.referencesA. G. d. Silva Junior, D. H. d. Santos, A. P. F. d. Negreiros, J. M. V. B. d. S. Silva, and L. M. G. Gonc¸alves, “High-level path planning for an autonomous sailboat robot using q-learning,” Sensors, vol. 20, no. 6, 2020.
dc.relation.referencesM. E. C¸ imen, Z. Garip, Y. Yalc¸ın, M. Kutlu, and A. F. Boz, “Self adaptive methods for learning rate parameter of q-learning algorithm,” Journal of Intelligent Systems: Theory and Applications, vol. 6, no. 2, p. 191–198, 2023.
dc.relation.referencesL. Zhang, L. Tang, S. Zhang, Z. Wang, X. Shen, and Z. Zhang, “A self-adaptive reinforcement-exploration q-learning algorithm,” Symmetry, vol. 13, no. 6, 2021.
dc.relation.referencesJ. Huang, Z. Zhang, and X. Ruan, “An improved dyna-q algorithm in- spired by the forward prediction mechanism in the rat brain for mobile robot path planning,” Biomimetics, vol. 9, no. 6, 2024.
dc.relation.referencesS. Xu, Y. Gu, X. Li, C. Chen, Y. Hu, Y. Sang, and W. Jiang, “Indoor emer- gency path planning based on the q-learning optimization algorithm,” IS- PRS International Journal of Geo-Information, vol. 11, no. 1, 2022.
dc.relation.referencesA. dos Santos Mignon and R. L. de Azevedo da Rocha, “An adaptive im- plementation of ϵ-greedy in reinforcement learning,” Procedia Computer cience, vol. 109, pp. 1146–1151, 2017. 8th International Conference on Ambient Systems, Networks and Technologies, ANT-2017 and the 7th International Conference on Sustainable Energy Information Technology, SEIT 2017, 16-19 May 2017, Madeira, Portugal.
dc.relation.referencesM. Zhang, W. Cai, and L. Pang, “Predator-prey reward based q- learning coverage path planning for mobile robot,” IEEE Access, vol. 11, pp. 29673–29683, 2023.
dc.relation.referencesW. Jin, R. Gu, and Y. Ji, “Reward function learning for q-learning-based geographic routing protocol,” IEEE Communications Letters, vol. 23, no. 7, pp. 1236–1239, 2019.
dc.relation.referencesX. Ou, Q. Chang, and N. Chakraborty, “Simulation study on reward func- tion of reinforcement learning in gantry work cell scheduling,” Journal of Manufacturing Systems, vol. 50, pp. 1–8, 2019.
dc.relation.referencesY. Li, H. Wang, J. Fan, and Y. Geng, “A novel q-learning algorithm based on improved whale optimization algorithm for path planning,” PLOS ONE, vol. 17, no. 12, p. e0279438, 2022.
dc.relation.referencesS. Mirjalili and A. Lewis, “The whale optimization algorithm,” Advances in Engineering Software, vol. 95, pp. 51–67, 2016.
dc.relation.referencesH. Sowerby, Z.-H. Zhou, and M. L. Littman, “Designing rewards for fast learning,” ArXiv, vol. abs/2205.15400, 2022.
dc.relation.referencesG. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
dc.rights.accesoAbierto (Texto Completo)
dc.rights.accessrightsRestrictedAccess
dc.subjectAprendizaje por refuerzo
dc.subjectDilema exploración-explotación
dc.subjectQ-Learning
dc.subjectFrozen lake
dc.subjectEnfoque heurístico
dc.subject.keywordReinforcement learning
dc.subject.keywordExploration-exploitation dilemma
dc.subject.keywordQ-Learning
dc.subject.keywordFrozen Lake
dc.subject.keywordHeuristic approach
dc.subject.lembIngeniería Electrónica -- Tesis y Disertaciones Académicas
dc.subject.lembMinería de datos
dc.subject.lembAprendizaje por experiencia
dc.subject.lembAprendizaje por descubrimiento
dc.titleM-Learning: enfoque heurístico para recompensas diferidas en el aprendizaje por refuerzo
dc.title.titleenglishM-Learning: heuristic approach for delayed rewards in reinforcement learning
dc.typebachelorThesis
dc.type.coarhttp://purl.org/coar/resource_type/c_7a1f
dc.type.degreeProducción Académica
dc.type.driverinfo:eu-repo/semantics/bachelorThesis

Archivos

Bloque original

Mostrando 1 - 2 de 2
Cargando...
Miniatura
Nombre:
MoraCortesMarlonSneider2025.pdf
Tamaño:
3.97 MB
Formato:
Adobe Portable Document Format
Descripción:
Trabajo de grado
No hay miniatura disponible
Nombre:
Licencia de uso y publicacion.pdf
Tamaño:
1.92 MB
Formato:
Adobe Portable Document Format
Descripción:
Licencia de uso y publicación

Bloque de licencias

Mostrando 1 - 1 de 1
No hay miniatura disponible
Nombre:
license.txt
Tamaño:
7 KB
Formato:
Item-specific license agreed upon to submission
Descripción: