OAI-SDR: Framework Inteligente de Búsqueda y Descubrimiento para Repositorios OAI-PMH mediante Aprendizaje Automático y Análisis Semántico
| dc.contributor.author | León Sandoval, Jhon Helmit | |
| dc.date.accessioned | 2026-02-02T23:40:06Z | |
| dc.date.available | 2026-02-02T23:40:06Z | |
| dc.date.created | 2025-12-02 | |
| dc.description | OAI-SDR se crea como herramienta para encontrar información en repositorios digitales que usan el protocolo OAI-PMH, como los de universidades o archivos culturales. Junta un método clásico pero eficiente, el BM25, que encuentra palabras clave de manera rápida, con técnicas nuevas de inteligencia artificial (los embeddings) que captan el sentido real de lo que busca. El proceso es bastante directo: primero recoge los datos de los documentos, los pone en orden, les añade algunos detalles extras y los transforma en algo que la IA pueda leer sin problema. Luego, mezcla los resultados con algo llamado Reciprocal Rank Fusion, que saca lo mejor de los dos mundos, puedes darle un toque final con un modelo de aprendizaje automático para afinar las opciones. Se encontró como necesidad de agilidad de precisión en las búsquedas realizadas esperando los mejores resultados al principio, que la carga sea en milisegundos y que descubrir documentos útiles en bibliotecas digitales sea más practico. Para lograr este objetivo de calidad de los hallazgos (nDCG@10, MAP@10, Recall@10), cuántos clics dan los usuarios (CTR). | |
| dc.description.abstract | OAI-SDR is a clever hybrid search tool crafted for OAI-PMH repositories, blending the straightforward BM25 method—which quickly spots keywords—with smarter embedding techniques that really get the meaning behind what you’re looking for. It kicks off by gathering and tidying up metadata, then enriches the content and turns it into handy vector formats. From there, it mixes the results using something called Reciprocal Rank Fusion, and if you want, you can add a learning-to-rank twist to polish the top picks even more. To see how well it works, they checked it with measures like nDCG@10, MAP@10, Recall@10, how often users click (CTR), and how fast they find something useful. | |
| dc.format.mimetype | ||
| dc.identifier.uri | http://hdl.handle.net/11349/100283 | |
| dc.relation.references | M. Agosti, N. Ferro, and G. Silvello, “Access and Exchange of Hierarchically Structured Resources on the Web,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 8, pp. 1123–1137, Aug. 2009,2008.257. DOI: 10.1109/TKDE. | |
| dc.relation.references | Q. Pei, S. Zhou, and M. Pan, “Pseudo-Relevance-Driven Query Expansion Using BERT,” in Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence and Intelligent Agent Technology (WI-IAT), Dec. 2024, pp. 736–740,2024.00120. DOI: 10.1109/WI-IAT62293. | |
| dc.relation.references | D. Sun et al., “Zero-shot Document Retrieval with Hybrid Pseudo-document Retriever,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Jun. 2025, pp. 1–5,2025.10447152. DOI: 10.1109/ICASSP48485. | |
| dc.relation.references | R. M. Richard and A. R. Villanueva, “SLM-Based Hybrid Retrieval for Resource Constrained Retrieval-Augmented Generation on Open Super-Large Crawled Data,” in Proc. IEEE Int. Conf. Signal Processing (ICSP), Mar. 2025, pp. 1157–1160,2025.10792154. DOI: 10.1109/ICSP58787. | |
| dc.relation.references | Y. Zheng, “An Analysis of the Technical Trend of Semantic Search in Natural Language Processing,” IEEE Access, vol. 8, pp. 188673–188688, 2020,2020.3030077. DOI: 10.1109/ACCESS. | |
| dc.relation.references | C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008,. DOI: 10.1017/CBO9780511809071. | |
| dc.relation.references | M. Kozlowski, “Hybrid Retrievers with Generative Re-Rankers for Polish Passage Retrieval,” in Proc. Federated Conf. Computer Science and Information Systems (FedCSIS), Sep. 2023, pp. 1271–1276,. DOI: 10.15439/2023F5344. | |
| dc.relation.references | G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988,. )90021-0. DOI: 10.1016/0306-4573(88. | |
| dc.relation.references | N. Aitymbetov, M.-H. Lee, and N. A. Tu, “Multi-Document Question Answering with Lightweight Embeddings-based Document Reranker,” in Proc. IEEE Asia-Pacific Conf. Computer Science and Data Engineering (CSDE), Jul. 2024, pp. 707–712,2024.10625377. DOI: 10.1109/CSDE60791. | |
| dc.relation.references | G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, Nov. 1975,361220. DOI: 10.1145/361219. | |
| dc.relation.references | T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv preprint arXiv:1301.3781, Jan. 2013,1301.3781. DOI: 10.48550/arXiv. | |
| dc.relation.references | J. Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” in Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Jun. 2019, pp. 4171–4186,. DOI: 10.18653/v1/N19-1423. | |
| dc.relation.references | N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proc. Conf. Empirical Methods in Natural Language Processing and Int. Joint Conf. Natural Language Processing (EMNLP-IJCNLP), Nov. 2019, pp. 3982–3992,. DOI: 10.18653/v1/D19-1410. | |
| dc.relation.references | N. Reimers and I. Gurevych, “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation,” in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), Nov. 2020, pp. 4512–4525,emnlp-main.365. DOI: 10.18653/v1/2020. | |
| dc.relation.references | J. Carbonell and J. Goldstein, “The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries,” in Proc. 21st Ann. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Aug. 1998, pp. 335–336,291025. DOI: 10.1145/290941. | |
| dc.relation.references | J. C. Platt, “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” Advances in Large Margin Classifiers, Cambridge, MA, USA: MIT Press, 1999, pp. 61–74, ISBN: 9780262161831. | |
| dc.relation.references | A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” in Proc. 22nd Int. Conf. Machine Learning (ICML), Aug. 2005, pp. 625–632,1102430. DOI: 10.1145/1102351. | |
| dc.relation.references | N. Sinhababu and R. Khatun, “LEq: Large Language Models Generate Expanded Queries for Searching,” in Proc. IEEE Int. Conf. Contemporary Computing and Networking Technology (ICCCNT), Jul. 2024, pp. 1–6,2024.10762129. DOI: 10.1109/ICCCNT61278. | |
| dc.relation.references | L. Xiong et al., “Approximate nearest neighbor negative contrastive learning for dense text retrieval,” in Proc. Int. Conf. Learning Representations (ICLR), May 2021,2007.00808. DOI: 10.48550/arXiv. | |
| dc.relation.references | L. Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training,” arXiv preprint arXiv:2212.03533, Dec. 2022,2212.03533. DOI: 10.48550/arXiv. | |
| dc.relation.references | G. V. Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” in Proc. 32nd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Jul. 2009, pp. 758–759,1572114. DOI: 10.1145/1571941. | |
| dc.relation.references | C. J. Burges, “From RankNet to LambdaRank to LambdaMART: an Overview,” Microsoft Research, Redmond, WA, USA, Tech. Rep. MSR-TR-2010-82, Jun. 2010. Disponible en: https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/ (PDF directo: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf ). DOI no disponible. | |
| dc.relation.references | N. Thakur, N. Reimers, J. Daxenberger, and I. Gurevych, “BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models,” in Proc. NeurIPS Datasets and Benchmarks Track, 2021,2104.08663. DOI: 10.48550/arXiv. | |
| dc.relation.references | N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees, “Overview of the TREC 2019 Deep Learning Track,” in Proc. TREC, 2019,2003.07820. DOI: 10.48550/arXiv. | |
| dc.relation.references | O. Khattab and M. Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT,” in Proc. SIGIR, 2020, pp. 39--48,3401075. DOI: 10.1145/3397271. | |
| dc.relation.references | T. Formal, B. Piwowarski, and S. Clinchant, “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking,” arXiv:2107.05720, 2021,2107.05720. DOI: 10.48550/arXiv. [27] J. Johnson, M. Douze, and H. Jégou, “Billion-Scale Similarity Search with GPUs,” arXiv:1702.08734, 2017,1702.08734. DOI: 10.48550/arXiv. | |
| dc.relation.references | Y. A. Malkov and D. A. Yashunin, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824--836, 2020,2018.2889473. DOI: 10.1109/TPAMI. | |
| dc.relation.references | J. C. Platt, “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” in Advances in Large Margin Classifiers, A. J. Smola et al., Eds. Cambridge, MA, USA: MIT Press, 1999, pp. 61–74. ISBN: 9780262161831. | |
| dc.relation.references | A. Niculescu-Mizil and R. Caruana, “Predicting Good Probabilities with Supervised Learning,” in Proc. ICML, 2005, pp. 625--632,1102430. DOI: 10.1145/1102351. | |
| dc.relation.references | G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM, vol. 38, no. 11, pp. 39--41, 1995,219748. DOI: 10.1145/219717. | |
| dc.relation.references | A. Sinha et al., “An Overview of Microsoft Academic Service (más) and Applications,” in Proc. WWW Companion, 2015, pp. 243--246,2742839. DOI: 10.1145/2740908. | |
| dc.relation.references | K. Järvelin and J. Kekäläinen, “Cumulated Gain-Based Evaluation of IR Techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422--446, 2002,582418. DOI: 10.1145/582415. | |
| dc.relation.references | T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, and G. Gay, “Accurately Interpreting Clickthrough Data as Implicit Feedback,” in Proc. SIGIR, 2005, pp. 154--161,1076063. DOI: 10.1145/1076034. | |
| dc.relation.references | F. Zhang, W. Chen, M. Fu, F. Li, H. Qu, and Z. Yi, “An Attention-Based Interactive Learning-to-Rank Model for Document Retrieval,” IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 9, pp. 5770–5782, Sep. 2022,2021.3129839. DOI: 10.1109/TSMC. | |
| dc.rights.acceso | Abierto (Texto Completo) | |
| dc.rights.accessrights | OpenAccess | |
| dc.subject | OAI-PMH | |
| dc.subject | Búsqueda híbrida | |
| dc.subject | BM25 | |
| dc.subject | Embeddings de oraciones | |
| dc.subject | Fusión de rangos recíprocos | |
| dc.subject | Aprendizaje por ranking | |
| dc.subject | Bibliotecas digitales | |
| dc.subject | Recuperación de información | |
| dc.subject.keyword | OAI-PMH | |
| dc.subject.keyword | Hybrid search | |
| dc.subject.keyword | BM25 | |
| dc.subject.keyword | Sentence embeddings | |
| dc.subject.keyword | Reciprocal Rank Fusion | |
| dc.subject.keyword | Learning to rank | |
| dc.subject.keyword | Digital libraries | |
| dc.subject.keyword | Information retrieval | |
| dc.title | OAI-SDR: Framework Inteligente de Búsqueda y Descubrimiento para Repositorios OAI-PMH mediante Aprendizaje Automático y Análisis Semántico | |
| dc.title.titleenglish | OAI-SDR: Intelligent Search and Discovery Framework for OAI-PMH Repositories using Machine Learning and Semantic Analysis | |
| dc.type | article |
Archivos
Bloque de licencias
1 - 1 de 1
No hay miniatura disponible
- Nombre:
- license.txt
- Tamaño:
- 7 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción:
