2020 |
Sanoja, Andrés; Garcia, Jean Manual-design of Blocks: Una Herramienta para Gestionar Segmentaciones Manuales de Páginas Web Journal Article ReVeCom, 06 (01), pp. 019-027, 2020, ISSN: 2244-7040. @article{SanojaGarcia2020, title = {Manual-design of Blocks: Una Herramienta para Gestionar Segmentaciones Manuales de Páginas Web}, author = {Andrés Sanoja and Jean Garcia}, editor = {Sociedad Venezolana de Informática}, url = {https://www.researchgate.net/publication/340221556_Manual-design_of_Blocks_Una_Herramienta_para_Gestionar_Segmentaciones_Manuales_de_Paginas_Web_Revista_Venezolana_de_Computacion}, issn = {2244-7040}, year = {2020}, date = {2020-03-01}, journal = {ReVeCom}, volume = {06}, number = {01}, pages = {019-027}, abstract = {Web page segmentation is an important task in Web page analysis. The objective is to divide a Web page intoblocks, each one representing a coherent part (or segment) of the content. In this work we describe the developmentof the Manual-design of Blocks (MoB). At the same time we describe how to get a ground truth of segmentations andhow to compute the“best manual segmentation”. The best manual segmentation is defined based on our experience andthe data obtained, in this investigation we define one way to obtain it, but we do not consider there’s only one way toachieve this. The best segmentation is then available to be used on the evaluation process of segmentation algorithmusing the Block-o-Matic framework. Also, a Web API and a Web repository for managing the data. Acceptance testresults are presented in this document.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Web page segmentation is an important task in Web page analysis. The objective is to divide a Web page intoblocks, each one representing a coherent part (or segment) of the content. In this work we describe the developmentof the Manual-design of Blocks (MoB). At the same time we describe how to get a ground truth of segmentations andhow to compute the“best manual segmentation”. The best manual segmentation is defined based on our experience andthe data obtained, in this investigation we define one way to obtain it, but we do not consider there’s only one way toachieve this. The best segmentation is then available to be used on the evaluation process of segmentation algorithmusing the Block-o-Matic framework. Also, a Web API and a Web repository for managing the data. Acceptance testresults are presented in this document. |
2018 |
Garcia, Jean; Sanoja, Andres Desarrollo de una Herramienta Interactiva Para la Construcción de un "Ground Truth"de Segmentaciones de Páginas Web Technical Report 2018. @techreport{GARSAN2018, title = {Desarrollo de una Herramienta Interactiva Para la Construcción de un "Ground Truth"de Segmentaciones de Páginas Web}, author = {Jean Garcia and Andres Sanoja}, editor = {CCPD}, year = {2018}, date = {2018-10-10}, abstract = {En el presente paper describimos los resultados de nuestra investigación, donde se evidencia la importancia de la evaluación de algoritmos de segmentación. La finalidad de nuestro trabajo es la obtención de una ”ground truth”(base de la verdad) de segmentaciones manuales sobre una página Web para la posterior obtención de “la mejor segmentación”, la cual puede ser usada más adelante para la evaluación del algoritmo de segmentación.}, keywords = {}, pubstate = {published}, tppubtype = {techreport} } En el presente paper describimos los resultados de nuestra investigación, donde se evidencia la importancia de la evaluación de algoritmos de segmentación. La finalidad de nuestro trabajo es la obtención de una ”ground truth”(base de la verdad) de segmentaciones manuales sobre una página Web para la posterior obtención de “la mejor segmentación”, la cual puede ser usada más adelante para la evaluación del algoritmo de segmentación. |
2017 |
Sanoja, Andrés; Gançarski, Stéphane Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation Book Chapter Kirikova, Mārīte; ørvaag, Kjetil; Papadopoulos, George A (Ed.): pp. 375–393, Springer International Publishing, Cham, 2017, ISBN: 978-3-319-66917-5. @inbook{Sanoja2017, title = {Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation}, author = {Andrés Sanoja and Stéphane Gançarski}, editor = {Mārīte Kirikova and Kjetil ørvaag and George A Papadopoulos}, url = {https://doi.org/10.1007/978-3-319-66917-5_25}, doi = {10.1007/978-3-319-66917-5_25}, isbn = {978-3-319-66917-5}, year = {2017}, date = {2017-09-24}, pages = {375--393}, publisher = {Springer International Publishing}, address = {Cham}, abstract = {Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements.}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements. |
2015 |
Sanoja, Andrés; ç, Stéphane Gan Web page segmentation evaluation Inproceedings Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 753–760, ACM 2015. @inproceedings{SanGan:SAC:2015, title = {Web page segmentation evaluation}, author = {Andrés Sanoja and Stéphane Gan{ç}arski}, url = {http://doi.acm.org/10.1145/2695664.2695786}, doi = {10.1145/2695664.2695786}, year = {2015}, date = {2015-03-01}, booktitle = {Proceedings of the 30th Annual ACM Symposium on Applied Computing}, pages = {753--760}, organization = {ACM}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
Sanoja, Andrés Web Page Segmentation, Evaluation and Applications PhD Thesis Université Pierre et Marie Curie-Paris VI, 2015. @phdthesis{Sanoja:LIP6:2015, title = {Web Page Segmentation, Evaluation and Applications}, author = {Andrés Sanoja}, editor = {UPMC}, url = {https://hal.inria.fr/tel-01128002/}, year = {2015}, date = {2015-01-22}, address = {4 place Jussieu, 75005. Paris, France}, school = {Université Pierre et Marie Curie-Paris VI}, type = {thesis}, keywords = {}, pubstate = {published}, tppubtype = {phdthesis} } |
2014 |
Sanoja, Andrés; Gançarski, Stéphane Block-o-Matic: A web page segmentation framework Inproceedings International Conference onMultimedia Computing and Systems (ICMCS), pp. 595-600, Marrakesh, Moroco, 2014. @inproceedings{Sanoja:ICMCS:2014, title = {Block-o-Matic: A web page segmentation framework}, author = {Andrés Sanoja and Stéphane Gançarski}, url = {http://ieeexplore.ieee.org/document/6911249/}, doi = {10.1109/ICMCS.2014.6911249}, year = {2014}, date = {2014-04-01}, booktitle = {International Conference onMultimedia Computing and Systems (ICMCS)}, pages = {595-600}, address = {Marrakesh, Moroco}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
2013 |
Sanoja, Andrés; Gançarski, Stéphane Block-o-Matic: a Web Page Segmentation Tool and its Evaluation Miscellaneous 29`eme journées ''Base de données avancées'', BDA'13, 2013, (Poster). @misc{sanoja:hal-00881693, title = {Block-o-Matic: a Web Page Segmentation Tool and its Evaluation}, author = {Andrés Sanoja and Stéphane Gançarski}, url = {https://hal.archives-ouvertes.fr/hal-00881693}, year = {2013}, date = {2013-01-01}, abstract = {In this paper we present our prototype for the web page segmentation called Block-o-matic and its counterpart Block-o-manual, for manual segmentation. The main idea is to evaluate the correctness of the segmentation algorithm. Build a ground truth database for evaluation can take days or months depending on the collection size, however we address our solution with our manual segmentation tool intended to minimize the time of annotation of blocks in web pages. Both tools implements the same rules for segmentation, for the manual version allows to propose blocks to assessor and for the automatic the block selection. We present our demonstration scenario with a collection of web pages organized in categories. After its annotation they are compared with the automatic segmentation version and it is given a score and a visual comparison.}, howpublished = {29`eme journées ''Base de données avancées'', BDA'13}, note = {Poster}, keywords = {}, pubstate = {published}, tppubtype = {misc} } In this paper we present our prototype for the web page segmentation called Block-o-matic and its counterpart Block-o-manual, for manual segmentation. The main idea is to evaluate the correctness of the segmentation algorithm. Build a ground truth database for evaluation can take days or months depending on the collection size, however we address our solution with our manual segmentation tool intended to minimize the time of annotation of blocks in web pages. Both tools implements the same rules for segmentation, for the manual version allows to propose blocks to assessor and for the automatic the block selection. We present our demonstration scenario with a collection of web pages organized in categories. After its annotation they are compared with the automatic segmentation version and it is given a score and a visual comparison. |
2012 |
Sanoja, Andrés; Gançarski, Stéphane Yet Another Hybrid Segmentation Tool Miscellaneous iPRES 2012 -- 9 th International Conference on Preservation of Digital Objects, 2012, (Poster). @misc{sanoja:hal-00770527, title = {Yet Another Hybrid Segmentation Tool}, author = {Andrés Sanoja and Stéphane Gançarski}, url = {https://hal.archives-ouvertes.fr/hal-00770527}, year = {2012}, date = {2012-01-01}, abstract = {In this paper1 we present an overview of a prototype we are developing for in the context of web archives (page comparison, crawling and information retrieval). It analyses pages based on their DOM tree information and their visual rendering. This tool implements a modified version of VIPS with the aim of enhancing the precision of visual block extraction and the hierarchy construction. First, the visual rendering of a page, produced by several browsers, is segmented into rectangular blocks. Then, the extracted blocks are analysed looking for visual overlaps, which are analysed using a adapted version of the XY-Cut algorithm and resolve the overlap. As a result we may have different shapes of blocks, rectangular and non-rectangular blocks. Finally, the visual block tree, representing the layout of the page is analysed in order to have a more coherent layout disposition.}, howpublished = {iPRES 2012 -- 9 th International Conference on Preservation of Digital Objects}, note = {Poster}, keywords = {}, pubstate = {published}, tppubtype = {misc} } In this paper1 we present an overview of a prototype we are developing for in the context of web archives (page comparison, crawling and information retrieval). It analyses pages based on their DOM tree information and their visual rendering. This tool implements a modified version of VIPS with the aim of enhancing the precision of visual block extraction and the hierarchy construction. First, the visual rendering of a page, produced by several browsers, is segmented into rectangular blocks. Then, the extracted blocks are analysed looking for visual overlaps, which are analysed using a adapted version of the XY-Cut algorithm and resolve the overlap. As a result we may have different shapes of blocks, rectangular and non-rectangular blocks. Finally, the visual block tree, representing the layout of the page is analysed in order to have a more coherent layout disposition. |
2010 |
Sanoja, Andrés; León, Claudia; Torres, Gustavo Lineamientos para la Construcción de un Archivo Histórico de la Información Digital producida en Venezuela Inproceedings CLCAR 2010. Conferencia Latino Americana de Computación de Alto Rendimiento, 2010. @inproceedings{sanoja2010lineamientos, title = {Lineamientos para la Construcción de un Archivo Histórico de la Información Digital producida en Venezuela}, author = {Andrés Sanoja and Claudia León and Gustavo Torres}, year = {2010}, date = {2010-01-01}, booktitle = {CLCAR 2010. Conferencia Latino Americana de Computación de Alto Rendimiento}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
González, Zulma; Sanoja, Andrés; Rivas, Sergio Utilización de Métricas de Software para Apoyar la Selección de Frameworks Web para el Sistema de Gestión de Datos Académicos de la Facultad de Ciencias UCV Journal Article SABER. Revista Multidisciplinaria del Consejo de Investigación de la Universidad de Oriente, 22 (2), pp. 185–192, 2010. @article{gonzalez2010utilizacion, title = {Utilización de Métricas de Software para Apoyar la Selección de Frameworks Web para el Sistema de Gestión de Datos Académicos de la Facultad de Ciencias UCV}, author = {Zulma González and Andrés Sanoja and Sergio Rivas}, year = {2010}, date = {2010-01-01}, journal = {SABER. Revista Multidisciplinaria del Consejo de Investigación de la Universidad de Oriente}, volume = {22}, number = {2}, pages = {185--192}, publisher = {Universidad de Oriente}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
2008 |
Sanoja, Andrés; León, Claudia Overview of extratos: yet another service oriented information extraction system for the web Proceeding Universidad de Magallanes, 2008, ISBN: 978-956-319-507-1. @proceedings{Sanoja2008, title = {Overview of extratos: yet another service oriented information extraction system for the web }, author = {Andrés Sanoja and Claudia León}, editor = {Jornadas Chilenas de Computación}, url = {https://web.archive.org/web/20090219124306/http://lahuen.dcc.uchile.cl/~jcc2008/libro.pdf}, isbn = {978-956-319-507-1}, year = {2008}, date = {2008-11-15}, publisher = {Universidad de Magallanes}, abstract = {This article describes the design and implementation of Extratos, a Service Oriented Information Extraction System for web content sharing, based on web services as extractors and BPEL business process generation. Some insights from archaeological sciences are applied to the design of the system. It is organized in five subsystems: Xpathula, Lab, Node, Web Portal and Executor and the external entities web browser, web page and orchestration engine. Our solution present a web extraction process, from the perspective of users and software systems, with three phases: design, generation and execution. The goal of the design phase is to help the user "discover" text from web pages, transform them in Text References, conform Pages and compose a Mashup, and the corresponding extraction procedure. The goal of the generation phase is, based on design, is convert the mashups as the result of a service oriented process. For the execution of the process is given a service oriented infrastructure to allow access to software clients through to mashups, using web services standard protocols }, keywords = {}, pubstate = {published}, tppubtype = {proceedings} } This article describes the design and implementation of Extratos, a Service Oriented Information Extraction System for web content sharing, based on web services as extractors and BPEL business process generation. Some insights from archaeological sciences are applied to the design of the system. It is organized in five subsystems: Xpathula, Lab, Node, Web Portal and Executor and the external entities web browser, web page and orchestration engine. Our solution present a web extraction process, from the perspective of users and software systems, with three phases: design, generation and execution. The goal of the design phase is to help the user "discover" text from web pages, transform them in Text References, conform Pages and compose a Mashup, and the corresponding extraction procedure. The goal of the generation phase is, based on design, is convert the mashups as the result of a service oriented process. For the execution of the process is given a service oriented infrastructure to allow access to software clients through to mashups, using web services standard protocols |
2006 |
Carballo, Yusneyi; Cattafi, Ricardo; Sanoja, Andrés; Zambrano, Nancy Gobierno electrónico en Venezuela Technical Report 2006. @techreport{carballo2006gobierno, title = {Gobierno electrónico en Venezuela}, author = {Yusneyi Carballo and Ricardo Cattafi and Andrés Sanoja and Nancy Zambrano}, url = {http://www.computacion.ciens ucv ve}, year = {2006}, date = {2006-01-01}, journal = {UCV. Caracas. Venezuela.[Documento en l'inea]. Recuperado de: http://www. computacion. ciens. ucv. ve}, keywords = {}, pubstate = {published}, tppubtype = {techreport} } |
Cattafi, Ricardo; Sanoja, Andrés; Carballo, Yusneyi; Zambrano, Nancy Gobierno-e en América Latina Technical Report 2006. @techreport{cattafi2006gobierno, title = {Gobierno-e en América Latina}, author = {Ricardo Cattafi and Andrés Sanoja and Yusneyi Carballo and Nancy Zambrano}, editor = {UCV. Caracas. Venezuela.[Documento en linea}, url = {http://www.computacion.ciens ucv ve}, year = {2006}, date = {2006-01-01}, journal = {Lecturas en Ciencias de la Computación}, pages = {1--22}, keywords = {}, pubstate = {published}, tppubtype = {techreport} } |
Sanoja, Andres; Cattafi, Ricardo; Carballo, Yusneyi; Zambrano, Nancy Gobierno Electrónico en el Sureste Asiático Technical Report 2006. @techreport{sanoja2006gobierno, title = {Gobierno Electrónico en el Sureste Asiático}, author = {Andres Sanoja and Ricardo Cattafi and Yusneyi Carballo and Nancy Zambrano }, url = {http://www.computacion.ciens ucv ve}, year = {2006}, date = {2006-01-01}, journal = {Caracas, Septiembre}, keywords = {}, pubstate = {published}, tppubtype = {techreport} } |
Publicaciones
2020 |
Manual-design of Blocks: Una Herramienta para Gestionar Segmentaciones Manuales de Páginas Web Journal Article ReVeCom, 06 (01), pp. 019-027, 2020, ISSN: 2244-7040. |
2018 |
Desarrollo de una Herramienta Interactiva Para la Construcción de un "Ground Truth"de Segmentaciones de Páginas Web Technical Report 2018. |
2017 |
Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation Book Chapter Kirikova, Mārīte; ørvaag, Kjetil; Papadopoulos, George A (Ed.): pp. 375–393, Springer International Publishing, Cham, 2017, ISBN: 978-3-319-66917-5. |
2015 |
Web page segmentation evaluation Inproceedings Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 753–760, ACM 2015. |
Web Page Segmentation, Evaluation and Applications PhD Thesis Université Pierre et Marie Curie-Paris VI, 2015. |
2014 |
Block-o-Matic: A web page segmentation framework Inproceedings International Conference onMultimedia Computing and Systems (ICMCS), pp. 595-600, Marrakesh, Moroco, 2014. |
2013 |
Block-o-Matic: a Web Page Segmentation Tool and its Evaluation Miscellaneous 29`eme journées ''Base de données avancées'', BDA'13, 2013, (Poster). |
2012 |
Yet Another Hybrid Segmentation Tool Miscellaneous iPRES 2012 -- 9 th International Conference on Preservation of Digital Objects, 2012, (Poster). |
2010 |
Lineamientos para la Construcción de un Archivo Histórico de la Información Digital producida en Venezuela Inproceedings CLCAR 2010. Conferencia Latino Americana de Computación de Alto Rendimiento, 2010. |
Utilización de Métricas de Software para Apoyar la Selección de Frameworks Web para el Sistema de Gestión de Datos Académicos de la Facultad de Ciencias UCV Journal Article SABER. Revista Multidisciplinaria del Consejo de Investigación de la Universidad de Oriente, 22 (2), pp. 185–192, 2010. |
2008 |
Overview of extratos: yet another service oriented information extraction system for the web Proceeding Universidad de Magallanes, 2008, ISBN: 978-956-319-507-1. |
2006 |
Gobierno electrónico en Venezuela Technical Report 2006. |
Gobierno-e en América Latina Technical Report 2006. |
Gobierno Electrónico en el Sureste Asiático Technical Report 2006. |