Search by item HOME > Access full text > Search by item

JBE, vol. 27, no. 4, pp., July, 2022

DOI: https://doi.org/10.5909/JBE.2022.27.4.581

Dataset Search System Using Metadata-Based Ranking Algorithm

Wooyoung Choi and Jonghoon Chun

C.A E-mail: Dataset, Search, Metadata, Ranking, Big data

Abstract:

Recently, as the requirements for using big data have increased, interest in dataset search technology needed for data analysis is also growing. Although it is necessary to proactively utilize metadata, unlike conventional text search, research on such dataset search systems has not been actively carried out. In this paper, we propose a new dataset-tailored search system that indexes metadata of datasets and performs dataset search based on metadata indices. The ranking given to the dataset search results from a newly devised algorithm that reflects the unique characteristics of the dataset. The system provides the capability to search for additional datasets which correlate with the dataset searched by the user-submitted query so that multiple datasets needed for analysis can be found at once.  



Keyword: Dataset, Search, Metadata, Ranking, Big data

Reference:
[1] Data Catalog Vocabulary (DCAT) - Version 2, https://www.w3.org/ TR/vocab-dcat-2/ (accessed Feb. 04, 2020).
[2] Schema.org https://schema.org/ (accessed Mar. 17, 2022).
[3] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L. Ibáñez, E. Kacprzak, and P. Groth, “Dataset search: a survey,” The VLDB Journal, Vol. 9, No.1, pp. 251-272, Jan. 2020. doi: https://doi.org/10.1007/s00778-019-00564-x
[4] M. Thelwall and K. Kousha, “ Figshare: a universal repository for academic resource sharing?” Online Information Review, Vol. 40, No. 3, pp. 333–346, June 2016. doi: https://doi.org/10.1108/OIR-06-2015-0190
[5] M. Altman, E. Castro, M. Crosas, P. Durbin, A. Garnett, and J. Whitney, “Open journal systems and dataverse integration—helping journals to upgrade data publication for reusable research,” Code4Lib Journal, Issue 30, Oct. 2015.
[6] Elsevier scientific repository, https://datasearch.elsevier.com/ (access- ed July 4, 2022).
[7] Data.gov, https://data.gov/ (accessed July 5, 2022).
[8] Korean public data portal (data.go.kr), https://www.data.go.kr/en/ index.do (accessed June 13, 2022).
[9] Kaggle, https://www.kaggle.com/ (accessed May 14, 2022).
[10] European data portal, https://data.europa.eu/en (accessed June 15, 2022).
[11] Google dataset search, https://datasetsearch.research.google.com (accessed June 7, 2022).
[12] J. Hendler, J. Holm, C. Musialek, and G. Thomas, “Us government linked open data: Semantic.data.gov.,” IEEE Intelligent Systems, Vol. 27, No. 3, pp. 25–31, May 2022. doi: https://doi.org/10.1109/MIS.2012.27
[13] Linked open data cloud, https://lod-cloud.net/ (accessed Mar. 28, 2022).
[14] Open data monitor, https://opendatamonitor.eu/ (accessed June 21, 2022).
[15] Uk open data portal, https://data.gov.uk/ (accessed June 20, 2022).
[16] CKAN – The open source data management system, https://ckan.org/ (accessed Mar. 27, 2022).
[17] Apache Lucene, https://lucene.apache.org/ (accessed Oct. 15, 2021).
[18] Apache Solr, https://solr.apache.org/ (accessed Oct. 15, 2021).
[19] R. Miller, “Open Data Integration,”  Proceedings of the VLDB Endowment, Vol. 11, No. 12, pp. 2130-2139, Aug. 2018. doi: https://doi.org/10.14778/3229863.3240491
[20] N. Noy, M. Burgess, and D. Brickley, “Google dataset search: building a search engine for datasets in an open web ecosystem,” The World Wide Web Conference 2019, San Francisco, USA, pp. 1365-1375, May 13, 2019. doi: https://doi.org/10.1145/3308558.3313685
[21] S. Sansone, A. González-Beltrán, P. Rocca-Serra, G. Alter, J. Grethe, H. Xu, I. Fore, J. Lyle, A. Gururaj, X. Chen, H. Kim, N. Zong, Y. Li, R. Liu, I. Burak Ozyurt, and L. Ohno-Machado, “Dats, the data tag suite to enable discoverability of datasets,” Scientific data, Vol. 4, No. 1, pp. 1-8, June 2017. doi: https://doi.org/10.1038/sdata.2017.59
[22] S. Neumaier and A. Polleres, “Enabling spatio-temporal search in open data,” Journal of Web Semantics, Vol. 55, pp. 21-36, Mar. 2019. doi: https://doi.org/10.1016/j.websem.2018.12.007
[23] S. Neumaier, J. Umbrich, A. Polleres, “Automated quality assessment of metadata across open data portals,” Journal of Data and Information Quality, Vol. 8, No. 1, pp. 1-29 Oct. 2016. doi: https://doi.org/10.1145/2964909
[24] Elasticsearch, https://www.elastic.co/kr/ (accessed Mar. 04, 2020).
[25] Practical BM25-Part 2: The BM25 algorithms and its variables, https://www.elastic.co/kr/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables (accessed Mar. 05 2020).
[26] Beautiful Soup documentation, https://www.crummy.com/software/ BeautifulSoup/bs4/doc/ (accessed Dec. 12, 2021).
[27] Selenium, https://www.selenium.dev/ (accessed Jan. 15, 2022).
[28] JSON for linking data, https://json-ld.org/ (accessed Feb. 22, 2022).
[29] HTML Microdata, https://www.w3.org/TR/2021/NOTE-microdata- 20210128/(accessed Feb. 23, 2022).
[30] Mongoosastic, https://mongoosastic.github.io/mongoosastic/ (access- ed Mar. 03, 2022).

Comment


Editorial Office
1108, New building, 22, Teheran-ro 7-gil, Gangnam-gu, Seoul, Korea
Homepage: www.kibme.org TEL: +82-2-568-3556 FAX: +82-2-568-3557
Copyrightⓒ 2012 The Korean Institute of Broadcast and Media Engineers
All Rights Reserved