1/9/2023 0 Comments Fminer run code actionOther examples are disciplinary open access repositories like the Social Science Open Access Repository (SSOAR) that gather available full text items from different partner organizations like publishers, research institutes, and individuals. Ley (2009) gave an excellent overview and insight into all the traps one might fall. One of the largest digital libraries that lead the way in digitizing this data extraction process is the dblp computer science bibliography, which built up their process chain to heavily rely on automatic metadata extraction from many different sources. While this might be a trivial task for programmers, librarians and content curators are most likely overwhelmed with such a task and its complexity and pitfalls. Usually this is done by coding custom data handlers or conversion scripts with languages like Perl or Python. Not only do digital content curators need to assess many different data sources intellectually but also need to invest a lot of time and effort to extract the available data sets. Introduction and Motivationīuilding up new collections for digital libraries is an expensive and demanding task. On top of that, we also present a syntax highlighting plugin for the popular text editor Atom that we developed to further support OXPath users and to simplify the authoring process.īy Mandy Neumann, Jan Steinberg, and Philipp Schaer 1. We also point out some practical things to consider when creating a web scraper (with OXPath). By taking one of our own use cases as an example, we guide you in more detail through the process of creating an OXPath wrapper for metadata harvesting. We present the open-source tool OXPath, an extension of XPath, that allows the user to define data to be extracted from websites in a declarative way. Therefore we would like to present a web scraping tool that does not demand the digital library curators to program custom web scrapers from scratch. As data curation is a typical task that is done by people with a library and information science background, these people are usually proficient with XML technologies but are not full-stack programmers. This may be the case for small to medium-size publishers, research institutes or funding agencies. In cases where the desired data is only available on the data provider’s website custom web scrapers are needed. Available data sets have to be extracted which is usually done with the help of software developers as it involves custom data handlers or conversion scripts. Building up new collections for digital libraries is a demanding task.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |