Search results

1 – 10 of 621
Open Access
Article
Publication date: 18 April 2024

Joseph Nockels, Paul Gooding and Melissa Terras

This paper focuses on image-to-text manuscript processing through Handwritten Text Recognition (HTR), a Machine Learning (ML) approach enabled by Artificial Intelligence (AI)…

Abstract

Purpose

This paper focuses on image-to-text manuscript processing through Handwritten Text Recognition (HTR), a Machine Learning (ML) approach enabled by Artificial Intelligence (AI). With HTR now achieving high levels of accuracy, we consider its potential impact on our near-future information environment and knowledge of the past.

Design/methodology/approach

In undertaking a more constructivist analysis, we identified gaps in the current literature through a Grounded Theory Method (GTM). This guided an iterative process of concept mapping through writing sprints in workshop settings. We identified, explored and confirmed themes through group discussion and a further interrogation of relevant literature, until reaching saturation.

Findings

Catalogued as part of our GTM, 120 published texts underpin this paper. We found that HTR facilitates accurate transcription and dataset cleaning, while facilitating access to a variety of historical material. HTR contributes to a virtuous cycle of dataset production and can inform the development of online cataloguing. However, current limitations include dependency on digitisation pipelines, potential archival history omission and entrenchment of bias. We also cite near-future HTR considerations. These include encouraging open access, integrating advanced AI processes and metadata extraction; legal and moral issues surrounding copyright and data ethics; crediting individuals’ transcription contributions and HTR’s environmental costs.

Originality/value

Our research produces a set of best practice recommendations for researchers, data providers and memory institutions, surrounding HTR use. This forms an initial, though not comprehensive, blueprint for directing future HTR research. In pursuing this, the narrative that HTR’s speed and efficiency will simply transform scholarship in archives is deconstructed.

Abstract

Purpose

An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues.

Design/methodology/approach

This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material.

Findings

Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified.

Research limitations/implications

The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc.

Practical implications

Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field.

Social implications

The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.

Originality/value

This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

Article
Publication date: 31 July 2020

Zainab Akhtar, Jong Weon Lee, Muhammad Attique Khan, Muhammad Sharif, Sajid Ali Khan and Naveed Riaz

In artificial intelligence, the optical character recognition (OCR) is an active research area based on famous applications such as automation and transformation of printed…

Abstract

Purpose

In artificial intelligence, the optical character recognition (OCR) is an active research area based on famous applications such as automation and transformation of printed documents into machine-readable text document. The major purpose of OCR in academia and banks is to achieve a significant performance to save storage space.

Design/methodology/approach

A novel technique is proposed for automated OCR based on multi-properties features fusion and selection. The features are fused using serially formulation and output passed to partial least square (PLS) based selection method. The selection is done based on the entropy fitness function. The final features are classified by an ensemble classifier.

Findings

The presented method was extensively tested on two datasets such as the authors proposed and Chars74k benchmark and achieved an accuracy of 91.2 and 99.9%. Comparing the results with existing techniques, it is found that the proposed method gives improved performance.

Originality/value

The technique presented in this work will help for license plate recognition and text conversion from a printed document to machine-readable.

Details

Journal of Enterprise Information Management, vol. 36 no. 3
Type: Research Article
ISSN: 1741-0398

Keywords

Article
Publication date: 10 April 2023

Evagelos Varthis and Marios Poulos

This study aims to present metaGraphos, a crowdsourcing system that aids in the transcription and semantic enhancement of scanned documents by using a pool of volunteers or people…

Abstract

Purpose

This study aims to present metaGraphos, a crowdsourcing system that aids in the transcription and semantic enhancement of scanned documents by using a pool of volunteers or people willing to participate in exchange for a financial reward.

Design/methodology/approach

The metaGraphos can be used in circumstances where optical character recognition fails to produce satisfactory results, semantic tagging or assigning thematic headings to texts is considered necessary or even when ground-truth data has to be collected in raw form.

Findings

The system automatically provides a Web-based interface comprising a static HTML page and JavaScript code that displays the scanned images of the document, coupled with the corresponding incomplete texts side by side, allowing users to correct or complete the texts in parallel.

Social implications

By assisting the parallel transcription and the semantic enhancement of difficult scanned documents, the system further reveals the hidden cultural wealth and aids in knowledge dissemination, a fact that contributes significantly to the academic-scientific dialog and feedback.

Originality/value

Individual researchers, libraries and organizations in general may benefit from the system because it is cost-effective, practical and simple to set up client–server architecture that provides a reliable way to transcribe texts or revise transcriptions on a large scale.

Details

Collection and Curation, vol. 42 no. 4
Type: Research Article
ISSN: 2514-9326

Keywords

Open Access
Article
Publication date: 28 November 2017

Mansoor Alghamdi and William Teahan

The aim of this paper is to experimentally evaluate the effectiveness of the state-of-the-art printed Arabic text recognition systems to determine open areas for future…

6641

Abstract

Purpose

The aim of this paper is to experimentally evaluate the effectiveness of the state-of-the-art printed Arabic text recognition systems to determine open areas for future improvements. In addition, this paper proposes a standard protocol with a set of metrics for measuring the effectiveness of Arabic optical character recognition (OCR) systems to assist researchers in comparing different Arabic OCR approaches.

Design/methodology/approach

This paper describes an experiment to automatically evaluate four well-known Arabic OCR systems using a set of performance metrics. The evaluation experiment is conducted on a publicly available printed Arabic dataset comprising 240 text images with a variety of resolution levels, font types, font styles and font sizes.

Findings

The experimental results show that the field of character recognition for printed Arabic still requires further research to reach an efficient text recognition method for Arabic script.

Originality/value

To the best of the authors’ knowledge, this is the first work that provides a comprehensive automated evaluation of Arabic OCR systems with respect to the characteristics of Arabic script and, in addition, proposes an evaluation methodology that can be used as a benchmark by researchers and therefore will contribute significantly to the enhancement of the field of Arabic script recognition.

Details

PSU Research Review, vol. 1 no. 3
Type: Research Article
ISSN: 2399-1747

Keywords

Article
Publication date: 7 June 2021

Marco Humbel, Julianne Nyhan, Andreas Vlachidis, Kim Sloan and Alexandra Ortolja-Baird

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the…

Abstract

Purpose

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward.

Design/methodology/approach

Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum.

Findings

Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required.

Research limitations/implications

This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here.

Practical implications

The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain.

Social implications

NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper.

Originality/value

This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article.

Details

Journal of Documentation, vol. 77 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 20 February 2024

Alenka Kavčič Čolić and Andreja Hari

The current predominant delivery format resulting from digitization is PDF, which is not appropriate for the blind, partially sighted and people who read on mobile devices. To…

Abstract

Purpose

The current predominant delivery format resulting from digitization is PDF, which is not appropriate for the blind, partially sighted and people who read on mobile devices. To meet the needs of both communities, as well as broader ones, alternative file formats are required. With the findings of the eBooks-On-Demand-Network Opening Publications for European Netizens project research, this study aims to improve access to digitized content for these communities.

Design/methodology/approach

In 2022, the authors conducted research on the digitization experiences of 13 EODOPEN partners at their organizations. The authors distributed the same sample of scans in English with different characteristics, and in accordance with Web content accessibility guidelines, the authors created 24 criteria to analyze their digitization workflows, output formats and optical character recognition (OCR) quality.

Findings

In this contribution, the authors present the results of a trial implementation among EODOPEN partners regarding their digitization workflows, used delivery file formats and the resulting quality of OCR results, depending on the type of digitization output file format. It was shown that partners using the OCR tool ABBYY FineReader Professional and producing scanning outputs in tagged PDF and PDF/UA formats achieved better results according to set criteria.

Research limitations/implications

The trial implementations were limited to 13 project partners’ organizations only.

Originality/value

This research paper can be a valuable contribution to the field of massive digitization practices, particularly in terms of improving the accessibility of the output delivery file formats.

Details

Digital Library Perspectives, vol. 40 no. 2
Type: Research Article
ISSN: 2059-5816

Keywords

Article
Publication date: 17 July 2020

Hrvoje Stančić and Željko Trbušić

The authors investigate optical character recognition (OCR) technology and discuss its implementation in the context of digitisation of archival materials.

Abstract

Purpose

The authors investigate optical character recognition (OCR) technology and discuss its implementation in the context of digitisation of archival materials.

Design/methodology/approach

The typewritten transcripts of the Croatian Writers' Society from the mid-60s of the 20th century are used as the test data. The optimal digitisation setup is investigated in order to obtain the best OCR results. This was done by using the sample of 123 pages digitised at different resolution settings and binarisation levels.

Findings

A series of tests showed that different settings produce significantly different results. The best OCR accuracy achieved at the test sample of the typewritten documents was 95.02%. The results show that the resolution is significantly more important than binarisation pre-processing procedure for achieving better OCR results.

Originality/value

Based on the research results, the authors give recommendations for achieving optimal digitisation process setup with the aim of increasing the quality of OCR results. Finally, the authors put the research results in the context of digitisation of cultural heritage in general and discuss further investigation possibilities.

Details

Aslib Journal of Information Management, vol. 72 no. 4
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 31 January 2023

Mrinalini Luthra, Konstantin Todorov, Charles Jeurgens and Giovanni Colavizza

This paper aims to expand the scope and mitigate the biases of extant archival indexes.

Abstract

Purpose

This paper aims to expand the scope and mitigate the biases of extant archival indexes.

Design/methodology/approach

The authors use automatic entity recognition on the archives of the Dutch East India Company to extract mentions of underrepresented people.

Findings

The authors release an annotated corpus and baselines for a shared task and show that the proposed goal is feasible.

Originality/value

Colonial archives are increasingly a focus of attention for historians and the public, broadening access to them is a pressing need for archives.

Details

Journal of Documentation, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0022-0418

Keywords

Abstract

Details

Reference Reviews, vol. 32 no. 2
Type: Research Article
ISSN: 0950-4125

Keywords

1 – 10 of 621