ICDAR2017 Competition on Multi-lingual scene text detection and script identification

RRC-MLT (Robust Reading Competition – Multi-Lingual Text)



Text detection and recognition environment is a key component of many applications, ranging from business card digitization to shop indexation in a street. This new competition aims at assessing the ability of state of the art methods to detect text where the
user is facing various scripts and languages in a way which prevent using much a priori knowledge, as in modern cities where multiple cultures live and communicate together. This situation is also frequent when analyzing streams of contents gathered on the Internet. This competition therefore is an extension of the existing Robust Reading Competition (RRC) which has been held since 2003 both in ICDAR and in an online context. The proposed competition will be added as a new challenge of the RRC. We believe such initiative would not only encourage research on a topic of main interest, but also open new perspectives while being complementary to existing work.


In this proposed competition we try to answer the question whether text detection methods (whether deep learning-based or otherwise) could handle different scripts without fundamental changes in the used algorithms/techniques, or do we really need script-specific
methods?. The ultimate goal of robust reading is be able to read the text which appears in any captured image despite image source (type), image quality, text script or any other difficulties. Many research works have been devoted to solve this problem. The previous editions of RRC competitions [9, 10, 11] and other works, have provided useful datasets to help researchers tackle each of those problems in order to robustly read text in natural scene images. In this
competition, we extend state-of-the-art work further by tackling the problem of multi-lingual text detection and script identification. In other words, methods should be script-robust text detection methods.

Despite the available datasets related to scene text detection or to script identification (see Section 1.2), our proposed dataset offers interesting novel aspects. The dataset is composed of complete scene images which come from 9 languages representing 6 different scripts. It combines text detection with script identification, and contains much more images than related datasets. The number of images per script is equal. This makes it a useful benchmark
for the task of multi-lingual scene text detection. The dataset along with its ground truth contains all necessary information to prepare for text recognition systems as well. The considered languages are the following: Chinese, Japanese, Korean, English, French, Arabic, Italian, German and Indian.

Such dataset is the natural extension of the RRC series, with more scripts and more images while only focusing on intentional (or focused) text. It addresses the needs of the community for improved and robust scene text detection. As we will review in Section 1.2, datasets following this idea are being created because they are needed by industry and regular users. However, such datasets — we argue — cannot be considered for benchmarking multi-script scene text detection.

The target audience of this dataset is obviously not only the ICDAR community, but also the computer vision community. In both communities, researchers work on analyzing scenes, scene text detection and recognition, quality of text images and script identification.


  • Multi-script text detection
  • Script and language identification
  • End to End text detection and language identification


The dataset will be comprised of natural scene images with embedded text and born-digital images, such as street signs, street advertisement boards, shops names, passing vehicles, users photos in microblogs, web advertisements. This kind of images represents one of the mostly encountered image types on the internet which are the images with embedded text in social media.

We are in the process of collecting images of 9 languages. Our datasets has equal partitions of the following languages: Chinese, Japanese, Korean, Arabic, French, English, Indian, German and Italian.

We are collecting minimum of 9000 images with a minimum of 1000 images per script (we will try to reach 2000 images per script). The images will be divided as follows: 50% for training (500 ~ 1000 per script) and 50% for testing. Note that here we refer to full scene images, and each image may contain one or more words
(usually more than one word appear in images). This means that the number of word images will be much larger than 9000 (the number of the full images).