This HTML page is an extract of the deliverable “Report on Backlist Data and gap analysis” available to download as PDF (1.6 mo) or EPUB (800 ko).


The objective of this section of the report is to share results on the identification of recurrent accessibility issues per categories of ebooks that will need remediation to fit EAA requirements. We’ll first define the target, our methodology and the scoring threshold established for this analysis, as well as the identified biases and limits. Then we’ll present the results we deemed useful for the next steps of the ABE Lab project, with a list of recurrent accessibility issues detected. Lastly, we’ll define the outcomes of this work and the ebook classification we developed which will be used to test remediation tools and workflows.

Target

The European Accessibility Act (EAA) 1 requirements for ebooks are listed in Annex I sections III and IV Linea f). EPUB Accessibility - EU Accessibility Act Mapping 2 is a W3C group note that shows how EPUB files conforming to EPUB accessibility guidelines ( EPUB Accessibility 1.1 and WCAG 2.1 AA) are responding to the European Accessibility Act requirements. Those two documents help us define our target and basis to establish the accessibility deficiencies.

Pre-paginated ebooks (like PDFs and Fixed Layout EPUBs) do not comply with the EAA requirements for the criterion of flexibility and choice in the presentation of the content 3 , a key functionality for persons facing cognitive difficulties (like dyslexia) or with sight impairments. As remediation for these types of ebooks would mean a change of format, we choose to introduce middle-way target remediation to allow the study of remediation to today's format state and possibilities offered by remediation tools. The target for these documents will be compliance with the Web Content Accessibility Guidelines (WCAG) 2.1 4 . In addition, PDFs will have to reach PDF/UA conformity, a dedicated standard registered as ISO 5 .

Methodology

The backlist data analysis allowed us to define a wish list of categories of files to collect in order to represent the backlist composition in a small but consistent sample. We had a target objective of 200 files from 5 countries. This objective was exceeded, with 351 files collected from 7 EU countries (Denmark, Finland, France, Germany, Italy, the Netherlands, Spain), additionally including some samples also from the United Kingdom. We used the Thema codes as our reference for classification. Some provided samples were classified to different Thema categories by the publisher. As it was not possible to separate the Thema categories, we chose to multiplicate these samples (one per Thema code, i.e. a book with Thema codes D, F was analysed two times, one as D and one as F), resulting in a total of 376 units to analyse.

We added to this sample one accessible EPUB3 target file6, to be sure that the gap emerging from our analysis was fitting the reality and that files already made accessible would not be considered as files with remediation needs. This target file was produced by LIA as born accessible in 2023.

We first established a list of key points indicators (KPI) we wanted to evaluate and from them, we could determine the data to extract from the samples. We verified which of these data were available from existing reporting tools (EPUBCheck, ACE, and RGTK for EPUB files; VeraPDF and PDFIX for PDF files, see annex Tools used in the automated analysis for a brief description of these tools) and determined the missing ones. Fondazione LIA developed a script to extract the missing data, aggregate all the data, and export a unified report. The details of the tests are available in the Detailed evaluation of the tests made on ebooks document, available for project partners and contributing publishers. 15 iterations of the script were made to refine data extraction and the exported reports. We started from a large number of data collected to stretch to a minimum necessary point.

The report was then used to develop calculation methods to define remediation complexity indicators7. Iterations were needed for this step as well, as data visualisation produced helped us identify biases, missings and non-relevant information. The results of the evaluation are presented and commented on within this document.

Scoring

Usually, providers of remediation services classify the ebooks per complexity: a book with more images, tables or pages will get a higher score. This method is relevant if the whole set of ebooks to classify is produced from a known production workflow. Looking at the European level, we know that publishers’ workflows differ in the quality of files they produce, which consequently may be totally different in terms of accessibility features, accessibility information and, therefore, remediation needs.

That is why in this project we established a new classification related to remediation complexity, considering that an ebook may be very complex but already produced in accordance with accepted accessibility standards, thus resulting in a very low remediation complexity score. To be sure that the scoring was truly reflecting the remediation needs, we referred to our target file known to be fully accessible and with no remediation needs. With some iterations on scoring, we made sure that the target got a score of zero.

Capturing remediation complexities in relation to different file formats was one of the main challenges of the process. PDFs and Fixed-layout EPUBs are known to be the most complex to remediate as the technologies and languages used to build them imply more complexities and a higher level of programmatic abstraction. That’s why we decided to represent them apart.

One bias we had to deal with is that PDF format allows for less structure and metadata, resulting in less possibilities for analyses, which resulted in abnormally low scores for files in this format. To address this bias, we had to establish a complementary scoring calculation to apply to these files.

Therefore, each format has specificities related to contents found in the files and accessibility related features missing. To find the correct marker, a threshold of calculated key indicators has been established thru iterations.

Identified limits and bias

As previously commented, files in PDF format do not have the same accessibility possibilities as files in the EPUB format. Therefore, the comparison between the two formats must be done very consciously and should not lead to categorical formulas.

Most of the publishers providing samples are de facto aware of the accessibility subject and therefore the collection we have might be a biased representation of the backlist. A way to verify that would be to do a similar analysis on a large number of files not specifically selected for this type of test. This analysis perspective has been discussed with three members of EDRLab (Beletrina, De Marque and Hachette Livres) and we hope to be able to provide it as a complementary ABE Lab publication in the future.

At the time of writing this report, some remediation needs can not be spotted automatically, but as technological improvements are occurring very fast, we expect that a better gap analysis could be produced in the coming years. Examples of accessibility problems that cannot be automatically detected are incorrect, non-meaningful or insufficient image descriptions and wrong metadata claims, for which we were not able to establish a valid calculation method during this work.

Results

Per format

The sample contains 84% (316 files) of reflowable EPUB (RFL); 9% (33 files) of pre-paginated EPUB3 Fixed Layout (FXL) and 7% (26 files) of PDFs. This, actually, does not properly represent any of the market segmentations observed in the backlist data analysis.

The low number of pre-paginated files in the sample limits the analysis pertinence. It may be interpreted as an interest of the publishers providing samples to have accurate analysis on the remediation needs of reflowable EPUB files rather than PDF and EPUB3 FXL files, as many ebooks coexist in both reflowable and pre-paginated formats.

The radar diagram and the data table in the next page show the results of the scoring. We resume here the main trendings per format:

  • PDF scoring ranges from 29 to 68 with representation in Thema categories A (The Arts), J (Society and Social Sciences), K (Economics, Finance, Business and Management), L (Law, ), M (Medicine and Nursing), P (Mathematics and Science) T (Technology, Engineering, Agriculture, Industrial processes) and V (Health, Relationships and Personal development).

  • EPUB3 Fixed Layout (FXL) average scoring ranges from 24 to 64 with representation in Thema categories A (The Arts,), C (Language and Linguistics), D (Biography, Literature and Literary studies), P (Mathematics and Science), S (Sports and Active outdoor recreation), T (Technology, Engineering, Agriculture, Industrial processes), W (Lifestyle, Hobbies and Leisure), X (Graphic novels, Comic books, Manga, Cartoons) and Y (Children’s, Teenage and Educational) ;

  • EPUB3 reflowable average scoring ranges from 4 to 77 with representation in all Thema categories except X (Graphic novels, Comic books, Manga, Cartoons).

This overview shows a concrete difference in ranges, where reflowable formats are almost all below a score of 50 and pre-paginated formats are all over 50. As commented before, the lack of information provided in PDF files might lead to minoring the remediation complexity. We tried to compensate for that in our scoring threshold, but remediation testing will have to establish if the compensation is enough or misleading.

We also detected that pre-paginated are not represented in every Thema code, while reflowables are missing only for category X: Graphic novels, Comic books, Manga, and Cartoons. This shows that, except for visual narratives, all types of books can be produced in a reflowable format.

From these results, it seems legit to treat remediation of pre-paginated files apart from the reflowable ones. This result will be represented in our remediation classification through the establishment of a first level of complexity related to file format.

Figure 6: Radar chart showing three curves for PDF (blue continuous line), EPUB Fixed-Layout (orange dashed line) and EPUB reflowable (violet dotted line)

 Radar chart

Table 4: Average score per format (rows) and Thema codes (columns). “-” represents absence of files in the sample collection for given format and Thema category
Thema codeACDFJKLMNPQRSTUVWXY
PDF66---50635767-68---73-67---
RFL3543301924371748263320504636353229-32
FXL626649------61--6465--606150

Focus on reflowable EPUB3

As reflowable EPUB3 is the format allowing full compliance to the EAA requirements, we judged it essential to dive deeper in the analysis of the remediation complexity of files in this format. In the collected samples files we found scores from 4 to 73 points. The vast majority have a score between 10 and 30.

The following charts and tables give a full representation. We will summarise here the key information we found:

  • most of the files have a medium remediation complexity, but there is also a good number of files with high scores (fig. 4);

  • images to fix (meaning textual alternatives to establish) are the heaviest error affecting strongly all categories except for L (Laws) and F (Fiction) (fig. 5);

  • most of the categories have a large amplitude of errors per file, meaning that the Thema category alone is not sufficient to establish a segmented average remediation cost (fig. 6).

Figure 7: Bar chart showing the number of files as ordinate by scoring as abscissa.

Bar chart showing a bimodal distribution.

List data: number of files per score

  • Score 0: 1 files

  • Score 1: 0 files

  • Score 2: 0 files

  • Score 3: 0 files

  • Score 4: 1 files

  • Score 5: 6 files

  • Score 6: 0 files

  • Score 7: 3 files

  • Score 8: 4 files

  • Score 9: 5 files

  • Score 10: 13 files

  • Score 11: 4 files

  • Score 12: 1 files

  • Score 13: 5 files

  • Score 14: 2 files

  • Score 15: 6 files

  • Score 16: 18 files

  • Score 17: 12 files

  • Score 18: 12 files

  • Score 19: 15 files

  • Score 20: 24 files

  • Score 21: 18 files

  • Score 22: 11 files

  • Score 23: 15 files

  • Score 24: 7 files

  • Score 25: 4 files

  • Score 26: 5 files

  • Score 27: 0 files

  • Score 28: 1 files

  • Score 29: 6 files

  • Score 30: 3 files

  • Score 31: 3 files

  • Score 32: 6 files

  • Score 33: 5 files

  • Score 34: 3 files

  • Score 35: 3 files

  • Score 36: 5 files

  • Score 37: 1 files

  • Score 38: 2 files

  • Score 39: 4 files

  • Score 40: 7 files

  • Score 41: 5 files

  • Score 42: 5 files

  • Score 43: 4 files

  • Score 44: 0 files

  • Score 45: 3 files

  • Score 46: 4 files

  • Score 47: 3 files

  • Score 48: 3 files

  • Score 49: 1 files

  • Score 50: 9 files

  • Score 51: 4 files

  • Score 52: 3 files

  • Score 53: 4 files

  • Score 54: 2 files

  • Score 55: 4 files

  • Score 56: 3 files

  • Score 57: 2 files

  • Score 58: 2 files

  • Score 59: 3 files

  • Score 60: 8 files

  • Score 61: 6 files

  • Score 62: 19 files

  • Score 63: 8 files

  • Score 64: 2 files

  • Score 65: 1 files

  • Score 66: 5 files

  • Score 67: 5 files

  • Score 68: 4 files

  • Score 69: 2 files

  • Score 70: 1 files

  • Score 71: 0 files

  • Score 72: 3 files

  • Score 73: 5 files

Figure 8: Bar chart showing level and repartition of errors per Thema code categories

Bar chart

Table 5: level and repartition of errors per Thema code categories
ThemaPublicationsimages to fixunique ACE issuespossibly wrong languagefiles without headings
A626,72,71,02,2
C430,04,32,33,8
D3019,32,12,82,4
F6810,02,41,12,3
J2313,52,62,72,1
K2624,64,13,02,0
L116,43,13,50,8
M2039,03,32,01,9
N1615,02,92,91,8
P2221,43,92,62,1
Q1110,92,13,30,8
R540,04,00,83,0
S435,04,50,03,8
T323,35,32,03,0
target10,00,00,00,0
U524,04,02,42,8
V1422,13,12,21,4
W1320,02,62,21,2
Y3323,62,50,82,0

Figure 9: Candlestick chart showing number of publications, minimum, average and maximum scores per Thema codes

Candlestick chart, see data below.

Table 6: number of publications, minimum, average and maximum scores per Thema codes.

tblHeader
Thema codePublicationsAverage ScoreStandard DeviationMinimum ScoreMaximum Score
A63517,871658
C44314,113161
D303017,10565
F68197,89562
J232415,05566
K263720,541166
L11178,98537
M204813,101867
N162616,76563
P223317,321077
Q11209,63936
R55016,772061
S44616,472362
T33610,442443
target100,0000
U53521,25459
V143220,67867
W132915,741052
Y333213,98962

Figure 10: Bubble chart showing average score (X axis) per standard variation (Y axis), one bubble per Thema code, bubble size represents the number of publications in the sample.

Bubble Chart, see data below.

Table 7: Average score, standard variation and number of publications per Thema code.
Thema codePublicationsAverage ScoreStandard Deviation
A63517,87
C44314,11
D303017,10
F68197,89
J232415,05
K263720,54
L11178,98
M204813,10
N162616,76
P223317,32
Q11209,63
R55016,77
S44616,47
T33610,44
target100,00
U53521,25
V143220,67
W132915,74
Y333213,98

Recurrent accessibility issues detected

As a complement to the Thema category level gap analysis, we listed the main known accessibility issues and tried to identify occurrences of these accessibility issues in the collected files. The following table resumes our findings. Results on each accessibility issue are detailed in the following sections.

Table 8: occurrences of main accessibility issues identified in collected files
Accessibility issueconcernNumber of filesin % of the sample
Missing Accessibility MetadataEPUB files343100
Non reflowable contentall formats5916
Missing or bad textual alternative for non decorative graphical resourcesall formats31283
Missing or bad Language TagEPUB files22766
ACE IssuesEPUB files31993

Missing Accessibility Metadata

  • Issue: no accessibility metadata are present

  • Rule: EPUBaccessibility 1.1 section '2. Discoverability’

  • Applies to: EPUB files

  • Problem: the reader cannot know features or limitations they may experience while reading and the publication can’t be discovered through filtering

  • Indicators: calculated as follows: missing metadata - inferred metadata8, -3 (conformance metadata are counted as missing per ACE, but are not requested by the EAA). , minimum = 0

  • Collected files affected: 100%

Non Reflowable content

  • Issue: the presentation of the content can’t be adjusted to fit the reader’s needs

  • Rule: EAA, Annex I, Section IV, f

  • Applies to: all formats

  • Problem: fixed displays impeach correct visual adaptation of the content

  • Indicators: pre-paginated formats

  • Collected files affected: 16%

Missing or bad textual alternative for non decorative graphical resources

  • Issue: No textual alternative is provided for informative graphical contents or the alternative is recognized as not meaningful (file name or one word)

  • Rule: WCAG, Guideline 1.1 Text Alternatives, Success Criterion 1.1.1 Non-text Content, level A

  • Applies to: all formats

  • Problem: the non visual readers using TTS or assistive technologies will lose important information necessary to understand the content

  • Indicators: calculated as follows: content images – content images with alt-text (more than one word and not equal to filename) – contents images decorative

  • Collected files affected: 83%

Missing or bad Language Tag

  • Issue: words in different languages from the one of the main content are not identified as such

  • Rule: WCAG, Guideline 3.1 Readable, Success Criterion 3.1.2 Language of Parts, level AA

  • Applies to: all formats, but no way was found to identify that in PDF

  • Problem: non-visual readers using TTS or assistive technologies will experience strange or not understandable reading because of mispronunciation, incorrect braille rendering and bad hyphenations

  • Indicators: the wrong language assertion is done through a dedicated algorithm. It targets two or more following words in a sentence

  • Collected files affected: 66%

ACE issues

Issues reported by ACE. The following table shows the number of files and the corresponding percentage of the samples containing errors per severity level. We can note that very few (5% only) files have critical issues, but 92% have serious issues which will need to be evaluated for remediation.

Table 9: number and percentage of collected files affected per ACE issues gravity level.

ACE issueNumber of files% of the samples
critical195
serious34392
moderate13235
minor17246

A larger table of unique ACE issues has been produced for the use of the project and the building of testing files. The details of those errors are reported in the following tables. One shows the errors for which we proposed a detailed remediation complexity KPI, while the second shows the errors that are not addressed by a specific calculation.

Table 10: percentage of the sample affected by ACE errors for which a detailed remediation complexity KPI has been established.

ACE Issue% of the samples affected
Epub-Lang:Serious82.83
Metadata-Accessmode:Serious46.81
Metadata-Accessmodesufficient:Moderate71.75
Metadata-Accessibilityfeature:Serious49.03
Metadata-Accessibilityhazard:Serious52.91
Metadata-Accessibilitysummary:Moderate71.47
Image-Alt:Critical23.82

Table 11: percentage of the sample affected by ACE errors for which no detailed remediation complexity KPI has been established.

ACE Issue% of the samples affected
Empty-Table-Header:Minor45.19
Empty-Heading:Minor69.25
Heading-Order:Moderate47.33
Html-Has-Lang:Serious51.07
Link-In-Text-Block:Serious68.98
Color-Contrast:Serious27.01
Metadata-Accessibilityhazard-Invalid:Moderate28.88
Aria-Allowed-Role:Minor29.41
Aria-Roles:Minor23.80
Epub-Pagelist-Broken:Serious5.08
Epub-Type-Has-Matching-Role:Minor8.02
Landmark-Unique:Moderate22.99
Epub-Toc-Order:Serious27.54
Link-Name:Serious26.20
Document-Title:Serious15.24
Metadata-Accessibilityfeature-Invalid:Minor12.57

Potential accessibility issues undetectable through automated analysis

The following are issues that cannot be detected automatically and will require ad hoc human testing.

Reflowable restrictions

  • Issue: adjusting the presentation leads to letters or sentences overlapping or making the content visually unreadable in any way

  • Rule: EAA, Annex I, Section IV, f

  • Applies to: reflowable EPUB

  • Problem: fixed styles impeach correct visual adaptation of the content

Specific contents to be verified manually if found in files

Some very specialised contents such as forms, scripts, maths, videos and audios are not usually used in ebooks, but as this may happen, it will be necessary to include them in remediation testing. The following table shows that very few occurrences were found in the sample collection.

Table 12: number and percentage of collected files per specific content.

ContentNumber of filesin % of the samples
Forms10.3
Scripts175
Maths contents72
Video contents00
Audio contents72

Classification for remediation

The following classification aims to list the remediation workflows to test. A list of six elements is spread across the different categories, here is a summary of it:

  1. PDF to PDF/UA (compliant to WCAG 2.1, level AA)

  2. PDF to Reflowable EPUB3 (compliant EPUB Accessibility 1.1, WCAG 2.1, level AA)

  3. FXL to «accessible» FXL (compliant EPUB Accessibility 1.1, WCAG 2.1, level AA)

  4. FXL to Reflowable EPUB3 (compliant EPUB Accessibility 1.1, WCAG 2.1, level AA)

  5. EPUB2 to EPUB3 (compliant EPUB Accessibility 1.1, WCAG 2.1, level AA)

  6. Reflowable EPUB to Reflowable EPUB3 (compliant EPUB Accessibility 1.1, WCAG 2.1, level AA)

PDFs

As a representation of the printed page, the PDF format accessibility features are limited in term of flexibility and choice in the presentation of the content (For details about the format and it’s known limitations, refer to the Annex Ebooks files formats).

We see two possible remediation options for the PDF files:

  1. improve the file to reach PDF/UA standard with WCAG 2.1 AA conformance, allowing the file to support the text zoom functionality provided by most of the reading applications. These files will not totally comply with the EAA's requisites, but will provide state-of-the-art compliance.

  2. convert the file to a reflowable EPUB to reach full compliance with EAA requirements.

Fixed Layout EPUBs

Fixed layout EPUBs are subject to the same visual adjustment limitations as PDF: changing font type and spaces between letters, words, lines, or paragraphs is not possible. The possible remediations are:

  1. improve the file to reach WCAG 2.1 AA and EPUB accessibility 1.1. As in the case of PDF files, in Fixed Layout EPUB some accessibility features are supported and others are not. If the file is made according to the specifications, however, it must support text zoom functionality provided by most of the reading applications.

  2. convert the file to a reflowable EPUB to reach full compliance with EAA requirements.

Reflowable EPUBs

Reflowable EPUBs are known to be fully compliant with EAA requirements9 if they conform to WCAG 2.1 AA (or superior) and EPUB accessibility 1.1. e found different types of remediation needs:

  1. EPUB2 files need to be converted to reflowable EPUB3;

  2. Reflowable EPUB3 files need to become compliant with WCAG 2.1 AA and the EPUB accessibility 1.1 .

Outcomes

From this gap analysis, we were able to establish a classification of remediation needs and build test files for each of the classifications.

Direct outcomes of this work are

  • a remediation complexity assessment methodology applicable to collections of files;

  • a view of the remediation complexity per Thema category;

  • a view of main accessibility issues detected.

The heavy presence of images and visual resources appears to be the main criteria of demarcation between categories that will reclaim more efforts to remediate (Medicine, Earth sciences and Sports) and others (Fiction, Philosophy, Religion and Law) that will be easier to remediate.

As per the following steps of the ABE Lab project, it allows us to establish a testing classification and methodology as well as building meaningful files to test for remediation tools.


  1. A detailed list of Norms and Standards is available as annex to this document.↩︎

  2. Available at https://www.w3.org/TR/epub-a11y-eaa-mapping/↩︎

  3. EAA Annex I, Section IV, f) iii) available at
    https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32019L0882#d1e32-100-1↩︎

  4. On October 5, 2023 version 2.2 of the WCAG was officially released as a W3C recommendation. This update does not impact or compromise the analysis, research work and testing carried out in the context of the ABE Lab project.↩︎

  5. PDF/UA standard, ISO 14289-1:2014: https://www.iso.org/standard/64599.html.↩︎

  6. Target file publicly available at
    https://www.fondazionelia.org/wp-content/uploads/2023/10/European_Stories_2023_EUPL.epub.↩︎

  7. Remediation complexity indicators are available for the publishers partners of the project.↩︎

  8. Inferred metadata are found per RGTK, meaning that we are able to see an accessibility feature in the file even if the information about it was not provided by the publisher. Therefore no remediation need is necessary except for informing about it, which is already automated per RGTK.↩︎

  9. See EPUB Accessibility - EU Accessibility Act Mapping Group note established by W3C.↩︎