The authors HPM and MF selected 95 candidate SAHL-D terms from a Dutch thesaurus of health terms http://www.thesauruszorgenwelzijn.nl, of which 20 were related to medical specialties, tests and treatments (e.g. oncology, defibrillation), 15 to bodily functions and health behaviors (e.g. biorhythm, hygiene), 25 to the human body (e.g. pigment, pancreas) and 35 to diseases and symptoms (e.g. embolus, hemophilia). The chosen terms were potentially relevant to a general public. We avoided acronyms and terms referring to phenomena only known to medical professionals and particular patient groups. All terms were provided with a correct and an incorrect association word, using medical dictionaries when necessary. For example, ‘hemophilia’ could be associated with ‘clotting’ (correct) or ‘immunity’ (incorrect). The target word, the two associates and a ‘Do not know’ option were presented on cards, using large print.
Potential participants for the pretest were approached by undergraduate students (Language and communication) in the waiting room of the outpatient clinic of Internal Medicine at a large university hospital. Inclusion criteria were aged ≥ 18 years and able to communicate in Dutch. Those willing to participate signed an informed consent form, filled in a questionnaire and participated in a personal interview with one of the students.
The questionnaire assessed general vocabulary skills based on a written multiple choice vocabulary test used in the 8th grade of Dutch pre-vocational secondary education . Each item presents a sentence with one word underlined; the respondent has to choose the correct meaning of that word from the four possible meanings that are offered.
In the personal interview, the SAHL-D was administered by handing the participant the 95 cards, one by one. Word recognition was assessed by asking the participant to read the word out loud. The instructions for students contained information on correct phonetic pronunciation and the correct stress of each syllable in each word. Word comprehension was assessed by asking participants to choose the correct word associated with the ‘target’ word, or to use the ‘Do not know’ option; participants were encouraged not to guess the answer.
In the pretest we analyzed item scores and distributions of proportions correct to select the items with the best discriminative ability. Reliability of the set of 95 items was analyzed by Cronbach’s alpha. Analyses of variance (ANOVA) were used to assess relations between educational level and scores. The feasibility was assessed by noting the administration time for a subset of participants. Finally, we examined whether word features (such as opaque orthography and corpus frequency) were related to recognition and comprehension of each word.
We selected a subset of the pretest item pool by rejecting items that were scored correctly for recognition or comprehension by at least 95% of the participants. This left 33 items that mainly refer to medical specialties, tests and treatments on the one hand, and diseases and symptoms on the other (Additional file 1). Most of the terms referring to body parts, bodily functions and health behaviors did not meet the inclusion criteria. We then constructed a more demanding semantic test component. To assess word comprehension, instead of presenting 2 associated words we decided to present 3 candidate meanings of each word (1 correct, 2 distractors), together with a ‘Do not know’ option. As illustrated in Additional file 2, each item presents a distractor that is more or less related and a distractor that more obviously incorrect. Whereas the semantic test component in the pretest measured ‘surface-level familiarity’ (knowing which notions are related to the term and which are not), the SAHL-D aims to tap into ‘concept-level familiarity’ (knowing what the term actually refers to) .
Participants for the validation study were drawn from a test panel of The Netherlands Institute for Health Services Research, which is a list of people who are periodically invited to participate in various health-related research studies . Inclusion criteria were age 18–75 years, and ability to read, write and converse in Dutch. Participants were approached by mail with an online questionnaire; participants were asked to indicate whether they were willing to participate in a telephone interview later on. Only data of consenting participants were used.
The following variables were assessed in the online questionnaire:
Background characteristics: Gender; age; educational attainment level; ethnic background; native language; whether they work(ed) in health care; and how often they had contact with a professional care provider in the past year. Following the International Standard Classification of Education (ISCED), educational level was categorized as low (level 0–2: early childhood; primary education; lower secondary education); intermediate (level 3–5: upper secondary; post secondary; short cycle tertiary); and high (level 6–8: bachelor; master; doctoral .
General vocabulary: In the absence of a brief vocabulary test for Dutch adults, we created a general vocabulary measure by selecting 50 terms typical of formal Dutch prose style, such as ‘interruption’ and ‘precarious’ , and presenting 4 alternatives together with a ‘Do not know’ option for each item; participants were encouraged to choose this latter option in case of serious doubt. In the final scale we left out 2 of the 50 items with negative rest-item correlations (due to problems with the alternatives). For the resulting 48-item test, alpha was 0.87.
Prose literacy: In this study, we sought to validate our literacy measure by comparing it to a general test of higher-order reading skills, especially the contextual reconstruction of meaning in prose contexts (as opposed to word knowledge). Prose literacy was assessed by a subset of items from a reading comprehension test widely used for 9th graders in Dutch pre-university secondary education (total 16 items) . The test does not require specific topic knowledge. Specifically, we used four reading passages and 16 multiple choice text comprehension items about argumentative relations, relations between sentences and paragraphs, and main ideas for texts or paragraphs. Two questions ask for sentence-level paraphrases. After dropping an item with a low rest-item correlation, Cronbach’s alpha was 0.75 for the remaining 15 items. We defined adequate and inadequate prose literacy with reference to the mean proportion for the lowest educational group (0.44). We stipulated that scores ≤ 6 (corresponding to a proportion of .4) reflect inadequate prose literacy and that scores of ≥ 7 reflect adequate prose literacy.
Health Literacy Survey-Europe Q16: A short version of the Health Literacy Survey-Europe  was used to assess subjective health literacy. The HLS-EU was derived from a theoretical model that integrates health care, disease prevention and health promotion, and four information processing stages (access, understand, appraise and apply) related to health- relevant decision-making and tasks .
The HLS-EU-Q16 consists of 16 items scored on a 4-point scale (very difficult to very easy). For each item the option ‘Do not know’ was also provided .
In a telephonic interview, NVS-D and SAHL-D were administered. These tests were sent as pdf files by email, not beforehand but upon starting the interview. As soon as the mail arrived, the participant started working on the NVS-D, followed by SAHL-D.
Newest Vital Sign (NVS): The NVS is a 6-question tool to assess an individual’s ability to find and interpret information (both text and numerical information) on an ice cream nutrition label . Earlier, Fransen et al.  translated and tested the NVS in Dutch (NVS-D); the cross-cultural adaptation and validation of the NVS-D is submitted for publication.
During the interview, we sent one file with the ice cream label and another one with the questions; respondents were asked to open both files on their screen. The interviewer read the questions out loud while the respondents read the questions and looked at the label on their screen.
SAHL-D: SAHL-D started with a title page and provided a single word per page, with the candidate meanings underneath it. The participant proceeded page by page. The item order was kept on, except in rare cases when words were skipped accidentally (by pressing the arrow button more than once). In those cases, the interviewer steered the participant back to the omitted word after the current item has been completed. At any time of the test, the participant saw only a single target word on the screen. Upon opening a new page, participants were given 5 seconds to pronounce the word, after which a multiple choice option was to be chosen immediately. This procedure practically rules out the possibility of using dictionaries. The participants worked alone (possible consultations with others would have been overheard). Administration of the SAHL-D took (on average) 6.39 min.
In the validation study we assessed the proportions of correct answers and score distributions of the SAHL-D. Feasibility was assessed by calculating percentage refusals and acceptance and the time to complete the SAHL-D. Reliability was tested with Cronbach’s alpha.
To explore the possibility of a shorter SAHL-D, we created an item subset by first discarding recognition items with rest-item correlations of ≤ 0.10 in the 33-item reliability analysis and/or a proportion correct of ≥ 0.95. This left 22 recognition items. We included the shorter 22-item set (SAHL-D22) in the analyses to illustrate the potential for a briefer SAHL-D.
Construct validity was examined by analyzing association patterns of the SAHL-D, NVS-D, HLS-EU-Q16, educational level, prose literacy and vocabulary scores in relation to predefined expectations about the size and pattern of the associations.
The following hypotheses were formulated:
Regarding known-groups validity, we expected the SAHL-D to be able to distinguish between low, intermediate and high levels of education based on significant differences in the mean scores.
Because of partly overlapping constructs, we expected a strong correlation between general vocabulary, prose literacy, NVS-D and the SAHL-D.
We expected a significant (but not sizeable) correlation between the SAHL-D (objective measure) and the HLS-EU-Q16 (subjective measure).
Regarding associations with socio-demographic variables, earlier literacy research [22, 23] led us to expect a strong positive association between the SAHL-D and educational level, and a moderate negative correlation between SAHL-D and age; no significant gender difference was expected.
ANOVA pairwise comparisons with Bonferroni correction were used for multiple testing to test differences in the SAHL-D scores by educational level, age, gender, and profession (working in health care). The association between the SAHL-D with general vocabulary, prose literacy, NVS-D, and HLS-EU-Q16 was tested with Pearson’s correlations and stepwise linear regression analyses to correct for background variables.
We used receiver operating characteristic (ROC) curves with adequate prose literacy as the reference standard to determine optimal cut-off scores for identifying objective HL.