Skip to main content

Data cleaning

In the first three data-collection waves, the research team in each country was responsible for cleaning the national data sets according to the ESPAD guidelines. Beginning with 2007, a central cleaning process was introduced, with raw national data delivered and merged into a joint database and thereafter centrally cleaned. The major advantage with this arrangement is that all questionnaires from all countries are treated in the same way, which improves comparability. However, national research teams still have the possibility to highlight, but not to discard, any questionnaires that they consider questionable. Those questionnaires are assigned a special code and are included in the national data sets sent for centralised data cleaning.

It has previously been concluded that the shift to a standardised common cleaning approach did not result in any major problems with comparability of data from previous ESPAD surveys, even though there might conceivably have been a minor effect on low-prevalence (about 1 %) behaviours (Hibell et al., 2012).

The standard cleaning procedure involved two phases: the logical substitution of missing values and the deletion of unusable cases. Only students born in 1999 (or equivalent) have been considered in this process. Initially, all cases where information was missing about gender were excluded from the database. The other major reason for questionnaire exclusion was poor data quality. All questionnaires with responses to less than half of the core items were discarded, as were all questionnaires where the respondent appeared to have followed patterns involving repetitive marking of extreme values.

Across all ESPAD countries, an average of 1.8 % (0.0-7.6 %) of the questionnaires were excluded because of poor data quality or missing information on gender (Table C). Relatively large proportions of the questionnaires from Cyprus, Norway and Austria were excluded (3.8-4.2 %), and a particularly large proportion was removed from the Latvian data (7.6 %). This indicates that the quality of the collected data in those countries tended to be not as good as compared to the average ESPAD country, especially for Latvia. If the ESPAD average is calculated without Latvia, it drops from 1.8 % to 1.6 %.

Roughly half of the countries used the opportunity to flag questionnaires considered to be of questionable quality. On average, 58 % of those questionnaires were later removed in the central cleaning process. Table D shows the impact on the results due to the discarding of questionnaires for eight different measures of lifetime substance use. For all eight measures the prevalence rates were reduced. This reduction was however very limited, and ranged between 0.1 % and 0.4 % at the all-countries level. The three countries where the discarding of questionnaires had the most visible impact in terms of percentage points were Bulgaria, Cyprus and Latvia. In relative terms, at the all-countries level, the reduction was most obvious for the fake drug ‘relevin’. According to Table C, reported lifetime relevin use drops by more than a third when discarding bad data. The above indicates that the standardised syntax deleting questionnaires targets students with less trustworthy responses relatively well.

Another part of the data-cleaning process relates to the logical substitution of missing values, which is carried out in a conservative fashion. In cases where students had indicated that they had never used a specific substance and subsequently did not respond to further questions about such use, any missing values were substituted with no use for that particular substance. However, no substitutions were made if any counter indications of use were at hand.

Table E presents information about the non-response rates before the logical substitution of missing values and the substitution impact on the non-response rates. For the seven substance use variables shown in the table, the average reduction of the non-response rates was rather small, ranging from 0.1 % to 0.5 %. With a few exceptions, the reduction was relatively limited for all seven variables in most countries. The single highest figure is found for Norway, where the non-response rate for lifetime inhalants use was reduced by 2.7 percentage points. Norway, the former Yugoslav Republic of Macedonia and Latvia were countries where the logical substitution of missing values had the biggest impact. However, such low reductions of the non-response rates hardly has any effect at all on the final prevalence estimates.

On the whole, the standardised data-cleaning process did not greatly influence the lifetime-prevalence figures. The single largest decrease in relative terms (a drop by one third) was related to students claiming to have used the dummy drug relevin, and it was accounted for by the discarding of questionnaires with repetitive extreme response patterns.

Table D. Changes in lifetime prevalence (LTP) of different substances due to deletion of bad data a in students born in 1999 b. Percentages. ESPAD 2015

Download XLS  
CountryCigarettes LTPAlcohol LTPBeen intoxicated LTPCannabis LTPInhalants LTPEcstasy LTPTranquillisers or sedatives (non-medical use) LTPRelevin LTP
Before deletionFinal dataBefore deletionFinal dataBefore deletionFinal dataBefore deletionFinal dataBefore deletionFinal dataBefore deletionFinal dataBefore deletionFinal dataBefore deletionFinal data
Albania37,737,360,560,222,421,77,67,14,03,62,92,68,07,61,61,3
Austria c53,753,288,688,649,749,220,019,59,89,52,22,04,44,10,50,3
Belgium (Flanders) 31,131,279,579,529,729,717,417,42,82,83,23,26,16,10,40,4
Bulgaria56,055,586,486,445,644,727,926,94,13,06,55,24,83,63,62,6
Croatia62,562,192,392,347,646,922,421,525,925,33,32,45,14,21,50,8
Cyprus36,035,388,688,433,031,98,57,29,38,13,62,55,84,63,31,9
Czech Republic63,763,596,296,151,851,436,936,55,75,52,82,716,015,80,50,3
Denmark39,338,992,392,460,260,012,712,43,83,60,80,52,62,30,50,2
Estonia59,959,886,486,437,937,725,825,513,112,92,82,59,18,90,60,4
Faroes49,349,280,880,834,334,25,95,92,32,30,40,41,81,80,20,2
Finland47,447,273,873,737,237,18,68,58,07,81,31,15,95,80,40,2
FYR Macedonia d38,838,457,257,022,822,05,65,02,31,92,62,111,411,11,10,8
France c57,056,787,987,940,339,933,833,46,56,32,62,39,69,30,80,6
Georgia43,242,984,984,743,543,212,111,512,412,14,74,411,911,32,11,7
Greece38,738,593,993,934,434,28,58,312,412,21,51,24,44,10,70,4
Hungary55,455,292,792,753,653,413,313,06,96,62,32,07,57,20,80,5
Iceland16,616,335,134,810,310,07,77,43,33,02,11,75,85,50,80,5
Ireland32,832,173,973,634,333,719,718,911,310,54,53,74,43,41,81,0
Italy58,057,684,484,434,633,928,227,44,33,43,62,66,35,42,31,4
Latvia66,065,489,689,646,846,318,016,315,514,63,52,35,54,32,51,0
Liechtenstein57,157,189,289,241,941,929,829,88,38,31,61,63,23,20,00,0
Lithuania64,964,887,087,046,446,217,917,78,28,02,11,89,28,91,00,7
Malta29,429,186,286,238,338,113,012,68,68,32,22,03,22,90,70,5
Moldova33,433,282,282,325,224,84,84,51,71,41,81,51,51,20,60,3
Monaco56,455,989,088,842,441,732,331,38,28,12,72,010,710,12,00,8
Montenegro34,834,178,077,623,021,99,18,07,77,14,33,411,010,30,90,4
Netherlands39,839,473,773,533,232,723,122,55,44,93,63,18,88,40,90,4
Norway29,028,759,158,826,526,36,46,15,75,51,10,86,15,80,40,2
Poland54,854,583,883,735,735,324,423,910,710,23,63,117,016,62,01,5
Portugal37,036,971,471,426,426,215,415,34,54,52,01,95,25,10,60,5
Romania51,751,677,977,932,432,28,38,13,83,62,32,12,12,00,70,5
Slovakia61,861,690,790,744,644,426,526,38,48,13,63,37,06,80,70,4
Slovenia47,547,389,089,043,242,825,124,814,114,02,32,23,23,10,50,4
Sweden 33,733,465,365,028,127,67,16,68,27,41,71,27,56,91,00,5
Ukraine50,049,882,782,639,739,59,18,74,84,51,51,11,81,50,60,4
AVERAGE46,446,180,980,837,136,716,916,57,87,42,72,26,76,31,10,7
a Cases are deleted due to missing gender, more than 50 % missing and repeated extreme responses.
b Results are based on cleaned unweighted data with only students born in 1999.
c Results refer to all students born in 1999, not only the ESPAD sample since further cases are removed when new weightings are introduced in the final data.
d Official name former Yugoslav Republic of Macedonia.

Table E. Changes in lifetime prevalence (LTP) of different substances due to deletion of bad data a in students born in 1999 b. Percentages. ESPAD 2015

Download XLS  
CountryCigarettes LTPAlcohol LTPBeen intoxicated LTPCannabis LTPEcstasy LTPInhalants LTPTranquillisers or sedatives (non-medical use) LTP
Before log. subst.ReductionBefore log. subst.ReductionBefore log. subst.ReductionBefore log. subst.ReductionBefore log. subst.ReductionBefore log. subst.ReductionBefore log. subst.Reduction
Albania0,80,62,60,12,20,51,40,41,51,10,70,31,10,6
Austria c0,20,11,40,11,20,11,10,31,20,80,90,50,80,5
Belgium (Flanders) 0,30,00,90,00,80,00,30,10,20,20,20,10,30,2
Bulgaria0,50,22,20,12,40,21,10,40,30,30,80,30,80,4
Croatia0,50,11,30,11,00,00,40,20,30,30,60,20,50,1
Cyprus0,50,21,40,01,10,00,40,20,50,30,90,60,40,2
Czech Republic0,10,00,50,00,50,00,20,00,10,10,10,10,10,1
Denmark0,40,02,00,11,10,10,50,10,70,10,70,00,70,0
Estonia0,10,01,20,00,70,00,20,00,10,00,20,10,20,1
Faroes0,60,01,40,21,00,00,00,00,00,00,00,00,00,0
Finland0,30,10,60,10,80,10,20,10,50,50,60,50,60,5
FYR Macedonia d1,61,03,60,42,41,11,71,11,30,91,10,51,20,8
France c0,40,11,00,10,30,10,80,10,70,61,51,30,40,1
Georgia0,50,33,20,04,90,42,01,11,60,81,10,41,30,6
Greece0,00,00,40,00,20,00,20,00,30,30,20,20,20,1
Hungary0,40,11,50,10,70,10,70,30,30,20,30,20,20,1
Iceland0,20,20,70,10,30,10,50,10,30,20,40,20,30,2
Ireland0,50,22,60,02,40,21,30,30,40,30,50,10,30,1
Italy0,70,21,50,11,10,31,00,31,20,91,40,81,20,8
Latvia c0,40,21,50,12,00,41,10,91,61,42,72,12,51,6
Liechtenstein0,30,00,30,00,30,00,30,00,90,60,30,00,90,6
Lithuania0,60,21,20,00,70,00,90,20,50,50,30,20,30,1
Malta0,90,22,60,11,10,11,80,91,81,10,90,60,80,5
Moldova1,10,40,40,10,60,10,60,30,50,40,50,40,40,3
Monaco0,00,01,00,00,30,00,30,00,00,01,01,00,80,5
Montenegro0,40,20,80,10,80,20,50,40,40,40,40,30,50,3
Netherlands0,00,00,00,00,10,00,10,00,10,10,10,10,10,0
Norway1,80,82,50,63,71,82,11,22,91,73,92,72,71,5
Poland0,30,11,20,10,70,10,70,10,60,50,30,10,20,1
Portugal4,70,14,10,11,70,20,90,10,40,30,70,40,40,3
Romania0,30,12,20,01,80,20,80,10,50,40,90,70,40,3
Slovakia0,50,10,70,00,70,10,30,10,10,10,40,40,20,1
Slovenia0,20,01,70,10,80,10,60,10,20,10,20,10,10,0
Sweden 0,40,11,40,21,30,31,10,40,80,51,41,00,90,4
Ukraine0,60,20,80,12,00,41,40,41,41,10,30,00,30,1
AVERAGE0,60,21,50,11,20,20,80,30,70,50,80,50,60,3
a The results are based on unweighted raw data, first without logical substitution of missing values and then where logical substitution has been made. Cases have been deleted due to missing gender, 50 % missing and repeated extreme responses.
b When multiple responses are given on a single choice question, some countries code this –2 instead of –1 (no response). For comparability reasons all –2 are treated as –1.
c Frequencies differ from the final 1999 data since further cases are removed after weighting has been introduced.
d Official name former Yugoslav Republic of Macedonia.