Copyright Sociological Research Online, 1997



Using Metadata for Cross-National Comparisons

Introduction

1.1
Since the late 1980s, interest in metadata and statistical information systems (SIS) has grown at a significant pace. Much of this interest has been stimulated by Eurostat's (The Statistical Office of the European Communities) interest in harmonizing official statistics from National Statistical Institutes of the European Union. Metadata can also be used to increase the efficiency of collecting and disseminating national statistics.

1.2
At present, the most sophisticated SISs are held within organizations whose operation can justify the expense of developing in-house methodologies and expertise. While these aid the organizations in disseminating their own statistics, there is little opportunity for others to utilize this power. Different institutes use different metadata in order to satisfy the needs of their own particular systems.

1.3
In order to use metadata for cross- national and cross-cultural analysis, a more systematic and rigorous approach is needed. The IDARESA project (which will be discussed in this paper) aims to do this. In this project, we hope to apply a data model, developed in a DOSES (Development of Statistical Expert Systems) project, to an integrated statistical data processing environment, thus providing a unified support to accessing information; and to use school leavers surveys and historical macroeconomics time series for testing.

The Background to the Project

2.1
A number of different activities in which CES was involved came together in the IDARESA project. A short description of each of these activities follows.

The CES

2.2
The Centre for Educational Sociology (CES) was founded in 1972 as a research centre at the University of Edinburgh. It conducts multi-disciplinary research on education, training and transitions to the labour market. It takes a holistic view of these transitions, covering the family, households and other transitions to adulthood. The themes of social change, the formation and impact of public policy, and the effects of social change on individual young people's lives have underpinned the Centre's research programme.

2.3
For most of its existence, the CES has designed, conducted and analyzed the Scottish Young People's Survey (SYPS), a regular postal survey of school leavers and, later, of school year groups.

2.4
Another key part of the Centre's activity has been the study of transitions from school to the labour force, and we have held several research grants for comparative studies of vocational education across Europe (Bruijn et al, 1993; Howieson et al, 1994; Bruijn & Howieson, 1995). This led to participation in the European Science Foundation (ESF) approved proposal for a Scientific Network on Transitions in Youth.

SYPS

2.5
The School Leavers' Survey (part of the SYPS series) is a biennial postal survey sent to 10% of young people leaving school from S4, S5 and S6 in alternate school sessions. Postal questionnaires of varying length were sent to the home address of each sample member approximately nine months after they had left school. The sample members were therefore aged between 16 and 19 at the time of contact. Topics covered have varied, but a core of questions have been maintained. These cover: present circumstances; qualifications; family background; attitude to school; and aspirations.

2.6
The core survey was administered every two years from 1977 through to 1991 to pupils leaving school in 1976 through to 1992.

ESF

2.7
In June 1993 the Executive Council of the ESF approved a proposal for a Scientific Network on Transitions in Youth. Discussions about the formation of a network began at the end of 1990, when some of the researchers working on regular and longitudinal surveys of young people - such as the French Observatoire EVA and the Scottish and Irish School Leavers Surveys - began to consider closer links. There had been relatively few formal links between the different national surveys, which tended to differ in their objectives, in their samples, in the data collected and in the classifications used. Nevertheless, it was felt that there was much to be gained, not only from methodological collaboration, but also from using the data for comparative research. Theoretical developments in these fields increasingly stressed the importance of societal-level variables. Comparative data are needed to allow these theoretical interests to be tested and developed through empirical analysis.

2.8
The long-term goal of the Network is 'to advance theoretical understanding of transitions in youth, and especially of the relationships between education/training and the labour market, through the comparative analysis of regular and longitudinal surveys of transitions'.

2.9
This is a very ambitious and long- term goal. The scale of the task is further increased by the diverse designs, concepts and theoretical underpinnings of the different transition surveys.

2.10
A principal activity of the network is to host three annual workshops, each with a specific theme. These have been:

2.11
At the 1994 workshop, the CES, jointly with ESRI and DESAN presented a paper describing our experiences in building a very small dataset for the three countries (Scotland, The Netherlands and Ireland) (Hannan et al, 1994). The discussions we had over the formation of the combined variables provided a valuable insight into the problems of constructing comparative datasets. In particular it highlighted the differences between the Scottish and the Dutch systems. A paper describing the initial work on IDARESA was presented to the 1996 workshop (Lamb et al, 1996).

DESAN

2.12
One of the participants in the ESF network, and an associated partner in the IDARESA project is the survey research company DESAN Market Research based in Amsterdam. DESAN has been responsible for the conduct of the school leavers survey in The Netherlands. This survey is based on a national sample of 20,000 school leavers yearly. To complete this sample a number of schools equally spread over the country and across the different types and levels of education are included. Every leaver from the selected schools receives a postal questionnaire, sent out by the school. To supplement this national sample additional schools are included on the basis of funding by regional labour offices, organizations representing industrial sectors, provinces etc. (the annual sample is therefore boosted to between 40,000 and 50,000).

2.13
The Dutch secondary educational system is formally differentiated with separate but parallel ladders up which students may climb. Pupils only switch courses during the first 2-3 years in secondary school, after that they stay in one of four courses. Subsequent destinations depend on the course followed.

2.14
Comparing the Dutch and Scottish systems highlights some of the differences that can be found between surveys:

CES and Metadata

2.15
Metadata is a term that has become increasingly common over the last few years. While there has been a large amount of discussion on its exact meaning, a useful working definition is as follows. The basic characteristics of metadata are:

2.16
This is a minimalist definition that is deliberately all-embracing, but the emphasis is on computerized data and on the fact that the metadata is either used by a computing system, or presented to a user to help in the use or interpretation of the data.

2.17
The CES involvement with the SYPS led to an interest in metadata, and a number of projects came from this: an ESRC research project developed a documentation database (Ritchie, 1991); we developed a metadata system linking questionnaire design, data capture and documentation(Lamb, 1993); and we were partners in one of the Eurostat sponsored DOSES projects - the EISI project, which is described below.

DOSES

2.18
In 1989 the Council of Ministers formally approved the DOSES programme as a part of the Second Framework Programme of the European Community. The programme's primary aim was 'to carry out exploratory work to identify how official statistics could be helped by certain data processing techniques' (Yves Franchet, Director General of Eurostat in Drappier, 1993). This was a breakthrough. 'It was the first time there had been any talk of research in official statistics ... There is no problem if you want to talk about research in micro electronics, but it is another matter when you want to do the same in official statistics' (D. Defays in Drappier, 1993).

2.19
CES was involved in the DOSES initiative from the start. At the initial seminar in Luxembourg in December 1987, we presented a paper which identified the need to capture more of the metadata at the start of the survey process (Lamb, 1989). The contacts made during this initial stage led to collaboration on the EISI project.

EISI

2.20
EISI (Expert Interface to Statistical Information) was a project funded under Theme 2 of the DOSES initiative (Documentation of Data and Statistical Methods). The objectives of the topic were:

2.21
The project explored the difficult problem of identifying the information that experts in the field carry in their heads, and attempted to create a formalized structure using case based reasoning (Epprecht, 1992) which would allow an expert to define an illustrative problem and a solution. Knowledge based techniques could then be used to solve similar problems. The project also set out to advance techniques of information dissemination by examining and using methodologies for remote accessing of data stored in a knowledge base. As part of the project an architecture for a statistical metadatabase was developed (de Vaney et al, 1992; 1993).

MMD

2.22
MMD (Modelling Metadata) was another project funded under Theme 2 of the DOSES initiative and was developed by partners from the Universities of Leuven, Leiden, Amsterdam and Vienna, INE (Portugal) and ESRI (Ireland). The objective of this project was to outline a general methodology for the study and modelling of metadata. The research focused on the development of a formal language for modelling metadata (Froschl, 1993). The project considered all the tasks of the statistical data process, and identified the relevant sub-tasks and the information required. The metadata associated with each task was presented in a structured manner. The model was applied to the Labour Force Survey, and the documentation of the survey was measured against the formal structure. The results of this project form the basis for the data model underlying IDARESA.

Statistical Metadata

2.23
The DOSES initiative was important because it brought together statisticians, informaticians and survey researchers to discuss the practical problems facing the National Statistical Institutes of Europe. The theme of metadata became very important, and a number of theoretical papers were presented at workshops organized by Eurostat (Hand, 1993; Sundgren, 1993). There were also a number of developments taking place in the statistical offices in Europe, of which DUVA is one.

DUVA

2.24
Statistisches Landesamt, the Statistical Office of Berlin, (Appel, 1994) has been working on a metadata system. This system was started in late 1989 as a PC-aided evaluation system for the 1987 census data in Germany. The package which has been developed is called DUVA which stands for Datenverarbeitungs Unterstutzte Volkszahlungs Auswertung (computer-aided population census evaluation). This original system has since been developed to become a more generalized system for the evaluation of all statistics. Its purpose is to allow the storing of all metadata used during the entire statistical production process in an integrated form. This means that metadata only require to be captured once, as early as possible in the process. The system uses a thesaurus-based user interface to allow different users to have access to the same metadata. The system deals with both micro and macro level data. The micro-level data (the smallest units which cannot be derived by applying statistical methods) are stored in basic files. Using a macro file generator, new metadata can be generated which is related to aggregates and derived variables, indices, indicators etc. The system is based on hierarchical classifications with the relations between the levels defined in reference tables. The thesaurus based user interface makes an important contribution to IDARESA.

The IDARESA Project

3.1
Figure 1 is a diagrammatic representation of how all of these elements relate to each other and to the IDARESA project.

Figure 1

A Description of the Project

3.2
IDARESA (Integrated Documentation And Retrieval Environment for Statistical Aggregates) is one of the approved research and development projects of Eurostatis DOSIS (Development of Statistical Information Systems) initiative. DOSIS is a special task of Espritis Emerging Software Technologies track within the European Union's Fourth Framework Programme for Research and Technological Development.

3.3
The project aims to design and implement a metadata-based statistical information and processing system targeted at the practical needs particularly of statistical offices. In order to achieve this, special emphasis is laid on the harmonization of statistical data originating from different sources and contexts. The intended outcome of the project is a software system in a network based client-server architecture such that clients can define data requests in a natural language irrespective of the actual storage structure and physical distribution.

3.4
Two test areas will be used in the implementation of the system - school leavers surveys and time-series in public financing of research and development.

List of Partners

Work Plan

3.5
The first part of the work plan is to analyze the two test areas in the countries involved. For the CES's and DESAN's part this involves analysis of the School Leavers Surveys as they currently exist in Scotland and The Netherlands. From this there will emerge a definition of an idealized survey which will feed into the design of a formal framework for school leavers surveys which is known as a Quality Frame. The next step will be to map the existing surveys onto this quality frame. Thereafter, once the quality frame has been finalized, it will be implemented in a formal query system and made accessible via the natural language (thesaurus) interface.

Research Method

3.6
Whilst carrying out the analysis of the school leavers surveys, we discovered that the area was hugely more complex than we had at first anticipated. We found that the opportunities available to the individual school leaver are dependent on institutional frameworks which differ across countries. A full analysis of these differences needs to be carried out. One step towards this goal was to identify the elements in real life which constrain or influence the individual. A preliminary outline of some of these elements is shown in Figure 2.

Figure 2

3.7
These elements need to be represented in a formal setting. Individual questions from both the Scottish and Dutch surveys are then specified in relation to this representation paying particular attention to such attributes as code lists, ranges, missing values etc. In order to get a grasp of the complexity of the relations, we use a formal Object-Oriented software design tool. This enables us to identify all the influences explicitly. This exercise highlights the fact that a simple instrument such as a questionnaire is used to capture information about a complex world.

3.8
Figure 3 illustrates the different elements that are associated with just one question 'What are you doing now?'. The analysis of this complex world will enable us to match survey questions against the real world, and use this to match variables from different surveys. The conceptual variables in the quality frame will be constructed from this real world model.

Figure 3

Conclusions

4.1
The IDARESA project aims to develop and use IT techniques to bring more semantic meaning to survey data, as well as offering a natural language interface which will allow users to conduct comparative analysis in a comfortable environment. It also offers an opportunity to examine and formalize the extremely complex world confronting the school leaver. This work can be generalized in two ways: first, the formal description of a country's educational system will be relevant to other areas of research; second, the model can be extended to other institutional frameworks such as health, judiciary, social services etc.

Karen Brannen and Joanne Lamb
Centre for Educational Sociology, University of Edinburgh

References

APPEL, G. C. (1994) 'Management Aspects of a Statistical Meta-Information System', Conference of European Statisticians, Geneva, 22-25 November 1994, United Nations Development Programme.

BRUIJN, E. de, FROISSART, C., GONZALEZ TIRADOS, R. M., HOWIESON, C., MANNING, S., ORTEGA GARCIA, P., RAFFE, D. & SPENCE, J. (1993) Current Issues in Modular Training: An Interview Study with Trainers in Six European Countries. Edinburgh: Centre for Educational Sociology, University of Edinburgh.

BRUIJN, E. de & HOWIESON, C. (1995) 'Modular Vocational Education and Training in Scotland and The Netherlands: Between Specificity and Coherence', Journal of Comparative Education, vol. 31, no. 1, pp. 83 - 99.

De VANEY, C., FLEMING, M., LAMB, J., MERSCH, G. & SANCHEZ, J. (1992) 'The Expert Interface to Statistical Information: Rationale, Techniques and Experiences', seminar on New Techniques and Technologies for Statistics, Bonn, 1992.

De VANEY, C., FLEMING, M., LAMB, J., MERSCH, G. AND SANCHEZ, J. (1993) 'Expert Interface to Statistical Information', EEC DOSES programme project no. B34, Final report, 1993.

DRAPPIER, J. DOSES - Its Evaluation Its Results Its Future. Luxembourg: Office for Official Publications of the European Communities.

EPPRECHT, E. (1992) 'An Approach for Knowledge Representation and Reasoning with Contextual Strategies in Knowledge-Based Statistical Consultancy Systems', unpublished PhD thesis, Facultes Universitaires Notre Dame de la Paix, 1992.

FROSCHL,K.A.: 'Towards an Operative View of Semantic Metadata', Proceedings of the Statistical Meta-Information Systems Workshop, Luxembourg.

HAND, D. J. (1993) 'Data, Metadata and Information', Proceedings of the Statistical Meta-Information Systems Workshop, Luxembourg.

HANNAN, D., LAMB, J. M., PAGRACH, K., RAFFE, D., RUTJES, J. J. (1994) Building a Cross-National Dataset on Transitions in Youth: An exploration using Data from Ireland, Scotland and the Netherlands. Scientific Network on Transitions in Youth Working Paper. Strasbourg: European Science Foundation.

HOWIESON, C., HURLEY, N., JONES, G. & RAFFE, D (1994) Determining the need for Vocational Counselling among different Target Groups of Young People under 28 years of age in the European Community: Young People in Full-Time Employment and Homeless Young People in the United Kingdom. A CEDEFOP Panorama National Report. Berlin: The European Centre for the Development of Vocational Training (CEDEFOP).

LAMB, J. M. (1989) 'Putting Semantics into Data Capture', paper presented to the Development of Statistical Tools seminar on 'The development of statistical expert systems', Luxembourg, December 1987. Eurostat News Special Edition 1989. Luxembourg: Office for Official Publications of the European Communities, catalogue number CA-AB-89-005-EN-C.

LAMB, J. M. (1993) 'Metadata in Survey Processing', Proceedings of the Statistical Meta-Information Systems Workshop, Luxembourg.

LAMB, J. M., RUTJES J. J. & PAGRACH, K. (1996) Quality Frames for a European Survey on Transitions in Youth. Paper for the European Scientific Foundation workshop on 'Linking Theory with Empirical Analysis in the study of Transitions in Youth', Marseilles.

RITCHIE, P. (1991) 'Design of a Conceptual Model for a Documentation Database', ESRC R000231293, Final report.

SUNDGREN, B. (1993) 'Modelling Meta-Information Systems', Proceedings of the Statistical Meta-Information Systems Workshop, Luxembourg.

Copyright Sociological Research Online, 1997