Introduction

This User Reference describes the TNSMS multi-file structure for the listing, household, individual, married women, and community surveys, and provides guidelines on how to link files and subfiles of interest.  This note is intended as an easy reference and introduction to the TNSMS database. To download a Word version of this Users Reference, click here.

Survey Overview

Sampling

Community sampling frame: Our primary sample units are 400 rural (village) and urban communities in Tamil Nadu.  Our rural sampling frame was the approximately 17,000 census villages enumerated in the 2001 Indian Census. We randomly sampled 200 villages for our rural communities.  As there was no pre-existing sampling frame of urban communities, our sample frame was created by consolidating regional data from the National Sample Survey Organization's Tamil Nadu offices. We randomly sampled 200 urban communities from our master sample frame of approximately 40,000 “Urban Frame Sample” (UFS) blocks, which are located across more than 90 towns, cities, and metropolitan areas.

Household sampling frame: In each of the villages and urban communities, a complete door-to-door census or listing was taken. The list of all households served as the sampling frame for the random sample of 25 households in each village or community.

Survey Instruments

A very broad range of data is collected to maximize the usefulness of this research design. Many of the component survey modules and codes are comparable to major international and Indian surveys to facilitate comparison with other datasets. The rural and urban survey instruments cover the same topics with minor changes for context. The survey consists of three complementary survey tools:

  • Census:The census or listing of all households in the community provides the sample frame for random selection of households, and permits measurement of the broad characterises of the community.  The detailed village census collects information on the household’s earnings from various sources, land ownership, housing conditions, and socio-economic characteristics (age, caste, primary and secondary occupation, education) of all household members. In addition, since the village census includes limited variables not captured in the household questionnaire, users may wish to link the data from the sample households to the data captured in the census.  In urban communities, a simplified census collects information on the size of the household, and education and occupation of the head of the household. 
  • Household-Level Questionnaires: For the 10,000 sample households, 3 instruments were administered:
    • Household questionnaire: The household questionnaire collects household-level data on key socioeconomic aspects of the household, including residence patterns, financial behaviour, cultivation, household enterprise, income, asset holdings, housing characteristics, and consumption. The household questionnaire was administered to the household members most knowledgeable about the various topics.
    • Individual questionnaire: The individual questionnaire collected details of education, health, migration, labour allocation, and political participation and was administered to all individuals over the age of 14 in the household. Proxy questionnaires were administered for persons under the age of 14, for persons temporarily unable to respond (due to migration) or unable to respond (due to disability).
    • Married women questionnaires: Married women questionnaires cover details of marital and fertility history, reproductive health, and time use and was administered to all ever-married women between the ages of 15-60. Proxy questionnaires were not administered.

Community Questionnaires

The community facility inventories, administered to key informants in the community, document a broad range of community characteristics such as infrastructure, housing and settlement patterns, employment and wage rates, local enterprise, agricultural and livestock patterns, political organization,  the presence of development programs, education and health facilities, and access to financial services.

Complete survey instruments, organized by region, questionnaire and theme, can be found in Folder 2_Documentation> Questionnaires>Questionnaires by Module.

Additional Survey Features

  • Drinking water quality testing: Testing for presence of fecal coliform is performed for the three primary drinking water sources in each community.
  • Anthropometric measurements: Height, weight, and upper arm measurement are taken for all household members in the sample.
  • Individual expectations and predictions: Household members are asked their expectations of lifespan, relative changes in rural and urban wages, and their predicted use of funds in the event of a positive capital shock.
  • Social networks: for a large number of economic and social transactions, the transaction partner characteristics (relationship or kinship, residence, and caste) are recorded.
  • Cognitive and achievement tests:
    • Digit span memory test is administered to all individuals over 5 years of age.
    • Ravens test of spatial aptitude is administered to all individuals over 5 years of age.
    • Reading, math, and comprehension tests are administered to all children between the ages of 5 and 13 years.

 Additional survey features and testing tools can be found in Folder 2_Documentation> Questionnaires>Additional Survey Features.

Data Entry and Cleaning

The data were transcribed from the survey instruments into the data entry program CS-Pro. All data was double-entered, and data entry errors are extremely rare.  The entered files were converted into Stata for use in the data cleaning process and the preparation of a public use version of the data.

In formulating a strategy for data processing, it was important to focus on those data cleaning activities which required access to privacy-protected information, and aimed to meet basic user needs as well as make the data available to the research community in a reasonable time frame.  For example, the highest priority was given to the cleaning of location, household, and person identifiers. 

The only changes that have been incorporated into the data are those on which we had reliable information on the correct value. Known problems, largely resulting from interviewer error in transcription, have been corrected so that identifiers how link properly. In the majority of other cases, decisions on how to handle particular issues belong in the hands of individual users. For privacy protection, information collected which makes it possible to identify villages, communities, or persons have been dropped. After public release, subsequent cleaning efforts will continue and updated versions and notes will be made available to the TNSMS user community.

Locations:

Minor interviewer recording errors in psu have been corrected. In the rural database, errors in interviewer miscoding of taluk and block have been corrected.

Households: A small number of duplicate hhid were corrected in the urban household database. One unresolved duplicate remains in the rural database.  Users may wish to link the sample household data to data collected on the same household during the census or listing. This is more likely to be the case with the rural census, where additional data is collected about the household that is not collected in the household or individual questionnaires. In both rural and urban communities, there were a number of interviewer transcription errors in recording the census or listing identifiers (listing_id) on the sample household surveys.  In villages, we were able to make use of the extensive information collected in the census to correctly match listing and sample households; the current rate of mismatch is well under 5%.  A similar process was applied to the urban data, but for approximately 10% of sample households we did not have adequate information in the listing to make a reasonably accurate match.

Persons: When working with the person identifiers in the urban individual and married-women databases, we discovered a number of duplicates in hhid and memid, largely due to interviewer error in transcription of member identifiers from the rosters.  To resolve these errors, we returned to the original questionnaires to assign the correct identifiers.  In the rural database, where the numbers of duplicates were fewer, we used a strategy of referring to various personal identifiers to correctly assign members to their respective sample households.

File Structure, Format, and Conventions

This section presents the structure and format of the TNSMS public use database.

FILE STRUCTURE: The rural and urban TNSMS databases consist of five separate databases: the listing or census database, the household database, the individual database, the married women database, and the community-facility database. The census or listing is a single file; the household database contains 98 Subfiles; the individual database contains 17 Subfiles; the married women database contains 12 Subfiles; the rural community database contains 63 subfiles and the urban community database 34 subfiles.  The TNSMS data were split into these separate subfiles to facilitate construction of subsequent analysis files.

DATA FORMAT: The files and subfiles of the TNSMS database are available in Stata format.

DATA CONVENTIONS USED IN TNSMS: Although the structure of the TNSMS database is complex, it is relatively straightforward in terms of data conventions. The items of note are the main identifier variables, non-response codes, variable names, and file names.

Identifier variables: The two main identifier variables in the TNSMS Household-Level Data are:

  • hhid: Main household identifier
  • memid: Person number from the household roster

The main identifiers for the TNSMS community-level data are:

  • psu: Primary Sampling Unit
  • region: Rural/Urban Indicator

Non-response codes

Non-response codes are straightforward and are present in all cases where an individual did not provide a response (for example, “doesn’t know” or “refuse to answer.”) These codes have been labelled in the database. If the interviewer was to skip questions or entire sections of the questionnaire, the fields were left blank in the data entry program. These skips will appear as blanks or “.” and are used to denote “not applicable.”

Variable names

Variable names originate from the data entry program and generally reflect the question number associated with the variable.  A excel sheet of all variables, presented by subfile, can be found in Folder 2_Documentation>File descriptions and variable lists.

Subfile names and contents

The structure of the database subfiles is a product of the structure of the data entry program. Subfile names reflect the questionnaires from which the data originates, and the questionnaire section. Subfiles generally contain the same unit of observation and are a continuation of the same type of information.  An excel sheet providing descriptions of subfiles and their summary content can be found in Folder 2_Documentation>File descriptions and variable lists.

Identifying samples, households, individuals within samples

  • Households: Households are identified by the variable hhid. All records in rural or urban database with the same value of hhid belong to the same household.
  • Members: Members are identified by the variables memid. All individual and married women subfiles in rural or urban databases with the same value of hhid and memid belong to the same household.

Linking files
The rural and urban databases each consist of 5 separate component databases - the Census or Listing, Household, Individual and Ever married women. As mentioned, each component database corresponds to a separate survey instrument.

TNSMS Databases and Number of Subfiles

Database

Rural Subfiles

Urban Subfiles

Listing

1

1

Community

63

34

Household

98

98

Individual

17

17

Ever-Married Women

12

12

 

Linking files within a database

Each database is split up into multiple Stata files for easy access and manageability. Most users will work only with their desired files, rather than the whole database at one time.  Data files can be found in Folder 1_Rural and Urban Data.  

Each subfile corresponds to pdf document of the relevant pages from the corresponding questionnaire, which can be found in Folder 2_Documentation>Questionnaires>Questionnaires by Stata subfile.

To combine data from various sub-files in each database, the user would (in most cases) need to use a combination of the main identifiers corresponding to that dataset

Table 1 Main identifier variables

Description

Variable name

Type of region (Rural/Urban)

region

Primary Sampling Unit

psu

Households

hhid

Persons

hhid, memid

For example, a household is the unique identifier in most sub-files in the household database. hhid is the household ID that correctly links observations across household database subfiles (sech01a, sech01b, etc.)

Linking files across databases 

To link information across databases, the user would need to refer to the appropriate identifier for each level (a community, a household, a person). For example, to merge data between rural Community and rural Household databases, the user would use the merge key “psu”. Note that the variable “psu” uniquely identifies observations in the Community database but not in the Household database (where it corresponds to approximately 25 households per psu). Details on merge keys between the various databases are given in Tables 2 & 3

Table 2 To merge across databases (RURAL)

 

Listing

Community

Household

Individual

EMW

Listing

 

To merge listing data with community characteristics, merge “m:1” by matching psu

To merge listing data for sample households with household characteristics, merge “1:1” by matching psu, listing_id

To merge listing data for sample households with data from the individuals database, merge “1:m” by matching psu, listing_id

To merge listing data for sample households with data from the individuals database, merge “1:m” by matching psu, listing_id

Community

 

 

To merge community characteristics with household level data, merge “1:m” by matching psu

To merge community characteristics with data from the individual database, merge “1:m” by matching psu

To merge community characteristics with data from the individual database, merge “1:m” by matching psu

Household

 

 

 

To merge household characteristics with data from the individual database, merge “1:m” by matching hhid

To merge the household roster with individuals, merge “1:1” by matching hhid, memid

To merge household characteristics with data from the individual database, merge “1:m” by matching hhid

To merge the household roster with EMW, merge “1:1” by matching hhid, memid

Individual

 

 

 

 

To merge data from the individual database with the EMW database, merge “1:1” by matching hhid, memid

EMW

 

 

 

 

 

Table 3 To merge across databases (URBAN)

 

Listing

Community

Household

Individual

EMW

Listing

 

To merge listing data with community characteristics, merge “m:1” by matching psu

To merge listing data for sample households with household characteristics, merge “1:1” by matching psu, listing_id

To merge listing data for sample households with data from the individuals database, merge “1:m” by matching psu, listing_id

To merge listing data for sample households with data from the individuals database, merge “1:m” by matching psu, listing_id

Community

 

 

To merge community characteristics with household level data, merge “1:m” by matching psu

To merge community characteristics with data from the individual database, merge “1:m” by matching psu

To merge community characteristics with data from the individual database, merge “1:m” by matching psu

Household

 

 

 

To merge household characteristics with data from the individual database, merge “1:m” by matching hhid

To merge the household roster with individuals, merge “1:1” by matching hhid, memid

To merge household characteristics with data from the individual database, merge “1:m” by matching hhid

To merge the household roster with individuals, merge “1:1” by matching hhid, memid

Individual

 

 

 

 

To merge data from the individual database with the EMW database, merge “1:1” by matching hhid, memid

EMW