# 4. Data collection

## 4.1 Background

It is important that government funds are invested in areas that provide the greatest return. Capital investment in transport infrastructure projects must be underpinned by good information on travel demand patterns (how, why, when and where people travel). Effective allocation of resources to manage and operate transport systems requires good information on the transport system performance. This information can only be obtained from comprehensive and regularly updated surveys of travel activity and demand.

The availability of reliable existing travel demand data, together with the costs involved in collecting new data, may dictate the specification and structure of the transport modelling system. Being able to establish a valid Reference (Base) Year demand is critical in undertaking the modelling of any major transport infrastructure proposal. Attempts should always be made to make best use of available demand data. The appropriateness of available data (for example, its currency, coverage, robustness and reliability) should be ascertained early in any model development and application undertaking.

## 4.2 Travel demand surveys

The collection of travel demand data usually requires large-scale travel surveys using either a mail out/mail back self-completion survey or a household personal interview survey.

The mail out/mail back self-completion survey questionnaire is mailed to a household and mailed back to the survey firm or agency after all questions are answered by all members of the surveyed household. Postage costs are usually borne by the survey firm or agency. The Victorian Activity and Travel Survey (VATS) is an example of a mail out/mail back survey.

The household personal interview survey involves face-to-face personal interviews and records all responses by all members of the surveyed household. Personal interview surveys have, to date, provided the major form of data collection for developing and updating transport models. Household personal interview surveys generally have high response rates (in the order of 70–80%) and can be undertaken over a much shorter time period than mail out/mail back surveys.

Other forms of travel demand survey may involve a combination of the mail out/mail back and face-to-face interview surveys, as well as computer aided telephone interview (CATI) surveys. Increasing use is being made of GPS devices to track individuals and assist in the collection of activity or travel diary information.

One critical issue to be addressed in designing a travel demand survey is the survey sample size. Generally, the more detailed the travel demand model, the larger the survey sample size required to obtain statistically reliable estimates of the model parameters. Funding limitations will, to some extent, limit the survey sample size and will dictate the level of detail in the travel demand model. One way of dealing with this issue is to conduct relatively small annual travel demand surveys that accumulate to increasing sample sizes over the ensuing years, making it possible to develop a travel demand model that becomes more detailed over time.

## 4.3 Person travel demand data

The travel demand data collected by the above-mentioned survey approaches represent a snapshot of travel patterns on a particular day and may include the following:

• Household information:
• dwelling type
• ownership status of dwelling
• household size
• number of registered motor vehicles by type
• number of bicycles
• Data about people in the household
• age
• sex
• relationship to head of household
• employment status
• resident or visitor
• licence holding
• occupation
• industry of employment
• personal income
• if currently studying – primary, secondary, tertiary
• undertaking other activities
• Travel data for all travel made on the travel day, on a ‘stop’ basis
• Travel origin
• Time of travel, including departure time and arrival time
• Purpose for the travel
• Location of destination
• Mode of transport used
• If the travel was made by vehicle
• vehicle used
• number of occupants
• any toll paid and by who
• parking location, any parking fee paid and by who
• If travel was made by public transport
• type of ticket
• type of zone ticket
• type of fare paid
• reason for not travelling on the travel day.

GPS based household travel surveys are becoming more prevalent in Europe and North America.  Such surveys require independent travellers, from households, to carry GPS devices such as loggers, or phone based applications.  These surveys, particularly in combination with prompted recall interviews, enable the collection of more accurate and precise personal travel behaviour data.  Further information will be provided in the next update.

## 4.4 Other data sources

Other data sources may include:

• Up-to-date traffic counts by hour and by direction aim to cover, as is practically possible, the main highway sections included in the model. Consideration should be given to establishing a regime of screenline traffic counts to provide information for model validation.
• Traffic signal count data.
• Bluetooth data for developing origin-destination movements and observed travel times.
• Smart card system data is an alternative source of public transport patronage data and may eliminate possible bias in survey design and conduct. Further detail on the advantages and possible limitations of this data is provided at the end of the section.
• On-board surveys or surveys at stations can provide data and information on boardings and alightings, loadings, and origins and destinations. These surveys may be used to augment household interview surveys or to provide detailed public transport patronage and demand data for specific areas of interest.
• Automatic number plate recognition (ANPR), matching vehicles passing distinct locations, both to provide information on travel times and, where the locations are organised in screenlines, a geographically coarse sector based identification of demand patterns. There are examples of similar use of Bluetooth detectors, although consideration should be given to potential bias from the vehicles and travellers sampled. Research, particularly in the US, seeks to exploit vehicle weight detection and the magnetic impulse characteristics of individual vehicles to match vehicles passing different locations.
• Automated Vehicle Location (AVL) data.

There is emerging work to collate data tracking GPS devices (based on direct location data) and mobile phone devices (based on the location of phone masts the device is linked to at different times). A number of products are available providing information on travel times, tracking GPS related devices. These can be linked to in vehicle devices (such as for vehicle theft of route finding) and can provide vehicular journey time and speed. It should be recognised that the source data may be biased (e.g. comprising a large proportion of commercial vehicles) and suitable care taken in drawing on these data sources.  There is also emerging evidence of these data sources being used to establish trip matrices, based on a range of assumptions to interpret that data for this purpose.  One fundamental issue here relates to sampling and expansion. Most GPS based sources involve particularly small and biased samples that may render them unsuitable as a primary source of data for travel patterns. While mobile operators generally have access to a large sample of the population, there are significant variations in market share, and quite different behaviours (e.g. in older and younger individuals) that require careful consideration.  Consideration is required in respect of privacy legislation. Careful consideration of the processing methods and assumptions, together with direct verification of the outcomes is needed in exploring the usefulness of these data.

### Use of Smart Card System Data

Potential advantages of using smart card system data are:

• The collection of large effective samples sizes, compared to household travel survey data
• Travel by individuals can be analysed over time
• Boardings and alightings can be accumulated to accurately estimate the passenger load on any segment of a public transport service
• As smart card data is usually timestamped at stops, it can also be used to estimate the speed and the reliability of public transport services even without an Automatic Vehicle Location (AVL) system (refer Austroads National Performance Indicators for Public Transport in Australia)
• Data is generally available within days of collection
• Low cost to acquire
• As the data is collected continuously, the data can be analysed and adjusted for daily variability and seasonality.  Variability and seasonality are key issues to understand network reliability and crowding.

Potential limitations of using smart card system data are:

• The absence of information on travel purpose and on traveller characteristics.  Inference is required to determine these details
• The difficulty in reliably identifying trips involving interchanging.  Inference is required
• The difficulty in allocating origin and destination of travel to specific stops and transport zones.  Again inference is required.

## 4.5 Survey Methodology and Data Requirements

Models and analytical procedures need data. The data needs to be relevant, current and accurate if useful results are to be gained from modelling and analysis.

Data collection is expensive, time consuming and not always straightforward, so care is needed in the planning, design and conduct of surveys. Without this attention, resources – time, people and money – can easily be wasted for little gain. High quality and relevant data are essential for analysis and serve to support policy formulation and decision-making. Poor quality or inappropriate data are to the detriment of informed decision-making.

One useful way to approach data collection is to view the survey process from the systems perspective. Figure 8 below provides one such process model. This figure represents a transport survey data collection as a process, starting with the specification of objectives of the survey and running through to the archiving of results. Note the existence within Figure 8 of various feedback loops indicating that survey design is not a purely sequential process; for example, analysts must be prepared to modify their survey instruments and sample frames in the light of the outcome of the pilot survey.

Figure 8: Survey process modelSource: Taylor, Bonsall and Young 2002, p.138

This process model identifies a number of steps and stages in the collection and analysis of data. These steps may be grouped into three broad stages:

1. Preliminary planning, in which the purpose and specific objectives of the survey are identified, specifications of the requirements for new data are determined (in light of existing data sources) and resources available or required for the survey are identified.
2. Survey planning and design, in which the appropriate survey instrument is selected and the sample design (including target population, sampling frame, sampling method and sample size) is undertaken, leading to a survey plan and the conduct of a pilot survey to test all aspects of the plan and to ensure that it works and provides the required data, and that they are compatible with the proposed analysis. This is an iterative stage, in which pilot survey outcomes may lead to revisions in the survey plan. Good and successful surveys necessarily pay significant attention to getting this stage right.
3. Survey conduct, in which the full survey is undertaken, data extracted and analysed, study reports prepared and databases archived for future reference.

This process is fully explained in Taylor, Bonsall and Young (2000, pp.137–145).

## 4.6 Survey techniques

Surveys are used to obtain data, which are then used to estimate model parameters for predicting the behaviour of transport users in order to make demand forecasts and to estimate the economic and financial values of projects. Most transport data surveys are sample surveys, in that only a small fraction of the overall population (for example, travellers, vehicles, network links or customers) is surveyed to provide data that are then extrapolated to provide a description of the total population. Data are collected at a few locations taken to represent transport activity, travel movement and traffic flow across the study area or a sample of individual travellers, customers or operators is surveyed because it is infeasible, impractical or uneconomic to survey the entire population. This means that survey data often need expansion from the sample to represent the full population. As discussed in the following sections, care in survey organisation and attention to detail are needed to ensure that survey data can properly represent their parent population.

There are two broad approaches to data collection:

1. Observational (passive) surveys – where surveyors (human or mechanical) record the occurrence (and often time of occurrence) of specified transport events or phenomena, such as the passage of vehicles past a point on the road, the arrival of trucks at a warehouse, or the number of passengers exiting from a railway platform in a specified time interval
2. Interview (active) surveys – where the surveyors make contact with the individual travellers, customers or decision makers to seek information directly from them

The information gathered in active surveys can be much richer than information available from passive surveys because:

• Observations are limited in scope to the direct area under study. For instance, the arrival of a vehicle at a cordon line indicates the point at which the vehicle entered or left the study area, but provides little information on the actual origin or the ultimate destination of the trip, nor the frequency with which the vehicle makes that trip or the purpose for which it is made. An active interview or questionnaire survey could obtain this additional information.
• Observational surveys are limited to study of actual behaviour at the study site. They provide information on ‘revealed demand’ - the actual behaviour that is occurring under the environmental conditions pertaining to the study area at the time of the survey. Revealed demand is the observed use of an area or facility. Environmental states, such as traffic congestion or lack of parking and seasonal conditions (including time of day), may restrict the ability of some individuals to access the specific site or facility or to choose to use an alternative (for example, another destination). This phenomenon is known as ‘latent demand’ and its extent cannot be gauged using observational surveys. An active survey method could seek to determine the existence and extent of latent demand, especially if the survey is designed and applied to include ‘non-users’ of facility or service as well as the users.[1]

The passive surveys aim to make no interference with the normal operation of the survey site and to not disturb the behaviour of the individuals under observation. The active surveys cannot avoid some interference and may even create disturbances that could affect the behaviour of the respondents. Great care is needed in the survey design for both observational and active surveys to ensure any interference is minimised and that significant bias is not introduced into the survey results because of how the data were collected.

Active (‘interview’) surveys may be conducted in three alternative ways: direct personal interviews, questionnaire surveys and remote interviews (generally conducted by telephone, but also possible over the internet).

### 4.6.1 Direct personal interviews

Personal interviews may be conducted in a variety of locations. Interviews in people’s homes have been widely used for collecting detailed data on the travel behaviour of households and individual. Most metropolitan areas and other large cities have databases of personal travel conducted using ‘household interviews’. Household interview surveys were conducted in Adelaide in 1999 and Perth in 2002-03, among other cities, while Sydney has a rolling cycle of home interview surveys running continuously.

Direct interviews can be used to collect detailed data about businesses, households and individuals and their travel behaviour, and about other traits such as attitudes and perceptions. In some cases, direct interviews may be conducted in a laboratory setting as well as ‘in the field’. Stated preference studies are often undertaken in a laboratory where specialised resources can be used (for example, to create a simulated environment for the respondent to be immersed in the situational context behind the survey). Hensher, Brotchie and Gunn (1989) described a methodology for surveying rail passengers using active survey techniques. Hensher and Golob (1998) described an interview survey of shoppers and freight forwarders conducted in Sydney in 1996.

One problem with the direct interview is that it can take a considerable period of time to complete, which may cause significant inconvenience to the (volunteer) interviewee. A second problem is that the interviewer must make direct contact with the survey respondent. This may involve considerable time spent travelling by the interviewer to visit the respondents and the need for multiple ‘call backs’ if the respondent is not ‘at home’.

### 4.6.2 Questionnaire surveys

One solution to overcome the time constraints associated with direct interviews is to use questionnaire surveys, often to be returned through the post (‘mail back’) at some future time. Examples of these surveys are roadside or i-vehicle surveys. These can be attempted by direct interview, but individual travellers may be delayed and inconvenienced in the process, or may reach their destination before the conclusion of the interview. It may be more reasonable and effective to distribute a written questionnaire to the travellers, asking them to complete and return the survey once the journey is finished.

The questionnaire can contain questions similar to those posed in an interview, but there are many limitations. For example, the questions must be clear and unambiguous as it stands, as there is no opportunity for an interviewer to offer an explanation. Fewer questions can be asked, as excessively long questionnaires reduce the number of completed responses. There is also the possibility for the respondent to offer false or misleading answers that an interviewer would recognise as such, but that are much harder to detect on a written form. However, the major problem with questionnaire surveys of this type is the likelihood of a low response rate. While response rates of the order of 10–20% may be acceptable in some areas (such as general market research on consumer goods), there is a considerable body of transport research that suggests that much higher rates of response – 80% or better - may be necessary to properly gauge true levels of travel activity in a population. It is necessary to recognise that the problem of low response rates may introduce sample bias. To overcome this problem, the survey process needs to incorporate a random procedure to select respondents and require the interviewer to visit a household multiple times to meet the selected respondent for interview.  Richardson, Ampt and Meyburg (1995) provide a full discussion of this issue, as well as detailed advice on the conduct of interview and questionnaire surveys.

### 4.6.3 Telephone and internet surveys

The third alternative is the use of telephone (or internet-based) surveys. These are similar to the direct interview except the interview is conducted remotely over a telecommunications network, with telephone interviews usually using random dial-up access. The advantages of this survey technique are cost and convenience. The interviewer stays in the one place (a call centre) and can make contact with a large number of respondents in a short time. Data entry can also be automated, as responses are directly entered into a computer database during the interview.

The disadvantages of the technique include the relatively short length of the interview that is normally possible over the telephone and perhaps a growing ‘consumer resistance’ to the telephone interview, given the method is widely used in general social and market research surveys and in direct marketing. Richardson, Ampt and Meyburg (1995), among many other transport survey researchers, maintain that good quality data on travel behaviour (at least of quality commensurate with that obtainable from face-to-face interviews and questionnaires) cannot be collected using telephone interviews.

Internet-based surveys are more like observational surveys in that respondents to the surveys generally find the survey website of their own volition, rather than through active encouragement by surveyors. This technique is being used, for instance, in studies on driver route choice, where detailed information is required that is quite difficult to obtain through more conventional survey approaches (see Abdel-Aty 2003). However, bias in sampling is quite likely to be an issue in these surveys and the field is as yet relatively unexplored.

### 4.6.4 Costs

Direct interview surveys are generally the most expensive, followed by questionnaire surveys (especially with regard to the number of valid completed questionnaires returned) and then telephone surveys. Observational surveys can be relatively inexpensive, at least in terms of elemental costs such as hourly wage rates or where automatic data loggers can be employed (as for automatic vehicle counts). However, large-scale observational surveys (such as vehicle number plate surveys that require a large number of observed vehicles) can prove very expensive and are sometimes not particularly efficient in terms of the collection of usable, quality data.

### 4.6.5 Revealed versus stated preferences

A further distinction in interview surveys should be drawn between the collection of revealed preference data (what people are seen to do or record that they have done) and stated preference data (what people say they would do in different circumstances, such as when faced with changes in transport fares, services or costs).

Generally, the household travel surveys concentrate on revealed preference data. They record historical data on travel behaviour, which then form a snapshot of travel activity in an area at one particular point in time. These data provide little information on how people might change their behaviour in response to new transport policies or to changing travel environments or to the availability of new modes or services. Stated preference data may be used for these purposes and stated preference experimental methods provide powerful tools in this regard. At the same time, there are considerable problems in ensuring that stated preference information is valid and reliable. Louviere, Hensher and Swait (2000) provide a full coverage of this survey methodology and its use. It should be noted that stated preference methods are a rich source of information for the development of discrete choice models of travel behaviour.

For further reading on transport survey methods see Taylor, Bonsall and Young (2000), Louviere, Hensher and Swait (2000) and Richardson, Ampt and Meyburg (1995).

## 4.7 Sample size estimation

As indicated in Section 4.17, most transport data surveys are sample surveys. Sampling is usually necessary because it is too expensive to survey all members of the population (for example, to obtain travel diaries from all inhabitants of a metropolitan area) or it is physically impossible to do so (such as testing the roadworthiness of all vehicles) or because the survey testing process would be destructive (such as determining the strength of railway sleepers).

Almost all transport surveys involve observing some members of a target population to infer something about the characteristics of that population. In this sense they are statistical sampling surveys. As the effectiveness of the survey is dependent upon choosing an appropriate sample, sample design is a fundamental part of the overall survey process.

• Definition of target population
• Definition of sampling unit
• Selection of sampling frame
• Choice of sample method
• Consideration of likely sampling errors and biases
• Determination of sample size.

Two main methods exist for selecting samples from a target population: judgement sampling and random sampling. In random sampling, all members of the target population have a chance of being selected in the sample, whereas judgement sampling uses personal knowledge, expertise and opinion to identify sample members.

Judgement samples have a certain convenience. They may have a particular role, such as ‘case studies’ of particular phenomena or behaviours. The difficulty is that because judgement samples have no statistical meaning, they cannot represent the target population. Statistical techniques cannot be applied to these samples to produce useful results as they are almost certainly biased.

There is a particular role for judgment sampling in exploratory or pilot surveys where the intention is to examine the possible extremes of outcomes with minimal resources. However, to go beyond such an exploration, the investigator cannot attempt to select ‘typical’ members or exclude ‘atypical’ members of a population, or to seek sampling by convenience or desire (choosing sample members on the basis of ease or pleasure of observation). Rather, a random sampling scheme should be adopted, to ensure the sample taken is statistically representative.

Random samples may be taken by one of four basic methods (Cochran 1977): simple random sampling, systematic sampling, stratified random sampling and cluster sampling. Taylor, Bonsall and Young (2000 pp. 155–58) describe each of these sampling methods and their applications, as do Richardson, Ampt and Meyburg (1995).

Simple random sampling allows each possible sample to have an equal probability of being chosen, and each unit in the target population has an equal probability of being included in any one sample. Sampling may be either ‘with replacement’ (any member may be selected more than once in any sample draw) or ‘without replacement’ (after selection in one sample, that unit is removed from the sampling frame for the remainder of the draw for that sample). Selection of the sample is by way of computerised randomisation techniques such as random number generation.   The methods of statistical inferences applied to sample data analysis are predicated on the basis that a sample is chosen by simple random sampling. Data collected using other sampling methods need to be analysed using known techniques that include corrections to approximate simple random samples.

There is always a possibility that a sample may not adequately reflect the nature of the parent population. Random fluctuations (‘errors’), which are inherent in the sampling process, are not serious because they can be quantified and allow for using statistical methods.[2] However, if due to poor experimental design or survey execution there is a systematic pattern to the errors, this will introduce bias into the data and, unless it can be detected, it will distort the analysis. A principal objective of statistical theory is to infer valid conclusions about a population from unbiased sample data, bearing in mind the inherent variability introduced by sampling. Bias and sampling errors are two, quite different, sources of error in experimental observations. As described in Richardson, Ampt and Meyburg (1995, pp. 96–101), bias (or systematic error) needs to be removed from sample data before statistical analysis can be attempted, for statistical theory treats all errors as sampling errors.

A distribution of all the possible means of samples drawn from a target population is known as a sampling distribution. It can be partially described by its mean and standard deviation. The standard deviation of the sampling distribution is known as the standard error. It takes account of the anticipated amount of random variation inherent in the sampling process and can therefore be used to determine the precious of a given estimate of a population parameter from the sample.

Surveys for specific investigations usually attempt to provide data for the estimation of particular population parameters or to test statistical hypotheses about a population. In either case, the size of the sample selected will be an important element and the reliability of the estimate will increase as sample size increases. However, the cost of gathering the data will also increase with increased sample size - an important consideration in sample design. A trade-off may have to occur and the additional returns from an increase in sample size will need to be evaluated against the additional costs incurred. If the target population may be taken as infinite, then the standard error (sx) of variable X is given by

$S X - = s n$

[EQ 4.1]

where s is the estimated standard deviation of the population and n is the sample size, assuming that the sampling distribution is normal. Even when the sampling distribution is not normal, this method may still apply because to the Central Limit Theorem which states that the mean of n random variables form the same distribution will, in the limit as n approaches infinity, have a normal distribution even if the parent distribution is not normal. The standard deviation of the mean is inversely proportional to √n. The implication of equation 4.1 is that as sample size increases, standard error decreases in proportion to the square root of n. Here is an important result. The extra precision of a larger sample should be traded off against the cost of collecting that amount of data. To double the precision of an estimate will require the collection of four times as much data.

Similar results are found for other statistical parameters. For instance, the standard error (sp) of a proportion p (e.g. a measure such as ‘the proportion of households owning one vehicle’) is given by:

$S P = p ( 1 - p ) n$

[EQ 4.2]

The practical application of these results requires some prior knowledge of the population, such as a prior estimate of the sample standard deviation (s) in the case of the mean value of variable X or an initial estimate of the proportion p. The results of previous surveys, or the pilot survey, may be used to provide this knowledge.

[1] For example, on-board surveys conducted on bus, train or tram are often used to collect data on public transport users, but could not indicate much about those travellers who are potential users of public transport, but are currently using some other mode. This one reason for the use of home interviews in general travel surveys.

[2] Noting that minimisation of experimental errors is of course important in improving the precision of parameter estimates based on survey data.