National Notifiable Diseases Data


     The American Statistical Association (ASA) Section on Statistical Graphics 
     is sponsoring a special exposition entitled "Statistics in Public Health 
     Surveillance," as a poster session at the August 1991 Joint Statistical 
     Meetings in Atlanta.  Its purpose is to provide a forum for ASA members 
     to present innovative graphical and analytic techniques for addressing
     questions of importance to public health control and prevention
     efforts.  For the exposition, everyone will have access to the following
     data files concerning tuberculosis and mumps incidence in the US.

=========================================================================

The following is a quick inventory of the files and their sizes:

  BYTES       LINES        FILE NAME        Contents

  16541         326        README           text (this file)
  38151        3141        census_70        county populations, 1970
  38294        3137        census_80        county populations, 1980
  57207        3141        fips_70          state & county codes, 1970
  57051        3137        fips_80          state & county codes, 1980
 112651       10342        mumps_ms         mumps cases by month & state
  83476        7000        mumps_yc         mumps cases by year & county
  26235        1606        mumps_ys         mumps cases by year & state
   3079          60        notice           text (call for presentations)
    650          54        start            when states started reporting
 201556       11338        tb               tuberculosis cases
 ------       -----
 634891       43282        TOTAL

=========================================================================

                           Getting the Data

You should have obtained this README file by sending an email letter to
statlib@temper.stat.cmu.edu containing the single line
"send README from disease".

If you were unable to get the README file from statlib in this manner,
then there may be some sort of problem with the way your mail address
makes it through the network.  Statlib is run by a computer program that is 
sometimes unable to figure out how to reply to your mail.  If you have
requested the README file and haven't gotten it, then don't try to
request anything else.  Instead, send a mail message to
statlib-request@temper.stat.cmu.edu, requesting assistance.

After scanning through the information below, if you decide you want
the entire data collection to try your hand at analyzing it,
here are instructions for getting it.

If you request the entire disease data set you will receive a lot of
mail.  Please make sure that your system can accept about a megabyte of
mail.  Some computer systems have only a limited amount of disk space
allocated for incoming mail.  If this could be a problem on your
machine, please check first with the system administrator, or obtain the
disease data in smaller chunks.

Please be patient when you request that data.  Some parts of the data
may arrive quite quickly, while other parts may take much longer to
arrive.  There are many reasons why mail can be delayed, ranging from
temporary network problems, to intermediary computers being unavailable.
So please wait up to a day before you reissue a request to statlib.

1) If you are on a UNIX system, you can get the rest of the data
   as a self-unbundling shar archive by sending another email letter to
   statlib@temper.stat.cmu.edu containing the single line
   "send shar from disease".  It will come in several pieces, with easy
   instructions for reassembling them.

1a) If you have the uudecode and uncompress commands on your system,
   you can ask the statlib server to "send shar.uu from disease" to 
   retrieve a uuencoded, compressed version of the shar file.
   When you receive the pieces of this file, put them together, execute
   uudecode on the resulting file, and then uncompress the result of the
   uudecode operation.  This ends up sending about 320K characters as opposed
   to the 660K characters occupied by the entire shar file.

2) If you do not have access to a UNIX system, send another email letter
   to statlib@temper.stat.cmu.edu containing the lines

   send census_70 from disease
   send census_80 from disease
   send fips_70 from disease
   send fips_80 from disease
   send mumps_ms from disease
   send mumps_yc from disease
   send mumps_ys from disease
   send start from disease
   send tb from disease
   
   The data will be sent to you in a number of email letters that you will have
   to hand edit to trim away the mail headers and recreate the data files.

Some network mailers can not accept or forward very large pieces of
mail.  Hence the statlib system breaks up large messages into several
smaller messages.  Each time a message is broken a copy of the copy of a
small piece of software for reassembling the data (call statchop) is
also sent.  Thus you may receive multiple copies of statchop.

=========================================================================

BACKGROUND

     Reporting of cases of communicable disease is necessary for planning
     and evaluation of disease prevention and control programs (e.g. for
     vaccine-preventable diseases such as mumps), in the assurance of
     appropriate medical therapy (e.g. tuberculosis), and in the detection
     of outbreaks (e.g. the recent increase in tuberculosis in young
     adults).  Systematic reporting of diseases in the United States began
     in 1874 when the Massachusetts State Board of Health inaugurated
     weekly voluntary reporting of diseases by physicians.  The authority
     to require notification of cases of disease now resides in the
     respective State legislatures, State epidemiologists, or boards of
     health.  The Centers for Disease Control in partnership with the
     Council of State and Territorial Epidemiologists (CSTE) operates the
     National Notifiable Diseases Surveillance System (NNDSS) to provide
     weekly provisional information on the occurrence of diseases that are
     defined as "notifiable" by CSTE.  The NNDSS data are based on reports
     by State epidemiologists, who themselves receive reports from a
     variety of sources, such as individual practitioners, hospitals,
     laboratories, and health departments.  Reports are received from all
     States, Washington, D.C., New York City, and 5 United States
     territories (Puerto Rico, Virgin Islands, American Samoa, Guam, and
     the Commonwealth of the Northern Mariana Islands).

     Tools for this surveillance system are continually improving.  The
     National Electronic Telecommunications System for Surveillance
     (NETSS) is a computer-based system begun in 1984 for reporting
     disease surveillance information to CDC.  The computerized system
     allows more case detail and analytic capability than previously, when
     only aggregate case counts were available by telephone; disease
     distribution can now be mapped by county, onset dates of disease can
     be examined, and comparative information on the distribution of age,
     race, and sex of case patients is available.

     The usefulness of surveillance data varies with the disease, but
     generally such data are used to monitor trends, alert health
     professionals to important aberrations from historical patterns,
     estimate the effect of morbidity, portray natural history of disease,
     develop and test hypotheses, evaluate control measures, monitor
     changes in infectious agents, detect changes in health practices, and
     facilitate planning.  Surveillance provides information on case
     patients for more detailed examination, thus facilitating at the
     local level epidemiologic research and the follow-up of individuals
     resulting in the initiation of appropriate therapy.  Surveillance
     data also provide policymakers the basis for planning and
     implementing prevention and control programs.

     Although through participating in the surveillance process,
     physicians and other health care providers ensure that public health
     resources are effectively used, completeness of reporting varies
     considerably by location and disease.  

     Reports are considered provisional and subject to updating when
     more specific information becomes available.

     Reference: "Mandatory Reporting of Infectious Diseases by Clinicians",
     Journal of the American Medical Association, December 1, 1989.


TUBERCULOSIS DATA

     Tuberculosis is caused by bacteria that are transmitted from
     person to person primarily through the air.  It is estimated
     that over 90% of persons reported to have clinically apparent
     disease have had latent TB infection for a year or longer.  The
     number of persons with latent infection in the U. S. is
     estimated to be from 10 to 15 million.  Questions of public
     health importance include whether Incidence of TB varies with
     age, sex, race, or ethnicity.  One hypothesis is that overcrowding
     is associated with tuberculosis incidence.  Thus, for example,
     if this were true, you might expect to see higher incidence rates
     during the winter when people are together indoors and
     and lower rates during the summer.  Another possibility is that
     incidence would be higher in cities like New York and Washington, DC.
     There is a hypothesis that tuberculosis incidence is increasing
     in groups which are infected with the human immunodeficiency virus
     (HIV).  This might lead to increased tuberculosis rates, for example,
     in males in their 20s to 40s.  Do medically underserved groups
     (e.g. blacks, Hispanics, and Native Americans), have higher tuberculosis
     rates?


MUMPS DATA

     Mumps is of current public health interest, because of large
     outbreaks which occurred in 1986-1987 and in 1989, primarily
     among unvaccinated adolescents and young adults in states
     without requirements for mumps vaccination.  
  
     The data supplied for mumps consists of 3 files with different
     spatial and temporal resolutions.  Because of differences in reporting
     processes, results from the various files may not be directly
     comparable.  Each of these files may be appropriate for use in
     answering different questions.

     Questions of public health importance include:
     Can periodicity be demonstrated for mumps in the United States,
     1953-1989?
     Can geographic spread be demonstrated?

DATASETS

     Seven datasets are provided here.  Datasets 1 through 5 were provided
     by the Centers for Disease Control; datasets 6 and 7 are from the U.S.
     Bureau of the Census.
     All fields are separated by "#" characters.

     1.  tb:     Reported tuberculosis cases.

           The dataset consists of 11,338 individual cases of tuberculosis
           sampled randomly from 113,417 cases reported to CDC during 1985-1989.
           For its archival purposes, CDC may assign cases to an earlier
           year due to lags in case reporting.  
           Because States joined NETSS at varying times during this period,
           some variation in reported cases may be due to varying 
           participation rates rather than variation in disease incidence
           (see dataset 2 below for times when States started reporting).

           Each record includes 7 fields:
           STATE:     State Federal Information Processing Standard (FIPS)
                      code (2 digits, leading 0's; see also dataset 7).
           YEAR:      Year the case was counted by CDC (last 2 digits).
           MONTH:     Month the case was counted by CDC (1-12).
           AGE:       Age in years (98=over 97, 99=unknown)
           SEX:       Sex (1=male, 2=female, 9=unknown)
           RACE:      Race (1=white, 2=black, 3=American Indian/Alaskan
                      Native, 4=Asian/Pacific Islander, 9=unknown)
           ETHNICITY: Ethnicity (1=Hispanic, 2=non-Hispanic, 9=unknown)

     2.  start:       When did states start reporting tb cases?

           The data in file tb was first reported from various states at
           different times.  This file gives those dates.

           Each record has 3 fields:
           STATE:     State Federal Information Processing Standard (FIPS)
                      code (2 digits, leading 0's; see also dataset 7).
           POSTAL:    2-letter postal code for State.
           MONTHYR:   Month and year when reporting began (2 digits month
                      with leading zero, "/", two digit year).  Note,
                      VI has not yet started reporting.

     3.  mumps_ys:    Reported Mumps Cases (Year/State)

           The dataset consists of the number of cases of mumps reported
           from each State annually, 1953-1988. Data
           are not available for all States for the entire period, since
           mumps has become reportable at differing times in the States.

           The file contains 1606 records.  Each record consists of 3
           fields.

           YEAR:      Year (last two digits)
           STATE:     State (or other reporting area) name (no embedded
                      spaces).  UpstateNY is New York, excluding New York City.
           COUNT:     Number of cases reported.

     4.  mumps_yc:    Reported Mumps Cases (Year/County)

           The dataset consists of the number of cases of mumps reported
           from each county annually, every third year from 1970-1988.
           Data are not available for all counties for the
           entire period, since mumps has become reportable at differing
           times in the States.  Due to confidentiality considerations,
           counties reporting less than 4 cases in any year are assigned
           unknown county designation (999).

           The file contains 7000 records.  Each record consists of 4
           fields:

           STATE:     State Federal Information Processing Standard (FIPS)
                      code (2 digits, leading 0's; see also dataset 7).
           YEAR:      Year the case was reported to CDC (last 2 digits).
           COUNTY:    County (FIPS) code (1 or 3 digits, no leading 0's,
                      may be 999; see also dataset 7).
           COUNT:     Number of cases reported for the given year and
                      county.

     5.  mumps_ms:    Reported Mumps Cases (month/State)

           The dataset consists of the number of cases of mumps reported
           from each State monthly, 1968-1988.  Data
           are not available for all counties for the entire period, since
           mumps has become reportable at differing times in the States.

           The file contains 10,342 records.  Each record consists of 4
           fields:

           STATE:     State FIPS code (2 digits, leading 0's; see also
                      dataset 7).
           YEAR:      Year the case was reported to CDC (last 2 digits).
           MONTH:     Month the case was reported to CDC (1-12, 13=unknown).
           COUNT:     Number of cases reported for the given month/year
                      and State.

     6.  census:      Census Population (census_70 and census_80)

           These files contain decennial population totals by county (or
           other reporting unit) for 1970 and for 1980 respectively.
           Although intercensal population estimates are provided by the
           census, decennial counts are generally used for calculation of
           rates (e.g. cases per 100,000 population).

           The fields in each file are:

           STATE:     State FIPS code  (2 digits, leading 0's).
           COUNTY:    County FIPS code (3 digits, leading 0's).
           POP:       Total county population.

     7.  fips:   FIPS Codebook

           Two datasets FIPS_70 and FIPS_80 contain FIPS codes for States
           and counties for 1970 and 1980 respectively (the FIPS
           classification changed between 1970 and 1980).

           The fields in each file are:

           STATE:     State FIPS code  (2 digits, leading 0's).
           COUNTY:    County FIPS code (3 digits, leading 0's).
           POSTAL:    2-letter postal code for State
           NAME:      Name of county (or other reporting unit). May have
                      embedded spaces.  All upper case!
============================= END OF README ==============================