Data Mining: Federal Efforts Cover a Wide Range of Uses
(04-MAY-04, GAO-04-548).
Both the government and the private sector are increasingly using
"data mining"--that is, the application of database technology
and techniques (such as statistical analysis and modeling) to
uncover hidden patterns and subtle relationships in data and to
infer rules that allow for the prediction of future results. As
has been widely reported, many federal data mining efforts
involve the use of personal information that is mined from
databases maintained by public as well as private sector
organizations. GAO was asked to survey data mining systems and
activities in federal agencies. Specifically, GAO was asked to
identify planned and operational federal data mining efforts and
describe their characteristics.
-------------------------Indexing Terms-------------------------
REPORTNUM: GAO-04-548
ACCNO: A09947
TITLE: Data Mining: Federal Efforts Cover a Wide Range of Uses
DATE: 05/04/2004
SUBJECT: Counterterrorism
Crime prevention
Data collection
Federal agencies
Fraud
Information technology
Personnel management
Planning
Statistical methods
Data mining
Personal information
******************************************************************
** This file contains an ASCII representation of the text of a **
** GAO Product. **
** **
** No attempt has been made to display graphic images, although **
** figure captions are reproduced. Tables are included, but **
** may not resemble those in the printed version. **
** **
** Please see the PDF (Portable Document Format) file, when **
** available, for a complete electronic file of the printed **
** document's contents. **
** **
******************************************************************
GAO-04-548
United States General Accounting Office
GAO Report to the Ranking Minority Member, Subcommittee on Financial Management,
the Budget, and International Security, Committee on Governmental Affairs, U.S.
Senate
May 2004
DATA MINING
Federal Efforts Cover
a Wide Range of Uses
a
GAO-04-548
Highlights of GAO-04-548, a report to the Ranking Minority Member,
Subcommittee on Financial Management, the Budget, and International
Security, Committee on Governmental Affairs, U.S. Senate
Both the government and the private sector are increasingly using "data
mining"-that is, the application of database technology and techniques
(such as statistical analysis and modeling) to uncover hidden patterns and
subtle relationships in data and to infer rules that allow for the
prediction of future results. As has been widely reported, many federal
data mining efforts involve the use of personal information that is mined
from databases maintained by public as well as private sector
organizations.
GAO was asked to survey data mining systems and activities in federal
agencies. Specifically, GAO was asked to identify planned and operational
federal data mining efforts and describe their characteristics.
May 2004
DATA MINING
Federal Efforts Cover a Wide Range of Uses
Federal agencies are using data mining for a variety of purposes, ranging
from improving service or performance to analyzing and detecting terrorist
patterns and activities. Our survey of 128 federal departments and
agencies on their use of data mining shows that 52 agencies are using or
are planning to use data mining. These departments and agencies reported
199 data mining efforts, of which 68 are planned and 131 are operational.
The figure here shows the most common uses of data mining efforts as
described by agencies. Of these uses, the Department of Defense reported
the largest number of efforts aimed at improving service or performance,
managing human resources, and analyzing intelligence and detecting
terrorist activities. The Department of Education reported the largest
number of efforts aimed at detecting fraud, waste, and abuse. The National
Aeronautics and Space Administration reported the largest number of
efforts aimed at analyzing scientific and research information. For
detecting criminal activities or patterns, however, efforts are spread
relatively evenly among the agencies that reported having such efforts.
In addition, out of all 199 data mining efforts identified, 122 used
personal information. For these efforts, the primary purposes were
improving service or performance; detecting fraud, waste, and abuse;
analyzing scientific and research information; managing human resources;
detecting criminal activities or patterns; and analyzing intelligence and
detecting terrorist activities.
Agencies also identified efforts to mine data from the private sector and
data from other federal agencies, both of which could include personal
information. Of 54 efforts to mine data from the private sector (such as
credit reports or credit card transactions), 36 involve personal
information. Of 77 efforts to mine data from other federal agencies, 46
involve personal information (including student loan application data,
bank account numbers, credit card information, and taxpayer identification
numbers).
Top Six Purposes of Data Mining Efforts in Departments and Agencies
www.gao.gov/cgi-bin/getrpt?GAO-04-548
To view the full product, including the scope and methodology, click on
the link above. For more information, contact Linda Koontz at (202)
512-6240 or [email protected].
Contents
Letter 1
Results in Brief 2
Background 3
Agencies Identified Numerous Data Mining Efforts with Various
Aims 7
Summary 12
Appendixes
Appendix I: Objective, Scope, and Methodology 14
Appendix II: Surveyed Departments and Agencies 16
Appendix III: Departments and Agencies Reporting No Data Mining Efforts 23
Appendix IV: Inventories of Efforts 27
Tables Table 1: Table 2: Table 3: Table 4: Table 5: Table 6: Table 7:
Table 8: Table 9:
Top Six Purposes of Data Mining Efforts in Departments and Agencies and
Number of Efforts Reported Department of Agriculture's Inventory of Data
Mining Efforts Department of Commerce's Inventory of Data Mining Efforts
Department of Defense's Inventory of Data Mining Efforts Department of
Education's Inventory of Data Mining Efforts Department of Energy's
Inventory of Data Mining Efforts Department of Health and Human Services'
Inventory of Data Mining Efforts Department of Homeland Security's
Inventory of Data Mining Efforts Department of the Interior's Inventory of
Data Mining Efforts
8
27
29
29
37
40
41
43
46
47 49 50
50 Table 10: Department of Justice's Inventory of Data Mining
Efforts Table 11: Department of Labor's Inventory of Data Mining Efforts
Table 12: Department of State's Inventory of Data Mining Efforts Table 13:
Department of Transportation's Inventory of Data Mining
Efforts
Table 14: Department of the Treasury's Inventory of Data Mining Efforts 51
Table 15: Department of Veterans Affairs' Inventory of Data Mining Efforts
54 Table 16: Environmental Protection Agency's Inventory of Data Mining
Efforts 56 Table 17: Export-Import Bank of the United States' Inventory of
Data Mining Efforts 56 Table 18: Federal Deposit Insurance Corporation's
Inventory of Data Mining Efforts 57 Table 19: Federal Reserve System's
Inventory of Data Mining Efforts 57 Table 20: National Aeronautics and
Space Administration's Inventory of Data Mining Efforts 58 Table 21:
Nuclear Regulatory Commission's Inventory of Data Mining Efforts 62 Table
22: Office of Personnel Management's Inventory of Data Mining Efforts 62
Table 23: Pension Benefit Guaranty Corporation's Inventory of Data Mining
Efforts 63 Table 24: Railroad Retirement Board's Inventory of Data Mining
Efforts 63 Table 25: Small Business Administration's Inventory of Data
Mining Efforts 64
Figures Figure 1: Top Six Purposes of Data Mining Efforts That Involve
Personal Information 10
Figure 2: Top Six Purposes of Data Mining Efforts That Involve
Private Sector Data 11
Figure 3: Top Six Purposes of Data Mining Efforts That Involve
Data from Other Federal Agencies 12
Abbreviations
CARDS Counterintelligence Analytical Research Data System
CG Coast Guard
CI-AIMS Counterintelligence Automated Investigative
Management System DHHS Department of Health and Human Services DOD
Department of Defense DOE Department of Energy DOT Department of
Transportation EFTPS Electronic Federal Tax Payment System EOS Earth
Observing System FARS Fatality Analysis Reporting System FDA Food and Drug
Administration GENESIS Global Environmental and Earth Science Information
System GSFC Goddard Space Federal Center HR Human Resources HRSA Health
Resources and Services Administration MATRIX Multistate Anti-terrorism
Information Exchange System NASA National Aeronautics and Space
Administration NVO National Virtual Observatory OIG Office of Inspector
General OLAP On-line Analytical Processing RSST Real Estate Stress Test
SAA Spectral Analysis Automation SAS Safety Automated System SMARTS
Statistical Management Analysis and Reporting Tool
System SWC Space Warfare Center TIMS Technical Information Management
System TOP Treasury Offset Program VA Veterans Affairs VHA Veterans Health
Administration VISN Veterans Integrated Service Network
This is a work of the U.S. government and is not subject to copyright
protection in the United States. It may be reproduced and distributed in
its entirety without further permission from GAO. However, because this
work may contain copyrighted images or other material, permission from the
copyright holder may be necessary if you wish to reproduce this material
separately.
A
United States General Accounting Office Washington, D.C. 20548
May 4, 2004
The Honorable Daniel K. Akaka
Ranking Minority Member
Subcommittee on Financial Management, the Budget, and International
Security Committee on Governmental Affairs United States Senate
Dear Senator Akaka:
Data mining-a technique for extracting knowledge from large volumes of
data-is increasingly being used by government and by the private sector.
As has been widely reported, many federal data mining efforts involve the
use of personal information1 that is mined from public as well as private
sector organizations.
This report responds to your request that we identify and describe
operational and planned data mining systems and activities in federal
agencies. In a follow-up report, we plan to perform an in-depth review of
selected federal data mining efforts.
The term "data mining" has a number of meanings. For purposes of this
work, we define data mining as the application of database technology and
techniques-such as statistical analysis and modeling-to uncover hidden
patterns and subtle relationships in data and to infer rules that allow
for the prediction of future results. We based this definition on the most
commonly used terms found in a survey of the technical literature. In our
initial survey of chief information officers, these officials found the
definition sufficient to identify agency data mining efforts.
1As used in this report, personal information is all information
associated with an individual and includes both identifying information
and nonidentifying information. Identifying information, which can be used
to locate or identify an individual, includes name, aliases, Social
Security number, e-mail address, driver's license number, and
agency-assigned case number. Nonidentifying personal information includes
age, education, finances, criminal history, physical attributes, and
gender.
To address our objective to identify and describe operational and planned
data mining systems and activities in federal agencies, we surveyed chief
information officers or comparable officials at 128 federal departments
and agencies to determine whether the agencies had operational and planned
data mining systems or activities.2 We then conducted telephone interviews
with the reported system managers to obtain information on the
characteristics of the identified data mining efforts. To verify the
information we received, we sent follow-up letters to agencies that
responded as well as to those that did not respond, we asked responsible
officials to verify the information, and we performed random assessments
of the means that these officials used to verify the information.
In addition, we conducted a search of technical literature and periodicals
to develop a comprehensive list of federal government data mining efforts
and then compared these efforts with data mining efforts reported by
federal agencies. If the data mining efforts on our lists were not
reported on the survey, we contacted the appropriate chief information
officers and, with their concurrence, added the efforts.
We performed our work from May 2003 to April 2004 in accordance with
generally accepted government auditing standards. Additional details on
our scope and methodology are provided in appendix I.
Results in Brief Federal agencies are using data mining for a variety of
purposes, ranging from improving service or performance to analyzing and
detecting terrorist patterns and activities. Our survey of 128 federal
departments and agencies on their use of data mining shows that 52
agencies are using or are planning to use data mining. These departments
and agencies reported 199 data mining efforts, of which 68 were planned
and 131 were operational. The most common uses of data mining efforts were
described by agencies as
o improving service or performance;
o detecting fraud, waste, and abuse;
o analyzing scientific and research information;
2That is, we asked about both systems explicitly dedicated to data mining
and activities using automated tools to "mine" databases that are part of
other systems. In this report, we use the word "efforts" to refer to both
systems and activities, unless otherwise specified.
o managing human resources;
o detecting criminal activities or patterns; and
o analyzing intelligence and detecting terrorist activities.
The Department of Defense reported having the largest number of data
mining efforts aimed at improving service or performance and at managing
human resources. Defense was also the most frequent user of efforts aimed
at analyzing intelligence and detecting terrorist activities, followed by
the Departments of Homeland Security, Justice, and Education.
The Department of Education reported the largest number of efforts aimed
at detecting fraud, waste, and abuse, while the National Aeronautics and
Space Administration targets most of their data mining efforts (21 out of
23) toward analyzing scientific and research information. Data mining
efforts for detecting criminal activities or patterns, however, were
spread relatively evenly among the reporting agencies.
In addition, out of all 199 data mining efforts identified, 122 used
personal information. For these efforts, the primary purposes were
detecting fraud, waste, and abuse; detecting criminal activities or
patterns; analyzing intelligence and detecting terrorist activities; and
increasing tax compliance.
Agencies also identified efforts to mine data from the private sector and
data from other federal agencies, both of which could include personal
information. Of 54 efforts to mine data from the private sector (such as
credit reports or credit card transactions), 36 involve personal
information. Of 77 efforts to mine data from other federal agencies, 46
involve personal information (including student loan application data,
bank account numbers, credit card information, and taxpayer identification
numbers).
Background Data mining enables corporations and government agencies to
analyze massive volumes of data quickly and relatively inexpensively. The
use of this type of information retrieval has been driven by the
exponential growth in the volumes and availability of information
collected by the public and private sectors, as well as by advances in
computing and data storage capabilities. In response to these trends,
generic data mining tools are increasingly available for-or built
into-major commercial database applications. Today, mining can be
performed on many types of data,
including those in structured, textual, spatial, Web, or multimedia forms.
Data mining is becoming a big business; Forrester Research has estimated
that the data mining market is passing the billion dollar mark.
Although the use and sophistication of data mining have increased in both
the government and the private sector, data mining remains an ambiguous
term. According to some experts, data mining overlaps a wide range of
analytical activities, including data profiling, data warehousing, online
analytical processing, and enterprise analytical applications.3 Some of
the terms used to describe data mining or similar analytical activities
include "factual data analysis" and "predictive analytics." We surveyed
technical literature and developed a definition of data mining based on
the most commonly used terms found in this literature. Based on this
search, we define data mining as the application of database technology
and techniques-such as statistical analysis and modeling-to uncover hidden
patterns and subtle relationships in data and to infer rules that allow
for the prediction of future results. We used this definition in our
initial survey of chief information officers; these officials found the
definition sufficient to identify agency data mining efforts.
Data mining has been used successfully for a number of years in the
private and public sectors in a broad range of applications. In the
private sector, these applications include customer relationship
management, market research, retail and supply chain analysis, medical
analysis and diagnostics, financial analysis, and fraud detection. In the
government, data mining was initially used to detect financial fraud and
abuse. For example, data mining has been an integral part of GAO audits
and investigations of federal government purchase and credit card
programs.4 Data mining and related technologies are also emerging as key
tools in Department of Homeland Security initiatives.
3Lou Agosta, "Data Mining Is Dead-Long Live Predictive Analytics!"
(Forrester Research, Oct. 30, 2003),
http://www.forrester.com/Research/LegacyIT/0,7208,33030,00.html
(downloaded Jan. 26, 2004).
4For more information on the uses of data mining in GAO audits, see U.S.
General Accounting Office, Data Mining: Results and Challenges for
Government Programs, Audits, and Investigations, GAO-03-591T (Washington,
D.C: Mar. 25, 2003).
Data Mining Poses Privacy Challenge
Since the terrorist attacks of September 11, 2001, data mining has been
seen increasingly as a useful tool to help detect terrorist threats by
improving the collection and analysis of public and private sector data.
In a recent report on information sharing and analysis to address the
challenges of homeland security, it was noted that agencies at all levels
of government are now interested in collecting and mining large amounts of
data from commercial sources.5 The report noted that agencies may use such
data not only for investigations of known terrorists, but also to perform
large-scale data analysis and pattern discovery in order to discern
potential terrorist activity by unknown individuals. Such use of data
mining by federal agencies has raised public and congressional concerns
regarding privacy.
One example of a large-scale development effort launched in the wake of
the September 11 attacks is the Multistate Anti-terrorism Information
Exchange System, known as MATRIX. MATRIX, currently used in five states,6
provides the capability to store, analyze, and exchange sensitive
terrorism-related and other criminal intelligence data among agencies
within a state, among states, and between state and federal agencies.
Information in MATRIX databases includes criminal history records,
driver's license data, vehicle registration records, incarceration
records, and digitized photographs. Public awareness of MATRIX and of
similar large-scale data mining or data mining-like projects has led to
concerns about the government's use of data mining to conduct a mass
"dataveillance"7-a surveillance of large groups of people-to sift through
vast amounts of personally identifying data to find individuals who might
fit a terrorist profile.
5Creating a Trusted Information Network for Homeland Security (New York
City: The Markle Foundation, December 2003),
http://www.markletaskforce.org/Report2_Full_Report.pdf (downloaded Mar. 8,
2004).
6Five states are currently participating in the MATRIX pilot project:
Connecticut, Florida, Michigan, Ohio, and Pennsylvania.
7Roger Clarke, "Information Technology and Dataveillance," Communications
of the ACM, vol. 31, issue 5 (New York City: ACM Press, May 1988),
http://www.anu.edu.au/people/Roger.Clarke/DV/CACM88.html (downloaded Mar.
5, 2004). Clarke defines mass dataveillance as the systematic use of
personal data systems in the investigation or monitoring of the actions or
communications of groups of people.
Mining government and private databases containing personal information
creates a range of privacy concerns. Through data mining, agencies can
quickly and efficiently obtain information on individuals or groups by
exploiting large databases containing personal information aggregated from
public and private records. Information can be developed about a specific
individual or about unknown individuals whose behavior or characteristics
fit a specific pattern. Before data aggregation and data mining came into
use, personal information contained in paper records stored at widely
dispersed locations, such as courthouses or other government offices, was
relatively difficult to gather and analyze. As one expert noted, data
mining technologies that provide for easy access and analysis of
aggregated data challenge the concept of privacy protection afforded to
individuals through the inherent inefficiency of government agencies
analyzing paper, rather than aggregated, computer records.8
Privacy concerns about mined or analyzed personal data also include
concerns about the quality and accuracy of the mined data; the use of the
data for other than the original purpose for which the data were collected
without the consent of the individual; the protection of the data against
unauthorized access, modification, or disclosure; and the right of
individuals to know about the collection of personal information, how to
access that information, and how to request a correction of inaccurate
information.9
8K.A. Taipale, "Data Mining and Domestic Security: Connecting the Dots to
Make Sense of Data," The Columbia Science and Technology Law Review, vol.
V, 2003-2004 (New York City: Columbia Law School, 2004),
http://www.stlr.org/cite.cgi?volume=5&article=2 (downloaded Mar. 18,
2004).
9These privacy concerns are reflected in the Fair Information Practices
proposed in 1980 by the Organization for Economic Cooperation and
Development and endorsed by the U.S. Department of Commerce in 1981. These
practices govern collection limitation, purpose specification, use
limitation, data quality, security safeguards, openness, individual
participation, and accountability.
Agencies Identified Numerous Data Mining Efforts with Various Aims
Of 128 federal departments and agencies surveyed for information on their
planned and operational data mining efforts (listed in app. II), 52
agencies reported 199 data mining efforts, and 69 agencies reported that
they were not engaged in data mining and were not planning such efforts
(listed in app. III). Of the 199 data mining efforts, 68 were planned and
131 were operational. Seven agencies did not respond to our survey.10
Appendix IV lists the 199 data mining efforts reported, along with key
characteristics.
Agencies described the most common purposes of data mining efforts as
o improving service or performance;
o detecting fraud, waste, and abuse;
o analyzing scientific and research information;
o managing human resources;
o detecting criminal activities or patterns; and
o analyzing intelligence and detecting terrorist activities.
As shown in table 1, the Department of Defense reported the largest number
of efforts aimed at improving service or performance (with 19 out of 65
reported efforts) and at managing human resources (with 14 out of 17
efforts). Defense was also the most frequent user of efforts aimed at
analyzing intelligence and detecting terrorist activities, with 5 of 14
efforts, followed by the Departments of Homeland Security and Justice,
with 4 and 3 efforts, respectively. The Department of Education has the
largest number of efforts aimed at detecting fraud, waste, and abuse (9
out of 24 efforts reported). The National Aeronautics and Space
Administration accounts for 21 of the 23 identified efforts for analyzing
scientific and research information. Efforts are spread relatively evenly
among the agencies that reported using data mining efforts for detecting
criminal
10Agencies that did not respond to our survey are (1) the Central
Intelligence Agency; (2) the Corporation for National and Community
Services; (3) the Department of Army, Department of Defense; (4) the Equal
Employment Opportunity Commission; (5) the National Park Service,
Department of the Interior; (6) the National Security Agency, Department
of Defense; and (7) the Rural Utilities Service, Department of
Agriculture.
activities or patterns. Table 1 summarizes the top six uses of data mining
efforts among the responding agencies.
Table 1: Top Six Purposes of Data Mining Efforts in Departments and Agencies and
Number of Efforts Reported
Analyzing
Analyzing Detecting intelligence
Improving Detecting scientific Managing criminal and
and detecting
service or fraud, research human activities terrorist
waste, or
Department performance and abuse information resources patterns activities
or agency
Department
of 8 1
Agriculture
Department of Commerce
Department of Defense 19 1 1 14 1
Department of Education 6 9 3
Department of Energy 3
Department of Health and Human
Services 4 1
Department of Homeland Security 5 2 2
Department of the Interior 1
Department of Justice 1 1 3
Department of Labor 3 1
Department of State 2
Department of Transportation 1
Department of the Treasury 4 1 2
Department of Veterans Affairs 5 5 1
Environmental Protection Agency 1
Export-Import Bank of the United
States 1
Federal Deposit Insurance Corporation 1
Federal Reserve System 1
National Aeronautics and Space
Administration 1 1 21
Nuclear Regulatory Commission 1
Office of Personnel Management 1
Pension Benefit Guaranty Corporation 2
Railroad Retirement Board 1
Small Business Administration 1
Total 65 24 23 17 15 14
Source: GAO analysis of agency-provided data.
Some data mining purposes focus on human activities and therefore are
inherently likely to involve personal information; examples of these
purposes are detecting fraud, waste, and abuse; detecting criminal
activities or patterns; managing human resources; and analyzing
intelligence. The following are examples of data mining efforts for each
of these purposes:
o Detecting fraud, waste, and abuse. The Veterans Benefits
Administration's C & P Payment Data Analysis effort mines veterans'
compensation and pension data for evidence of fraud.
o Detecting criminal activities or patterns. The Department of
Education's Title IV Identity Theft Initiative effort focuses on identity
theft cases involving education loans.
o Managing human resources. The U.S. Air Force's Oracle HR (Human
Resources) uses data mining to provide information on promotions, pay
grades, clearances, and other information relevant to human resources
planning.
o Analyzing intelligence and detecting terrorist activities. The Defense
Intelligence Agency's Verity K2 Enterprise mines data from the
intelligence community and Internet sources to identify foreign terrorists
or U.S. citizens connected to foreign terrorism activities.
On the other hand, other categories of efforts do not necessarily focus on
human activities or involve personal information, such as many of the
efforts aimed at analyzing scientific and research information. The
National Aeronautics and Space Administration, for example, mines large,
complex earth science data sets to find patterns and relationships to
detect hidden events (the system is called Machine Learning and Data
Mining for Improved Data Understanding of High Dimensional Earth Sensed
Data).
Similarly, many efforts aimed at improving service or performance (the
most frequently cited purpose of data mining efforts) do not involve
personal information. For example, the Department of the Navy's Supply
Management System Multidimensional Cubes system includes a data warehouse
containing data on every ship part that has been ordered since the 1980s,
with multidimensional information on each part. The Navy uses data mining
to calculate failure rates and identify needed improvements; according to
the Navy, this system reduces downtime on ships by improving parts
replacement.
However, some efforts aimed at improving service or performance do involve
personal information. For example, the Veterans Administration's VISN
(Veterans Integrated Service Network) 16 Data Warehouse is mined for a
variety of information, including patient visits, laboratory tests, and
pharmacy records, to provide management with health care system
performance information.
Overall, 122 of the 199 data mining efforts involve personal information.
Figure 1 shows the top six purposes of these efforts, as well as their
distribution.
Figure 1: Top Six Purposes of Data Mining Efforts That Involve Personal
Information
Purposes
Increasing tax compliance
Analyzing intelligence and detecting terrorist activities
Detecting criminal activities or patterns
Managing human resources
Detecting fraud, waste, and abuse
Improving service or performance 33
0 10203040 Number of data mining efforts
Source: GAO analysis of agency data.
Of the 199 data mining efforts, 54 use or plan to use data from the
private sector. Of these, 36 involve personal information. The personal
information from the private sector included credit reports and credit
card transaction records. Figure 2 shows the distribution of the top six
purposes of the 54 efforts involving data from the private sector.
Figure 2: Top Six Purposes of Data Mining Efforts That Involve Private
Sector Data
Purposes
Improving safety
Detecting criminal activities or patterns
Analyzing scientific and research information
Analyzing intelligence and detecting terrorist activities
Detecting fraud, waste, and abuse
Improving service or performance 14
0 10203040
Number of data mining efforts Source: GAO analysis of agency data.
Of the 199 data mining efforts, 77 efforts use or plan to use data from
other federal agencies. Of the 77 efforts, 46 involve personal
information. The personal information from other federal agencies included
student loan application data, bank account numbers, credit card
information, and taxpayer identification numbers. Figure 3 shows the top
six uses for the 77 efforts involving data from other federal agencies and
their distribution.
Figure 3: Top Six Purposes of Data Mining Efforts That Involve Data from
Other Federal Agencies
Purposes
Managing human resources
Detecting fraud, waste, and abuse
Detecting criminal activities or patterns
Analyzing intelligence and detecting terrorist activities
Analyzing scientific and research information
Improving service or performance
20
0 10203040 Number of data mining efforts
Source: GAO analysis of agency data.
Summary Driven by advances in computing and data storage capabilities and
by growth in the volumes and availability of information collected by the
public and private sectors, data mining enables government agencies to
analyze massive volumes of data. Our survey shows that data mining is
increasingly being used by government for a variety of purposes, ranging
from improving service or performance to analyzing and detecting terrorist
patterns and activities.
Although this survey provides a broad overview of the emerging uses of
data mining in the federal government, more work is needed to shed light
on the privacy implications of these efforts. In future work, we plan to
examine selected federal data mining efforts and their implications.
As agreed with your office, unless you publicly announce the contents of
the report earlier, we plan no further distribution until 30 days from the
report date. At that time, we will send copies of this report to the
Chairmen and Ranking Minority Members of the House Committee on Government
Reform; Subcommittee on Civil Service and Agency Organization, House
Committee on Government Reform; Select Committee on Homeland Security,
House of Representatives; Senate Committee on Governmental
Affairs; and the Subcommittee on Oversight of Government Management, the
Federal Workforce and the District of Columbia, Senate Committee on
Governmental Affairs. We will also make copies available to others on
request. In addition, this report will be available at no charge on the
GAO Web site at http://www.gao.gov.
If you have any questions concerning this report, please call me at (202)
512-6240 or Mirko J. Dolak, Assistant Director, at (202) 512-6362. We can
also be reached by e-mail at [email protected] and [email protected],
respectively. Key contributors to this report were Camille M. Chaires,
Barbara S. Collier, Orlando O. Copeland, Nancy E. Glover, Stuart M.
Kaufman, Lori D. Martinez, Morgan F. Walts, and Marcia C. Washington.
Sincerely yours,
Linda D. Koontz Director, Information Management Issues
Appendix I
Objective, Scope, and Methodology
Our objective was to identify and describe planned and operational federal
data mining efforts. As a first step in addressing this objective, we
developed a definition of "data mining." Because this expression has a
range of meanings, we surveyed the technical literature to develop a
definition based on the most commonly used terms found in this literature.
We defined data mining as the application of database technology and
techniques-such as statistical analysis and modeling-to uncover hidden
patterns and subtle relationships in data and to infer rules that allow
for the prediction of future results. In our initial survey of chief
information officers, these officials found the definition sufficient to
identify agency data mining efforts.
We then surveyed chief information officers or comparable officials at 128
federal departments and agencies (see app. II) and asked them to identify
whether their agency had operational and planned data mining efforts. We
achieved a 95 percent response rate. Of the 121 agencies that responded,
69 reported that they did not have any data mining efforts (see app. III).
We followed up with these 69 agencies and gave them another opportunity to
report data mining efforts.
To obtain information on the characteristics of the identified operational
or planned data mining efforts, we conducted structured telephone
interviews1 with the identified system owners or activity managers. The
interviews were designed to obtain detailed information about each data
mining system, including the purpose and size, the use of personal
information, and the use of data from the private sector or other federal
organizations. We pretested the structured interview to ensure relevance
and clarity.
We aggregated these data by agency and sent them back to the chief
information officer, comparable official, or their designee and asked that
they review the characteristics for completeness and accuracy. One of the
52 departments and agencies that reported data mining systems-the
Department of Homeland Security-has not responded to our request to review
the reported data for completeness and accuracy.
1In a structured interview, the interviewer asks the same questions of
numerous individuals or individuals representing numerous organizations in
a precise manner, offering each interviewee the same set of possible
responses.
We performed random assessments of the means that these officials used to
verify the information. Based on these assessments, we concluded that the
agencies' verification methods were reasonable and that as a result, we
could rely on the accuracy of the reported data. We also conducted a
search of technical literature and periodicals to develop a list of
federal government data mining efforts and then compared the efforts on
this list with the data mining efforts reported by federal agencies. If
the data mining efforts on our list were not reported on the survey, we
contacted the chief information officer or comparable official to
determine whether that data mining effort should be included in our
survey.
Because this was not a sample survey, there are no sampling errors.
However, the practical difficulties of conducting any survey may introduce
errors, commonly referred to as nonsampling errors. For example,
difficulties in how a particular question is interpreted, in the sources
of information that are available to respondents, or in how the data are
entered into a database or were analyzed can introduce unwanted
variability into the survey results. We took steps in the development of
the structured interview, the data collection, and the data analysis to
minimize these nonsampling errors. Among these steps, we pretested the
structured interview instrument, contacted nonresponding agencies as well
as agencies not identifying data mining efforts, and sent the aggregated
data to the agency chief information officer for review.
We conducted our work from May 2003 to April 2004 in accordance with
generally accepted government auditing standards.
Appendix II
Surveyed Departments and Agencies
Department of Agriculture
o Agricultural Marketing Service
o Agricultural Research Service
o Animal and Plant Health Inspection Service
o Cooperative State Research, Education, and Extension Service
o Farm Service Agency
o Food and Nutrition Service
o Food Safety and Inspection Service
o Foreign Agricultural Service
o Forest Service
o National Agricultural Statistics Service
o Natural Resources Conservation Service
o Risk Management Agency
o Rural Utilities Service Department of Commerce
o Bureau of the Census
o Economic Development Administration
o International Trade Administration
o National Oceanic and Atmospheric Administration
o U.S. Patent and Trademark Office
Department of Defense
o Missile Defense Agency
o Defense Advanced Research Projects Agency
o Defense Commissary Agency
o Defense Contract Audit Agency
o Defense Contract Management Agency
o Defense Information Systems Agency
o Defense Intelligence Agency
o Defense Legal Services Agency
o Defense Logistics Agency
o Defense Security Cooperation Agency
o Defense Security Service
o Defense Threat Reduction Agency
o Department of the Air Force
o Department of the Army
o Department of the Navy
o National Geospatial-Intelligence Agency
o National Security Agency
o U.S. Marine Corps Department of Education
Department of Energy
o Bonneville Power Administration
o Southeastern Power Administration
o Southwestern Power Administration
o Western Area Power Administration Department of Health and Human
Services
o Administration for Children and Families
o Agency for Healthcare Research and Quality
o Centers for Disease Control and Prevention
o Centers for Medicare and Medicaid Services
o Food and Drug Administration
o Health Resources and Services Administration
o Indian Health Service
o National Institutes of Health
o Program Support Center Department of Homeland Security
o Border and Transportation Security Directorate
o Bureau of Citizenship and Immigration Services
o Emergency Preparedness and Response Directorate
o Information Analysis and Infrastructure Protection Directorate
o Management Directorate
o Science and Technology Directorate
o U.S. Coast Guard
o U.S. Secret Service Department of Housing and Urban Development
Department of the Interior
o Bureau of Indian Affairs
o Bureau of Land Management
o Bureau of Reclamation
o Minerals Management Service
o National Park Service
o Office of Surface Mining Reclamation and Enforcement
o U.S. Fish and Wildlife Service
o U.S. Geological Survey Department of Justice
o Bureau of Alcohol, Tobacco, Firearms, and Explosives
o Drug Enforcement Administration
o Federal Bureau of Investigation
o Federal Bureau of Prisons
o U.S. Marshals Service Department of Labor Department of State
Department of Transportation
o Federal Aviation Administration
o Federal Highway Administration
o Federal Motor Carrier Safety Administration
o Federal Railroad Administration
o Federal Transit Administration
o National Highway Traffic Safety Administration
Department of the Treasury
o Bureau of Engraving and Printing
o Bureau of the Public Debt
o Financial Management Service
o Internal Revenue Service
o Office of the Comptroller of the Currency
o Office of Thrift Supervision
o U.S. Mint
Department of Veterans Affairs
o Veterans Benefits Administration
o Veterans Health Administration
Agency for International Development
Central Intelligence Agency
Corporation for National and Community Service
Environmental Protection Agency
Equal Employment Opportunity Commission
Executive Office of the President
Export-Import Bank of the United States
Federal Deposit Insurance Corporation
Federal Energy Regulatory Commission
Federal Reserve System
Federal Retirement Thrift Investment Board
General Services Administration
Legal Services Corporation
National Aeronautics and Space Administration
National Credit Union Administration
National Labor Relations Board
National Science Foundation
Nuclear Regulatory Commission
Office of Management and Budget
Office of Personnel Management
Peace Corps
Pension Benefit Guaranty Corporation
Railroad Retirement Board
Securities and Exchange Commission
Small Business Administration Smithsonian Institution Social Security
Administration U.S. Postal Service
Appendix III
Departments and Agencies Reporting No Data Mining Efforts
The following 69 departments and agencies reported that they have no
operational or planned data mining efforts:
Department of Agriculture
o Agricultural Marketing Service
o Agricultural Research Service
o Animal and Plant Health Inspection Service
o Cooperative State Research, Education, and Extension Service
o Farm Service Agency
o Foreign Agricultural Service
o Forest Service
o National Agricultural Statistics Service
o Food Safety and Inspection Service Department of Commerce
o Economic Development Administration
o Bureau of the Census
o International Trade Administration
o Department of Commerce Headquarters
o National Oceanic and Atmospheric Administration Department of Defense
o Defense Contract Audit Agency
o Missile Defense Agency
o Defense Legal Services Agency
Appendix III
Departments and Agencies Reporting No
Data Mining Efforts
o Defense Security Service
o Defense Threat Reduction Agency
o Defense Logistics Agency
o Defense Advanced Research Projects Agency
o Defense Contract Management Agency
o Defense Security Cooperation Agency Department of Energy
o Bonneville Power Administration
o Southeastern Power Administration
o Southwestern Power Administration
o Western Area Power Administration Department of Health and Human
Services
o Centers for Medicare and Medicaid Services
o Administration for Children and Families
o National Institutes of Health
o Indian Health Service Department of Homeland Security
o Science and Technology Directorate
o Management Directorate
o Bureau of Citizenship and Immigration Services
o Department of Homeland Security Headquarters
Appendix III
Departments and Agencies Reporting No
Data Mining Efforts
Department of Housing and Urban Development Department of the Interior
o Bureau of Reclamation
o Bureau of Land Management
o U.S. Geological Survey
o Fish and Wildlife Service
o Office of Surface Mining Reclamation and Enforcement
o Bureau of Indian Affairs
o Department of the Interior Headquarters Department of Justice
o Bureau of Alcohol, Tobacco, Firearms, and Explosives Department of
Transportation
o Federal Aviation Administration
o Federal Transit Administration
o Federal Railroad Administration
o Federal Motor Carrier Safety Administration
o Federal Highway Administration Department of the Treasury
o Comptroller of the Currency
o Bureau of the Public Debt
o Office of Thrift Supervision
Appendix III
Departments and Agencies Reporting No
Data Mining Efforts
o Department of the Treasury Headquarters
o Bureau of Engraving and Printing
Agency for International Development
Executive Office of the President
Federal Energy Regulatory Commission
Federal Retirement Thrift Investment Board
General Services Administration
Legal Services Corporation
National Credit Union Administration
National Labor Relations Board
National Science Foundation
Office of Management and Budget
Peace Corps
Security and Exchange Commission
Smithsonian Institution
Social Security Administration
U.S. Postal service
Appendix IV
Inventories of Efforts
The following tables present selected information from our survey of 128
major federal departments and agencies on their use of data mining. The
tables list the purpose of each data mining effort, whether the system is
planned or operational, and whether the system uses personal information,
data from the private sector, or data from other federal agencies. The
survey shows that 52 departments and agencies are using or are planning to
use data mining. These departments and agencies reported 199 data mining
efforts, of which 68 were planned and 131 were operational.
Table 2: Department of Agriculture's Inventory of Data Mining Efforts Features
Other agency data
Organization/
system name Description Purpose Status
Personal information Private sector data
Department of Agriculture Headquarters Food and Nutrition Service
Travel Data Mart Will consolidate employee Improving Planned Yes No No
travel
information from financial service or
and
travel systems. Will allow
for a performance
governmentwide e-travel
system
and provide the department
with
information on the
financial
ramifications of its
travel.
Financial Is used in the production Financial Operational No No No
Statements of
Data Warehouse consolidated financial
statements. management
Provides information for
products
that are used to satisfy
external
reporting requirements,
such as
Office of Management and
Budget
and Department of the
Treasury
requirements.
Financial Data Is the department's Financial Operational Yes No No
internal
Warehouse financial management
reporting management
system. Data mining is
done for ad
hoc and on-demand reports.
Assists in
Grantee Monitoring monitoring the Improving Operational Yes No No
financial
Activities-Southeast status of grant service or
holders. Grantees
Regional Office are required to performance
provide
expenditure
reports, and
analysis
is performed
quarterly that
matches stated
draws to the
actual draws from
the U.S.
Treasury.
(Continued From Previous Page)
Features
Other agency data
Organization/
system name Description Purpose Status
Personal information Private sector data
Grantee Monitoring Assists in Improving Operational Yes No No
monitoring the
Activities-Mountain management and service or
distribution of
Plains Regional Indian funds for performance
major food benefit
Office programs, such as
food stamps, in
10 grantee states.
Grantee Monitoring Maximizes on-site Improving Operational Yes No No
monitoring
Activities- efforts by confirming service or
the accuracy
Southwest Regional of grantee performance
accounting. Reduces
Office on-site time,
maximizes time to
complete reviews, and
has
achieved a 50 percent
travel
savings.
Grantee Monitoring Will be a reporting Improving Planned No No Yes
system to
Activities-Midwest provide reports and service or
automate the
Regional Office audit process. Plans are performance
to
acquire data mining tools
to review
and compare budgets,
reports,
and plans.
Grantee Monitoring Supports on-site Improving Operational Yes Yes No
reviews of
Activities-Northeast analyses to service or
confirm financial
Regional Office report performance
information.
Will create ad-hoc Planned No No No
Integrated Program reporting Improving
centers to
Accounting System validate service or
accounting
Data Integrity information. performance
Natural Resources Conservation Service Risk Management Agency
National Resource Is a trending Improving Operational No No No
database that tracks
Inventory Used for more than 200 service or
resource issues
Statistical such as monitoring performance
Analysis of erosion. Also
Past Soil Survey processes statistical
technology.
Databases.
CAE Is part of a congressionally Detecting Operational Yes Yes Yes
mandated project to assist the fraud, waste,
Risk Management Agency in and abuse
controlling fraud, waste, and
abuse in the Federal Crop
Insurance Corporation program.
Source: Department of Agriculture.
Table 3: Department of Commerce's Inventory of Data Mining Efforts Features
Other agency data
Organization/
system name Description Purpose Status
Personal information Private sector data
U.S. Patent and Trademark Office
Compensation Generates and makes Managing Operational Yes No Yes
available
Projection Model in compensation human
projection data,
the Enterprise Data both salary and resources
benefits, on
Warehouse current employees and
on
planned hires. It
also accounts for
planned attritions.
Source: Department of Commerce.
Table 4: Department of Defense's Inventory of Data Mining Efforts Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Defense Commissary Agency Defense Information Systems Agency
DeCA Electronic Will be a corporate Improving Planned Yes Yes Yes
information
Records system for managing service or
unstructured
Management and data. It will allow for performance
electronic
Archive System record keeping, document
management, and automated
receipt processes.
Corporate Decision Mines data to produce Improving Operational No No No
analytical
Support System/ data on commissary service or
operations.
Commissary Provides information performance
such as what
Operations items stores are
selling and helps
Management System determine whether
cashiers are
being honest.
Enterprise Business Will replace the current Improving Planned No No No
Intelligence System management information service or
environment, which performance
includes
operations, reporting,
billing,
statistics, and other
management
information activities.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Defense Intelligence Agency Department of the Air Force
Insight Smart Will be a data mining Analyzing Planned Yes No Yes
knowledge
Discovery discovery tool to intelligence
work against
unstructured text. and detecting
Will categorize
nouns (names, terrorist
locations, events)
and present
information in activities
images.
Verity K2 Mines data from the Yes Yes Yes
Enterprise intelligence Analyzing Operational
community and intelligence
Internet searches
to identify foreign and detecting
terrorists or U.S.
citizens connected to terrorist
foreign
terrorism activities. activities
PATHFINDER Is a data mining tool Yes No Yes
developed for Analyzing Operational
analysts that
provides the ability intelligence
to
analyze government and detecting
and private
sector databases terrorist
rapidly. It can
compare and search activities
multiple
large databases
quickly.
Is a large search No No Yes
Autonomy engine tool that Analyzing Operational
is used to search intelligence
hundreds of
thousands of word and detecting
documents. Is
used for the terrorist
organization and
knowledge discovery activities
of
intelligence.
ANG Data Will be used to measure military Measuring Planned Yes No No
Warehouse- readiness. It incorporates military
Guardian information on all disciplines to readiness
provide management information
needed to assess military
readiness.
Integrated Space Will be an internal Improving Planned Yes No No
database
Warfare Center containing information on service or
all
(SWC) development/execution performance
Information activities
System within the SWC. Will be
used by all
management and analyst
personnel to track and
align the
center's activities to
warfighter
needs, report on execution
status,
financial status, schedule
status,
and performance
measurements.
Safety Automated Will query databases to find Improving Planned Yes No No
System (SAS) automation mishaps. Governed safety
by
Directive 920124 and will
allow for
the investigation and
reporting of
identified automation
mishaps.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Enterprise Business Will support strategic Improving Planned No No Yes
planning,
System assist in building service or
scientific and
technical budgets for performance
the Air
Force, and serve as a
launch point
for all new programs.
Research
and development case
files will be
maintained for 75 years;
the
activity indexes,
catalogs, and
tracks these files.
Genomic and Analyzes National Analyzing Operational No No Yes
Institutes of
Proteomic Health's genetic data. scientific
Results and
Analysis research
information
Enhances combat Yes No No
IG Corporate readiness and Improving Operational
Information mission capabilities service or
System for Air Combat
Command units and performance
commanders.
It assists in
preparing for and
conducting
inspections.
Computer Evaluates network No No No
Network activities to Improving Operational
Defense System create rules for information
intrusion detection
system signature sets. security
FAME Will serve as a Planned No No Yes
central repository Managing
for Air Force manpower human
information. Will resources
track manpower
and unit authorization
funding.
Resource Serves as a manpower No No No
Wizard tracking Improving Operational
system. Tracks service or
positions and
captures data for performance
specific funding
purposes.
Government Is used in overseeing Yes Yes No
purchases Detecting Operational
Purchase Card made by Air Force fraud,
personnel with waste,
government-provided and abuse
credit cards.
Ambulatory Data Tracks the initial Monitoring Operational Yes No No
diagnosis of
System Queries patients with the public health
results of further
testing and diagnosis.
Allows for
early notification of
diseases and
injuries.
Modus Operandi Is an investigative Detecting Operational Yes No No
tool used to
Database identify and track criminal
trends in
criminal behavior. It activities or
links
characteristics of
crimes and patterns
provides details on
crime scenes
and other crime
factors.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Executive Takes data from all Improving Operational No No No
Decision functional
Support System metric balances. service or
Processes charts
and graphs to identify performance
trends and
to make sure goals are
accomplished.
Inspire Is a tool that assists in Performing Operational Yes No Yes
providing a
narrative description of all strategic
research and development that is planning
being conducted within the Air
Force. Provides cost and
milestone information on
research
and development projects.
Discoverer Is used to manage personnel Managing Operational Yes No No
records, including individual human aliases and histories. resources
Requirements and Will serve as a repository Improving Planned No No No
for new
Concepts System system projects and system service or
requirements. It will be performance
available
for consultation for
information on
all project requests and
identified
requirements.
Business Objects Is a commercial off-the-shelf tool Managing Operational
Yes No Yes that is used to analyze and report human on human resources
activities. resources
THRMIS Uses commercial off-the-shelf Managing Operational Yes No No
software to maintain a data human
warehouse of integrated inventory resources
and manpower data for the Total
Force: active duty (officer and
enlisted), Air Force Reserve, Air
National Guard, and civilians. Is
used to assess and analyze the
health of the Air Force.
SAS Is a Web-enabled personnel data Managing Operational Yes No No
system that gives authorized human
users worldwide the ability to resources
tabulate demographic data on
recruitment, promotion, and
retention.
Oracle HR Is a personnel management Managing Operational Yes No No
system that manages information human
for promotions, pay grades, resources
clearances, and other
information
relevant to human resources.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Health Modeling Provides information and Improving Operational Yes No No
and decision
Informatics support to the Air Force service or
Division
Data Mart headquarters' surgeon performance
general for
decision making, policy
development, and
resource
allocation. It also
provides
performance information
and
analysis to medical
field units in
support of performance
measurement objectives.
FIRST EDV (BRIO) Will deal with Air Force Improving Planned No Yes No
budgets
and other components of its service or
financial environment. performance
Historical
analyses and trend analyses
will
be performed on the budget
process.
IG World Is used to store and track data Improving Operational Yes No No
and
requirements, such as lodging service or
and
augmentee requirements, for the performance
PAC inspector general.
Department of Defense Headquarters Department of the Navy
Automated Will be used to improve Managing Operational Yes Yes Yes
personnel
Continuing security continuing human
evaluation
Evaluation efforts within Department of resources
System
Defense (DOD) by identifying
issues of security concern
between the normal
reinvestigation cycle for
those who
hold DOD security clearances
and
have signed a consent form
that is
still in effect.
Human Resource Is used to improve Navy Managing Operational No No No
Trend Analysis readiness. Data on personnel human
manning levels are mined to resources
ensure that each Navy unit
has
the correct number of
training
personnel aboard.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
U.S. Naval Allows for the assessment of Managing Operational Yes No No
Academy
academic performance of human
midshipmen. It includes resources
demographic information,
information on grades,
participation in sports,
leadership
positions, etc. It is an
extension of
the registrar's system and is
mined for comparisons and
trends.
Navy Training Provides overall Navy Managing Operational Yes Yes No
Master training
Planning System information to assist in human
delivering
Navy training in the resources
most efficient
manner. Pertinent data
from
multiple databases are
consolidated into a
single
database that is mined.
DHAMS Is a database that Improving Operational No No No
contains
Multidimensional information on the time service or
and
Cubes attendance of 3,000
mariners performance
across 120 ships. Allows
managers to look at what
people
were doing at a
particular time and
to look across the fleet
as a whole
and compare ship
activities.
National Is used to conduct Analyzing Operational No Yes No
Cargo predictive
Tracking Plan analysis for intelligence
Cargo counterterrorism,
Tracking small weapons of mass and detecting
Division
destruction
proliferation, terrorist
narcotics,
alien smuggling, and activities
other high-
interest activities
involving
container shipping
activity.
Supply Management Reduces downtime on Improving Operational No No No
ships by
System allowing for the service or
analysis of ship
Multidimensional parts information. The performance
data
Cubes warehouse contains data
on every
part that has been
ordered since
the 1980s, and has
multidimensional
information on
each part. Failure
rates can be
calculated and
improvements can
be identified.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Type Commanders Is designed to provide a Measuring Operational No No Yes
fully
Readiness integrated environment military
for online
Management System analytical processing of readiness
readiness
indicators. Examples of
readiness
indicators include
status of
supplies available,
equipment in
operation, health
status, and
capabilities of the
crew.
FATHOM (APMC- Will be an internal Managing Planned Yes No No
program and
Human Resources) project tool used to human
improve
staffing, recruiting, resources
and managing
day-to-day operations.
Navy Training Is used for planning No No Yes
Quota and Improving Operational
Management forecasting training service or
System needs based
on skill requirements. performance
National Geospatial-Intelligence Agency
OLAP (On-Line Will provide aggregations of Improving Planned No No No
Analytical imagery system performance data service or
Processing) for management officers and performance
senior source decision makers
to
characterize system performance
and contribution to
intelligence
issues of national priority.
CITO Data Will evaluate and identify Improving Planned No No No
Mining imagery
system performance trends for service or
optimization, monitoring, or performance
reengineering.
Information Relevance Prototype Will establish an information relevancy
prototype to serve as a framework for community evaluation of commercial
information relevance approaches, methods, and technology. The term
information relevance refers to the ability of users to receive or
extract, then display and describe, information with measurable
satisfaction according to their need.
Improving Planned No No No
service or
performance
U.S. Marine Corps
Operational Data Is used for workforce planning. Managing Operational Yes
No No Store Enterprise human
resources
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Global Combat Support Systems- Marine Corps Will be a physical
implementation of the IT enterprise architecture designed to support both
improved and enhanced marine air/ground task force combat service support
functions and commander and combatant commander joint task force combatant
support information requirements. Data mining will allow for
interoperability with legacy Marine Corps systems and allow for a shared
data environment.
Improving Planned No Yes No
service or
performance
Total Force Data Is a system whose Managing Operational Yes No No
primary
Warehouse purpose is workforce human
planning and
workforce policy resources
decision making.
It contains current
(after 30 days)
and historical
workforce data.
Is a Web-based Yes No No
Marine Corps information Managing Operational
Recruiting system used for human
managing assets
Information Support and tracking enlisted resources
and officer
accessions into the
System Marine Corps.
Source: Department of Defense.
Table 5: Department of Education's Inventory of Data Mining Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Citizenship of Looks for issues Improving Operational Yes Yes Yes
PLUS regarding
Loan Borrowers- citizenship among service or
its PLUS loan
National Student borrowers. Flags performance
records based
Loan Data Systems on selected criteria
and requests
additional
information from
schools.
Is a proactive
Foreign Schools investigation Detecting Operational Yes No Yes
effort
Initiatives that looks at
National whether financial criminal
aid
was granted
Student Loan Data individuals activities or
attending
foreign
System/Central institutions
during periods patterns
Processing of nonenrollment.
Professional Used to determine Yes Yes Yes
when Improving Operational
Judgment professional service or
Practices: judgment has been
Title IV Pell exercised for
Grants, "special" performance
situations
National Student where families
cannot afford
Loan Data college expenses.
Title IV Compares Department of Detecting Operational Yes No Yes
Applicant-
Death Database Education data with the fraud,
Social waste,
Match Security and abuse
Administration's death
database to detect fraud
or
criminal activity.
Title IV Loans Will compare
with information from Detecting Planned Yes No No
the
No Applications Free Application fraud, waste,
for Federal
Student Aid Program and abuse
with the
Federal Family
Education Loan
Program to identify
fraud.
Compares Department Yes No Yes
OIG-Project of Analyzing Operational
Strikeback Education and intelligence
Federal Bureau of
Investigation data and detecting
for anomalies.
Also verifies
personal terrorist
identifiers.
activities
Audits and verifies Yes No Yes
Accuracy of U.S. personal Detecting Operational
Department of information that is fraud, waste,
contained in the
Education Department of and abuse
Personal Education's
Data personal data
system.
Audits data to Yes No No
Impact of Cohort determine the Legislative Operational
impact of
Default Rate legislation that impact
extended
Redefinition- the college loan
repayment default
National Student period from 180 to
270 days.
Loan Data System
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
CheckFree Takes monthly Detecting Operational Yes Yes No
billing information
Software/Purchase from the Bank of fraud, waste,
America to
Card Program create reports on and abuse
purchases,
purchase quantity,
and frequency
of purchases. Data
are mined for
instances of fraud
or abuse.
Improper Pell Will compare Pell Detecting Planned Yes No No
Grant Grants issued
Payment Activity with the amounts fraud,
received and waste,
look at the
eligibility of and abuse
grant
recipients.
Helps identify
Title IV Identity patterns and Yes No No
Theft trends Detecting Operational
Initiative in identity theft criminal
cases involving
loans for activities
education. or
Provides an
investigative
resource for
victims patterns
of identity theft.
Title IV Reviews addresses Yes No Yes
Applicant- listed on Title Improving Operational
Use of Multiple IV applications to service or
see if they are
Addresses/Central valid. For performance
example, jails or
Processing System employment
addresses are not
considered valid
addresses.
Identifies funds No No No
Lapsed that remain in the Improving Operational
Funds/Improper grants and payment service or
processing
Draw of Federal system beyond the performance
time period for
Grant Proceeds allocating the
funds.
Will support the Planned No No No
Decision Support department's Improving
System with Online performance-based service or
initiative. Will
Analytical allow custom performance
Processing queries of schools
from state and
Query local databases
for
demographics and
test scores.
Grant Assists in Yes Yes Yes
Administration managing grant Detecting Operational
and Payment System activities and fraud,
aids in detecting waste,
instances of fraud and abuse
or abuse in
grant activities.
Budget Execution Uses information in the Financial Operational Yes No No
National
Support Student Loan Data System
and a management
sample drawn from it to
estimate
cohort distributions for
financial
activities related to
the Federal
Family Education Loan
Program
pursuant to the Credit
Reform Act.
Pell Grant Model Provides estimates on the Financial Operational No No No
total
Assumptions cost of the Pell Grant
program. It management
uses data from previous
years and
makes assumptions for
future
years.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
National Student Compiles student Detecting Operational Yes No Yes
loan information
from the
Loan Data System guaranteeing fraud, waste,
agencies.
Is used for
eligibility and abuse
tracking and
to calculate
default rates.
Loan Model Estimates the cost Financial Yes No Yes
of loan Operational
Assumptions programs. Also
analyzes loan management
default behavior.
Office of the Is part of an OIG Yes No Yes
investigation to Detecting Operational
Inspector determine potential criminal
General fraud of
(OIG) Projects: financial aid activities or
grants primarily in
Tumbleweed/ New Hampshire. patterns
Snowball
Processes
Central applications for Yes No No
Processing student Detecting Operational
System aid. Contains data fraud, waste,
on more than
13 million
applications. Data and abuse
are
mined for
demographic trends.
Direct Loan Is used to track Yes Yes Yes
Services the life of student Improving Operational
System direct loans and to service or
monitor loan
repayments. performance
CheckFree Uses monthly Detecting Operational Yes Yes No
billing information
Software/Travel Card from Bank of fraud,
America to create waste,
Program reports on travel and abuse
expenditures to
look for improper
use of travel
cards.
Source: Department of Education.
Table 6: Department of Energy's Inventory of Data Mining Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Counterintelligence Is an investigative Detecting Operational Yes No No
management
Automated system used by criminal
Department of
Investigative Energy (DOE) field activities
sites to track or
Management System investigative cases
on individuals patterns
(CI-AIMS) or countries that
threaten DOE
assets. Information
stored in this
database is also used
to support
federal and state law
enforcement
agencies in support
of national
security.
Autonomy Will be used to mine a myriad Detecting Planned Yes No No
intelligence-related databases criminal
within the intelligence community activities or
to uncover criminal or terrorist patterns
activities relating to DOE
assets.
Counterintelligence Is used to log Detecting Operational Yes No Yes
briefings and
Analytical Research debriefings given criminal
to DOE
Data System employees who activities or
travel to foreign
countries or
(CARDS) interact with
foreign patterns
visitors to DOE
facilities. Data
are
mined to identify
potential threats
to DOE assets.
Source: Department of Energy.
Table 7: Department of Health and Human Services' Inventory of Data Mining
Efforts Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Agency for Healthcare Research and Quality
National Patient Safety Network Will contain reports on adverse medical
events that are filed by hospitals. The planned network's purpose is to
take out patient personal identifiers and other items that may violate
certain rules and create a warehouse that can be used by registered and
unregistered users to evaluateand implement patient safety and quality
measures. The network will be used to create tools that hospitals can use
for making quality improvements.
Improving Planned No No No
service or
performance
Centers for Disease Control and Prevention Department of Health and Human
Services Headquarters Food and Drug Administration
BioSense Enhances the nation's Analyzing Operational No Yes Yes
capability to
rapidly detect bioterrorism intelligence
events.
and detecting
terrorist
activities
DHHS Blood Monitors the country's Monitoring Operational No Yes No
blood
Monitoring supply by keeping an public health
Program inventory on
red blood cells and
platelets and
monitors blood supply
shortages,
the nature of the
shortage, and
size of the shortages.
Mission Is a comprehensive redesign and Operational No Yes Yes
Monitoring
Accomplishment and reengineering of two core
mission-food or drug
Regulatory critical legacy systems at Food
safety
Compliance Services and Drug Administration (FDA)
System that support the regulatory
functions that primarily take
place
in FDA's field offices.
(Continued From Previous Page)
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Turbo Establishment Provides a Improving Operational No Yes No
standardized database
Inspection Report of citations of safety
regulations and
statutes, and help
investigators in
preparing reports. It
will collect
data on specific
observations
uncovered during
inspections and
provide a more uniform
format
nationwide that will
allow for
electronic searches
and statistical
analysis to be
performed by
citation.
Phonetic Is a search engine that Improving Operational No Yes No
provides
Orthographic results indicating how safety
similar two
Computer Analysis drug names are on a
phonetic and
orthographic basis. Its
purpose is
to help in the safety
evaluation of
proposed proprietary
names to
reduce drug name
confusion after
an application is
approved by the
FDA.
MPRIS Data Will provide data to support end Improving Planned No No No
Warehouse user ad-hoc query analysis and service or
standard reporting needs. It will performance
provide the foundation for a
central
reporting repository that can be
used to populate business-specific
data marts.
Development and Will develop advanced Analyzing Planned Yes Yes Yes
software
Deployment of tools for quantitative scientific and
analysis of
Advanced drug safety data. research
Analytical Medical officers
Tools for Drug and safety evaluators information
Safety will use
these advances in
Risk Assessment software tools.
Add data mining capability to CFSAN Adverse Event Reporting System Is a
comprehensive system for tracking, reviewing, and reporting adverse event
incidences involving foods, cosmetics, and dietary supplements.
Integrating and centralizing the system and eliminating patchwork systems
make information on these adverse events available to federal, state, and
local governments as well as to industry and the public in a more timely
and efficient manner.
Monitoring Planned Yes Yes Yes
food or drug
safety
(Continued From Previous Page)
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Health Resources and Services Administration
HRSA Geospatial Data warehouse that primarily Improving Operational No Yes
Yes Data Warehouse collects programmatic, service or
demographic, and statistical data. performance
Program Support Center
Employee Uses information from a Improving Operational No No No
Assistance database
Program Analysis of employee assistance service or
program
case information that performance
does not
contain client personal
identifiers.
Data are mined for
quality
assurance and program
management information
that is
used to enhance the
quality and
cost effectiveness of
services.
Source: Department of Health and Human Services.
Table 8: Department of Homeland Security's Inventory of Data Mining Efforts
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Border and Transportation Security Directorate
Workforce Profile Contains payroll and personnel Managing Operational Yes
No Yes Data Mart data and is mined for workforce human
trends. resources
Customs Integrated Is a Customs data mart Managing Operational Yes No Yes
contained
Personnel Payroll within Department of human
Homeland
System Data Mart Security's workforce resources
profile data
mart. Personnel and
payroll data
are mined for
workforce trends.
Assists the
Internal Affairs Internal Affairs Detecting Operational Yes No Yes
group by
Treasury mining criminal criminal
activity data to
Enforcement ascertain how activities or
Customs' employees
are using the
Communications Treasury
Enforcement patterns
System Audit Data System.
Mart
Operations Assists in managing Improving Operational No No Yes
the operation
Management of all ports of entry service or
for incoming
Reports Data Mart carriers, people, and performance
cargo. Helps
in making resource
(people and
equipment) allocation
and
operational
improvement decisions.
(Continued From Previous Page)
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Automated Export Mines data on export Improving Operational No Yes Yes
trade in the
System Data Mart U.S. and produces service or
reports on
historical shipping performance
and receiving
trends.
Seized Property/ Mines data to ensure Improving Operational Yes No No
data quality
Forfeitures, and review work service or
assignments.
Penalties, and System has two performance
Fines components: one
Case Management that processes legal
cases like a
Data Mart law firm, and a second
that serves
as property and
inventory control by
tracking property
seized.
Incident Data Will look through incident Analyzing Planned Yes Yes Yes
Mart logs for
patterns of events. An intelligence
incident is an
event involving a law and detecting
enforcement
or government agency for terrorist
which a
log was created (e.g., activities
traffic ticket,
drug arrest, or firearm
possession).
The system may look at
crimes in a
particular geographic
location,
particular types of
arrests, or any
type of unusual activity.
Case Management Assists in managing Analyzing Operational Yes Yes Yes
law
Data Mart enforcement cases, intelligence
including
Customs cases. and detecting
Reviews case
loads, status, and terrorist
relationships
among cases. activities
Emergency Preparedness and Response Directorate
Enterprise Data Warehouse Will take data from multiple, disparate systems
and integrate the data into one reporting environment. The objective of
the effort is to allow for the reduction of data within the agency and to
provide an enterprise view of information necessary to drive critical
business processes and decisions. Data on internal human resources, all
aspects of disaster management, infrastructure, equipment location, etc.,
will be used.
Disaster Planned Yes Yes Yes
response and
recovery
Information Analysis and Infrastructure Protection Directorate
Analyst Notebook Correlates events Analyzing Operational Yes Yes No
I2 and people to
specific information intelligence
and detecting
terrorist
activities
(Continued From Previous Page)
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Automatic Message Automatically takes Analyzing Planned No No Yes
messages from
Handling System external agencies and intelligence
routes them
(Verity) to appropriate and detecting
recipients
terrorist
activities
U.S. Coast Guard
Readiness Assists in ensuring readiness for all Improving Operational Yes
No No
Management Coast Guard missions. service or
System performance
CG Info Provides one-stop shopping for Improving Operational Yes No Yes
Coast Guard information. It is service or
the
central location and common performance
interface for the entire Coast
Guard
to gain near real-time access
to
data from multiple, disparate
Coast
Guard information systems. It
provides a single interface for
users
to view mission-critical
support
data.
U.S. Secret Service
Criminal Mines data in Detecting Operational Yes No Yes
suspicious activity
Investigation reports received from criminal
banks to find
Division Data commonalities in data activities or
Mining to assist in
strategically
allocating resources. patterns
Source: Department of Homeland Security.
Table 9: Department of the Interior's Inventory of Data Mining Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Minerals Management Service
Data Mining of the Technical Information Management System (TIMS) Database
Is a corporate database for oil and gas leases. The database is mined in
support of policy development. One area of data mining is identification
of leases that will be abandoned in the near future. Data mining has shown
that leases with six or more producing wells in 1 year are almost never
abandoned in the next year. Another application of data mining is the
safety of oil and gas operations. For example, data mining has shown that
accidents have a peak rate on Thursday mornings.
Improving Operational Yes Yes No
service or
performance
Source: Department of the Interior.
Table 10: Department of Justice's Inventory of Data Mining Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Department of Justice Headquarters Drug Enforcement Administration Federal
Bureau of Investigation
Drug/Financial Will contain data from, Detecting Planned Yes Yes Yes
and be
Fusion Center used by, Organized Crime criminal
and
Drug Enforcement Task activities or
Force
agencies. The system will
permit patterns
the collection and cross
case
analysis of all drug and
related
financial investigative
data.
Statistical Is a query analysis Detecting Operational Yes No Yes
and reporting
Management tool that pulls data criminal
from many
Analysis and systems. It allows activities or
for statistical
Reporting Tool analyses of drug
cases Drug patterns
System (SMARTS) Enforcement
Administration's
statistical
/SPSS reporting.
TOLLS Is a database of telephone calls Detecting Operational Yes No No
from court ordered and approved criminal
wiretaps and Title III activities or
investigations. Information such patterns
as telephone numbers, time and
date of calls, and call duration
is
captured. Data are mined for
patterns to give leads in
investigations of drug
trafficking.
Secure Allows the FBI to Analyzing Operational Yes No Yes
Collaborative search multiple
Operational data sources through intelligence
one
Prototype interface to uncover and detecting
terrorist and
Environment/ criminal activities terrorist
and
Investigative relationships. Data activities
Data sources are a
Warehouse combination of
structured and
unstructured text.
Foreign Supports the Foreign Analyzing Operational Yes Yes Yes
Terrorist Terrorist
Tracking Task Tracking Task Force intelligence
Force that seeks to
Activity prevent foreign and detecting
terrorists from
gaining access to the terrorist
United
States. Data from the activities
Department
of Homeland Security,
Federal
Bureau of
Investigation, and
public
data sources are put
into a data
mart and mined to
determine
unlawful entry and to
support
deportations and
prosecutions.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
FBI Intelligence Is intended to take a Analyzing Planned Yes No Yes
subset of
Community Data approved data from a intelligence
data
Marts warehouse and make it and detecting
available
to the intelligence terrorist
community.
activities
Federal Bureau of Prisons U.S. Marshals Service
Business Will be a warehouse designed to Improving Planned No No Yes
Information
Warehouse provide information on service or
manufacturing by Federal Prison performance
Industries, which runs 100
factories in various prisons.
Data
will be mined for information
on
the manufacturing environment
(such as information on
material
on hand, scheduling, and the
production process) and
financial
activities.
USMS Workload Will seek to develop a workforce Managing Planned Yes No No
Modeling model that will support budget human
formulation, execution, and resources
resource analysis. Will be a
planning and execution activity
that will be used to help
determine
the quantity and location of
required resources.
Source: Department of Justice.
Table 11: Department of Labor's Inventory of Data Mining Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Dashboard Provides links to programs Improving Operational Yes No No
Display
throughout the Department of service or
Labor's Employment Training performance
Administration to provide
reports
or information on financial
activities.
Enforcement Is used to track Improving Operational Yes Yes No
investigations of
Management violations of Title service or
I and other
System, Case criminal laws performance
pertaining to
Opening, and pension and welfare
rights.
Results Analysis
Is used to monitor Yes No No
Employee compliance Detecting Operational
Retirement with Title I of the fraud, waste,
Income Employee
Security Act Retirement Income and abuse
Data Security Act.
System
Mine Safety and Mines data from a Improving Operational Yes No Yes
data store of
Health Administration information on safety
safety and health
Teradata Data Store enforcement and
demographic
data for mine
operations, along
with miner
accidents, injury,
and
illness data.
Mathematical Will look at data from Improving Planned No No No
economic
Statistics Research surveys to compare rates service or
of
Center nonresponse for Bureau of performance
Labor
Statistics.
Source: Department of Labor.
Table 12: Department of State's Inventory of Data Mining Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Citibank's Ad Hoc Enables purchase Detecting Operational Yes Yes No
card managers
Reporting System to track trends fraud, waste,
related to the
usage of credit and abuse
cards by
employees in
purchasing supplies
and services for
official use.
Purchase card
program is
worldwide, and
spending patterns
and purchases are
monitored for
potential misuse or
fraud.
Purchase Card Will involve the Detecting Planned Yes Yes No
automation of
Management System internal workflow fraud, waste,
processes
(system is in the early and abuse
phases of
development). Will use
internal
data and bank data to
track trends
and anomalies in the
Department
of State's worldwide
purchase
card program.
Source: Department of State.
Table 13: Department of Transportation's Inventory of Data Mining Efforts
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Department of Transportation Headquarters
DOT IT Security Will collect information to allow Detecting Planned Yes No
No Management System management to assess its IT fraud, waste,
security infrastructure. and abuse
National Highway Traffic Safety Administration
State Data Analyzes, mines, and researches Improving Operational No No No
System
automotive crash data, such as safety
statistics from rollovers of
SUVs,
from 22 states to improve
highway
safety and lessen fatalities.
Policies can be set based on the
data.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Fatality Helps to evaluate the Improving Operational Yes Yes Yes
Analysis
Reporting System effectiveness of motor safety
vehicle
(FARS) safety standards and
highway
safety programs. Data
are
collected from all 50
states, the
District of Columbia,
and Puerto
Rico and are used to
evaluate and
support highway safety.
National Collects and mines Improving Operational Yes Yes No
Automotive information on
Sampling System automotive crashes. safety
System is
related to the Federal
Motor
Vehicle Safety Standards
that
regulate vehicle
compliance items
such as seat belts, air
bags, and
the stopping distance of
brakes.
Source: Department of Transportation.
Table 14: Department of the Treasury's Inventory of Data Mining Efforts Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Financial Management Service
Treasury Offset Mines data to reduce the number Improving Operational Yes
No Yes
Program (TOP) of debts listed in TOP. service or
Cleanup performance
Electronic Is a free service offered by Increasing Operational Yes No No
Federal the tax
Tax Payment Department of the Treasury compliance
System for
(EFTPS) individuals and business
Marketing
taxpayers who pay their
federal
taxes electronically. Mining
activity
tracks enrollment, tax
payment
history, and usage trends.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Internal Revenue Service
Planning, Will be a component of the Improving Planned Yes No Yes
Analysis,
and Decision Custodial Accounting service or
Program,
Support System which is the warehouse that performance
is
used to query transactional
data
and produce reports. This
activity
is meant to improve
reporting and
use decision support tools.
Abusive Will model characteristics Increasing tax Planned Yes Yes No
Corporate of
Tax Shelter corporate tax shelters and compliance
Detection use
Model models to predict corporate
tax
shelter abuse and to assess
compliance risk in the
corporate
taxpayer population.
K-1 Link Analysis Will be used to detect potential tax Increasing tax
Planned Yes No No evasion. compliance
Research on the Will be used to research Detecting Planned Yes No No
data on
Population of taxpayers who receive fraud, waste,
the EITC.
Taxpayers Who and abuse
Receive Earned
Income Tax Credit
Issue Based Will provide access Increasing tax Planned No Yes No
to a variety of
Management data sources within compliance
IRS. Will
Information assist in research
System and case work.
Electronic Fraud Mines data to Yes No No
evaluate and rate Improving Operational
potentially
Detection System fraudulent service or
individual tax
returns. performance
Reveal Will be used to Planned Yes Yes No
detect financial Detecting
criminal activity criminal
such as tax
evasion. activities or
patterns
Oracle Model 22 Takes information Increasing tax Operational Yes No No
from individual
Partnership tax returns and compliance
Return attempts to
Scoring System replicate judgments
made by
taxpayers to detect
the likelihood
of material errors.
SPSS Form 1120-S Will automate the Increasing tax Planned Yes No No
classification of
Return Scoring certain corporate tax compliance
returns.
System
Oracle Model 33 Will identify Planned Yes No No
noncompliance in Increasing tax
Partnership partnership returns. compliance
Scoring
Model
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Compliance Will identify taxpayer Increasing tax Planned Yes Yes Yes
Laboratory noncompliance by looking at compliance
groups of returns.
U.S. Mint
Information Collects information on Improving Operational No No No
potential
Technology intrusions to U.S. Mint information
Intrusion systems.
Detection System Looks for trends in security
information
reported by sensors to
determine
if illicit activity has
occurred.
Minimizes false
positives.
E-Commerce Fraud Attempts to Detecting Operational Yes Yes Yes
identify and stop
Analysis fraudulent activity criminal
Activity involving stolen
credit cards to activities or
order products over
the Internet or via
telephone. patterns
Fraud rating
identifiers are
used to
identify areas
where fraud has
occurred and to
determine the
likelihood of
fraud. Allows for
orders to be
stopped or for
orders
over a certain
dollar limit to be
stopped.
Data Warehouse Will be an integrated, scalable, expandable data warehouse
that will support business functions by grouping the data in
subjectoriented data marts. Each warehouse data mart will be defined to
integrate both internal and external data to provide the necessary
information to perform both historical and predictive analysis and support
numerous calculations.
Improving Planned No No No
service or
performance
Source: Department of the Treasury.
Table 15: Department of Veterans Affairs' Inventory of Data Mining Efforts
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Department of Veterans Affairs Headquarters Veterans Benefits Administration
Veterans Health Administration
Veterans Affairs Is used to monitor Detecting Operational Yes Yes No
and manage
Central Incident intrusion detection criminal
and firewalls.
Scripts are written activities or
Response Center for forensic
analysis to go
through data patterns
collected from
system and
network logs.
Purchase Card Will identify Planned Yes Yes No
Data patterns in purchase Detecting
Mining (SAS) card use to identify fraud, waste,
fraud and
Reports misuse and to and abuse
maintain good
internal controls.
Travel Card Data Will be used to look Planned Yes Yes No
for patterns in Detecting
Mining (SAS) the use of travel fraud, waste,
credit cards that
Reports indicate misuse or and abuse
fraud and to
maintain good
internal controls.
Office of Analyzes and matches Detecting Operational Yes No No
Inspector (within the
General (OIG) guidelines of the law) fraud, waste,
Veterans
Affairs (VA) files, and abuse
pertaining to both
VA-provided benefits
and health
care services to detect
patterns of
waste, fraud, or abuse.
C & P Payment Data Analyzes Detecting Operational Yes No Yes
compensation and
Analysis pension data to fraud, waste,
detect fraud,
waste, and abuse. and abuse
C & P Large Serves as an Yes No No
Payment internal control Detecting Operational
Verification intended to make fraud, waste,
Process sure that
payments over a and abuse
certain dollar
threshold are
reviewed to detect
potential fraud or
abuse.
Primary Analysis Is used mainly to Improving Operational No No No
and discover trends,
Classification incidents/events, and safety
vulnerabilities that
may exist in VA
hospitals.
Allocation Is used in making Yes No No
Resource resource Improving Operational
Center Database allocation decisions service or
based on the
analysis of patient performance
workload and
cost data.
(Continued From Previous Page)
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Veteran's Health Integrates patient, Improving Operational Yes No No
clinical, and
Administration (VHA) financial data to service or
present a unified
Financial and management performance
Clinical perspective and
Data Mart enable consistent
reporting.
Is used to identify Yes No No
Decision Support patterns of care Improving Operational
System and patient service or
outcomes linked to
resource
consumption and performance
costs
associated with
each patient
encounter.
Top 50 Is used to standardize Improving Operational No Yes No
medical
Standardization and hospital supplies service or
and
Listing/Managed equipment to (1) performance
improve VHA's
Inventory System bargaining position
when soliciting
bids and (2) facilitate
the ability to
move doctors among
hospitals.
VISN 16 Data Provides unified view of Improving Operational Yes No No
the VISN
Warehouse 16 VA region, composed of service or
10
medical centers and 30 performance
outpatient
clinics. The system gives a
view of
the enterprise for
management
purposes. It is mined for a
variety
of types of information
such as
patient encounters, lab
tests,
pharmacy records, etc.
Source: Department of Veterans Affairs.
Table 16: Environmental Protection Agency's Inventory of Data Mining Efforts
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Conceptual Plans Will regularly review Detecting Planned Yes No No
to financial data
Design an systems for contracts, fraud, waste,
Approach bank cards,
and System to and small purchases and and abuse
other
Review Financial financial databases for
misuse or
Data fraud of Environmental
Protection
Agency's assets.
Drinking Water Integrates and Monitoring Operational Yes No Yes
Data analyzes drinking
Warehouse water information from public health
state,
regional, and
headquarters
sources. Includes data
on water
systems, compliance,
sample
analytical results,
and audit data.
Source: Environmental Protection Agency.
Table 17: Export-Import Bank of the United States' Inventory of Data Mining
Efforts Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Integrated Is used to generate Improving Operational Yes No No
reports that
Information System describe bank service or
lending activities
Data Warehouse and exposure trends. performance
Mining for
Financial
Risk Information
Source: Export-Import Bank of the United States.
Table 18: Federal Deposit Insurance Corporation's Inventory of Data Mining
Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Real Estate Is used to measure real Detecting risk Operational No No Yes
Stress estate
Test (RSST) risk. Bank examiners use in financial
data
from the system data as
part of a systems
pre-examination planning
process
to assist in identifying
risk
concentrations.
Determination of Will support the Improving Planned Yes No No
development of a
Insured Deposits new system for service or
implementing the
deposit insurance performance
claims.
Statistical Is used to rate No No Yes
CAMELS financial Detecting risk Operational
institutions'
Offsite Review performance and in financial
risk
management
practices. systems
Growth Is used to identify Detecting risk Operational No No Yes
Monitoring financial
System institutions that have in financial
experienced
significant growth.
Serves as an systems
early warning system for
detecting
financial institutions
that might
pose financial risk to
FDIC.
Source: Federal Deposit Insurance Corporation.
Table 19: Federal Reserve System's Inventory of Data Mining Efforts Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Office of the Will support audits and Detecting Planned Yes No No
Inspector General evaluations. Using ACL, fraud, waste,
queries
(OIG), Audit will be run against the and abuse
Services board's
financial and personnel
systems
to detect fraud, waste,
and abuse,
or to provide
information
supporting any aspect
of an OIG
project.
Source: Federal Reserve System.
Table 20: National Aeronautics and Space Administration's Inventory of Data
Mining Efforts Features
Private sector data
Organization/ system
name Description Purpose Status
Personal information Other agency data
Archiving of Web Will gather Analyzing Planned No No Yes
metadata on the
Information at GSFC Web site at scientific and
National NASA to
Aeronautics and preserve NASA research
legacy
Space Administration information. information
(NASA) and Goddard
Space Federal Center
(GSFC)
My Goddard Search- Will allow Web mining Analyzing Planned No Yes No
of scientific
Mining of data at Goddard Space scientific and
Goddard's Center. It
Web environment is referred to as research
"Google for
Goddard." information
NetContext Will monitor network Planned Yes No No
traffic for the Detecting
purpose of fraud, waste
identifying bandwidth
use, fraud, abuse, and abuse
and IT security-
related activities.
Geophysics Time Will develop a set of Analyzing Planned No No Yes
algorithms to
Series Analysis identify patterns within scientific and
temporal
activities. The data will research
be
trajectories of objects information
and
movement of objects
within
images.
"Simmarizer" Uses data mining Analyzing Operational No No No
techniques to
(Simulation-Based extract scientific and
knowledge from
Summary/ simulators to research
understand
Discovery of conditions and information
scenarios
Knowledge) regarding space
missions.
Is used to
Global Environmental obtain No No Yes
information Analyzing Operational
and Earth Science about global scientific and
climate changes.
Information System research
(GENESIS) information
Machine Learning Will find patterns and Analyzing Planned No No Yes
and
Data Mining for relationships in scientific and
large, complex
Improved earth science data research
Intelligent sets,
Data Understanding specifically for rare information
of and small
High Dimensional events hidden in
larger data
Earth Sensed Data signals. Will build
new capabilities
to understand NASA
science
data.
(Continued From Previous Page)
Features
Private sector data
Organization/ system
name Description Purpose Status
Personal information Other agency data
Distributed Data Mining Techniques for Object Discovery in the National
Virtual Observatory (NVO) Involves using a data mining tool set for space
science research. Incorporates a small number of targeted data mining
techniques in order to address specific NASA space science research
programs. In particular, the data mining environment will be used to
explore NASA's large space science data collections. These techniques are
being applied to astronomical object discovery, identification,
classification, and interpretation across large multiple distributed
astronomy data collections.
Analyzing Operational No No Yes
scientific and
research
information
Diamond Eye Analyzes large sets of Analyzing Operational No Yes Yes
(System images
for Mining looking for specific scientific and
Images) features.
research
information
Data Mining of 3-D Will automate the Analyzing Planned No No Yes
analysis of
Numerical Model weather model output, scientific and
Forecast Output observation, and research
and satellite data to
Its Application to allow for a better information
understanding of
Atmospheric the science of weather
Research dynamics
and to predict future
weather
events.
Ecological Will develop an adaptable Analyzing Planned No No Yes
Forecasting system
that can be used to mine scientific and
large
volumes of scientific data, research
identify
novel causal relationships information
in the
data about earth system
processes, and rapidly
incorporate discoveries with
biospheric models to
generate
now-casts and forecasts of
biospheric events and
conditions.
(Continued From Previous Page)
Features
Private sector data
Organization/ system
name Description Purpose Status
Personal information Other agency data
Distributed Data Mining for Large NASA Databases (Earth Science Earth
Observing System Data) Will research changes, trends, and relationships in
Earth Observing System (EOS) data. The major feature of this activity is
that it will allow for different data to be mined in parts and then
merged. The capability is needed for instances when scientific data are at
different locations. A research quality software will be used to allow for
a communication system and run-time environment for applying a collective
data analysis approach not bound to any specific platform, learning
algorithm, or representation of knowledge.
Analyzing Planned No No Yes
scientific and
research
information
Discovery of Will detect patterns in Analyzing Planned No No Yes
Changes scientific
from the Global data that are scientific and
geospatial and
Carbon Cycle and dynamic and represented research
as
Climate System raster data (gridded information
Using cells of
Data Mining surfaces such as the
Activity sun's or
earth's surfaces).
Mining
capabilities are being
developed
for future
NASA-relevant data and
science.
"AutoSciProd" Uses statistical and Analyzing Operational No No No
image data to
(Automatic determine and scientific and
Generation improve science
of Science products. research
Products
from Large Image information
Data Sets)
Near Archive Data Pulls data from an No No No
archive of Analyzing Operational
Mining of Earth earth science data scientific and
and applies
Science Data scientists' analyses research
and
algorithms to the information
data.
Will improve the Planned No No No
Spectral Analysis collection, Analyzing
Automation (SAA) identification, and scientific and
evaluation of
System spectral data to research
better meet
scientists' information
requirements.
Multiple Sensor Will be used for Analyzing Planned No No No
Image collaborative
Registration, Image preprocessing of data scientific and
and
Fusion and Dimension research on wavelets. research
Will
comprise research information
Reduction Using software that
Wavelets looks at different
technologies
such as image
processing and
dimensions.
(Continued From Previous Page)
Features
Private sector data
Organization/ system
name Description Purpose Status
Personal information Other agency data
GMSEC Event Will be used to Analyzing Planned No No No
determine health
Message Data of and reasons for scientific and
Mining problems with
Task satellite systems. research
information
Intrusion Looks at all traffic Yes No No
Detection that traverses Improving Operational
System NASA's networks' information
borders.
security
AvSP/ASMM Is used with No Yes No
Foreign simulations to identify Analyzing Operational
Object foreign object damage scientific and
Detection indicators
Toolset for commercial jet research
engines.
information
Mission and Science Will be a basic Analyzing Planned No No No
technology
Measurement and research program that scientific and
will also
Discovery Systems support infusion of research
resulting
technologies into NASA information
missions.
Purpose of the program
is to
solve the research
challenge in
extracting the most
scientific
knowledge from NASA's
space
missions and data
archives.
StarTool: Solar Is used for Analyzing Operational No No No
Active recognition of solar
activity in scientific
Region Detection sequences of and
multiband
solar images. research
information
"Toogle" Searches for No No No
(Times-Series time-series data. Is Improving Operational
Search Engine) similar to a Google safety
search
engine.
Use of Data Will help the Planned No No Yes
Mining, National Oceanic Improving
Remote Sensing, and Atmospheric service or
and Administration
Geographic automate its fire performance
detection
Information systems and improve
Systems the
for Wildfire accuracy of fire
Detection detection
and Prediction systems.
Knowledge Will mine data using Planned No Yes Yes
Discovery software that Analyzing
and Data Mining has been developed scientific
to exploit and
Based on information from a research
Hierarchical hierarchical
Image image segmentation information
Segmentation process.
Source: National Aeronautics and Space Administration.
Table 21: Nuclear Regulatory Commission's Inventory of Data Mining Efforts
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Licensee Event Identifies nuclear Improving Operational No Yes No
safety trends
Report Data and patterns in safety
commercial
nuclear power events.
Centralized Will consolidate and Planned Yes No No
standardize Improving
Information reporting for nuclear service or
Delivery reactor
regulations. performance
Source: Nuclear Regulatory Commission.
Table 22: Office of Personnel Management's Inventory of Data Mining Efforts
Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
CRIS Retirement Mines federal employee Improving Operational Yes No Yes
benefits
Data Mining data such as service or
Activity information on
retirement and life performance
insurance to
assist in managing
federal
employee eligibilities
and
entitlements.
Source: Office of Personnel Management.
Table 23: Pension Benefit Guaranty Corporation's Inventory of Data Mining
Efforts Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Corporate Will streamline access to Improving Planned No No Yes
Performance management and operational service or
Indicators and performance measures and performance
Analytics permit the correlation of
performance and output
measures.
Corporate Policy Is a stochastic Improving Operational No Yes Yes
and simulation model
Research that incorporates service or
historic equity
Department's and interest rates performance
and bankruptcy
Forecasting System possibilities to
forecast scenarios
for more than 300
pension plans
and their related
corporate
sponsors.
Source: Pension Benefit Guaranty Corporation.
Table 24: Railroad Retirement Board's Inventory of Data Mining Efforts Features
Organization/
system name Description Purpose Status
Personal information Private sector data Other agency data
Railroad Consists of two major Improving Operational Yes No Yes
Retirement databases
Board Data Stores (payment and service or
entitlement history
and employment data performance
maintenance) that are
mined by
actuaries to produce
annual
actuarial reports and
for audit
support and quality
control.
Source: Railroad Retirement Board.
Table 25: Small Business Administration's Inventory of Data Mining Efforts
Features
Private sector data
Organization/
system name Description Purpose Status
Personal information Other agency data
Loan Monitoring Helps to identify, Improving Operational Yes Yes No
measure, and
System manage the risk of service or
Small
Business performance
Administration's
portfolio. Business
credit scores
are used but individual
credit
scores are not.
MONSTER and Mines data from Financial Operational Yes No No
database that
Econometric Models includes all
transactions for each management
loan that affects SBA
subsidy
costs, to assist in
determining
credit subsidy rates
for SBA's
various credit
programs.
Source: Small Business Administration.
GAO's Mission The General Accounting Office, the audit, evaluation and
investigative arm of Congress, exists to support Congress in meeting its
constitutional responsibilities and to help improve the performance and
accountability of the federal government for the American people. GAO
examines the use of public funds; evaluates federal programs and policies;
and provides analyses, recommendations, and other assistance to help
Congress make informed oversight, policy, and funding decisions. GAO's
commitment to good government is reflected in its core values of
accountability, integrity, and reliability.
Obtaining Copies of GAO Reports and Testimony
The fastest and easiest way to obtain copies of GAO documents at no cost
is through the Internet. GAO's Web site (www.gao.gov) contains abstracts
and fulltext files of current reports and testimony and an expanding
archive of older products. The Web site features a search engine to help
you locate documents using key words and phrases. You can print these
documents in their entirety, including charts and other graphics.
Each day, GAO issues a list of newly released reports, testimony, and
correspondence. GAO posts this list, known as "Today's Reports," on its
Web site daily. The list contains links to the full-text document files.
To have GAO e-mail this list to you every afternoon, go to www.gao.gov and
select "Subscribe to e-mail alerts" under the "Order GAO Products"
heading.
Order by Mail or Phone The first copy of each printed report is free.
Additional copies are $2 each. A check or money order should be made out
to the Superintendent of Documents. GAO also accepts VISA and Mastercard.
Orders for 100 or more copies mailed to a single address are discounted 25
percent. Orders should be sent to:
U.S. General Accounting Office 441 G Street NW, Room LM Washington, D.C.
20548
To order by Phone: Voice: (202) 512-6000 TDD: (202) 512-2537 Fax: (202)
512-6061
To Report Fraud, Contact:
Web site: www.gao.gov/fraudnet/fraudnet.htmWaste, and Abuse in E-mail:
[email protected] Federal Programs Automated answering system: (800)
424-5454 or (202) 512-7470
Public Affairs Jeff Nelligan, Managing Director, [email protected] (202)
512-4800 U.S. General Accounting Office, 441 G Street NW, Room 7149
Washington, D.C. 20548
Presorted Standard
Postage & Fees Paid
GAO
Permit No. GI00
United States
General Accounting Office
Washington, D.C. 20548-0001
Official Business
Penalty for Private Use $300
Address Service Requested
*** End of document. ***