Data Mining: Federal Efforts Cover a Wide Range of Uses (04-MAY-04, GAO-04-548). Both the government and the private sector are increasingly using "data mining"--that is, the application of database technology and techniques (such as statistical analysis and modeling) to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results. As has been widely reported, many federal data mining efforts involve the use of personal information that is mined from databases maintained by public as well as private sector organizations. GAO was asked to survey data mining systems and activities in federal agencies. Specifically, GAO was asked to identify planned and operational federal data mining efforts and describe their characteristics. -------------------------Indexing Terms------------------------- REPORTNUM: GAO-04-548 ACCNO: A09947 TITLE: Data Mining: Federal Efforts Cover a Wide Range of Uses DATE: 05/04/2004 SUBJECT: Counterterrorism Crime prevention Data collection Federal agencies Fraud Information technology Personnel management Planning Statistical methods Data mining Personal information ****************************************************************** ** This file contains an ASCII representation of the text of a ** ** GAO Product. ** ** ** ** No attempt has been made to display graphic images, although ** ** figure captions are reproduced. Tables are included, but ** ** may not resemble those in the printed version. ** ** ** ** Please see the PDF (Portable Document Format) file, when ** ** available, for a complete electronic file of the printed ** ** document's contents. ** ** ** ****************************************************************** GAO-04-548 United States General Accounting Office GAO Report to the Ranking Minority Member, Subcommittee on Financial Management, the Budget, and International Security, Committee on Governmental Affairs, U.S. Senate May 2004 DATA MINING Federal Efforts Cover a Wide Range of Uses a GAO-04-548 Highlights of GAO-04-548, a report to the Ranking Minority Member, Subcommittee on Financial Management, the Budget, and International Security, Committee on Governmental Affairs, U.S. Senate Both the government and the private sector are increasingly using "data mining"-that is, the application of database technology and techniques (such as statistical analysis and modeling) to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results. As has been widely reported, many federal data mining efforts involve the use of personal information that is mined from databases maintained by public as well as private sector organizations. GAO was asked to survey data mining systems and activities in federal agencies. Specifically, GAO was asked to identify planned and operational federal data mining efforts and describe their characteristics. May 2004 DATA MINING Federal Efforts Cover a Wide Range of Uses Federal agencies are using data mining for a variety of purposes, ranging from improving service or performance to analyzing and detecting terrorist patterns and activities. Our survey of 128 federal departments and agencies on their use of data mining shows that 52 agencies are using or are planning to use data mining. These departments and agencies reported 199 data mining efforts, of which 68 are planned and 131 are operational. The figure here shows the most common uses of data mining efforts as described by agencies. Of these uses, the Department of Defense reported the largest number of efforts aimed at improving service or performance, managing human resources, and analyzing intelligence and detecting terrorist activities. The Department of Education reported the largest number of efforts aimed at detecting fraud, waste, and abuse. The National Aeronautics and Space Administration reported the largest number of efforts aimed at analyzing scientific and research information. For detecting criminal activities or patterns, however, efforts are spread relatively evenly among the agencies that reported having such efforts. In addition, out of all 199 data mining efforts identified, 122 used personal information. For these efforts, the primary purposes were improving service or performance; detecting fraud, waste, and abuse; analyzing scientific and research information; managing human resources; detecting criminal activities or patterns; and analyzing intelligence and detecting terrorist activities. Agencies also identified efforts to mine data from the private sector and data from other federal agencies, both of which could include personal information. Of 54 efforts to mine data from the private sector (such as credit reports or credit card transactions), 36 involve personal information. Of 77 efforts to mine data from other federal agencies, 46 involve personal information (including student loan application data, bank account numbers, credit card information, and taxpayer identification numbers). Top Six Purposes of Data Mining Efforts in Departments and Agencies www.gao.gov/cgi-bin/getrpt?GAO-04-548 To view the full product, including the scope and methodology, click on the link above. For more information, contact Linda Koontz at (202) 512-6240 or [email protected]. Contents Letter 1 Results in Brief 2 Background 3 Agencies Identified Numerous Data Mining Efforts with Various Aims 7 Summary 12 Appendixes Appendix I: Objective, Scope, and Methodology 14 Appendix II: Surveyed Departments and Agencies 16 Appendix III: Departments and Agencies Reporting No Data Mining Efforts 23 Appendix IV: Inventories of Efforts 27 Tables Table 1: Table 2: Table 3: Table 4: Table 5: Table 6: Table 7: Table 8: Table 9: Top Six Purposes of Data Mining Efforts in Departments and Agencies and Number of Efforts Reported Department of Agriculture's Inventory of Data Mining Efforts Department of Commerce's Inventory of Data Mining Efforts Department of Defense's Inventory of Data Mining Efforts Department of Education's Inventory of Data Mining Efforts Department of Energy's Inventory of Data Mining Efforts Department of Health and Human Services' Inventory of Data Mining Efforts Department of Homeland Security's Inventory of Data Mining Efforts Department of the Interior's Inventory of Data Mining Efforts 8 27 29 29 37 40 41 43 46 47 49 50 50 Table 10: Department of Justice's Inventory of Data Mining Efforts Table 11: Department of Labor's Inventory of Data Mining Efforts Table 12: Department of State's Inventory of Data Mining Efforts Table 13: Department of Transportation's Inventory of Data Mining Efforts Table 14: Department of the Treasury's Inventory of Data Mining Efforts 51 Table 15: Department of Veterans Affairs' Inventory of Data Mining Efforts 54 Table 16: Environmental Protection Agency's Inventory of Data Mining Efforts 56 Table 17: Export-Import Bank of the United States' Inventory of Data Mining Efforts 56 Table 18: Federal Deposit Insurance Corporation's Inventory of Data Mining Efforts 57 Table 19: Federal Reserve System's Inventory of Data Mining Efforts 57 Table 20: National Aeronautics and Space Administration's Inventory of Data Mining Efforts 58 Table 21: Nuclear Regulatory Commission's Inventory of Data Mining Efforts 62 Table 22: Office of Personnel Management's Inventory of Data Mining Efforts 62 Table 23: Pension Benefit Guaranty Corporation's Inventory of Data Mining Efforts 63 Table 24: Railroad Retirement Board's Inventory of Data Mining Efforts 63 Table 25: Small Business Administration's Inventory of Data Mining Efforts 64 Figures Figure 1: Top Six Purposes of Data Mining Efforts That Involve Personal Information 10 Figure 2: Top Six Purposes of Data Mining Efforts That Involve Private Sector Data 11 Figure 3: Top Six Purposes of Data Mining Efforts That Involve Data from Other Federal Agencies 12 Abbreviations CARDS Counterintelligence Analytical Research Data System CG Coast Guard CI-AIMS Counterintelligence Automated Investigative Management System DHHS Department of Health and Human Services DOD Department of Defense DOE Department of Energy DOT Department of Transportation EFTPS Electronic Federal Tax Payment System EOS Earth Observing System FARS Fatality Analysis Reporting System FDA Food and Drug Administration GENESIS Global Environmental and Earth Science Information System GSFC Goddard Space Federal Center HR Human Resources HRSA Health Resources and Services Administration MATRIX Multistate Anti-terrorism Information Exchange System NASA National Aeronautics and Space Administration NVO National Virtual Observatory OIG Office of Inspector General OLAP On-line Analytical Processing RSST Real Estate Stress Test SAA Spectral Analysis Automation SAS Safety Automated System SMARTS Statistical Management Analysis and Reporting Tool System SWC Space Warfare Center TIMS Technical Information Management System TOP Treasury Offset Program VA Veterans Affairs VHA Veterans Health Administration VISN Veterans Integrated Service Network This is a work of the U.S. government and is not subject to copyright protection in the United States. It may be reproduced and distributed in its entirety without further permission from GAO. However, because this work may contain copyrighted images or other material, permission from the copyright holder may be necessary if you wish to reproduce this material separately. A United States General Accounting Office Washington, D.C. 20548 May 4, 2004 The Honorable Daniel K. Akaka Ranking Minority Member Subcommittee on Financial Management, the Budget, and International Security Committee on Governmental Affairs United States Senate Dear Senator Akaka: Data mining-a technique for extracting knowledge from large volumes of data-is increasingly being used by government and by the private sector. As has been widely reported, many federal data mining efforts involve the use of personal information1 that is mined from public as well as private sector organizations. This report responds to your request that we identify and describe operational and planned data mining systems and activities in federal agencies. In a follow-up report, we plan to perform an in-depth review of selected federal data mining efforts. The term "data mining" has a number of meanings. For purposes of this work, we define data mining as the application of database technology and techniques-such as statistical analysis and modeling-to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results. We based this definition on the most commonly used terms found in a survey of the technical literature. In our initial survey of chief information officers, these officials found the definition sufficient to identify agency data mining efforts. 1As used in this report, personal information is all information associated with an individual and includes both identifying information and nonidentifying information. Identifying information, which can be used to locate or identify an individual, includes name, aliases, Social Security number, e-mail address, driver's license number, and agency-assigned case number. Nonidentifying personal information includes age, education, finances, criminal history, physical attributes, and gender. To address our objective to identify and describe operational and planned data mining systems and activities in federal agencies, we surveyed chief information officers or comparable officials at 128 federal departments and agencies to determine whether the agencies had operational and planned data mining systems or activities.2 We then conducted telephone interviews with the reported system managers to obtain information on the characteristics of the identified data mining efforts. To verify the information we received, we sent follow-up letters to agencies that responded as well as to those that did not respond, we asked responsible officials to verify the information, and we performed random assessments of the means that these officials used to verify the information. In addition, we conducted a search of technical literature and periodicals to develop a comprehensive list of federal government data mining efforts and then compared these efforts with data mining efforts reported by federal agencies. If the data mining efforts on our lists were not reported on the survey, we contacted the appropriate chief information officers and, with their concurrence, added the efforts. We performed our work from May 2003 to April 2004 in accordance with generally accepted government auditing standards. Additional details on our scope and methodology are provided in appendix I. Results in Brief Federal agencies are using data mining for a variety of purposes, ranging from improving service or performance to analyzing and detecting terrorist patterns and activities. Our survey of 128 federal departments and agencies on their use of data mining shows that 52 agencies are using or are planning to use data mining. These departments and agencies reported 199 data mining efforts, of which 68 were planned and 131 were operational. The most common uses of data mining efforts were described by agencies as o improving service or performance; o detecting fraud, waste, and abuse; o analyzing scientific and research information; 2That is, we asked about both systems explicitly dedicated to data mining and activities using automated tools to "mine" databases that are part of other systems. In this report, we use the word "efforts" to refer to both systems and activities, unless otherwise specified. o managing human resources; o detecting criminal activities or patterns; and o analyzing intelligence and detecting terrorist activities. The Department of Defense reported having the largest number of data mining efforts aimed at improving service or performance and at managing human resources. Defense was also the most frequent user of efforts aimed at analyzing intelligence and detecting terrorist activities, followed by the Departments of Homeland Security, Justice, and Education. The Department of Education reported the largest number of efforts aimed at detecting fraud, waste, and abuse, while the National Aeronautics and Space Administration targets most of their data mining efforts (21 out of 23) toward analyzing scientific and research information. Data mining efforts for detecting criminal activities or patterns, however, were spread relatively evenly among the reporting agencies. In addition, out of all 199 data mining efforts identified, 122 used personal information. For these efforts, the primary purposes were detecting fraud, waste, and abuse; detecting criminal activities or patterns; analyzing intelligence and detecting terrorist activities; and increasing tax compliance. Agencies also identified efforts to mine data from the private sector and data from other federal agencies, both of which could include personal information. Of 54 efforts to mine data from the private sector (such as credit reports or credit card transactions), 36 involve personal information. Of 77 efforts to mine data from other federal agencies, 46 involve personal information (including student loan application data, bank account numbers, credit card information, and taxpayer identification numbers). Background Data mining enables corporations and government agencies to analyze massive volumes of data quickly and relatively inexpensively. The use of this type of information retrieval has been driven by the exponential growth in the volumes and availability of information collected by the public and private sectors, as well as by advances in computing and data storage capabilities. In response to these trends, generic data mining tools are increasingly available for-or built into-major commercial database applications. Today, mining can be performed on many types of data, including those in structured, textual, spatial, Web, or multimedia forms. Data mining is becoming a big business; Forrester Research has estimated that the data mining market is passing the billion dollar mark. Although the use and sophistication of data mining have increased in both the government and the private sector, data mining remains an ambiguous term. According to some experts, data mining overlaps a wide range of analytical activities, including data profiling, data warehousing, online analytical processing, and enterprise analytical applications.3 Some of the terms used to describe data mining or similar analytical activities include "factual data analysis" and "predictive analytics." We surveyed technical literature and developed a definition of data mining based on the most commonly used terms found in this literature. Based on this search, we define data mining as the application of database technology and techniques-such as statistical analysis and modeling-to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results. We used this definition in our initial survey of chief information officers; these officials found the definition sufficient to identify agency data mining efforts. Data mining has been used successfully for a number of years in the private and public sectors in a broad range of applications. In the private sector, these applications include customer relationship management, market research, retail and supply chain analysis, medical analysis and diagnostics, financial analysis, and fraud detection. In the government, data mining was initially used to detect financial fraud and abuse. For example, data mining has been an integral part of GAO audits and investigations of federal government purchase and credit card programs.4 Data mining and related technologies are also emerging as key tools in Department of Homeland Security initiatives. 3Lou Agosta, "Data Mining Is Dead-Long Live Predictive Analytics!" (Forrester Research, Oct. 30, 2003), http://www.forrester.com/Research/LegacyIT/0,7208,33030,00.html (downloaded Jan. 26, 2004). 4For more information on the uses of data mining in GAO audits, see U.S. General Accounting Office, Data Mining: Results and Challenges for Government Programs, Audits, and Investigations, GAO-03-591T (Washington, D.C: Mar. 25, 2003). Data Mining Poses Privacy Challenge Since the terrorist attacks of September 11, 2001, data mining has been seen increasingly as a useful tool to help detect terrorist threats by improving the collection and analysis of public and private sector data. In a recent report on information sharing and analysis to address the challenges of homeland security, it was noted that agencies at all levels of government are now interested in collecting and mining large amounts of data from commercial sources.5 The report noted that agencies may use such data not only for investigations of known terrorists, but also to perform large-scale data analysis and pattern discovery in order to discern potential terrorist activity by unknown individuals. Such use of data mining by federal agencies has raised public and congressional concerns regarding privacy. One example of a large-scale development effort launched in the wake of the September 11 attacks is the Multistate Anti-terrorism Information Exchange System, known as MATRIX. MATRIX, currently used in five states,6 provides the capability to store, analyze, and exchange sensitive terrorism-related and other criminal intelligence data among agencies within a state, among states, and between state and federal agencies. Information in MATRIX databases includes criminal history records, driver's license data, vehicle registration records, incarceration records, and digitized photographs. Public awareness of MATRIX and of similar large-scale data mining or data mining-like projects has led to concerns about the government's use of data mining to conduct a mass "dataveillance"7-a surveillance of large groups of people-to sift through vast amounts of personally identifying data to find individuals who might fit a terrorist profile. 5Creating a Trusted Information Network for Homeland Security (New York City: The Markle Foundation, December 2003), http://www.markletaskforce.org/Report2_Full_Report.pdf (downloaded Mar. 8, 2004). 6Five states are currently participating in the MATRIX pilot project: Connecticut, Florida, Michigan, Ohio, and Pennsylvania. 7Roger Clarke, "Information Technology and Dataveillance," Communications of the ACM, vol. 31, issue 5 (New York City: ACM Press, May 1988), http://www.anu.edu.au/people/Roger.Clarke/DV/CACM88.html (downloaded Mar. 5, 2004). Clarke defines mass dataveillance as the systematic use of personal data systems in the investigation or monitoring of the actions or communications of groups of people. Mining government and private databases containing personal information creates a range of privacy concerns. Through data mining, agencies can quickly and efficiently obtain information on individuals or groups by exploiting large databases containing personal information aggregated from public and private records. Information can be developed about a specific individual or about unknown individuals whose behavior or characteristics fit a specific pattern. Before data aggregation and data mining came into use, personal information contained in paper records stored at widely dispersed locations, such as courthouses or other government offices, was relatively difficult to gather and analyze. As one expert noted, data mining technologies that provide for easy access and analysis of aggregated data challenge the concept of privacy protection afforded to individuals through the inherent inefficiency of government agencies analyzing paper, rather than aggregated, computer records.8 Privacy concerns about mined or analyzed personal data also include concerns about the quality and accuracy of the mined data; the use of the data for other than the original purpose for which the data were collected without the consent of the individual; the protection of the data against unauthorized access, modification, or disclosure; and the right of individuals to know about the collection of personal information, how to access that information, and how to request a correction of inaccurate information.9 8K.A. Taipale, "Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data," The Columbia Science and Technology Law Review, vol. V, 2003-2004 (New York City: Columbia Law School, 2004), http://www.stlr.org/cite.cgi?volume=5&article=2 (downloaded Mar. 18, 2004). 9These privacy concerns are reflected in the Fair Information Practices proposed in 1980 by the Organization for Economic Cooperation and Development and endorsed by the U.S. Department of Commerce in 1981. These practices govern collection limitation, purpose specification, use limitation, data quality, security safeguards, openness, individual participation, and accountability. Agencies Identified Numerous Data Mining Efforts with Various Aims Of 128 federal departments and agencies surveyed for information on their planned and operational data mining efforts (listed in app. II), 52 agencies reported 199 data mining efforts, and 69 agencies reported that they were not engaged in data mining and were not planning such efforts (listed in app. III). Of the 199 data mining efforts, 68 were planned and 131 were operational. Seven agencies did not respond to our survey.10 Appendix IV lists the 199 data mining efforts reported, along with key characteristics. Agencies described the most common purposes of data mining efforts as o improving service or performance; o detecting fraud, waste, and abuse; o analyzing scientific and research information; o managing human resources; o detecting criminal activities or patterns; and o analyzing intelligence and detecting terrorist activities. As shown in table 1, the Department of Defense reported the largest number of efforts aimed at improving service or performance (with 19 out of 65 reported efforts) and at managing human resources (with 14 out of 17 efforts). Defense was also the most frequent user of efforts aimed at analyzing intelligence and detecting terrorist activities, with 5 of 14 efforts, followed by the Departments of Homeland Security and Justice, with 4 and 3 efforts, respectively. The Department of Education has the largest number of efforts aimed at detecting fraud, waste, and abuse (9 out of 24 efforts reported). The National Aeronautics and Space Administration accounts for 21 of the 23 identified efforts for analyzing scientific and research information. Efforts are spread relatively evenly among the agencies that reported using data mining efforts for detecting criminal 10Agencies that did not respond to our survey are (1) the Central Intelligence Agency; (2) the Corporation for National and Community Services; (3) the Department of Army, Department of Defense; (4) the Equal Employment Opportunity Commission; (5) the National Park Service, Department of the Interior; (6) the National Security Agency, Department of Defense; and (7) the Rural Utilities Service, Department of Agriculture. activities or patterns. Table 1 summarizes the top six uses of data mining efforts among the responding agencies. Table 1: Top Six Purposes of Data Mining Efforts in Departments and Agencies and Number of Efforts Reported Analyzing Analyzing Detecting intelligence Improving Detecting scientific Managing criminal and and detecting service or fraud, research human activities terrorist waste, or Department performance and abuse information resources patterns activities or agency Department of 8 1 Agriculture Department of Commerce Department of Defense 19 1 1 14 1 Department of Education 6 9 3 Department of Energy 3 Department of Health and Human Services 4 1 Department of Homeland Security 5 2 2 Department of the Interior 1 Department of Justice 1 1 3 Department of Labor 3 1 Department of State 2 Department of Transportation 1 Department of the Treasury 4 1 2 Department of Veterans Affairs 5 5 1 Environmental Protection Agency 1 Export-Import Bank of the United States 1 Federal Deposit Insurance Corporation 1 Federal Reserve System 1 National Aeronautics and Space Administration 1 1 21 Nuclear Regulatory Commission 1 Office of Personnel Management 1 Pension Benefit Guaranty Corporation 2 Railroad Retirement Board 1 Small Business Administration 1 Total 65 24 23 17 15 14 Source: GAO analysis of agency-provided data. Some data mining purposes focus on human activities and therefore are inherently likely to involve personal information; examples of these purposes are detecting fraud, waste, and abuse; detecting criminal activities or patterns; managing human resources; and analyzing intelligence. The following are examples of data mining efforts for each of these purposes: o Detecting fraud, waste, and abuse. The Veterans Benefits Administration's C & P Payment Data Analysis effort mines veterans' compensation and pension data for evidence of fraud. o Detecting criminal activities or patterns. The Department of Education's Title IV Identity Theft Initiative effort focuses on identity theft cases involving education loans. o Managing human resources. The U.S. Air Force's Oracle HR (Human Resources) uses data mining to provide information on promotions, pay grades, clearances, and other information relevant to human resources planning. o Analyzing intelligence and detecting terrorist activities. The Defense Intelligence Agency's Verity K2 Enterprise mines data from the intelligence community and Internet sources to identify foreign terrorists or U.S. citizens connected to foreign terrorism activities. On the other hand, other categories of efforts do not necessarily focus on human activities or involve personal information, such as many of the efforts aimed at analyzing scientific and research information. The National Aeronautics and Space Administration, for example, mines large, complex earth science data sets to find patterns and relationships to detect hidden events (the system is called Machine Learning and Data Mining for Improved Data Understanding of High Dimensional Earth Sensed Data). Similarly, many efforts aimed at improving service or performance (the most frequently cited purpose of data mining efforts) do not involve personal information. For example, the Department of the Navy's Supply Management System Multidimensional Cubes system includes a data warehouse containing data on every ship part that has been ordered since the 1980s, with multidimensional information on each part. The Navy uses data mining to calculate failure rates and identify needed improvements; according to the Navy, this system reduces downtime on ships by improving parts replacement. However, some efforts aimed at improving service or performance do involve personal information. For example, the Veterans Administration's VISN (Veterans Integrated Service Network) 16 Data Warehouse is mined for a variety of information, including patient visits, laboratory tests, and pharmacy records, to provide management with health care system performance information. Overall, 122 of the 199 data mining efforts involve personal information. Figure 1 shows the top six purposes of these efforts, as well as their distribution. Figure 1: Top Six Purposes of Data Mining Efforts That Involve Personal Information Purposes Increasing tax compliance Analyzing intelligence and detecting terrorist activities Detecting criminal activities or patterns Managing human resources Detecting fraud, waste, and abuse Improving service or performance 33 0 10203040 Number of data mining efforts Source: GAO analysis of agency data. Of the 199 data mining efforts, 54 use or plan to use data from the private sector. Of these, 36 involve personal information. The personal information from the private sector included credit reports and credit card transaction records. Figure 2 shows the distribution of the top six purposes of the 54 efforts involving data from the private sector. Figure 2: Top Six Purposes of Data Mining Efforts That Involve Private Sector Data Purposes Improving safety Detecting criminal activities or patterns Analyzing scientific and research information Analyzing intelligence and detecting terrorist activities Detecting fraud, waste, and abuse Improving service or performance 14 0 10203040 Number of data mining efforts Source: GAO analysis of agency data. Of the 199 data mining efforts, 77 efforts use or plan to use data from other federal agencies. Of the 77 efforts, 46 involve personal information. The personal information from other federal agencies included student loan application data, bank account numbers, credit card information, and taxpayer identification numbers. Figure 3 shows the top six uses for the 77 efforts involving data from other federal agencies and their distribution. Figure 3: Top Six Purposes of Data Mining Efforts That Involve Data from Other Federal Agencies Purposes Managing human resources Detecting fraud, waste, and abuse Detecting criminal activities or patterns Analyzing intelligence and detecting terrorist activities Analyzing scientific and research information Improving service or performance 20 0 10203040 Number of data mining efforts Source: GAO analysis of agency data. Summary Driven by advances in computing and data storage capabilities and by growth in the volumes and availability of information collected by the public and private sectors, data mining enables government agencies to analyze massive volumes of data. Our survey shows that data mining is increasingly being used by government for a variety of purposes, ranging from improving service or performance to analyzing and detecting terrorist patterns and activities. Although this survey provides a broad overview of the emerging uses of data mining in the federal government, more work is needed to shed light on the privacy implications of these efforts. In future work, we plan to examine selected federal data mining efforts and their implications. As agreed with your office, unless you publicly announce the contents of the report earlier, we plan no further distribution until 30 days from the report date. At that time, we will send copies of this report to the Chairmen and Ranking Minority Members of the House Committee on Government Reform; Subcommittee on Civil Service and Agency Organization, House Committee on Government Reform; Select Committee on Homeland Security, House of Representatives; Senate Committee on Governmental Affairs; and the Subcommittee on Oversight of Government Management, the Federal Workforce and the District of Columbia, Senate Committee on Governmental Affairs. We will also make copies available to others on request. In addition, this report will be available at no charge on the GAO Web site at http://www.gao.gov. If you have any questions concerning this report, please call me at (202) 512-6240 or Mirko J. Dolak, Assistant Director, at (202) 512-6362. We can also be reached by e-mail at [email protected] and [email protected], respectively. Key contributors to this report were Camille M. Chaires, Barbara S. Collier, Orlando O. Copeland, Nancy E. Glover, Stuart M. Kaufman, Lori D. Martinez, Morgan F. Walts, and Marcia C. Washington. Sincerely yours, Linda D. Koontz Director, Information Management Issues Appendix I Objective, Scope, and Methodology Our objective was to identify and describe planned and operational federal data mining efforts. As a first step in addressing this objective, we developed a definition of "data mining." Because this expression has a range of meanings, we surveyed the technical literature to develop a definition based on the most commonly used terms found in this literature. We defined data mining as the application of database technology and techniques-such as statistical analysis and modeling-to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results. In our initial survey of chief information officers, these officials found the definition sufficient to identify agency data mining efforts. We then surveyed chief information officers or comparable officials at 128 federal departments and agencies (see app. II) and asked them to identify whether their agency had operational and planned data mining efforts. We achieved a 95 percent response rate. Of the 121 agencies that responded, 69 reported that they did not have any data mining efforts (see app. III). We followed up with these 69 agencies and gave them another opportunity to report data mining efforts. To obtain information on the characteristics of the identified operational or planned data mining efforts, we conducted structured telephone interviews1 with the identified system owners or activity managers. The interviews were designed to obtain detailed information about each data mining system, including the purpose and size, the use of personal information, and the use of data from the private sector or other federal organizations. We pretested the structured interview to ensure relevance and clarity. We aggregated these data by agency and sent them back to the chief information officer, comparable official, or their designee and asked that they review the characteristics for completeness and accuracy. One of the 52 departments and agencies that reported data mining systems-the Department of Homeland Security-has not responded to our request to review the reported data for completeness and accuracy. 1In a structured interview, the interviewer asks the same questions of numerous individuals or individuals representing numerous organizations in a precise manner, offering each interviewee the same set of possible responses. We performed random assessments of the means that these officials used to verify the information. Based on these assessments, we concluded that the agencies' verification methods were reasonable and that as a result, we could rely on the accuracy of the reported data. We also conducted a search of technical literature and periodicals to develop a list of federal government data mining efforts and then compared the efforts on this list with the data mining efforts reported by federal agencies. If the data mining efforts on our list were not reported on the survey, we contacted the chief information officer or comparable official to determine whether that data mining effort should be included in our survey. Because this was not a sample survey, there are no sampling errors. However, the practical difficulties of conducting any survey may introduce errors, commonly referred to as nonsampling errors. For example, difficulties in how a particular question is interpreted, in the sources of information that are available to respondents, or in how the data are entered into a database or were analyzed can introduce unwanted variability into the survey results. We took steps in the development of the structured interview, the data collection, and the data analysis to minimize these nonsampling errors. Among these steps, we pretested the structured interview instrument, contacted nonresponding agencies as well as agencies not identifying data mining efforts, and sent the aggregated data to the agency chief information officer for review. We conducted our work from May 2003 to April 2004 in accordance with generally accepted government auditing standards. Appendix II Surveyed Departments and Agencies Department of Agriculture o Agricultural Marketing Service o Agricultural Research Service o Animal and Plant Health Inspection Service o Cooperative State Research, Education, and Extension Service o Farm Service Agency o Food and Nutrition Service o Food Safety and Inspection Service o Foreign Agricultural Service o Forest Service o National Agricultural Statistics Service o Natural Resources Conservation Service o Risk Management Agency o Rural Utilities Service Department of Commerce o Bureau of the Census o Economic Development Administration o International Trade Administration o National Oceanic and Atmospheric Administration o U.S. Patent and Trademark Office Department of Defense o Missile Defense Agency o Defense Advanced Research Projects Agency o Defense Commissary Agency o Defense Contract Audit Agency o Defense Contract Management Agency o Defense Information Systems Agency o Defense Intelligence Agency o Defense Legal Services Agency o Defense Logistics Agency o Defense Security Cooperation Agency o Defense Security Service o Defense Threat Reduction Agency o Department of the Air Force o Department of the Army o Department of the Navy o National Geospatial-Intelligence Agency o National Security Agency o U.S. Marine Corps Department of Education Department of Energy o Bonneville Power Administration o Southeastern Power Administration o Southwestern Power Administration o Western Area Power Administration Department of Health and Human Services o Administration for Children and Families o Agency for Healthcare Research and Quality o Centers for Disease Control and Prevention o Centers for Medicare and Medicaid Services o Food and Drug Administration o Health Resources and Services Administration o Indian Health Service o National Institutes of Health o Program Support Center Department of Homeland Security o Border and Transportation Security Directorate o Bureau of Citizenship and Immigration Services o Emergency Preparedness and Response Directorate o Information Analysis and Infrastructure Protection Directorate o Management Directorate o Science and Technology Directorate o U.S. Coast Guard o U.S. Secret Service Department of Housing and Urban Development Department of the Interior o Bureau of Indian Affairs o Bureau of Land Management o Bureau of Reclamation o Minerals Management Service o National Park Service o Office of Surface Mining Reclamation and Enforcement o U.S. Fish and Wildlife Service o U.S. Geological Survey Department of Justice o Bureau of Alcohol, Tobacco, Firearms, and Explosives o Drug Enforcement Administration o Federal Bureau of Investigation o Federal Bureau of Prisons o U.S. Marshals Service Department of Labor Department of State Department of Transportation o Federal Aviation Administration o Federal Highway Administration o Federal Motor Carrier Safety Administration o Federal Railroad Administration o Federal Transit Administration o National Highway Traffic Safety Administration Department of the Treasury o Bureau of Engraving and Printing o Bureau of the Public Debt o Financial Management Service o Internal Revenue Service o Office of the Comptroller of the Currency o Office of Thrift Supervision o U.S. Mint Department of Veterans Affairs o Veterans Benefits Administration o Veterans Health Administration Agency for International Development Central Intelligence Agency Corporation for National and Community Service Environmental Protection Agency Equal Employment Opportunity Commission Executive Office of the President Export-Import Bank of the United States Federal Deposit Insurance Corporation Federal Energy Regulatory Commission Federal Reserve System Federal Retirement Thrift Investment Board General Services Administration Legal Services Corporation National Aeronautics and Space Administration National Credit Union Administration National Labor Relations Board National Science Foundation Nuclear Regulatory Commission Office of Management and Budget Office of Personnel Management Peace Corps Pension Benefit Guaranty Corporation Railroad Retirement Board Securities and Exchange Commission Small Business Administration Smithsonian Institution Social Security Administration U.S. Postal Service Appendix III Departments and Agencies Reporting No Data Mining Efforts The following 69 departments and agencies reported that they have no operational or planned data mining efforts: Department of Agriculture o Agricultural Marketing Service o Agricultural Research Service o Animal and Plant Health Inspection Service o Cooperative State Research, Education, and Extension Service o Farm Service Agency o Foreign Agricultural Service o Forest Service o National Agricultural Statistics Service o Food Safety and Inspection Service Department of Commerce o Economic Development Administration o Bureau of the Census o International Trade Administration o Department of Commerce Headquarters o National Oceanic and Atmospheric Administration Department of Defense o Defense Contract Audit Agency o Missile Defense Agency o Defense Legal Services Agency Appendix III Departments and Agencies Reporting No Data Mining Efforts o Defense Security Service o Defense Threat Reduction Agency o Defense Logistics Agency o Defense Advanced Research Projects Agency o Defense Contract Management Agency o Defense Security Cooperation Agency Department of Energy o Bonneville Power Administration o Southeastern Power Administration o Southwestern Power Administration o Western Area Power Administration Department of Health and Human Services o Centers for Medicare and Medicaid Services o Administration for Children and Families o National Institutes of Health o Indian Health Service Department of Homeland Security o Science and Technology Directorate o Management Directorate o Bureau of Citizenship and Immigration Services o Department of Homeland Security Headquarters Appendix III Departments and Agencies Reporting No Data Mining Efforts Department of Housing and Urban Development Department of the Interior o Bureau of Reclamation o Bureau of Land Management o U.S. Geological Survey o Fish and Wildlife Service o Office of Surface Mining Reclamation and Enforcement o Bureau of Indian Affairs o Department of the Interior Headquarters Department of Justice o Bureau of Alcohol, Tobacco, Firearms, and Explosives Department of Transportation o Federal Aviation Administration o Federal Transit Administration o Federal Railroad Administration o Federal Motor Carrier Safety Administration o Federal Highway Administration Department of the Treasury o Comptroller of the Currency o Bureau of the Public Debt o Office of Thrift Supervision Appendix III Departments and Agencies Reporting No Data Mining Efforts o Department of the Treasury Headquarters o Bureau of Engraving and Printing Agency for International Development Executive Office of the President Federal Energy Regulatory Commission Federal Retirement Thrift Investment Board General Services Administration Legal Services Corporation National Credit Union Administration National Labor Relations Board National Science Foundation Office of Management and Budget Peace Corps Security and Exchange Commission Smithsonian Institution Social Security Administration U.S. Postal service Appendix IV Inventories of Efforts The following tables present selected information from our survey of 128 major federal departments and agencies on their use of data mining. The tables list the purpose of each data mining effort, whether the system is planned or operational, and whether the system uses personal information, data from the private sector, or data from other federal agencies. The survey shows that 52 departments and agencies are using or are planning to use data mining. These departments and agencies reported 199 data mining efforts, of which 68 were planned and 131 were operational. Table 2: Department of Agriculture's Inventory of Data Mining Efforts Features Other agency data Organization/ system name Description Purpose Status Personal information Private sector data Department of Agriculture Headquarters Food and Nutrition Service Travel Data Mart Will consolidate employee Improving Planned Yes No No travel information from financial service or and travel systems. Will allow for a performance governmentwide e-travel system and provide the department with information on the financial ramifications of its travel. Financial Is used in the production Financial Operational No No No Statements of Data Warehouse consolidated financial statements. management Provides information for products that are used to satisfy external reporting requirements, such as Office of Management and Budget and Department of the Treasury requirements. Financial Data Is the department's Financial Operational Yes No No internal Warehouse financial management reporting management system. Data mining is done for ad hoc and on-demand reports. Assists in Grantee Monitoring monitoring the Improving Operational Yes No No financial Activities-Southeast status of grant service or holders. Grantees Regional Office are required to performance provide expenditure reports, and analysis is performed quarterly that matches stated draws to the actual draws from the U.S. Treasury. (Continued From Previous Page) Features Other agency data Organization/ system name Description Purpose Status Personal information Private sector data Grantee Monitoring Assists in Improving Operational Yes No No monitoring the Activities-Mountain management and service or distribution of Plains Regional Indian funds for performance major food benefit Office programs, such as food stamps, in 10 grantee states. Grantee Monitoring Maximizes on-site Improving Operational Yes No No monitoring Activities- efforts by confirming service or the accuracy Southwest Regional of grantee performance accounting. Reduces Office on-site time, maximizes time to complete reviews, and has achieved a 50 percent travel savings. Grantee Monitoring Will be a reporting Improving Planned No No Yes system to Activities-Midwest provide reports and service or automate the Regional Office audit process. Plans are performance to acquire data mining tools to review and compare budgets, reports, and plans. Grantee Monitoring Supports on-site Improving Operational Yes Yes No reviews of Activities-Northeast analyses to service or confirm financial Regional Office report performance information. Will create ad-hoc Planned No No No Integrated Program reporting Improving centers to Accounting System validate service or accounting Data Integrity information. performance Natural Resources Conservation Service Risk Management Agency National Resource Is a trending Improving Operational No No No database that tracks Inventory Used for more than 200 service or resource issues Statistical such as monitoring performance Analysis of erosion. Also Past Soil Survey processes statistical technology. Databases. CAE Is part of a congressionally Detecting Operational Yes Yes Yes mandated project to assist the fraud, waste, Risk Management Agency in and abuse controlling fraud, waste, and abuse in the Federal Crop Insurance Corporation program. Source: Department of Agriculture. Table 3: Department of Commerce's Inventory of Data Mining Efforts Features Other agency data Organization/ system name Description Purpose Status Personal information Private sector data U.S. Patent and Trademark Office Compensation Generates and makes Managing Operational Yes No Yes available Projection Model in compensation human projection data, the Enterprise Data both salary and resources benefits, on Warehouse current employees and on planned hires. It also accounts for planned attritions. Source: Department of Commerce. Table 4: Department of Defense's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Defense Commissary Agency Defense Information Systems Agency DeCA Electronic Will be a corporate Improving Planned Yes Yes Yes information Records system for managing service or unstructured Management and data. It will allow for performance electronic Archive System record keeping, document management, and automated receipt processes. Corporate Decision Mines data to produce Improving Operational No No No analytical Support System/ data on commissary service or operations. Commissary Provides information performance such as what Operations items stores are selling and helps Management System determine whether cashiers are being honest. Enterprise Business Will replace the current Improving Planned No No No Intelligence System management information service or environment, which performance includes operations, reporting, billing, statistics, and other management information activities. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Defense Intelligence Agency Department of the Air Force Insight Smart Will be a data mining Analyzing Planned Yes No Yes knowledge Discovery discovery tool to intelligence work against unstructured text. and detecting Will categorize nouns (names, terrorist locations, events) and present information in activities images. Verity K2 Mines data from the Yes Yes Yes Enterprise intelligence Analyzing Operational community and intelligence Internet searches to identify foreign and detecting terrorists or U.S. citizens connected to terrorist foreign terrorism activities. activities PATHFINDER Is a data mining tool Yes No Yes developed for Analyzing Operational analysts that provides the ability intelligence to analyze government and detecting and private sector databases terrorist rapidly. It can compare and search activities multiple large databases quickly. Is a large search No No Yes Autonomy engine tool that Analyzing Operational is used to search intelligence hundreds of thousands of word and detecting documents. Is used for the terrorist organization and knowledge discovery activities of intelligence. ANG Data Will be used to measure military Measuring Planned Yes No No Warehouse- readiness. It incorporates military Guardian information on all disciplines to readiness provide management information needed to assess military readiness. Integrated Space Will be an internal Improving Planned Yes No No database Warfare Center containing information on service or all (SWC) development/execution performance Information activities System within the SWC. Will be used by all management and analyst personnel to track and align the center's activities to warfighter needs, report on execution status, financial status, schedule status, and performance measurements. Safety Automated Will query databases to find Improving Planned Yes No No System (SAS) automation mishaps. Governed safety by Directive 920124 and will allow for the investigation and reporting of identified automation mishaps. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Enterprise Business Will support strategic Improving Planned No No Yes planning, System assist in building service or scientific and technical budgets for performance the Air Force, and serve as a launch point for all new programs. Research and development case files will be maintained for 75 years; the activity indexes, catalogs, and tracks these files. Genomic and Analyzes National Analyzing Operational No No Yes Institutes of Proteomic Health's genetic data. scientific Results and Analysis research information Enhances combat Yes No No IG Corporate readiness and Improving Operational Information mission capabilities service or System for Air Combat Command units and performance commanders. It assists in preparing for and conducting inspections. Computer Evaluates network No No No Network activities to Improving Operational Defense System create rules for information intrusion detection system signature sets. security FAME Will serve as a Planned No No Yes central repository Managing for Air Force manpower human information. Will resources track manpower and unit authorization funding. Resource Serves as a manpower No No No Wizard tracking Improving Operational system. Tracks service or positions and captures data for performance specific funding purposes. Government Is used in overseeing Yes Yes No purchases Detecting Operational Purchase Card made by Air Force fraud, personnel with waste, government-provided and abuse credit cards. Ambulatory Data Tracks the initial Monitoring Operational Yes No No diagnosis of System Queries patients with the public health results of further testing and diagnosis. Allows for early notification of diseases and injuries. Modus Operandi Is an investigative Detecting Operational Yes No No tool used to Database identify and track criminal trends in criminal behavior. It activities or links characteristics of crimes and patterns provides details on crime scenes and other crime factors. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Executive Takes data from all Improving Operational No No No Decision functional Support System metric balances. service or Processes charts and graphs to identify performance trends and to make sure goals are accomplished. Inspire Is a tool that assists in Performing Operational Yes No Yes providing a narrative description of all strategic research and development that is planning being conducted within the Air Force. Provides cost and milestone information on research and development projects. Discoverer Is used to manage personnel Managing Operational Yes No No records, including individual human aliases and histories. resources Requirements and Will serve as a repository Improving Planned No No No for new Concepts System system projects and system service or requirements. It will be performance available for consultation for information on all project requests and identified requirements. Business Objects Is a commercial off-the-shelf tool Managing Operational Yes No Yes that is used to analyze and report human on human resources activities. resources THRMIS Uses commercial off-the-shelf Managing Operational Yes No No software to maintain a data human warehouse of integrated inventory resources and manpower data for the Total Force: active duty (officer and enlisted), Air Force Reserve, Air National Guard, and civilians. Is used to assess and analyze the health of the Air Force. SAS Is a Web-enabled personnel data Managing Operational Yes No No system that gives authorized human users worldwide the ability to resources tabulate demographic data on recruitment, promotion, and retention. Oracle HR Is a personnel management Managing Operational Yes No No system that manages information human for promotions, pay grades, resources clearances, and other information relevant to human resources. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Health Modeling Provides information and Improving Operational Yes No No and decision Informatics support to the Air Force service or Division Data Mart headquarters' surgeon performance general for decision making, policy development, and resource allocation. It also provides performance information and analysis to medical field units in support of performance measurement objectives. FIRST EDV (BRIO) Will deal with Air Force Improving Planned No Yes No budgets and other components of its service or financial environment. performance Historical analyses and trend analyses will be performed on the budget process. IG World Is used to store and track data Improving Operational Yes No No and requirements, such as lodging service or and augmentee requirements, for the performance PAC inspector general. Department of Defense Headquarters Department of the Navy Automated Will be used to improve Managing Operational Yes Yes Yes personnel Continuing security continuing human evaluation Evaluation efforts within Department of resources System Defense (DOD) by identifying issues of security concern between the normal reinvestigation cycle for those who hold DOD security clearances and have signed a consent form that is still in effect. Human Resource Is used to improve Navy Managing Operational No No No Trend Analysis readiness. Data on personnel human manning levels are mined to resources ensure that each Navy unit has the correct number of training personnel aboard. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data U.S. Naval Allows for the assessment of Managing Operational Yes No No Academy academic performance of human midshipmen. It includes resources demographic information, information on grades, participation in sports, leadership positions, etc. It is an extension of the registrar's system and is mined for comparisons and trends. Navy Training Provides overall Navy Managing Operational Yes Yes No Master training Planning System information to assist in human delivering Navy training in the resources most efficient manner. Pertinent data from multiple databases are consolidated into a single database that is mined. DHAMS Is a database that Improving Operational No No No contains Multidimensional information on the time service or and Cubes attendance of 3,000 mariners performance across 120 ships. Allows managers to look at what people were doing at a particular time and to look across the fleet as a whole and compare ship activities. National Is used to conduct Analyzing Operational No Yes No Cargo predictive Tracking Plan analysis for intelligence Cargo counterterrorism, Tracking small weapons of mass and detecting Division destruction proliferation, terrorist narcotics, alien smuggling, and activities other high- interest activities involving container shipping activity. Supply Management Reduces downtime on Improving Operational No No No ships by System allowing for the service or analysis of ship Multidimensional parts information. The performance data Cubes warehouse contains data on every part that has been ordered since the 1980s, and has multidimensional information on each part. Failure rates can be calculated and improvements can be identified. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Type Commanders Is designed to provide a Measuring Operational No No Yes fully Readiness integrated environment military for online Management System analytical processing of readiness readiness indicators. Examples of readiness indicators include status of supplies available, equipment in operation, health status, and capabilities of the crew. FATHOM (APMC- Will be an internal Managing Planned Yes No No program and Human Resources) project tool used to human improve staffing, recruiting, resources and managing day-to-day operations. Navy Training Is used for planning No No Yes Quota and Improving Operational Management forecasting training service or System needs based on skill requirements. performance National Geospatial-Intelligence Agency OLAP (On-Line Will provide aggregations of Improving Planned No No No Analytical imagery system performance data service or Processing) for management officers and performance senior source decision makers to characterize system performance and contribution to intelligence issues of national priority. CITO Data Will evaluate and identify Improving Planned No No No Mining imagery system performance trends for service or optimization, monitoring, or performance reengineering. Information Relevance Prototype Will establish an information relevancy prototype to serve as a framework for community evaluation of commercial information relevance approaches, methods, and technology. The term information relevance refers to the ability of users to receive or extract, then display and describe, information with measurable satisfaction according to their need. Improving Planned No No No service or performance U.S. Marine Corps Operational Data Is used for workforce planning. Managing Operational Yes No No Store Enterprise human resources (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Global Combat Support Systems- Marine Corps Will be a physical implementation of the IT enterprise architecture designed to support both improved and enhanced marine air/ground task force combat service support functions and commander and combatant commander joint task force combatant support information requirements. Data mining will allow for interoperability with legacy Marine Corps systems and allow for a shared data environment. Improving Planned No Yes No service or performance Total Force Data Is a system whose Managing Operational Yes No No primary Warehouse purpose is workforce human planning and workforce policy resources decision making. It contains current (after 30 days) and historical workforce data. Is a Web-based Yes No No Marine Corps information Managing Operational Recruiting system used for human managing assets Information Support and tracking enlisted resources and officer accessions into the System Marine Corps. Source: Department of Defense. Table 5: Department of Education's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Citizenship of Looks for issues Improving Operational Yes Yes Yes PLUS regarding Loan Borrowers- citizenship among service or its PLUS loan National Student borrowers. Flags performance records based Loan Data Systems on selected criteria and requests additional information from schools. Is a proactive Foreign Schools investigation Detecting Operational Yes No Yes effort Initiatives that looks at National whether financial criminal aid was granted Student Loan Data individuals activities or attending foreign System/Central institutions during periods patterns Processing of nonenrollment. Professional Used to determine Yes Yes Yes when Improving Operational Judgment professional service or Practices: judgment has been Title IV Pell exercised for Grants, "special" performance situations National Student where families cannot afford Loan Data college expenses. Title IV Compares Department of Detecting Operational Yes No Yes Applicant- Death Database Education data with the fraud, Social waste, Match Security and abuse Administration's death database to detect fraud or criminal activity. Title IV Loans Will compare with information from Detecting Planned Yes No No the No Applications Free Application fraud, waste, for Federal Student Aid Program and abuse with the Federal Family Education Loan Program to identify fraud. Compares Department Yes No Yes OIG-Project of Analyzing Operational Strikeback Education and intelligence Federal Bureau of Investigation data and detecting for anomalies. Also verifies personal terrorist identifiers. activities Audits and verifies Yes No Yes Accuracy of U.S. personal Detecting Operational Department of information that is fraud, waste, contained in the Education Department of and abuse Personal Education's Data personal data system. Audits data to Yes No No Impact of Cohort determine the Legislative Operational impact of Default Rate legislation that impact extended Redefinition- the college loan repayment default National Student period from 180 to 270 days. Loan Data System (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data CheckFree Takes monthly Detecting Operational Yes Yes No billing information Software/Purchase from the Bank of fraud, waste, America to Card Program create reports on and abuse purchases, purchase quantity, and frequency of purchases. Data are mined for instances of fraud or abuse. Improper Pell Will compare Pell Detecting Planned Yes No No Grant Grants issued Payment Activity with the amounts fraud, received and waste, look at the eligibility of and abuse grant recipients. Helps identify Title IV Identity patterns and Yes No No Theft trends Detecting Operational Initiative in identity theft criminal cases involving loans for activities education. or Provides an investigative resource for victims patterns of identity theft. Title IV Reviews addresses Yes No Yes Applicant- listed on Title Improving Operational Use of Multiple IV applications to service or see if they are Addresses/Central valid. For performance example, jails or Processing System employment addresses are not considered valid addresses. Identifies funds No No No Lapsed that remain in the Improving Operational Funds/Improper grants and payment service or processing Draw of Federal system beyond the performance time period for Grant Proceeds allocating the funds. Will support the Planned No No No Decision Support department's Improving System with Online performance-based service or initiative. Will Analytical allow custom performance Processing queries of schools from state and Query local databases for demographics and test scores. Grant Assists in Yes Yes Yes Administration managing grant Detecting Operational and Payment System activities and fraud, aids in detecting waste, instances of fraud and abuse or abuse in grant activities. Budget Execution Uses information in the Financial Operational Yes No No National Support Student Loan Data System and a management sample drawn from it to estimate cohort distributions for financial activities related to the Federal Family Education Loan Program pursuant to the Credit Reform Act. Pell Grant Model Provides estimates on the Financial Operational No No No total Assumptions cost of the Pell Grant program. It management uses data from previous years and makes assumptions for future years. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data National Student Compiles student Detecting Operational Yes No Yes loan information from the Loan Data System guaranteeing fraud, waste, agencies. Is used for eligibility and abuse tracking and to calculate default rates. Loan Model Estimates the cost Financial Yes No Yes of loan Operational Assumptions programs. Also analyzes loan management default behavior. Office of the Is part of an OIG Yes No Yes investigation to Detecting Operational Inspector determine potential criminal General fraud of (OIG) Projects: financial aid activities or grants primarily in Tumbleweed/ New Hampshire. patterns Snowball Processes Central applications for Yes No No Processing student Detecting Operational System aid. Contains data fraud, waste, on more than 13 million applications. Data and abuse are mined for demographic trends. Direct Loan Is used to track Yes Yes Yes Services the life of student Improving Operational System direct loans and to service or monitor loan repayments. performance CheckFree Uses monthly Detecting Operational Yes Yes No billing information Software/Travel Card from Bank of fraud, America to create waste, Program reports on travel and abuse expenditures to look for improper use of travel cards. Source: Department of Education. Table 6: Department of Energy's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Counterintelligence Is an investigative Detecting Operational Yes No No management Automated system used by criminal Department of Investigative Energy (DOE) field activities sites to track or Management System investigative cases on individuals patterns (CI-AIMS) or countries that threaten DOE assets. Information stored in this database is also used to support federal and state law enforcement agencies in support of national security. Autonomy Will be used to mine a myriad Detecting Planned Yes No No intelligence-related databases criminal within the intelligence community activities or to uncover criminal or terrorist patterns activities relating to DOE assets. Counterintelligence Is used to log Detecting Operational Yes No Yes briefings and Analytical Research debriefings given criminal to DOE Data System employees who activities or travel to foreign countries or (CARDS) interact with foreign patterns visitors to DOE facilities. Data are mined to identify potential threats to DOE assets. Source: Department of Energy. Table 7: Department of Health and Human Services' Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Agency for Healthcare Research and Quality National Patient Safety Network Will contain reports on adverse medical events that are filed by hospitals. The planned network's purpose is to take out patient personal identifiers and other items that may violate certain rules and create a warehouse that can be used by registered and unregistered users to evaluateand implement patient safety and quality measures. The network will be used to create tools that hospitals can use for making quality improvements. Improving Planned No No No service or performance Centers for Disease Control and Prevention Department of Health and Human Services Headquarters Food and Drug Administration BioSense Enhances the nation's Analyzing Operational No Yes Yes capability to rapidly detect bioterrorism intelligence events. and detecting terrorist activities DHHS Blood Monitors the country's Monitoring Operational No Yes No blood Monitoring supply by keeping an public health Program inventory on red blood cells and platelets and monitors blood supply shortages, the nature of the shortage, and size of the shortages. Mission Is a comprehensive redesign and Operational No Yes Yes Monitoring Accomplishment and reengineering of two core mission-food or drug Regulatory critical legacy systems at Food safety Compliance Services and Drug Administration (FDA) System that support the regulatory functions that primarily take place in FDA's field offices. (Continued From Previous Page) Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Turbo Establishment Provides a Improving Operational No Yes No standardized database Inspection Report of citations of safety regulations and statutes, and help investigators in preparing reports. It will collect data on specific observations uncovered during inspections and provide a more uniform format nationwide that will allow for electronic searches and statistical analysis to be performed by citation. Phonetic Is a search engine that Improving Operational No Yes No provides Orthographic results indicating how safety similar two Computer Analysis drug names are on a phonetic and orthographic basis. Its purpose is to help in the safety evaluation of proposed proprietary names to reduce drug name confusion after an application is approved by the FDA. MPRIS Data Will provide data to support end Improving Planned No No No Warehouse user ad-hoc query analysis and service or standard reporting needs. It will performance provide the foundation for a central reporting repository that can be used to populate business-specific data marts. Development and Will develop advanced Analyzing Planned Yes Yes Yes software Deployment of tools for quantitative scientific and analysis of Advanced drug safety data. research Analytical Medical officers Tools for Drug and safety evaluators information Safety will use these advances in Risk Assessment software tools. Add data mining capability to CFSAN Adverse Event Reporting System Is a comprehensive system for tracking, reviewing, and reporting adverse event incidences involving foods, cosmetics, and dietary supplements. Integrating and centralizing the system and eliminating patchwork systems make information on these adverse events available to federal, state, and local governments as well as to industry and the public in a more timely and efficient manner. Monitoring Planned Yes Yes Yes food or drug safety (Continued From Previous Page) Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Health Resources and Services Administration HRSA Geospatial Data warehouse that primarily Improving Operational No Yes Yes Data Warehouse collects programmatic, service or demographic, and statistical data. performance Program Support Center Employee Uses information from a Improving Operational No No No Assistance database Program Analysis of employee assistance service or program case information that performance does not contain client personal identifiers. Data are mined for quality assurance and program management information that is used to enhance the quality and cost effectiveness of services. Source: Department of Health and Human Services. Table 8: Department of Homeland Security's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Border and Transportation Security Directorate Workforce Profile Contains payroll and personnel Managing Operational Yes No Yes Data Mart data and is mined for workforce human trends. resources Customs Integrated Is a Customs data mart Managing Operational Yes No Yes contained Personnel Payroll within Department of human Homeland System Data Mart Security's workforce resources profile data mart. Personnel and payroll data are mined for workforce trends. Assists the Internal Affairs Internal Affairs Detecting Operational Yes No Yes group by Treasury mining criminal criminal activity data to Enforcement ascertain how activities or Customs' employees are using the Communications Treasury Enforcement patterns System Audit Data System. Mart Operations Assists in managing Improving Operational No No Yes the operation Management of all ports of entry service or for incoming Reports Data Mart carriers, people, and performance cargo. Helps in making resource (people and equipment) allocation and operational improvement decisions. (Continued From Previous Page) Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Automated Export Mines data on export Improving Operational No Yes Yes trade in the System Data Mart U.S. and produces service or reports on historical shipping performance and receiving trends. Seized Property/ Mines data to ensure Improving Operational Yes No No data quality Forfeitures, and review work service or assignments. Penalties, and System has two performance Fines components: one Case Management that processes legal cases like a Data Mart law firm, and a second that serves as property and inventory control by tracking property seized. Incident Data Will look through incident Analyzing Planned Yes Yes Yes Mart logs for patterns of events. An intelligence incident is an event involving a law and detecting enforcement or government agency for terrorist which a log was created (e.g., activities traffic ticket, drug arrest, or firearm possession). The system may look at crimes in a particular geographic location, particular types of arrests, or any type of unusual activity. Case Management Assists in managing Analyzing Operational Yes Yes Yes law Data Mart enforcement cases, intelligence including Customs cases. and detecting Reviews case loads, status, and terrorist relationships among cases. activities Emergency Preparedness and Response Directorate Enterprise Data Warehouse Will take data from multiple, disparate systems and integrate the data into one reporting environment. The objective of the effort is to allow for the reduction of data within the agency and to provide an enterprise view of information necessary to drive critical business processes and decisions. Data on internal human resources, all aspects of disaster management, infrastructure, equipment location, etc., will be used. Disaster Planned Yes Yes Yes response and recovery Information Analysis and Infrastructure Protection Directorate Analyst Notebook Correlates events Analyzing Operational Yes Yes No I2 and people to specific information intelligence and detecting terrorist activities (Continued From Previous Page) Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Automatic Message Automatically takes Analyzing Planned No No Yes messages from Handling System external agencies and intelligence routes them (Verity) to appropriate and detecting recipients terrorist activities U.S. Coast Guard Readiness Assists in ensuring readiness for all Improving Operational Yes No No Management Coast Guard missions. service or System performance CG Info Provides one-stop shopping for Improving Operational Yes No Yes Coast Guard information. It is service or the central location and common performance interface for the entire Coast Guard to gain near real-time access to data from multiple, disparate Coast Guard information systems. It provides a single interface for users to view mission-critical support data. U.S. Secret Service Criminal Mines data in Detecting Operational Yes No Yes suspicious activity Investigation reports received from criminal banks to find Division Data commonalities in data activities or Mining to assist in strategically allocating resources. patterns Source: Department of Homeland Security. Table 9: Department of the Interior's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Minerals Management Service Data Mining of the Technical Information Management System (TIMS) Database Is a corporate database for oil and gas leases. The database is mined in support of policy development. One area of data mining is identification of leases that will be abandoned in the near future. Data mining has shown that leases with six or more producing wells in 1 year are almost never abandoned in the next year. Another application of data mining is the safety of oil and gas operations. For example, data mining has shown that accidents have a peak rate on Thursday mornings. Improving Operational Yes Yes No service or performance Source: Department of the Interior. Table 10: Department of Justice's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Department of Justice Headquarters Drug Enforcement Administration Federal Bureau of Investigation Drug/Financial Will contain data from, Detecting Planned Yes Yes Yes and be Fusion Center used by, Organized Crime criminal and Drug Enforcement Task activities or Force agencies. The system will permit patterns the collection and cross case analysis of all drug and related financial investigative data. Statistical Is a query analysis Detecting Operational Yes No Yes and reporting Management tool that pulls data criminal from many Analysis and systems. It allows activities or for statistical Reporting Tool analyses of drug cases Drug patterns System (SMARTS) Enforcement Administration's statistical /SPSS reporting. TOLLS Is a database of telephone calls Detecting Operational Yes No No from court ordered and approved criminal wiretaps and Title III activities or investigations. Information such patterns as telephone numbers, time and date of calls, and call duration is captured. Data are mined for patterns to give leads in investigations of drug trafficking. Secure Allows the FBI to Analyzing Operational Yes No Yes Collaborative search multiple Operational data sources through intelligence one Prototype interface to uncover and detecting terrorist and Environment/ criminal activities terrorist and Investigative relationships. Data activities Data sources are a Warehouse combination of structured and unstructured text. Foreign Supports the Foreign Analyzing Operational Yes Yes Yes Terrorist Terrorist Tracking Task Tracking Task Force intelligence Force that seeks to Activity prevent foreign and detecting terrorists from gaining access to the terrorist United States. Data from the activities Department of Homeland Security, Federal Bureau of Investigation, and public data sources are put into a data mart and mined to determine unlawful entry and to support deportations and prosecutions. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data FBI Intelligence Is intended to take a Analyzing Planned Yes No Yes subset of Community Data approved data from a intelligence data Marts warehouse and make it and detecting available to the intelligence terrorist community. activities Federal Bureau of Prisons U.S. Marshals Service Business Will be a warehouse designed to Improving Planned No No Yes Information Warehouse provide information on service or manufacturing by Federal Prison performance Industries, which runs 100 factories in various prisons. Data will be mined for information on the manufacturing environment (such as information on material on hand, scheduling, and the production process) and financial activities. USMS Workload Will seek to develop a workforce Managing Planned Yes No No Modeling model that will support budget human formulation, execution, and resources resource analysis. Will be a planning and execution activity that will be used to help determine the quantity and location of required resources. Source: Department of Justice. Table 11: Department of Labor's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Dashboard Provides links to programs Improving Operational Yes No No Display throughout the Department of service or Labor's Employment Training performance Administration to provide reports or information on financial activities. Enforcement Is used to track Improving Operational Yes Yes No investigations of Management violations of Title service or I and other System, Case criminal laws performance pertaining to Opening, and pension and welfare rights. Results Analysis Is used to monitor Yes No No Employee compliance Detecting Operational Retirement with Title I of the fraud, waste, Income Employee Security Act Retirement Income and abuse Data Security Act. System Mine Safety and Mines data from a Improving Operational Yes No Yes data store of Health Administration information on safety safety and health Teradata Data Store enforcement and demographic data for mine operations, along with miner accidents, injury, and illness data. Mathematical Will look at data from Improving Planned No No No economic Statistics Research surveys to compare rates service or of Center nonresponse for Bureau of performance Labor Statistics. Source: Department of Labor. Table 12: Department of State's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Citibank's Ad Hoc Enables purchase Detecting Operational Yes Yes No card managers Reporting System to track trends fraud, waste, related to the usage of credit and abuse cards by employees in purchasing supplies and services for official use. Purchase card program is worldwide, and spending patterns and purchases are monitored for potential misuse or fraud. Purchase Card Will involve the Detecting Planned Yes Yes No automation of Management System internal workflow fraud, waste, processes (system is in the early and abuse phases of development). Will use internal data and bank data to track trends and anomalies in the Department of State's worldwide purchase card program. Source: Department of State. Table 13: Department of Transportation's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Department of Transportation Headquarters DOT IT Security Will collect information to allow Detecting Planned Yes No No Management System management to assess its IT fraud, waste, security infrastructure. and abuse National Highway Traffic Safety Administration State Data Analyzes, mines, and researches Improving Operational No No No System automotive crash data, such as safety statistics from rollovers of SUVs, from 22 states to improve highway safety and lessen fatalities. Policies can be set based on the data. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Fatality Helps to evaluate the Improving Operational Yes Yes Yes Analysis Reporting System effectiveness of motor safety vehicle (FARS) safety standards and highway safety programs. Data are collected from all 50 states, the District of Columbia, and Puerto Rico and are used to evaluate and support highway safety. National Collects and mines Improving Operational Yes Yes No Automotive information on Sampling System automotive crashes. safety System is related to the Federal Motor Vehicle Safety Standards that regulate vehicle compliance items such as seat belts, air bags, and the stopping distance of brakes. Source: Department of Transportation. Table 14: Department of the Treasury's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Financial Management Service Treasury Offset Mines data to reduce the number Improving Operational Yes No Yes Program (TOP) of debts listed in TOP. service or Cleanup performance Electronic Is a free service offered by Increasing Operational Yes No No Federal the tax Tax Payment Department of the Treasury compliance System for (EFTPS) individuals and business Marketing taxpayers who pay their federal taxes electronically. Mining activity tracks enrollment, tax payment history, and usage trends. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Internal Revenue Service Planning, Will be a component of the Improving Planned Yes No Yes Analysis, and Decision Custodial Accounting service or Program, Support System which is the warehouse that performance is used to query transactional data and produce reports. This activity is meant to improve reporting and use decision support tools. Abusive Will model characteristics Increasing tax Planned Yes Yes No Corporate of Tax Shelter corporate tax shelters and compliance Detection use Model models to predict corporate tax shelter abuse and to assess compliance risk in the corporate taxpayer population. K-1 Link Analysis Will be used to detect potential tax Increasing tax Planned Yes No No evasion. compliance Research on the Will be used to research Detecting Planned Yes No No data on Population of taxpayers who receive fraud, waste, the EITC. Taxpayers Who and abuse Receive Earned Income Tax Credit Issue Based Will provide access Increasing tax Planned No Yes No to a variety of Management data sources within compliance IRS. Will Information assist in research System and case work. Electronic Fraud Mines data to Yes No No evaluate and rate Improving Operational potentially Detection System fraudulent service or individual tax returns. performance Reveal Will be used to Planned Yes Yes No detect financial Detecting criminal activity criminal such as tax evasion. activities or patterns Oracle Model 22 Takes information Increasing tax Operational Yes No No from individual Partnership tax returns and compliance Return attempts to Scoring System replicate judgments made by taxpayers to detect the likelihood of material errors. SPSS Form 1120-S Will automate the Increasing tax Planned Yes No No classification of Return Scoring certain corporate tax compliance returns. System Oracle Model 33 Will identify Planned Yes No No noncompliance in Increasing tax Partnership partnership returns. compliance Scoring Model (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Compliance Will identify taxpayer Increasing tax Planned Yes Yes Yes Laboratory noncompliance by looking at compliance groups of returns. U.S. Mint Information Collects information on Improving Operational No No No potential Technology intrusions to U.S. Mint information Intrusion systems. Detection System Looks for trends in security information reported by sensors to determine if illicit activity has occurred. Minimizes false positives. E-Commerce Fraud Attempts to Detecting Operational Yes Yes Yes identify and stop Analysis fraudulent activity criminal Activity involving stolen credit cards to activities or order products over the Internet or via telephone. patterns Fraud rating identifiers are used to identify areas where fraud has occurred and to determine the likelihood of fraud. Allows for orders to be stopped or for orders over a certain dollar limit to be stopped. Data Warehouse Will be an integrated, scalable, expandable data warehouse that will support business functions by grouping the data in subjectoriented data marts. Each warehouse data mart will be defined to integrate both internal and external data to provide the necessary information to perform both historical and predictive analysis and support numerous calculations. Improving Planned No No No service or performance Source: Department of the Treasury. Table 15: Department of Veterans Affairs' Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Department of Veterans Affairs Headquarters Veterans Benefits Administration Veterans Health Administration Veterans Affairs Is used to monitor Detecting Operational Yes Yes No and manage Central Incident intrusion detection criminal and firewalls. Scripts are written activities or Response Center for forensic analysis to go through data patterns collected from system and network logs. Purchase Card Will identify Planned Yes Yes No Data patterns in purchase Detecting Mining (SAS) card use to identify fraud, waste, fraud and Reports misuse and to and abuse maintain good internal controls. Travel Card Data Will be used to look Planned Yes Yes No for patterns in Detecting Mining (SAS) the use of travel fraud, waste, credit cards that Reports indicate misuse or and abuse fraud and to maintain good internal controls. Office of Analyzes and matches Detecting Operational Yes No No Inspector (within the General (OIG) guidelines of the law) fraud, waste, Veterans Affairs (VA) files, and abuse pertaining to both VA-provided benefits and health care services to detect patterns of waste, fraud, or abuse. C & P Payment Data Analyzes Detecting Operational Yes No Yes compensation and Analysis pension data to fraud, waste, detect fraud, waste, and abuse. and abuse C & P Large Serves as an Yes No No Payment internal control Detecting Operational Verification intended to make fraud, waste, Process sure that payments over a and abuse certain dollar threshold are reviewed to detect potential fraud or abuse. Primary Analysis Is used mainly to Improving Operational No No No and discover trends, Classification incidents/events, and safety vulnerabilities that may exist in VA hospitals. Allocation Is used in making Yes No No Resource resource Improving Operational Center Database allocation decisions service or based on the analysis of patient performance workload and cost data. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Veteran's Health Integrates patient, Improving Operational Yes No No clinical, and Administration (VHA) financial data to service or present a unified Financial and management performance Clinical perspective and Data Mart enable consistent reporting. Is used to identify Yes No No Decision Support patterns of care Improving Operational System and patient service or outcomes linked to resource consumption and performance costs associated with each patient encounter. Top 50 Is used to standardize Improving Operational No Yes No medical Standardization and hospital supplies service or and Listing/Managed equipment to (1) performance improve VHA's Inventory System bargaining position when soliciting bids and (2) facilitate the ability to move doctors among hospitals. VISN 16 Data Provides unified view of Improving Operational Yes No No the VISN Warehouse 16 VA region, composed of service or 10 medical centers and 30 performance outpatient clinics. The system gives a view of the enterprise for management purposes. It is mined for a variety of types of information such as patient encounters, lab tests, pharmacy records, etc. Source: Department of Veterans Affairs. Table 16: Environmental Protection Agency's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Conceptual Plans Will regularly review Detecting Planned Yes No No to financial data Design an systems for contracts, fraud, waste, Approach bank cards, and System to and small purchases and and abuse other Review Financial financial databases for misuse or Data fraud of Environmental Protection Agency's assets. Drinking Water Integrates and Monitoring Operational Yes No Yes Data analyzes drinking Warehouse water information from public health state, regional, and headquarters sources. Includes data on water systems, compliance, sample analytical results, and audit data. Source: Environmental Protection Agency. Table 17: Export-Import Bank of the United States' Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Integrated Is used to generate Improving Operational Yes No No reports that Information System describe bank service or lending activities Data Warehouse and exposure trends. performance Mining for Financial Risk Information Source: Export-Import Bank of the United States. Table 18: Federal Deposit Insurance Corporation's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Real Estate Is used to measure real Detecting risk Operational No No Yes Stress estate Test (RSST) risk. Bank examiners use in financial data from the system data as part of a systems pre-examination planning process to assist in identifying risk concentrations. Determination of Will support the Improving Planned Yes No No development of a Insured Deposits new system for service or implementing the deposit insurance performance claims. Statistical Is used to rate No No Yes CAMELS financial Detecting risk Operational institutions' Offsite Review performance and in financial risk management practices. systems Growth Is used to identify Detecting risk Operational No No Yes Monitoring financial System institutions that have in financial experienced significant growth. Serves as an systems early warning system for detecting financial institutions that might pose financial risk to FDIC. Source: Federal Deposit Insurance Corporation. Table 19: Federal Reserve System's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Office of the Will support audits and Detecting Planned Yes No No Inspector General evaluations. Using ACL, fraud, waste, queries (OIG), Audit will be run against the and abuse Services board's financial and personnel systems to detect fraud, waste, and abuse, or to provide information supporting any aspect of an OIG project. Source: Federal Reserve System. Table 20: National Aeronautics and Space Administration's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Archiving of Web Will gather Analyzing Planned No No Yes metadata on the Information at GSFC Web site at scientific and National NASA to Aeronautics and preserve NASA research legacy Space Administration information. information (NASA) and Goddard Space Federal Center (GSFC) My Goddard Search- Will allow Web mining Analyzing Planned No Yes No of scientific Mining of data at Goddard Space scientific and Goddard's Center. It Web environment is referred to as research "Google for Goddard." information NetContext Will monitor network Planned Yes No No traffic for the Detecting purpose of fraud, waste identifying bandwidth use, fraud, abuse, and abuse and IT security- related activities. Geophysics Time Will develop a set of Analyzing Planned No No Yes algorithms to Series Analysis identify patterns within scientific and temporal activities. The data will research be trajectories of objects information and movement of objects within images. "Simmarizer" Uses data mining Analyzing Operational No No No techniques to (Simulation-Based extract scientific and knowledge from Summary/ simulators to research understand Discovery of conditions and information scenarios Knowledge) regarding space missions. Is used to Global Environmental obtain No No Yes information Analyzing Operational and Earth Science about global scientific and climate changes. Information System research (GENESIS) information Machine Learning Will find patterns and Analyzing Planned No No Yes and Data Mining for relationships in scientific and large, complex Improved earth science data research Intelligent sets, Data Understanding specifically for rare information of and small High Dimensional events hidden in larger data Earth Sensed Data signals. Will build new capabilities to understand NASA science data. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Distributed Data Mining Techniques for Object Discovery in the National Virtual Observatory (NVO) Involves using a data mining tool set for space science research. Incorporates a small number of targeted data mining techniques in order to address specific NASA space science research programs. In particular, the data mining environment will be used to explore NASA's large space science data collections. These techniques are being applied to astronomical object discovery, identification, classification, and interpretation across large multiple distributed astronomy data collections. Analyzing Operational No No Yes scientific and research information Diamond Eye Analyzes large sets of Analyzing Operational No Yes Yes (System images for Mining looking for specific scientific and Images) features. research information Data Mining of 3-D Will automate the Analyzing Planned No No Yes analysis of Numerical Model weather model output, scientific and Forecast Output observation, and research and satellite data to Its Application to allow for a better information understanding of Atmospheric the science of weather Research dynamics and to predict future weather events. Ecological Will develop an adaptable Analyzing Planned No No Yes Forecasting system that can be used to mine scientific and large volumes of scientific data, research identify novel causal relationships information in the data about earth system processes, and rapidly incorporate discoveries with biospheric models to generate now-casts and forecasts of biospheric events and conditions. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Distributed Data Mining for Large NASA Databases (Earth Science Earth Observing System Data) Will research changes, trends, and relationships in Earth Observing System (EOS) data. The major feature of this activity is that it will allow for different data to be mined in parts and then merged. The capability is needed for instances when scientific data are at different locations. A research quality software will be used to allow for a communication system and run-time environment for applying a collective data analysis approach not bound to any specific platform, learning algorithm, or representation of knowledge. Analyzing Planned No No Yes scientific and research information Discovery of Will detect patterns in Analyzing Planned No No Yes Changes scientific from the Global data that are scientific and geospatial and Carbon Cycle and dynamic and represented research as Climate System raster data (gridded information Using cells of Data Mining surfaces such as the Activity sun's or earth's surfaces). Mining capabilities are being developed for future NASA-relevant data and science. "AutoSciProd" Uses statistical and Analyzing Operational No No No image data to (Automatic determine and scientific and Generation improve science of Science products. research Products from Large Image information Data Sets) Near Archive Data Pulls data from an No No No archive of Analyzing Operational Mining of Earth earth science data scientific and and applies Science Data scientists' analyses research and algorithms to the information data. Will improve the Planned No No No Spectral Analysis collection, Analyzing Automation (SAA) identification, and scientific and evaluation of System spectral data to research better meet scientists' information requirements. Multiple Sensor Will be used for Analyzing Planned No No No Image collaborative Registration, Image preprocessing of data scientific and and Fusion and Dimension research on wavelets. research Will comprise research information Reduction Using software that Wavelets looks at different technologies such as image processing and dimensions. (Continued From Previous Page) Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data GMSEC Event Will be used to Analyzing Planned No No No determine health Message Data of and reasons for scientific and Mining problems with Task satellite systems. research information Intrusion Looks at all traffic Yes No No Detection that traverses Improving Operational System NASA's networks' information borders. security AvSP/ASMM Is used with No Yes No Foreign simulations to identify Analyzing Operational Object foreign object damage scientific and Detection indicators Toolset for commercial jet research engines. information Mission and Science Will be a basic Analyzing Planned No No No technology Measurement and research program that scientific and will also Discovery Systems support infusion of research resulting technologies into NASA information missions. Purpose of the program is to solve the research challenge in extracting the most scientific knowledge from NASA's space missions and data archives. StarTool: Solar Is used for Analyzing Operational No No No Active recognition of solar activity in scientific Region Detection sequences of and multiband solar images. research information "Toogle" Searches for No No No (Times-Series time-series data. Is Improving Operational Search Engine) similar to a Google safety search engine. Use of Data Will help the Planned No No Yes Mining, National Oceanic Improving Remote Sensing, and Atmospheric service or and Administration Geographic automate its fire performance detection Information systems and improve Systems the for Wildfire accuracy of fire Detection detection and Prediction systems. Knowledge Will mine data using Planned No Yes Yes Discovery software that Analyzing and Data Mining has been developed scientific to exploit and Based on information from a research Hierarchical hierarchical Image image segmentation information Segmentation process. Source: National Aeronautics and Space Administration. Table 21: Nuclear Regulatory Commission's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Licensee Event Identifies nuclear Improving Operational No Yes No safety trends Report Data and patterns in safety commercial nuclear power events. Centralized Will consolidate and Planned Yes No No standardize Improving Information reporting for nuclear service or Delivery reactor regulations. performance Source: Nuclear Regulatory Commission. Table 22: Office of Personnel Management's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data CRIS Retirement Mines federal employee Improving Operational Yes No Yes benefits Data Mining data such as service or Activity information on retirement and life performance insurance to assist in managing federal employee eligibilities and entitlements. Source: Office of Personnel Management. Table 23: Pension Benefit Guaranty Corporation's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Corporate Will streamline access to Improving Planned No No Yes Performance management and operational service or Indicators and performance measures and performance Analytics permit the correlation of performance and output measures. Corporate Policy Is a stochastic Improving Operational No Yes Yes and simulation model Research that incorporates service or historic equity Department's and interest rates performance and bankruptcy Forecasting System possibilities to forecast scenarios for more than 300 pension plans and their related corporate sponsors. Source: Pension Benefit Guaranty Corporation. Table 24: Railroad Retirement Board's Inventory of Data Mining Efforts Features Organization/ system name Description Purpose Status Personal information Private sector data Other agency data Railroad Consists of two major Improving Operational Yes No Yes Retirement databases Board Data Stores (payment and service or entitlement history and employment data performance maintenance) that are mined by actuaries to produce annual actuarial reports and for audit support and quality control. Source: Railroad Retirement Board. Table 25: Small Business Administration's Inventory of Data Mining Efforts Features Private sector data Organization/ system name Description Purpose Status Personal information Other agency data Loan Monitoring Helps to identify, Improving Operational Yes Yes No measure, and System manage the risk of service or Small Business performance Administration's portfolio. Business credit scores are used but individual credit scores are not. MONSTER and Mines data from Financial Operational Yes No No database that Econometric Models includes all transactions for each management loan that affects SBA subsidy costs, to assist in determining credit subsidy rates for SBA's various credit programs. Source: Small Business Administration. GAO's Mission The General Accounting Office, the audit, evaluation and investigative arm of Congress, exists to support Congress in meeting its constitutional responsibilities and to help improve the performance and accountability of the federal government for the American people. GAO examines the use of public funds; evaluates federal programs and policies; and provides analyses, recommendations, and other assistance to help Congress make informed oversight, policy, and funding decisions. GAO's commitment to good government is reflected in its core values of accountability, integrity, and reliability. Obtaining Copies of GAO Reports and Testimony The fastest and easiest way to obtain copies of GAO documents at no cost is through the Internet. GAO's Web site (www.gao.gov) contains abstracts and fulltext files of current reports and testimony and an expanding archive of older products. The Web site features a search engine to help you locate documents using key words and phrases. You can print these documents in their entirety, including charts and other graphics. Each day, GAO issues a list of newly released reports, testimony, and correspondence. GAO posts this list, known as "Today's Reports," on its Web site daily. The list contains links to the full-text document files. To have GAO e-mail this list to you every afternoon, go to www.gao.gov and select "Subscribe to e-mail alerts" under the "Order GAO Products" heading. Order by Mail or Phone The first copy of each printed report is free. Additional copies are $2 each. A check or money order should be made out to the Superintendent of Documents. GAO also accepts VISA and Mastercard. Orders for 100 or more copies mailed to a single address are discounted 25 percent. Orders should be sent to: U.S. General Accounting Office 441 G Street NW, Room LM Washington, D.C. 20548 To order by Phone: Voice: (202) 512-6000 TDD: (202) 512-2537 Fax: (202) 512-6061 To Report Fraud, Contact: Web site: www.gao.gov/fraudnet/fraudnet.htmWaste, and Abuse in E-mail: [email protected] Federal Programs Automated answering system: (800) 424-5454 or (202) 512-7470 Public Affairs Jeff Nelligan, Managing Director, [email protected] (202) 512-4800 U.S. General Accounting Office, 441 G Street NW, Room 7149 Washington, D.C. 20548 Presorted Standard Postage & Fees Paid GAO Permit No. GI00 United States General Accounting Office Washington, D.C. 20548-0001 Official Business Penalty for Private Use $300 Address Service Requested *** End of document. ***