b"<html>\n<title> - NEXT GENERATION COMPUTING AND BIG DATA ANALYTICS</title>\n<body><pre>[House Hearing, 113 Congress]\n[From the U.S. Government Publishing Office]\n\n\n\n \n                       NEXT GENERATION COMPUTING \n                         AND BIG DATA ANALYTICS \n\n=======================================================================\n\n                             JOINT HEARING\n\n                               BEFORE THE\n\n                       SUBCOMMITTEE ON RESEARCH &\n                       SUBCOMMITTEE ON TECHNOLOGY\n\n              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY\n                        HOUSE OF REPRESENTATIVES\n\n                    ONE HUNDRED THIRTEENTH CONGRESS\n\n                             FIRST SESSION\n\n                               __________\n\n                       WEDNESDAY, APRIL 24, 2013\n\n                               __________\n\n                           Serial No. 113-22\n\n                               __________\n\n Printed for the use of the Committee on Science, Space, and Technology\n\n       Available via the World Wide Web: http://science.house.gov\n\n\n                               ----------\n                         U.S. GOVERNMENT PRINTING OFFICE \n\n80-561 PDF                       WASHINGTON : 2013 \n\n              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY\n\n                   HON. LAMAR S. SMITH, Texas, Chair\nDANA ROHRABACHER, California         EDDIE BERNICE JOHNSON, Texas\nRALPH M. HALL, Texas                 ZOE LOFGREN, California\nF. JAMES SENSENBRENNER, JR.,         DANIEL LIPINSKI, Illinois\n    Wisconsin                        DONNA F. EDWARDS, Maryland\nFRANK D. LUCAS, Oklahoma             FREDERICA S. WILSON, Florida\nRANDY NEUGEBAUER, Texas              SUZANNE BONAMICI, Oregon\nMICHAEL T. McCAUL, Texas             ERIC SWALWELL, California\nPAUL C. BROUN, Georgia               DAN MAFFEI, New York\nSTEVEN M. PALAZZO, Mississippi       ALAN GRAYSON, Florida\nMO BROOKS, Alabama                   JOSEPH KENNEDY III, Massachusetts\nRANDY HULTGREN, Illinois             SCOTT PETERS, California\nLARRY BUCSHON, Indiana               DEREK KILMER, Washington\nSTEVE STOCKMAN, Texas                AMI BERA, California\nBILL POSEY, Florida                  ELIZABETH ESTY, Connecticut\nCYNTHIA LUMMIS, Wyoming              MARC VEASEY, Texas\nDAVID SCHWEIKERT, Arizona            JULIA BROWNLEY, California\nTHOMAS MASSIE, Kentucky              MARK TAKANO, California\nKEVIN CRAMER, North Dakota           ROBIN KELLY, Illinois\nJIM BRIDENSTINE, Oklahoma\nRANDY WEBER, Texas\nCHRIS STEWART, Utah\nVACANCY\n                                 ------                                \n\n                        Subcommittee on Research\n\n                   HON. LARRY BUCSHON, Indiana, Chair\nSTEVEN M. PALAZZO, Mississippi       DANIEL LIPINSKI, Illinois\nMO BROOKS, Alabama                   ZOE LOFGREN, California\nSTEVE STOCKMAN, Texas                AMI BERA, California\nCYNTHIA LUMMIS, Wyoming              ELIZABETH ESTY, Connecticut\nJIM BRIDENSTINE, Oklahoma            EDDIE BERNICE JOHNSON, Texas\nLAMAR S. SMITH, Texas\n                                 ------                                \n\n                       Subcommittee on Technology\n\n                  HON. THOMAS MASSIE, Kentucky, Chair\nJIM BRIDENSTINE, Oklahoma            FREDERICA S. WILSON, Florida\nRANDY HULTGREN, Illinois             SCOTT PETERS, California\nDAVID SCHWEIKERT, Arizona            DEREK KILMER, Washington\n                                     EDDIE BERNICE JOHNSON, Texas\nLAMAR S. SMITH, Texas\n\n\n\n                            C O N T E N T S\n\n                       Wednesday, April 24, 2013\n\n                                                                   Page\nWitness List.....................................................     2\n\nHearing Charter..................................................     3\n\n                           Opening Statements\n\nStatement by Representative Larry Bucshon, Chairman, Subcommittee \n  on Research, Committee on Science, Space, and Technology, U.S. \n  House of Representatives.......................................     8\n    Written Statement............................................     9\n\nStatement by Representative Daniel Lipinski, Ranking Minority \n  Member, Subcommittee on Research, Committee on Science, Space, \n  and Technology, U.S. House of Representatives..................    10\n    Written Statement............................................    11\n\nStatement by Representative Thomas Massie, Chairman, Subcommittee \n  on Technology, Committee on Science, Space, and Technology, \n  U.S. House of Representatives..................................    12\n    Written Statement............................................    13\n\n\nStatement by Representative Frederica S. Wilson, Ranking Minority \n  Member, Subcommittee on Technology, Committee on Science, \n  Space, and Technology, U.S. House of Representatives...........    13\n    Written Statement............................................    14\n\n                               Witnesses:\n\nDr. David McQueeney, Vice President, Technical Strategy and \n  Worldwide Operations, IBM Research\n    Oral Statement...............................................    16\n    Written Statement............................................    18\n\nDr. Michael Rappa, Director, Institute for Advanced Analytics, \n  Distinguished University Professor, North Carolina State \n  University\n    Oral Statement...............................................    26\n    Written Statement............................................    28\n\nDr. Farnam Jahanian, Assistant Director for the Computer and \n  Information Science and Engineering (CISE) Directorate, \n  National Science Foundation\n    Oral Statement...............................................    36\n    Written Statement............................................    38\n\nDiscussion.......................................................    55\n\n             Appendix I: Answers to Post-Hearing Questions\n\nDr. Michael Rappa, Director, Institute for Advanced Analytics, \n  Distinguished University Professor, North Carolina State \n  University.....................................................    76\n\nDr. Farnam Jahanian, Assistant Director for the Computer and \n  Information Science and Engineering (CISE) Directorate, \n  National Science Foundation....................................    79\n\n            Appendix II: Additional Material for the Record\n\nIDC IVIEW report, The Digital Universe in 2020: Big Data, Bigger \n  Digital Shadows, and Biggest Growth in the Far East, submitted \n  by Representative Derek Kilmer, Subcommittee on Technology, \n  Committee on Science, Space, and Technology, U.S. House of \n  Representatives................................................    86\n\n\n            NEXT GENERATION COMPUTING AND BIG DATA ANALYTICS\n\n                              ----------                              \n\n\n                       WEDNESDAY, APRIL 24, 2013\n\n                  House of Representatives,\n                                 Subcommittee on Research &\n                                    Subcommittee Technology\n               Committee on Science, Space, and Technology,\n                                                   Washington, D.C.\n\n    The Subcommittees met, pursuant to call, at 10:04 a.m., in \nRoom 2318 of the Rayburn House Office Building, Hon. Larry \nBucshon [Chairman of the Subcommittee on Research] presiding.\n\n[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n\n    Chairman Bucshon. All right. This joint hearing of the \nSubcommittee on Research and the Subcommittee on Technology \nwill come to order.\n    Good morning, and welcome to today's joint hearing entitled \n``Next Generation Computing and Big Data Analytics.'' In front \nof you are packets containing the written testimony, \nbiographies and Truth in Testimony disclosures for today's \nwitnesses.\n    Before I get started, since this is a joint hearing \ninvolving two Subcommittees, I want to explain how we will \noperate procedurally so all Members understand how the \nquestion-and-answer period will be handled. As always, we will \nalternate rounds of questioning between majority and minority \nMembers. The Chairmen and Ranking Members of the Research and \nTechnology Subcommittees will be recognized first. Then we will \nrecognize Members present at the gavel in order of seniority on \nthe full Committee and those coming in after the gavel will be \nrecognized in order of their arrival. I now recognize myself \nfor five minutes for an opening statement.\n    Again, I would like to welcome everyone to today's hearing \nwhere we will examine how advancements in information \ntechnology and data analytics enable private and public sector \norganizations to provide greater value to their customers and \ncitizens. Industry, academia, and government are all interested \nin determining how to extract value, gain insights, and make \nbetter decisions based on the wealth of data that is generated \ntoday. In recent years, ``big data'' has become the popular \nterm used to encompass this phenomenon.\n    TechAmerica, an information technology trade association, \ndefines big data as ``large volumes of high-velocity, complex \nand variable data that require advanced techniques and \ntechnologies to enable the capture, storage, distribution, \nmanagement, and analysis of the information.''\n    Big data offers a range of opportunities for private \nindustry to reduce costs and increase profitability. It can \nenable scientists to make discoveries on a previously \nunreachable scale. And it can allow governments to identify \nways to serve its citizens more efficiently.\n    The McKinsey Global Institute predicts that effective \ninformation management can provide $300 billion in annual value \nto the U.S. health care sector alone. TechAmerica released a \nreport last year highlighting how big data initiatives can \nimprove the efficiency and effectiveness of government \nservices, and through the use of advanced computing power and \nanalytic techniques, universities and Federal laboratories can \ndrive new research initiatives that will significantly increase \nour scientific knowledge base.\n    There are also various challenges associated with big data \nthat the Committee will explore today. McKinsey has estimated \nthat the U.S. will face a shortfall of 140,000 to 190,000 \nprofessionals with significant technical depth in data \nanalytics, and a further shortfall of an additional 1.5 million \nmanagers and analysts who can work effectively with big data \nanalysis by 2018. Committee Members will be interested to learn \nhow industry, academia, and government are addressing this \nshortfall.\n    While the term ``big data'' is relatively new, public and \nprivate organizations have been investing in computing power \nand data analytics for a number of years. In March of last \nyear, the Obama Administration announced a Big Data Research \nand Development Initiative, including $200 million in new \nfunding across six different government departments and \nagencies. I am interested to learn how effectively these \nprograms are being coordinated across the different Federal \nagencies to ensure that taxpayer dollars are being leveraged \neffectively. Finally, privacy and security are major concerns \nwhen private and public organizations are collecting, \nanalyzing, and disseminating massive data sets.\n    We have an excellent panel of witnesses ranging across \nindustry, academia, and government. I would like to extend my \nappreciation to each of our witnesses for taking the time and \neffort to appear before us today. We look forward to your \ntestimony.\n    [The prepared statement of Mr. Bucshon follows:]\n\n Prepared Statement of Subcommittee on Research Chairman Larry Bucshon\n\n    Good morning, I would like to welcome everyone to today's hearing \nwhere we will examine how advancements in information technology and \ndata analytics enable private and public sector organizations to \nprovide greater value to their customers and citizens.\n    Industry, academia, and government are all interested in \ndetermining how to extract value, gain insights, and make better \ndecisions based on the wealth of data that is generated today. In \nrecent years, ``Big Data'' has become the popular term used to \nencompass this phenomenon.\n    TechAmerica, an information technology trade association, defines \nBig Data as ``large volumes of high velocity, complex and variable data \nthat require advanced techniques and technologies to enable the \ncapture, storage, distribution, management, and analysis of the \ninformation.''\n    Big Data offers a range of opportunities for private industry to \nreduce costs and increase profitability. It can enable scientists to \nmake discoveries on a previously unreachable scale. And it can allow \ngovernments to identify ways to serve its citizens more efficiently.\n    The McKinsey Global Institute predicts that effective information \nmanagement can provide $300 billion in annual value to the US health \ncare sector alone. TechAmerica released a report last year highlighting \nhow Big Data initiatives can improve the efficiency and effectiveness \nof government services. And, through the use of advanced computing \npower and analytics techniques, universities and federal laboratories \ncan drive new research initiatives that will significantly increase our \nscientific knowledge-base.\n    There are also various challenges associated with Big Data that the \nCommittee will explore today. McKinsey has estimated that the US will \nface a shortfall of 140,000 to 190,000 professionals with significant \ntechnical depth in data analytics, and a further shortfall of an \nadditional 1.5 million managers and analysts who can work effectively \nwith big data analysis by 2018. Committee members will be interested to \nlearn how industry, academia, and government are addressing this \nshortfall.\n    While the term Big Data is relatively new, public and private \norganizations have been investing in computing power and data analytics \nfor a number of years. In March of last year, the Obama Administration \nannounced a ``Big Data Research and Development Initiative,'' including \n$200 million in new funding across six different federal departments \nand agencies. I am interested to learn how effectively these programs \nare being coordinated across the different federal agencies to ensure \nthat taxpayer dollars are being leveraged effectively.\n    Finally, privacy and security are major concerns when private and \npublic organizations are collecting, analyzing, and disseminating \nmassive data sets. We have an excellent panel of witnesses ranging \nacross industry, academia and government. I'd like to extend my \nappreciation to each of our witnesses for taking the time and effort to \nappear before us today. We look forward to your testimony.\n\n    Chairman Bucshon. I will now yield to Mr. Lipinski for his \nopening statement.\n    Mr. Lipinski. Thank you. I want to thank you, Chairman \nBucshon, and I want to thank Chairman Massie for holding this \nhearing. I want to welcome and thank the witnesses for being \nhere.\n    Today's hearing gives us an opportunity to talk about the \nnew tools and analytics that are being developed for big data. \nAs Chairman Bucshon stated, big data can be thought of as large \nvolumes of complex and diverse types of data that change \nrapidly with time.\n    In basic scientific research in national security as well \nas in economic sectors ranging from energy to health care, big \ndata challenges are becoming fundamentally important. \nEffectively dealing with big data can impact how we do business \nand how we think about the world.\n    As a Member of the Research Subcommittee for several years, \nI have watched as the amount and complexity of data has grown \nby leaps and bounds. The field of astronomy is a great example. \nWhen the Sloan Digital Sky Survey started work in 2000, its \ntelescope in New Mexico collected more data in a few weeks than \nhad been collected in the history of astronomy, and that \ntelescope will be surpassed when the Large Synoptic Survey \nTelescope begins scientific operations in 2020. LSST will \nphotograph the entire sky every few days, producing data at a \nrate almost 100 times greater than the Sloan Survey. But data \nis useless without the means to store and analyze it in an \nefficient manner.\n    The types of data are changing as well. Data has gone from \nbeing mostly numbers entered into Excel spreadsheets to data \ncoming from sensors, cell phone cameras and millions of email \nmessages. In fact, it is estimated that over 85 percent of data \ngenerated today are these kinds of unstructured data, data like \nvideos and emails. The change in the volume and variety of data \nas well as how fast data is being produced and changed creates \nalmost limitless opportunities. For example, since \ncybersecurity data is massive, varied, and changing quickly, \nbig data technologies have the potential to detect and prevent \ncyber attacks before they happen. I know that organizations \nlike IBM are developing technologies to do just that. \nAdditionally, big data could be used to establish new business \nmodels, create transparency, improve decision-making and reduce \ninefficiencies within businesses and government.\n    But along with the opportunities, there are a number of \nchallenges. We need new tools and software packages to manage, \norganize, and analyze all these different kinds of data. \nAdditionally, we will need an analytic workforce to ensure the \ngains of big data. These challenges necessitate involvement \nfrom government, academia and the private sector. That is why I \nam happy to see all those sectors represented here today.\n    The government has and will continue to play an \ninstrumental role in this area. For instance, the Networking \nand Information Technology Research and Development program, or \nNITRD, created an interagency big data group that is \ncoordinating Federal efforts in technologies, research, \ncompetitions, and workforce development for big data. We had a \nhearing on the NITRD program back in February, and I expect \nthat we will be able to take a broader look at many of the same \nissues in today's hearing.\n    In some cases, agencies have teamed up to issue joint \nsolicitations. For example, NSF and NIH have a joint big data \ngrant program that awarded nearly $15 million of grants to \neight teams of researchers last year. These first award grants \nwent to projects focused on designing new tools for big data \nand new data analytic approaches. We will be hearing more about \nthese and other interagency activities from Dr. Jahanian in his \ntestimony. We will also learn more about specific programs at \nNSF, one of the leading agencies in Federal big data efforts on \nboth the analytics side and the computational resources side.\n    As I mentioned before, one of the areas being coordinated \nthrough NITRD is workforce development for big data. Several \nagencies, including NSF, have education activities to support a \nnew generation of big data researchers. As we will likely hear \nfrom all of the witnesses, we face a looming shortage of \nworkers with the skills needed to analyze and manage large, \ncomplex and high-velocity data sets. There is some overlap with \nthe broader STEM skills we so often speak about in this \ncommittee, but there are also unique skills required to address \nthe big challenges of big data. We need to consider how to \nbuild those skills into STEM curricula, especially at the \nundergraduate and graduate levels. I look forward to hearing \nfrom our witnesses about the current educational efforts and \nwhat additional initiatives may be necessary.\n    And finally, since big data involves different types of \ndata that can be produced and transferred quickly, there are \nconcerns over privacy. We need to ensure that we strike the \nright balance between exploring and implementing all of the \npotential benefits of big data while also protecting \nindividuals' personal information.\n    I look forward to hearing the witnesses' testimony and our \ndiscussion today, and I yield back the balance of my time.\n    [The prepared statement of Mr. Lipinski follows:]\n\n             Prepared Statement of Subcommittee on Research\n                Ranking Minority Member Daniel Lipinski\n\n    Thank you, Chairmen Bucshon and Massie for holding this hearing on \nexamining the next generation of computing and big data analytics. I \nwant to welcome and thank the witnesses for being here today.\n    Today's hearing gives us an opportunity to talk about the new tools \nand analytics that are being developed for big data. Big data can be \nthought of as large volumes of complex and diverse types of data that \nare also high velocity--meaning they change rapidly with time.\n    As a member of the Research Subcommittee for several years now, I \nhave watched as the amount and complexity of data has grown by leaps \nand bounds. The field of astronomy is a great example. When the Sloan \nDigital Sky Survey started work in 2000, its telescope in New Mexico \ncollected more data in a few weeks than had been collected in the \nhistory of astronomy. And that telescope will be surpassed when the \nLarge Synoptic Survey Telescope goes online in about 2020. LSST will \nphotograph the entire sky every few days. That's difficult for any of \nus to wrap our heads around.\n    The types of data are changing as well. Data has gone from being \nmostly numbers entered in excel spreadsheets to data coming from \nsensors, cellphone cameras, and millions of email messages. In fact, it \nis estimated that over 85 percent of data generated today are these \nkinds of unstructured data--data like videos or emails.\n    The change in the volume and variety of data as well as how fast \ndata is being produced and changed creates almost limitless \nopportunities. For example, since cybersecurity data is massive, \nvaried, and changing quickly, big data technologies have the potential \nto detect and prevent cyber attacks before they even happen. I know \nthat organizations like IBM are developing technologies to do just \nthat. Additionally, big data could be used to establish new business \nmodels, create transparency, improve decision-making, and reduce \ninefficiencies within businesses and government.\n    But along with the opportunities, there are a number of challenges. \nWe need new tools and software packages to manage, organize, and \nanalyze all these different kinds of data. Additionally, we will need \nan analytic workforce to ensure the gains of big data. These challenges \nnecessitate involvement from government, academia, and the private \nsector. That is why I am happy to see all those sectors represented \ntoday.\n    The government has and will continue to play an instrumental role \nin this area. For instance, the Networking and Information Technology \nResearch and Development--or NITRD--program created an interagency big \ndata group that is coordinating federal efforts in technologies, \nresearch, competitions, and workforce development for big data.\n    In some cases, agencies have teamed up to issue joint \nsolicitations. For example, NSF and NIH have a joint big data grant \nprogram that awarded nearly $15 million of grants to eight teams of \nresearchers last year. These first awarded grants went to projects \nfocused on designing new tools for big data and new data analytic \napproaches. We will hear more about these and other interagency \nactivities from Dr. Jahanian in his testimony. We will also learn more \nabout specific programs at NSF, one of the leading agencies in federal \nbig data efforts on both the analytics side and the computational \nresources side.\n    As I mentioned before, one of the areas being coordinated through \nNITRD is the workforce development needs for big data. Several \nagencies, including NSF, have education activities to support a new \ngeneration of big data researchers. As you will likely hear from all of \nthe witnesses, we face a looming shortage of workers with the skills \nneeded to analyze and manage large, complex, and high-velocity data \nsets. There is some overlap with the broader STEM skills we often speak \nof in this committee. But there are also some unique skills required to \naddress the challenges of big data. We need to consider how to build \nthose skills into STEM curricula, especially at the undergraduate and \ngraduate levels. I look forward to hearing from our witnesses about the \ncurrent educational efforts and what additional initiatives may be \nnecessary.\n    Finally, since big data involves different types of data that can \nbe produced and transferred quickly, there are concerns over privacy. \nWe need to ensure that we strike the right balance between exploring \nand implementing all of the potential benefits of big data while also \nprotecting individuals' personal information.\n    I look forward to hearing the witnesses' testimonies and to our \ndiscussion today.\n\n    Chairman Bucshon. Thank you, Mr. Lipinski. The Chair now \nrecognizes the Chairman of the Subcommittee on Technology, Mr. \nMassie, for five minutes for his opening statement.\n    Mr. Massie. Thank you, Chairman.\n    Good morning. Today we are examining an issue that we hear \na lot about. ``Big data'' is a popular new term that can mean a \nlot of different things. The scientific community, though, has \ngenerated and used big data before there was the term ``big \ndata.'' In fact, in 1991 this Committee authored the High \nPerformance Computing Act, which organized the Federal agency \nresearch, development, and training efforts in support of \nadvanced computing.\n    Individual researchers have always been faced with \ndifficult decisions about their data: what to keep, what to \ntoss, what to verify with additional experiments. And as our \ncomputing power has increased, so has the luxury of storing \nmore data. Incorporating computer power to process more \nscientific data is transforming laboratories across the \ncountry.\n    At the same time, the ability to analyze large amounts of \ndata across multiple networked platforms is also transforming \nthe private sector. Through big data applications, businesses \nhave not only revealed previously hidden efficiency \nimprovements in their internal operations, but, more \nimportantly, also uncovered entirely new types of businesses \nbuilt around data that was previously not accessible due to its \nsize and complexity.\n    Today's hearing will examine the hype around big data. Is \nthe United States the most innovative Nation in big data? Is \nour regulatory system creating any burdens on businesses? Could \npublic-private partnerships with the Federal agencies be \nimproved to allow for more data innovations?\n    I thank our witnesses today for their participation today \nand I look forward to hearing their testimony. Thank you. I \nyield back.\n    [The prepared statement of Mr. Massie follows:]\n\n            Prepared Statement of Subcommittee on Technology\n                         Chairman Thomas Massie\n\n    Good Morning. Today we are examining an issue that we hear a lot \nabout. ``Big Data'' is a popular new term that can mean a lot of \ndifferent things.\n    The scientific community has generated and used Big Data before \nthere was Big Data. In fact, in 1991 this Committee authored the High \nPerformance Computing Act, which organized the federal agency research, \ndevelopment and training efforts in support of advanced computing.\n    Individual researchers have always been faced with difficult \ndecisions about their data: what to keep, what to toss, what to verify \nwith additional experiments. As our computing power has increased, so \nhas the luxury of storing more data. Today, managing this data allows \nfor better-informed experiments, more exact metrics, and perhaps \nsignificantly longer doctoral theses. Incorporating computer power to \nprocess more scientific data is transforming laboratories across the \ncountry.\n    At the same time, the ability to analyze large amounts of data \nacross multiple networked platforms is also transforming the private \nsector. Through Big Data applications, businesses have not only \nrevealed previously hidden efficiency improvements in their internal \noperations, but also uncovered entirely new types of business built \naround data that was previously not accessible due to its size and \ncomplexity.\n    Today's hearing will examine the hype around Big Data. Is the \nUnited States the most innovative nation in Big Data? Is our regulatory \nsystem creating any burdens on businesses? Could public-private \npartnerships with the federal agencies be improved to allow for more \ndata innovations?\n    I thank our witnesses for their participation today and look \nforward to hearing their testimony.\n\n    Chairman Bucshon. Thank you, Mr. Massie. The Chair now \nrecognizes Ms. Wilson for five minutes for her opening \nstatement.\n    Ms. Wilson. First of all, I would like to thank both \nChairman Bucshon and Chairman Massie for holding this joint \nhearing, and thank you all to our witnesses for being here \ntoday. Welcome.\n    This morning's hearing provides us with the opportunity to \ndiscuss one of the newest buzzwords in Washington, and you know \nwe have many buzzwords here. This one: big data. This buzzword \nis not an exaggeration. A computer that used to take up the \nspace of this entire room now fits in the palm of your hand. It \nis remarkable.\n    Just as computers have gotten immensely smaller, they have \nalso gotten immensely more powerful. Instead of talking about \nmegabytes, we are now talking about petabytes and zettabytes--\nquadrillions and sextillions of units of information. It \nboggles the mind. Collecting and storing this huge volume of \ndata would have been impossible just a few years ago.\n    I am looking forward to your testimony and learning more \nabout the benefits of big data to society. As I understand it, \nbig data has the potential to improve nearly all sectors of \nsociety. The National Cancer Institute is funding a prototype \nin biological big data that could lead to new advances in \ncancer treatment. Companies and agencies are using big data to \nrun controlled experiments that improve decision-making. \nScientists at Florida International University in my district \nare using big data to advance understanding of topics including \ncybersecurity, social networks and cloud computing.\n    But there are challenges. In order to reap all the benefits \nof complex and broadly available data, we need new technologies \nand software. We also need a workforce, a workforce with the \nskills necessary to analyze data of such great volume and \ncomplexity. A recent study estimates that the United States is \nin need of 190,000 additional data scientists.\n    In thinking about this hearing on big data, I couldn't help \nbut think about the tragic events last week in Boston. The \nmarathon bombings may be one of the most photographed attacks \nin history. The Massachusetts State Police asked the public to \nshare the photos and videos taken on that awful day. Now all of \nthis digital information has been and is being used by the \nBoston Police Department and the FBI in their investigation. It \nappears that this data has been instrumental in helping to \nidentify the individuals who were involved.\n    Examples like this one demonstrate how important it is that \nwe develop and attain the tools and the skills people need to \nanalyze tremendous amounts of complex data. Big data can not \nonly lead to amazing scientific discoveries; it can also save \nlives.\n    As we learn more about these opportunities and challenges \ntoday, I hope our witnesses will offer recommendations on how \nthe Federal Government can help create the new tools, software \nand workforce needed to realize the full potential of big data.\n    Chairman Bucshon, Chairman Massie, thank you again for \nholding this hearing, and I yield back the balance of my time.\n    [The prepared statement of Ms. Wilson follows:]\n\n            Prepared Statement of Subcommittee on Technology\n              Ranking Minority Member Frederica S. Wilson\n    I'd like to thank both Chairman Bucshon and Chairman Massie for \nholding this joint hearing. And thank you to all of our witnesses for \nbeing here today.\n    This morning's hearing provides us with the opportunity to discuss \none of the newest buzz-words in Washington and around the world--``big \ndata.''\n    This buzz-word is not an exaggeration: A computer that used to take \nup the space of this entire room now fits in the palm of your hand. It \nis remarkable.\n    Just as computers have gotten immensely smaller, they have also \ngotten immensely more powerful. Instead of talking about megabytes, we \nare now talking about petabytes and zettabytes--quadrillions and \nsextillions of units of information. It boggles the mind. Collecting \nand storing this huge volume of data would have been impossible just a \nfew years ago.\n    I'm looking forward to the testimony of today's witnesses and \nlearning more about the benefits of ``big data'' to society.\n    As I understand it, big data has the potential to improve nearly \nall sectors of society. The National Cancer Institute is funding a \nprototype in biological ``big data'' that could lead to new advances in \ncancer treatment. Companies and agencies are using ``big data'' to run \ncontrolled experiments that improve decision-making. Scientists at \nFlorida International University--in my district--are using ``big \ndata'' to advance understanding of topics including cybersecurity, \nsocial networks, and cloud computing.\n    But there are challenges. In order to reap all the benefits of \ncomplex and broadly available data, we need new technologies and \nsoftware. We also need a workforce with the skills necessary to analyze \ndata of such great volume and complexity. A recent study estimates that \nthe United States is in need of 190,000 additional data scientists.\n    In thinking about this hearing on ``big data,'' I couldn't help but \nthink about the tragic events last week in Boston. The marathon \nbombings may be one of the most photographed attacks in history. The \nMassachusetts State Police asked the public to share the photos and \nvideos taken on that awful day. Now, all of this digital information \nhas been and is being used by the Boston Police Department and the FBI \nin their investigation. It appears that this data has been instrumental \nin helping to identify the individuals who were involved.\n    Examples like this one demonstrate how important it is that we \ndevelop and attain the tools and the skilled people needed to analyze \ntremendous amounts of complex data. Big data can not only lead to \namazing scientific discoveries--It can also save lives.\n    As we learn more about these opportunities and challenges today, I \nhope our witnesses will offer recommendations on how the federal \ngovernment can help create the new tools, software, and workforce \nneeded to realize the full potential of ``big data.''\n\n    Chairman Bucshon. Thank you, Ms. Wilson.\n    If there are Members who wish to submit additional opening \nstatements, your statements will be added to the record at this \npoint.\n    It is now time to introduce our panel of witnesses. Our \nfirst witness is Dr. David McQueeney, the Vice President of \nTechnical Strategy and Worldwide Operations at IBM Research. In \nthis capacity, he is responsible for setting the direction of \nIBM's overall research strategy across 12 worldwide labs and \nleading the global operations and information systems teams. \nDr. McQueeney's background covers a wide range of disciplines, \nspending about half of his career as a researcher and research \nexecutive and half in IBM's customer-focused areas. He holds an \nM.S. and Ph.D. in solid-state physics from Cornell University \nand an A.B. in physics from Dartmouth College. Welcome.\n    Our second witness is Dr. Michael Rappa, the Executive \nDirector of the Institute for Advanced Analytics and Faculty \nMember of the Department of Computer Science at North Carolina \nState University. Dr. Rappa has 25 years of experience as a \nprofessor working across academic disciplines at the \nintersection of management and computing. He began his teaching \ncareer at the University of Minnesota where he earned his \ndoctorate degree. Welcome.\n    And our final witness is Dr. Farnam Jahanian, the Assistant \nDirector for the Computer and Information Science and \nEngineering Directorate at the National Science Foundation and \na frequent visitor to our Subcommittee. He oversees the CISE's \nmission to uphold the Nation's leadership in computer and \ninformation science and engineering. He also serves as Co-chair \nof the Networking and Information Technology Research and \nDevelopment, or NITRD, Subcommittee of the National Science and \nTechnology Council Committee on Technology, providing overall \ncoordination for the activities of 14 government agencies. Dr. \nJahanian holds a master's degree and a Ph.D. in computer \nscience from the University of Texas at Austin. Welcome again.\n    As our witnesses should know, spoken testimony is limited \nto five minutes each after which Members of the Committee have \nfive minutes each to ask questions. Your written testimony will \nbe included in the record of the hearing.\n    I now recognize our first witness, Dr. McQueeney, for five \nminutes for his testimony.\n\n       TESTIMONY OF DR. DAVID MCQUEENEY, VICE PRESIDENT,\n\n          TECHNICAL STRATEGY AND WORLDWIDE OPERATIONS,\n\n                          IBM RESEARCH\n\n    Dr. McQueeney. Good morning, Mr. Chairman, Ranking Members, \nMembers of the Subcommittees. Thank you for the opportunity to \ntestify today. My written testimony covers next-generation \ncomputing, big data and analytics, workforce development and \nthe role of government. In my five minutes, I will focus on \nareas where I can offer critical insights from my personal \nexperience.\n    Computing today is undergoing profound change. We are \nmoving from computing based on processors that are programmed \nto follow a predesigned sequence of instructions to cognitive \ncomputing systems based on massive amounts of data evolving \ninto systems that can learn. This new approach will require new \nstrategies in hardware and in software and improved skills to \nmaintain U.S. leadership. Cognitive systems will digest and \nexploit massive data volumes. Tools such as mobile phones, \nvideos and social networks generate as much data in two days in \n2013 as in all of human history prior to 2003.\n    Advanced analytics can be thought of as tools for infusing \nall this data to make decisions on facts rather than intuition. \nThe challenge is to transform latent data into actionable \ninformation to decide what to do next. For example, the Memphis \nPolice Department is using data analytics to map crime hotspots \nand find patterns. As a result, they have been able to reduce \ncrime by 30 percent with no increase in overall police \nmanpower.\n    To run advanced analytics, it is essential to have the most \npowerful computing systems. However, current supercomputing \nsystems are reaching performance levels that will stagnate \nwithout significant innovation. We must move to the next \ngeneration of large-scale computing called exascale computing, \na thousand times faster than today's petascale machines.\n    The United States needs to invest now in the research and \ndevelopment for exascale systems to maintain strategic and \neconomic leadership. Government-funded research on domain \nskills, especially at our national laboratories, should target \nsystems for modeling, simulation, and analytics on big data.\n    Before 2005, the United States had a clear lead in the \nglobal supercomputing race. Today, we are still ahead but the \nrest of the world is catching up rapidly. To stay ahead will \nrequire new skills and knowledge and new types of decision-\nmaking. Nearly two million IT jobs will be created by 2015 in \nthe United States to support big data, and the job candidates \nwith analytic skills will get these jobs.\n    Industry is developing many collaborative skills programs, \nas enumerated in my testimony. I highlight our announcement \ntoday with Rensselaer Polytechnic Institute to offer a graduate \ndegree program in the fall of 2013, the Master of Science in \nBusiness Analytics.\n    Privacy must be considered in the design of big data \nsystems. Big data does not require the sacrifice of personal \nprivacy. When personal information is used, design-in processes \nsuch as IBM's Privacy By Design can protect privacy. When \npeople understand how information is used, they have the \nability to set data usage policies and enjoy benefits of the \nanalysis, they tend not to have privacy concerns.\n    The government's role should focus on research and skills. \nFirst, Federal research investment in high-performance \ncomputing is critical to big data. Industry needs university-\nbased exploratory research into numerous areas including system \ndesign, flexible software defined environments, and IT \ninfrastructure.\n    Second, IBM strongly supports the reauthorization of the \nDepartment of Energy High End Computing Revitalization Act of \n2004 to be offered by Representative Hultgren. This bill will \nimprove high-end computing R&D at the DOE and strengthen \ngovernment industry partnerships for exascale platforms. IBM \nhas a long history of successful partnerships with DOE. This \npartnership established computational simulation as an \nessential tool for scientific inquiry and led to world \nleadership in the United States in high-performance computing. \nThe challenge ahead is to continue this growth. Past Federal \ninvestments in HP-related research, particularly at DOE's \nnational labs, have underpinned mission-critical supercomputers \nat DOD, NASA, NOAA, and in the intelligence agencies.\n    Third, the professional science masters program supported \nby NSF is particularly relevant as it provides advanced \ntraining in science or mathematics and develops workplace \nskills valued by employers. Finally, Congress should \nreauthorize the Carl D. Perkins Act and the Federal work-study \nprogram and restructure them to align workforce needs and big \ndata.\n    In conclusion, there exists today a tremendous abundance of \ndata about our world. New cognitive computing capabilities will \nhelp determine which countries and businesses will thrive. The \nUnited States should support advanced computing and build its \nworkforce to seize the future.\n    Thank you, and I welcome your questions.\n    [The prepared statement of Dr. McQueeney follows:]\n\n    [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n    \n    Chairman Bucshon. Thank you, Dr. McQueeney.\n    I now recognize Dr. Rappa for five minutes for his \ntestimony.\n\n           TESTIMONY OF DR. MICHAEL RAPPA, DIRECTOR,\n\n               INSTITUTE FOR ADVANCED ANALYTICS,\n\n              DISTINGUISHED UNIVERSITY PROFESSOR,\n\n                NORTH CAROLINA STATE UNIVERSITY\n\n    Dr. Rappa. Good morning, Chairman Bucshon, Chairman Massie, \nRanking Member Lipinski, Ranking Member Wilson and other \nMembers of the Subcommittee. I appreciate the opportunity to be \nhere this morning to speak with you about data analytics and \nthe role institutions of higher learning can play in advancing \nthe field.\n    I am going to draw this morning's testimony on my own \nbehalf as a professor and director of a research institute, \neducational institute for over the past 25 years.\n    I think it is important to start with the fact that the \nworld is changing around data very rapidly and our ability to \nproductively use it becomes a very central part of what we do \nas a society today, as has been heard already. A generation \nago, data was scarce, expensive, time consuming to collect and \ndifficult to analyze. Today, data is everywhere.\n    Advances in computer technology and powerful analytic tools \nmake it possible not only to collect vast quantities of data \nbut also analyze and draw insights from data to solve pressing \nproblems from increasing operational efficiency to combating \nfraud, to better health care, to protecting national security. \nData is everywhere. The question is, how well are we prepared \nto use it? We have the data, the technology, the methods and \ntools, all of which continue to advance. The national \nchallenge, in my view, going forward will be our ability to \neducate a data-savvy workforce that has the analytical skills \nto put data into action. Estimates of the talent gap as we have \nheard are large and growing.\n    This is a dire but solvable problem. As we have shown at NC \nState, working closely with employers and focusing on their \nneeds, we can produce the kind of talent that is so desperately \nneeded today. We do it quickly in just 10 months with a \ndomestic student population ranging from their early 20s to \ntheir late 50s, many of whom are returning to school. We have \ndone this now for six years economically with consistently high \nstudent outcomes using a sustainable and scalable business \nmodel based on self-financed tuition.\n    What it comes down to is creative innovation, how we \norganize graduate education, allowing us to engage with \nemployers more productively to yield high-quality results in \nthe skills and readiness of our graduates.\n    I encourage the Committee to focus its attention on \nworkforce needs, to encourage the government to seek out \ninnovation in higher education and to promote new and novel \nlearning models. This is a solvable problem. With the proper \nincentives, focused resources, open collaboration with \nindustry, we can produce the analytics professionals needed to \nextract value from big data and to move the economy forward. As \nI said, we have done this ourselves now for 6 straight years to \ngreat effect. We will graduate a class in a matter of another \nweek, 80 students in the Master of Sciences and Analytics \nProgram, with already 95 percent of them placed in jobs. They \nare literally the most sought after and highest-paid graduates \nof the university.\n    So we can do this. It is a solvable problem. Thank you \nagain for your time. I will be glad to answer any questions.\n    [The prepared statement of Dr. Rappa follows:]\n\n    [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n    \n    Chairman Bucshon. Thank you for your testimony.\n    I now recognize our final witness, Dr. Jahanian, for five \nminutes for his testimony.\n\n               TESTIMONY OF DR. FARNAM JAHANIAN,\n\n            ASSISTANT DIRECTOR FOR THE COMPUTER AND\n\n           INFORMATION SCIENCE AND ENGINEERING (CISE)\n\n            DIRECTORATE, NATIONAL SCIENCE FOUNDATION\n\n    Dr. Jahanian. Good morning, Chairman Massie, Chairman \nBucshon, Ranking Members Wilson and Lipinski, and Members of \nthe Subcommittee. It is my pleasure to be back here to discuss \nthe next generation of computing and big data analytics.\n    Today we live in an era of data and information enabled by \nadvanced technologies that surround us. Data is generated by \nmodern experimental methods, scientific instruments such as \ntelescopes and particle accelerators, large-scale simulators, \nInternet transactions, email, video images, clickstreams, and \nwidespread deployment of sensors everywhere. Approximately 90 \npercent of the data in the world today were created in the last \ntwo years alone. However, when we talk about big data, it is \nimportant to emphasize not only the enormous volume of data \nbeing generated but also the velocity, heterogeneity and \ncomplexity of data that now confronts us.\n    Why is big data important? Several others have alluded to \nthis already. Data represents a transformative new currency. \nBig data is increasingly important to all facets of our \nNation's discovery and innovation ecosystem. First, insights \nand more accurate predictions from large and complex \ncollections of data are creating opportunities in new markets, \ndriving the creation of IT products and services and boosting \nthe productivity of businesses. Second, advances in our ability \nto store, integrate, and extract meaning and information from \ndata are accelerating the pace of discovery in almost every \nscience and engineering discipline. Third, big data has the \npotential to solve many of the Nation's most pressing \nchallenges from health care and education to cybersecurity and \npublic safety, yielding enormous societal benefits and ensuring \nsustained U.S. competitiveness.\n    Let me share with you just a few examples of the promise of \nbig data. These are all grounded in research that is funded by \nthe Federal Government or by the private sector, the work that \nis done in the private sector. By integrating biomedical, \nclinical and scientific data, we can predict the onset of \ndiseases and identify unwanted drug interactions. By coupling \nroadway sensors, traffic cameras, individual GPS devices, we \ncan reduce traffic congestion and generate significant savings \nin time and fuel. By accurately predicting natural disasters \nsuch as hurricanes and tornadoes, we can employ lifesaving and \npreventative measures that mitigate their potential impact. By \ncorrelating disparate data streams through text mining, image \nanalysis and face recognition, we can enhance public safety and \npublic security. By integrating emerging technologies such as \nMOOCs and inverted classrooms with knowledge from research \nabout how people learn, we can transform formal and informal \neducation.\n    What does this mean for scientific discovery? Data-driven \ndiscovery, also called the fourth paradigm, is revolutionizing \nscientific exploration and engineering innovations. It enables \nextraction of new knowledge, provides novel approaches to \ndriving discovery and decision-making, yields increasingly \naccurate predictions and provides deeper understanding of \ncausal relationship based on advanced data analysis.\n    What is government doing to ensure we harness this \npotential? As it was mentioned already, in 2011 U.S. Networking \nand Information Technology Research and Development Program, \nalso called NITRD, formed a big data senior steering group to \nidentify, initiate and coordinate big data research and \ndevelopment activities across the government to ensure that \nFederal agencies, the scientific research enterprise, and \npublic maximally benefit from data-driven discovery. In March \n2012, the National Big Data R&D Initiative was launched, \nfocusing the steering committee group's focus on the tools, \ntechnologies and human capital needed to move from data to \nknowledge to action. We see exciting new partnership \nopportunities with the private sector, state and local \ngovernments, academia and nonprofits.\n    At NSF, we have identified four major investment areas that \naddress current challenges and promise to serve as the \nfoundation of comprehensive long-term agenda: first, investment \nin foundational research to advance big data techniques and \ntechnologies; second, support for building new \ninterdisciplinary research communities; third, investment in \neducation and workforce development; and finally, development \nand deployment of cyber infrastructure to capture, manage, and \nanalyze and share digital data.\n    I should add that NSF's investment in cyber infrastructure \nincludes advanced computational resources that support data-\nenabled science. In particular, the newly dedicated Blue \nWaters, Stampede and Yellowstone supercomputers will expand our \nNation's computational capabilities significantly.\n    In summary, big data represents enormous opportunities for \nour Nation. Investments in big data research and education will \nadvance the frontier of knowledge, further fostering \ninnovation, creating new economic opportunities, and yielding \nnew approaches to addressing national priorities.\n    Thank you again for this opportunity. I would be happy to \nanswer any questions.\n    [The prepared statement of Dr. Jahanian follows:]\n\n    [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n    \n    Chairman Bucshon. Thank you for your testimony. I would \nlike to thank all the witnesses for their testimony. I am \nreminding Members that Committee rules limit questioning to \nfive minutes, and the Chair at this point will recognize \nhimself for five minutes to start the questions.\n    First, Dr. Jahanian, the Administration announced their Big \nData Research and Development Initiative in March 2012 \nincluding $200 million in new commitments for big data research \ninitiatives. However, the National Science Foundation, \nDepartment of Defense, Department of Energy, and other agencies \nhave had significant research programs and data analytics that \npredated the initiative. How has the Administration's \ninitiative changed the ways these agency research programs are \ncoordinated and are we effectively managing and leveraging our \nresearch investments across agencies?\n    Dr. Jahanian. Thank you for your question. You are \nabsolutely right that it is not that suddenly last March we \nwoke up and said boy, data is really important, we need to do \nsomething about it. There has been significant investment by \nthe Federal sector and private sector in areas having to do \nwith data. The challenges we face are many--stewardship of \ndigital data and software, for example. Many data sets, as was \nmentioned, are too poorly organized or also unstructured. Many \ndata sets are heterogeneous. The utility of data is also \nlimited by our ability to interpret them. Many data are being \ncollected at a scale that we can't even store them, let alone \nanalyze them. Also, large and linked data sets may be exploited \nto identify individuals and so there are also the privacy \nissues. So there are enormous challenges that we face.\n    As you alluded to, on March 29, 2012, OSTP in concert with \na number of Federal agencies launched the national Big Data \nResearch Initiative. It expands the scope of our activities in \nseveral directions, for example, state-of-the-art core \ntechnologies that we need to collect, store, preserve, manage \nand analyze data, harnessing these technologies to accelerate \npace of discovery, supporting responsible stewardship, for \nexample, and sustainable business models for big data.\n    There are a number of cross-coordination efforts taking \nplace under NITRD. Let me start with NSF. All NSF directorates, \nfor example, are participating in this. A multidisciplinary \npanel of experts are making a recommendation on funding of \nthis. Furthermore, big data is being coordinated through a \nsenior steering group that reports to the assistant directors \nat NSF for all the coordination because it involves every \nscience and engineering discipline.\n    As far as the Federal Government is concerned, the Big Data \nR&D Initiative is coordinated through the NITRD Subcommittee. \nAs you know, I Chair the Subcommittee. There is a senior \nsteering group that regularly meets to coordinate the \nactivities on many of the fronts that I alluded to. There are \nalso enormous opportunities not only in terms of joint \nsolicitations but there are a number of workshops that we are \nholding jointly with other agencies including NIH, NIST, DOE, \nDOD to advance the frontiers of knowledge and exploration in \nbig data.\n    I should also mention that when it comes to this \ninitiative, we can't forget that the private sector plays a \nsignificant role. When we think about innovation and discovery \necosystems, not only are we talking about universities, we are \ntalking about scientists and engineers, you know, a rich, \ntalented labor force, investments in research and education, \nand of course, a vibrant private sector. So there are a number \nof programs that we have at NSF that attempt to connect the \ndots when it comes to transfer of knowledge.\n    Chairman Bucshon. Thank you. I am glad to hear there is \nquite a bit of coordination at the Federal level because I \nthink all of us are concerned about that, and again, investing \nthe taxpayer dollar wisely.\n    Dr. Rappa, I also serve on the Education and Workforce \nCommittee, and I have got children age 9 through 20, four of \nthem, and I have a really strong interest in how we get young \npeople interested in different fields of study, and obviously \nwe have a tremendous challenge not only with this area but many \nothers, and do you think that--what are your ideas on how we \nengage young people in understanding what opportunities there \nare in this area and what the jobs of the future might hold? I \nmean, how do we do that? Because, you know, when you go to a \nhigh-school class, and I talk to a lot of high-school class, \npeople say, you know, not many people come up when you ask them \nwhat they want to be, you know, they want to analyze big data. \nSo how do you do that? What is your recommendation?\n    Dr. Rappa. Well, thank you very much for your question, and \nI understand exactly what you are saying, and I think that \nthings are changing. You know, I think it is exactly true that \nyour average 8-year-old doesn't say they want to grow up, for \nexample, to be a statistician. It is not common, unless they \nare really interested in sports. Then you see a sort of nexus \nthere because of the relationship. But I think what is changing \nis that it is really about producing education, in my case, at \nthe graduate level, reaching further into the pipeline down \ninto undergraduate education and even touching upon high school \nwhere people begin--where students begin to understand how data \nis really used in action. So it is really about creating, not \njust sort of creating knowledge or understanding but also \napplying that knowledge. And when our students--our whole \neducation is driven around the application of that knowledge, \nand so students really understand, and increasingly \nundergraduates understand that this kind of graduate education \nis going to lead them to a very interesting, compelling \nprofessional life.\n    Chairman Bucshon. Well, thank you, because I think that we \ndo--you know, we do need to have this type of information \ngravitate down, even to middle-school kids to get them \ninterested, and there is a program in Indianapolis called \nProject Lead the Way who I know very well that is beginning to \ndo that at the high-school level, and it is showing some \nsuccess.\n    But my time is expired, so I would love to talk more about \nthat but at this point I am going to yield to Ms. Wilson for \nfive minutes for her questions.\n    Ms. Wilson. Thank you, Mr. Chair.\n    Along those lines, can you tell me either one of you what \nskills are necessary for the big data workforce? I heard you \nsay something about an analytical something. And also as you \nare speaking, I would like to hear from you what role can \ncommunity colleges play in preparing the next-generation \nworkforce for big data.\n    Dr. Rappa. Thank you very much for your question. I would \nlike to try my hand at that. So what is sort of interesting and \nnovel about what we have done around the education, we really \nstarted from scratch in building an entire new graduate degree \nprogram, and we really wanted to address this question of what \nskills were needed, and we focused ourselves really looking at \nthe employer as the customer in a sense, the person, the \nindividuals who buy our product and the students and really \ntried to understand the skills that they need, and really where \nthat brings you is that there is these technical skills which \nare important in programming, in math and statistics, but \nemployers really want much more than that. They want \nindividuals who can work well in teams, who can communicate \nthese insights to decision makers, who can actually use the \ntools and apply the knowledge in an organizational context, and \nso we have structured the whole education to build a very \nbalanced set of skills as opposed to what I think is really the \nconventional approach in graduate education and to some extent \nundergraduate education to focus on the technical skills almost \nexclusively. And so really what we need to do is sort of \napproach the whole student. Now, I think community colleges can \nplay a very important role because you can really begin to \nchannel pipelines where students can go and get the \nprerequisite knowledge that they need, the early levels of math \nand statistics, before they go on to graduate education. Thank \nyou.\n    Dr. McQueeney. I would just like to comment that a lot of \nthe focus in the past has been on the graduate level of \neducation, as Dr. Rappa just talked about, and while we \ncontinue to have a strong need for Ph.D.'s and computer science \nand electric engineering and mathematics, the biggest skill gap \nthat we see is at the masters level, quite frankly, of people \nwho may not have the mathematical skills to create an entire \nnew type of analysis of data but who have more than basic IT \nskills who actually can understand the implications of using \ndifferent analytical techniques given a problem, given a data \nset with certain statistical properties, what would be the \nappropriate analytical technique to use, and when you apply \nthat technique, how could you be sure that the results would be \nreliable and proper, and so a lot of our focus has been on \ncreating an intermediate level of skill that has the basic \nunderstanding of how to use these tools even if it would fall \non someone with more of a Ph.D. level of training to create new \nanalytical approaches.\n    Dr. Jahanian. Representative Wilson, I want to echo \nsomething that has been said. If you think about big data, let \nus just step back. There are three related problems that go \nbeyond big data. It includes all of our IT workforce, computer \nscience, computational science and so on. These problems have \nto do with underproduction, which everybody recognizes, \nunderrepresentation and then pipeline issues. Chairman Bucshon \nalready alluded to this, that we need to worry about our high \nschools, we need to worry about the pipeline. I have three \nkids, and I know where we lose our kids, it is not in masters \nor Ph.D., we lose the interest of our kids in high schools and \nmiddle schools, so that has to be fixed, and there are a number \nof programs that we have initiated, pilot programs that try to \naddress that issue.\n    Let me share with you one anecdotal sort of evidence that \nprovides data on this. Annualized Bureau of Labor Statistics \ndata predicts that each year we need about 140,000 job \nopenings. We will have 140,000 job openings in computing and \nbroadly speaking IT-related jobs but we are only producing \nabout 100,000 qualified individuals including masters, Ph.D., \nundergraduate and community colleges. In fact, many of these \njobs would be available to individuals who have two year or \nfour year degrees.\n    Another data point that I want to share with you is that 62 \npercent of all newly created STEM job openings between 2010 and \n2020 will be in computing and IT. Let us not forget that. And \nthat includes data, that includes computational skills and many \nof the other skills that the other witnesses alluded to. Thank \nyou.\n    Ms. Wilson. Just in my 16--oh, 10, 9, 8--what would you \nsuggest that we begin to--how do we begin to get children \ninterested in these sort of skills? I know every little child \nhas an iPad. They can work these computers better than adults. \nWhat do you think we can do to stimulate that all the way from \nK-12 and into the community colleges so we will have more IT \ngraduates? Do you suggest we buy each one--we outfit classrooms \nwith iPads, or what do you think?\n    Dr. McQueeney. I think there is an intrinsic curiosity in \nyounger folks about a lot of the tools they use to communicate \nwith each other that have tremendously greater scalability than \nthe tools that I use to communicate with my friends.\n    Ms. Wilson. Right.\n    Dr. McQueeney. So the essence of what is a large \ncommunity's opinion on a topic of interest could involve the \nopinions of thousands or millions of people and so I think a \nlot of the young folks I talk to when I visit K-12 programs or, \nyou know, in programs like eWeek, they have an intrinsic sense \nnot only of the device and the technology but they have a sense \nof the reach of that device and technology which is the \nbeginning of an appreciation of really what we are talking \nabout with big data, that there are trends that they can reach \nwith that device, and I think that fires their imagination in a \nvery powerful way.\n    Chairman Bucshon. Thank you. I will now recognize Mr. \nMassie, Chairman Massie, for his questioning.\n    Mr. Massie. Thank you, Chairman.\n    So one of the questions that I have as we deal with the \ninterface between government and private industry here is, are \nyou aware of any government data sets that we need to get more \ninto the public domain for usage? For instance, I think we have \ndone a pretty good job about getting some of the mapping stuff \nout there but some of that map information is old, goes back to \nthe 1940s and 1950s, and I know the government has been paying \nfor LIDAR mapping, which is a high-resolution terrain mapping, \nand I am kind of concerned that that is not getting out there. \nAre you aware of that, and are there any other data sets that \nwould be useful to the public that the public has paid for that \nwe might want to work on getting out to the public?\n    Dr. McQueeney. I think the government has done an excellent \njob and had many initiatives that were very focused on getting \nthat valuable data out so it could be used. You mentioned \nLIDAR. I know that one of the uses that is very promising for \nLIDAR is to do something like an inventory of the forests in \nthe country, to actually be able to conduct a definitive \ninventory. Right now, the agencies that are responsible for \nthat use a statistical sampling technique but in a world where \nyou can take LIDAR images and process that enormous data \nvolume, you are able to move then from a statistical sampling \nbasis, which is all we could do before, to a more definitive \napproach to get a very, very good picture of one of the more \nvaluable natural resources that needs tremendous amounts of \nstewardship. So I think that is an example of a data set that \ncould be extremely valuable. But I think in general, the \ngovernment is very well and properly focused on getting those \nvaluable data sources out. Weather would be another--basic \nweather data would be another good example that can be built on \nto add extra value.\n    Mr. Massie. Are the other witnesses aware of any data sets \nthat we need to promote more?\n    Dr. Jahanian. I want to highlight a couple of things. I am \nsure you are aware of data.gov, which is a Web site that makes \na lot of government data sets available, and the goal here is \nto increase public access to high-value machine readable data \nsets that are generated by the government. Hopefully it will \ncreate new economic values. There are also a number of \nactivities in encouraging the private sector, entrepreneurs to \ndevelop applications on top of that data. It is not just making \nthe data available but also making the data valuable so there \nare a number of essential activities related to that.\n    There was a recent Wall Street Journal article actually \nthat highlighted at least a dozen different kind of government \ndata sets that have been made available from labor and health \nviolations to flu incidents, energy prize, offshore activities, \nsolar information, and so on and so on that are interesting. \nFrom the National Science Foundation's point of view, I should \nmention that as you may know, we have a number of large \nfacilities--LSST was mentioned, Neon, which is another facility \nthat collects a lot of data, will be collecting a lot of data. \nThe science and engineering community needs that data, and many \nFederal agencies are working very hard to make that data \navailable. There are a number of issues having to do with open \naccess that go beyond the scope of this question.\n    Mr. Massie. Let me ask a follow-up question to that. So big \ndata like any other data could be misused, altered, hacked, \nillegally accessed, and sometimes it may just be an honest \nmistake. We share data that we probably shouldn't have, for \ninstance, where some farm data that got out there and it could \nreally compromise our food safety if people know where all the \nfood sources are. How do we balance the desire for privacy, \nactually the constitutional right to privacy, with sharing all \nof this data now that everybody is under a microscope?\n    Dr. Rappa. I thank you for your question, and I would like \nto sort of just turn it a little bit because we do work--each \nyear we work with about 16, 17 organizations that share data \nunder a confidentiality agreement including three government \nagencies in which we put teams of students working on very \ncomplex analytics projects, and so while I applaud, and I think \nit is very important and I do think the government is doing a \ngood job at sharing data openly, it is a very important thing \nto do, I think there is also an opportunity to engage the \nacademic community in other ways to help understand that data, \nwhich might mitigate some of these issues around the privacy \nelement.\n    Mr. Massie. Dr. McQueeney?\n    Dr. McQueeney. Yes, that is an excellent question. Thank \nyou for that. One of the things that we can do is to get data \nabout the data. We call it metadata. So we analyze the data and \nwe don't just look at what information we can get from the data \nbut we describe the data perhaps in terms of its sensitivity--\nis this more or less sensitive from a point of view of privacy \nor security or secrecy--and we can then tag those data sets \nwith metadata that describes the implications of using that \ndata and then we can build into the systems that handle the \ndata policies that look not only at the data but the metadata \nthat describes what are the contents and what are the \nimplications of sharing and combining that data and so we can \nactually build into the foundation of big data systems the \nability to interpret policies that we have set in a very \nconscious and clear-eyed way and as they process the data they \ncan be respectful of that metadata. The medical community has \nactually done a lot of very good work around patient \nconfidentiality while still getting very good pattern analysis \nof different kinds of outcomes.\n    Mr. Massie. Thank you very much. My time expired. I \nappreciate your answer and concern for that question, Mr. \nChairman.\n    Mr. Bucshon. Thank you, Mr. Massie. I now recognize Dr. \nBera for five minutes for his questions.\n    Mr. Bera. Thank you, Mr. Chairman, and thank you for the \nseries of hearings that we have had on the Subcommittee. It has \nbeen great.\n    You know, big data is incredibly important and how we \nmanage data and with the rapidity of how the world is changing. \nI mean, when I think back to being a high-school student, and \nfor me it was going and looking at the index cards, walking \ndown and looking in the encyclopedia. Now, when my daughter, \nyou know, she has vast access, or when I do rounds in the \nhospital, we would have to race down to the library to get \ninformation but now before we are even finished presenting, the \nmedical students or the residents can just look at the latest \ndata on, you know, a device like this and get access to the \nmost accurate and timely information. So it is incredibly \nimportant that we make these investments to not only manage the \ndata, to sort that data and then to make sure it is accessible. \nIt is a critical priority that we have that workforce both at \nthe professional level but then also at the management level \nand I think the number that I read was we need about 1.5 \nmillion managers. So there is a huge need but also a huge \nopportunity.\n    When I think back to the talent that has been impacted in \nthe last four years in the recession, you know, there are a \nlarge number of extremely intelligent and talented individuals \nin their 30s and 40s who have been hit hard. These are folks \nlike myself that were trained for a 20th-century workforce but \nnow we find ourselves in a 21st-century economy.\n    Dr. Rappa, are there some best practices--and these aren't \nindividuals that need to get a graduate degree, you know, they \nare talented individuals--where we could take them and quickly \ntrain them for this new economy? Are there examples?\n    Dr. Rappa. Right. So we do offer it as a graduate degree \nbut we do this in 10 months, and indeed, a good, fairly \nsubstantial, larger portion of our population are people who \nare returning from--or coming from the workforce to go through \nthis and some of them are in exactly the position that you say. \nThey were transitioning, their companies were faltering. And so \nthe key really with this is short duration. Ten months is \nactually a very reasonably good time because you could build \nthe skills that you need. If it is too short, you can't \naccumulate the skills but the key thing is that you have really \ndemonstrated ROI on that education because that person who is \ncoming in to do that has to know that they have a very high \nprobability of getting a job when they leave and at a \nparticular salary rate so that they can justify the investment \nand time, and that is really what we have done.\n    Mr. Bera. Dr. McQueeney, are there potentially any \nexamples--you know, again, a lot of these folks are also paying \ntheir mortgage, they have to continue to foot their bills--of \npossibly even doing an advanced work-study type of program \nwhere you recruit this talent and they are getting on-the-job \ntraining as opposed to a traditional school model?\n    Dr. McQueeney. Yes. In fact, there is a related topic here \nthat I think is quite interesting, which is the application of \nbig data and analytics back on to the educational process \nitself. You have seen the great upsurge in videos that attempt \nto replace traditional brick-and-mortar classroom attendance, \ncoursework. You have seen a number of startup companies formed \nin this space. If you look at the education process, each of us \nreally learns quite differently. Some of us may learn more from \nhearing or from seeing or from working problems, and great \nteachers, great professors are sensitive to how their different \nstudents learn and are capable of presenting material in \nalternate ways to make sure they reach all the students. With \nelectronic delivery of course materials and monitoring of \nstudent progress, we generate digital exhaust, if you will, \nthat describes how that student is learning, how that student \nresponds to the instruction, and for the parts of the \ninstruction that are delivered electronically, we actually have \nthe ability to do analytics and to do an optimization process \nso that each of us on the panel might not get the same length \nof lecture on five different topics. It might be adjusted to \nour historical learning patterns.\n    So we have worked with a number of universities and other, \nyou know, non-traditional educational institutions to apply the \nbig data and analytics techniques to the education and training \nprocess itself.\n    Mr. Bera. Great. In my last 30 seconds, so we have access \nto data. I think one element that we should also be conscious \nof is the quality of the data because there certainly is very \ngood-quality data but at the same time there is very poor-\nquality data that is out there and, you know, any of you who \nwant to comment on how we monitor quality?\n    Dr. Rappa. I think most data starts off as bad data, for \nthe most part, unless it is being collected in a highly careful \nway. And so it is, you know--I think just as we hear about big \ndata today, we are going to hear about bad data in the future. \nMost projects start out where you have enormous front end to \nthem to really understanding cleaning and cultivating that data \nto make it useful, and that is an important part of the \neducational process.\n    Dr. Jahanian. I would just add that there are a number of \ntechniques that have been developed and are in development \ndealing with data exploration, data cleaning and so on. \nFurthermore, when we talk about large-scale data sets, there \nare statistical techniques that are being applied that really \ntake care of the noise, take care of some of these \ninconsistencies, and that is one of the attractions of big \ndata.\n    Mr. Bera. Great. Thank you.\n    Chairman Massie. [Presiding] Thank you, Mr. Bera. I now \nrecognize Mr. Schweikert from Arizona for five minutes.\n    Mr. Schweikert. Thank you, Mr. Chairman.\n    This is one of those types of conversations, you know, we \ncould all sit around and buy you some well-caffeinated coffee \nand talk for hours and still have no idea if we made any \nprogress.\n    Doctor, is it McQueeney?\n    Dr. McQueeney. Yes.\n    Mr. Schweikert. First, you are with IBM?\n    Dr. McQueeney. Yes.\n    Mr. Schweikert. In your testimony, help me do a little \nferreting out here. Hardware technology or IT talent, what is \nyour biggest bottleneck right now?\n    Dr. McQueeney. There are bottlenecks in a number of areas. \nIf I looked at the hardware itself, the biggest challenge \ngetting from the petascale to the exascale is actually the \npower dissipation of the systems. The new technology work that \nwe are doing is to get the computations more efficient in terms \nof floating point operations per watt so that if you assembled \na system thousand times bigger than today's supercomputers you \ncould house it and cool it.\n    Mr. Schweikert. You don't want to take down the power grid?\n    Dr. McQueeney. The power grid may not in fact be able to \nsupply enough power if we didn't make some innovations. That is \na good point.\n    Mr. Schweikert. But hasn't your company actually been one \nof the leaders at producing some of those breakthroughs?\n    Dr. McQueeney. In fact, we have, and in fact, a lot of that \nhistory goes back to work that started with the Department of \nEnergy many years ago, and this bears on an interesting \nhistorical point. In a time when we are concerned about making \ninvestments efficiently, if I go back to the beginning of the \nASCII program with the Department of Energy to do the nuclear \nweapons stockpile stewardship program, the Department of Energy \nscientists did a very careful analysis of what were the core \nalgorithms, the core analytics, if you will, in today's \nlanguage, that needed to be done at a certain level to provide \nthe mission that they needed to provide, and they found that \nthe current path at that time of supercomputing was going to \ntake five years to produce a machine that they needed in 1 or \ntwo years. The analysis they did was thorough enough to reveal \nthat there weren't bottlenecks everywhere but at that time \nthere were bottlenecks mostly in the inner process or \ncommunication. So they made a very thoughtful, very surgical \ninvestment in accelerating just the piece that was needed to \nclose their mission gap, which was the beginning of a very long \nrun of government-industry collaboration.\n    Mr. Schweikert. But you are in some ways heading towards \nwhere my question is. So if that bottleneck, in today's world, \ndo I find the technology if I went out to the private sector \naround the world that is competing and producing high-end \nsupercomputing or is it coming out of a government lab? And I \nknow the pop culture terminology is ``public-private \npartnership'' but the reality, they do operate in pretty \nsubstantially different silos.\n    Dr. McQueeney. The real forcing function for a breakthrough \nis a critical mission need. So in the case of high-performance \ncomputing, it has often been a government agency with a \ncritical mission that----\n    Mr. Schweikert. But they were doing a specific request for \nhow they wanted to manage their data?\n    Dr. McQueeney. That is correct, and once that technology is \navailable, it can be consumed very rapidly in lots of other \napplications that could take great advantage of it but didn't \nhave a compelling enough need to get over that hurdle. That is \nwhen the disbursal of technology starts.\n    Mr. Schweikert. Just as an aside, only because I had some \nacquaintances who were--I used to be an old SQL programmer so I \nam way out of date now. IBM was actually running a fascinating \nlarge data project where they were doing sweeping data sets \nthrough the world's social media and gathering it and looking \nfor trends. Can you in 30 seconds or so tell me your knowledge \non that?\n    Dr. McQueeney. Yeah, we have analyzed the public social \nmedia sources with several of our customers and we can gain a \nlot of insights. For example, you know, retailers can gain \ninsights about trends and their clients. Transportation \nagencies can gain insights about likely traffic congestion. \nThere are many sources of public data, both social media and \nother forms that can be analyzed to reveal patterns about how \npeople conduct their daily activities that are very useful for \noptimizing the public infrastructure.\n    Mr. Schweikert. Forgive me, I am blind as a bat without \nthese. Is it Dr. Rappa?\n    Dr. Rappa. Yes.\n    Mr. Schweikert. Isn't my single biggest problem in big data \nright now is noise that when I put data set after data set \nafter data set and build on it, that just small incremental \nerrors actually create really bad decisions on the end?\n    Dr. Rappa. Well, I think part of the education around \nhandling big data deals very squarely with the quality of the \ndata and how to clean it and cultivate it to reduce the noise, \nto----\n    Mr. Schweikert. But you and I can go over a long series of \npublic policies, both state, national, you know, military, \nothers, where we built it on really gigantic analyzed data sets \nand it was wrong.\n    Dr. Rappa. Well, I think that, you know, the challenge here \nis education. So as I alluded to earlier, we have teams of \nstudents----\n    Mr. Schweikert. Is it education or developing educational \nskepticism?\n    Dr. Rappa. It is developing the education around how to \nsquarely understand the inherent challenges in data. Data is \nnot born clean. It isn't born ready to be analyzed.\n    Mr. Schweikert. And when you and I build our model, the way \nwe wait, you know, because we start plugging in human factors \nthat, you know, you and I bring our biases and we----\n    Dr. Rappa. And this is why we really need a focused \neducation squarely around how do you draw insights from data \nbecause there are these inherent problems in data, especially \nas you scale them up, as you combine different data sets, as \nyou combine different types of data.\n    Mr. Schweikert. Thank you, Doctor, and Mr. Chairman, thank \nyou for tolerating. It is just one of my great fears. And look, \nI am a data freak. I mean, you have got to see the servers and \nstuff I have at home. But I have learned when we make big-time \npublic policy on something we all know is right, we keep making \nhuge, very costly mistakes.\n    Chairman Massie. Thank you, Mr. Schweikert. I now recognize \nMr. Hultgren from Illinois for five minutes.\n    Mr. Hultgren. Thank you, Mr. Chairman. Thank you all for \nbeing here. First of all, I just want to thank Dr. McQueeney \ntoo. I appreciate your mention and your support for the \nexascale computing bill I am currently authoring. I am very \nexcited about the potential there and see some huge shift in \nour national computing capabilities and I am very excited about \nthat, so I appreciate your mention and support of that.\n    I do have a few questions, and first I guess I would \naddress this one to Dr. McQueeney and also Dr. Jahanian. Is \nthat right? I am sorry. I wonder if you could comment briefly \non where the United States stands in your opinion in worldwide \ncomputing leadership? I know the metric of the fastest \nsupercomputer is one metric but what do you use as a metric for \nbig data to determine which countries are using it most \neffectively?\n    Dr. McQueeney. The common thing that is cited in these \ndiscussions is the top 500 supercomputers list. That is \nsomething that is compiled twice a year, as you well know, and \nwe have usually been at the top of that list. We have continued \nto be the majority of the systems on that list but other \ncountries have noticed the success that we had in, you know, \ngovernment leading the way on high-performance computing \nbreakthroughs. Once those systems are built, they find hundreds \nand thousands of other applications, each with a client that \nmight not have been able to fund that breakthrough themselves \nbut can certainly utilize it. Other countries have popped up on \nthe top of that list because they are interested in emulating \nthe success we have had in leading the way with innovation and \nthen seeing that innovation used broadly across the commercial \nsector. So the top 500 list is a very technical, perhaps very \ngeeky measure of who is on top, and I would say that we are \nstill in a leadership position there but it has been stronger \nin the past than it is today.\n    If you turn to more of a business view, you would want to \nlook at the companies that were taking the best advantage of \ndata sources, either to drive value in their companies or to \nprovide benefits such as public safety or health benefits, and \nthere again I think we are in a good position but it is a very \ndifferent kind of skill, a conversation we didn't quite finish \nbefore about the skill to build these large systems is a very \nfocused, very large-scale, very capital-intensive activity but \nthe skills to use these systems are more focused on creativity \nand are actually better done by large groups of small teams. In \nfact, you know, the NSF has been a leader in fostering that \nkind of innovation where thousands and thousands of groups can \nbuild innovative applications and take advantage of these \nsystems.\n    Mr. Hultgren. Thanks. Dr. Jahanian?\n    Dr. Jahanian. Yes, just a couple of quick comments. There \nis no question that we continue to maintain our leadership \nworldwide in this area, and there is no doubt that continued \ninvestment in this area is extremely important to the future of \nthe country. As I mentioned just a few minutes ago, NSF's \ninvestment in Blue Waters, Stampede, as well as the Yellowstone \nsupercomputing centers represent a range of investments that we \nmake in high-performance computing, addressing the needs of not \nonly the top five percent of application that have \nexceptionally high computational needs but also a broad \nspectrum of researchers across the country in science and \nengineering who would need computational resources.\n    A couple of comments. Just look at Blue Waters, for \nexample, which is at University of Illinois. A couple of data \npoints about it. It has--if you could--just the computing power \nof it, if you could multiply two numbers together every second, \nit would take 32 million years to do what Blue Waters does in \none second. That is astonishing power, for example, of Blue \nWaters. In terms of storage capacity, memory capacity and so \non, there is a similar kind of scale.\n    The second point that I want to make is, we view \ncomputation and data to be two sides of the same coin. You \nreally need to address both. So when we talk about \ncomputational capabilities, we also have to worry about cyber \ninfrastructure to manage, to curate, to serve data to science \nand engineering community, and the investment in cyber \ninfrastructure has to be balanced between the computation side \nof it as well as management and curation of data.\n    Mr. Hultgren. Let me have--my time is running out but I \nhave a follow-up question to the two of you as well if you \ncould both comment in the time I have. It seems to me that \nexascale computing is focused on solving discrete problems that \nnecessitate massive computing power and speed. Are these \ndifferent problems than those we are addressing through big \ndata analytical tools and how do these two terms, how are they \ndifferent, how are they similar?\n    Dr. McQueeney. Historically, we have tended to talk about \nthem differently, but as we project how the exascale systems \nwill be built and how they will be used and we look at the \ngrowing importance of big data analytic systems, we see that \nthe platforms on which these systems will both depend will be \nmuch more common than separate, and in fact, we see that there \nis no conflict between investments in classically what we have \ncalled HPC and what we are now calling big data analytics, and \nboth are changing actually. The way we use an exascale system \nwill not be the same way that we use a petascale system. There \nisn't time here to go into it, but it actually morphs into a \ndirection that is much more common with what we will do in big \ndata and analytics.\n    Dr. Jahanian. I would just add that many of the problems \nthat the business community needs, the science and engineering \ncommunity needs are being addressed today through different \nkind of computational architectures that doesn't necessarily \nrequire today to have exascale computing including weather \nmodeling, a number of other applications that have been \nmentioned. So it is really important to consider the investment \nin exascale computing in the spectrum of investment that we \nmake to support computational and data needs of the entire \nscience and engineering community and of course the private \nsector.\n    Mr. Hultgren. Thank you so much. Chairman, thank you. I \nyield back.\n    Chairman Massie. I now recognize Mr. Lipinski from Illinois \nfor five minutes.\n    Mr. Lipinski. Thank you, Mr. Chairman. I am glad that Dr. \nJahanian mentioned Blue Waters there. We were just there not \nthat long ago, but since you covered that, I can move on to a \ndifferent area.\n    Dr. McQueeney, in your testimony you talk about how the \nFederal Government needs to invest in big data if the U.S. is \ngoing to maintain its leadership and competitive edge in this \narea. The needs and potential benefits of big data for the \nFederal Government align closely with those of private industry \nin a number of areas. If that is the case, how can the Federal \nGovernment more effectively partner with industry to achieve \ncommon goals and do you believe that industry has sufficient \ninput in the Federal Government's research agenda as it relates \nto big data?\n    Dr. McQueeney. I do think we have sufficient input. I think \nwe have excellent dialogs with the relevant agencies and \nnational laboratories, and I think the roles are complementary. \nI go back to the story about the early days of the ASCII \nprogram where through a collaboration we realized that the key \npiece of a supercomputing system that needed to be accelerated \nwas not the entire investment. We could ride on the commercial \ninvestments for most of the components of the supercomputing \nsystems at that time except for one, which was the high-\nbandwidth switching between processors. And so that kind of \nthoughtful connection between the leaders in commercial \ncomputing and the leaders on the government side has been able \nhistorically to identify which areas are critical to attain \ngovernment mission imperatives and where we can leverage \ncommercial technology and where we need to accelerate that in a \nsurgical fashion. So it has, in our view, been a very good \npartnership based on very high-bandwidth technical \ncommunications, understanding of applications and knowing when \nthe government should be leveraging commercial investments and \nwhen they need to accelerate parts of that investment to attain \nunique mission goals, and again, as I have said before, once \nthose barriers are crossed in terms of either the scalability \nof the system or the internal bandwidth of the system, it opens \nup thousands of new applications where there were ready \nproblems to be analyzed but those applications weren't large \nenough to drive that breakthrough. So that is how the effect \nworks of the leadership coming from some of the government \nagencies and then being realized broadly across industry. That \nis the essence of where this leadership has come from so \nsuccessfully over the years.\n    Mr. Lipinski. I want to follow up with Dr. Rappa on that. \nDr. Rappa, you discussed the importance of public-private \npartnerships to realizing the benefits of big data and stated \nspecifically that we must intensify and accelerate the national \ninvestment in proven models. What characteristics make a \npublic-private partnership successful and what models should we \nbe investing in? What were you referring to there?\n    Dr. Rappa. Well, I think first of all, we have been doing \nthis now for six years and so I think we do have a fairly \ninteresting, novel model for producing talent in this field \nwith a kind of proven track record based on data, based on \nmarket value of the graduates, but I think it comes really, you \nknow, partly from the university community, partly from the \nacademic community. Obviously we have a set of missions to \neducate students but we need to also, I think, do that by \ntrying to really understand the employer, what are they looking \nfor when they hire talent, what are the kinds of skills that \nthey need in order to be effective on the job, and I think \nemployers need to sort of be open to working with the academic \ncommunity. You know, there is a certain amount of dissidence \nthat naturally occurs because there are two different worlds \nwith different missions but I think it is really--I think we \nhave shown that it is possible with organizational innovation, \nwith a focused effort, with a sense of openness to engage the \nprivate sector in a very positive way, not just at NC State but \nat other universities. There are many, many examples now that I \nhope we are providing some leadership on but that other \nuniversities are working with our model but also pursuing other \ncreative models to do this. There are probably about two dozen \naround the country already.\n    Mr. Lipinski. Thank you. Dr. Jahanian, anything you want to \nadd about public-private partnerships?\n    Dr. Jahanian. Yes, indeed. There is no question that when \nwe think about the innovation ecosystem in this country, it \nincludes academia, it includes the private sector, it includes \ngovernment investment and a talent-rich workforce. The private \nsector is investing heavily in cloud computing, as you know. It \nis investing heavily in making computational resources also \navailable. I think there are opportunities for the Federal \ninvestment to leverage that and make some of that available. Of \ncourse that is commercially available today to our researchers, \nto our scientists and engineers who could rely on those \nsystems. We have announced a number of partnerships, one with \nIBM and Google, another one with Microsoft that make some of \nthose resources available to the research community.\n    Dr. McQueeney already mentioned this, that there is high-\nbandwidth communication between the private sector and various \nFederal agencies. I can tell you from NSF's perspective, it is \na very, very rich collaboration. On my advisory committee, I \nhave a number of the senior leader from the private sector who \nserve on my advisory committee advising us on our portfolio, on \nour investments in addition to academics who serve on my \nadvisory committee.\n    The final comment that I want to make is, there are a \nnumber of programs at NSF, and I know you are familiar with all \nof them, including SBIR, including I-Corps and so on that focus \non transfer of knowledge from lab to practice. Federal \nGovernment invests heavily in advancing frontiers of knowledge. \nFor us to accelerate those programs such as I-Corps, SBIR and \nso on serves a tremendous purpose, and here again, there are \nopportunities to engage the private sector and accelerate the \ntransfer of knowledge to practice to benefit the Nation. Thank \nyou.\n    Mr. Lipinski. Thank you.\n    Chairman Massie. Thank you, Mr. Lipinski. I now recognize \nMr. Bridenstine from Oklahoma for five minutes.\n    Mr. Bridenstine. Thank you, Mr. Chairman.\n    I also serve on the House Armed Services Committee, and I \nam aware that the Department of Defense is moving towards \ncloud-based computing solutions, and this of course creates \nsome consternation about security issues, cyber hacking, other \ncyber crimes, and I was wondering if any of your organizations \nare involved in helping the Department of Defense work through \nthese issues and what those solutions might be, if you could \nshare with us on that?\n    Dr. McQueeney. Sure, if I could start? You are quite right \nto raise the concern about security for any systems used by the \nDefense Department especially, although it would be true for \nall Federal agencies. And when you move to a cloud computing \nmodel, there is an extra imperative to be concerned about \nsecurity, and if you think of it in terms of the DOD might \nthink of it, if that environment should be compromised by an \nenemy, it is a bigger piece of resource than an individual \nmachine so it requires special vigilance. Now, the good news \ntechnically is, the way we handle virtualization, which is the \nfoundation of how cloud computing is delivered from a compute \nvirtualization point of view, there are actually sophisticated \ntechniques that can provide additional security in a \nvirtualized environment that we can provide even when using \nthings running on bare metal. We have additional abilities to \ninstrument the operation of that cloud and to very rapidly \ndetect any kind of pattern or behavior that is indicative of a \nthreat.\n    We did a project a number of years ago with the U.S. Air \nForce and they graciously let us write a short press release on \nit where we built a cloud computing environment that was at the \ncutting edge a few years ago. We instrumented it very \nthoroughly with watching the package flowing on the \ninterconnected network that built the cloud in question and we \nvery carefully isolated it from the rest of the world, \nintroduced known cyber attacks into it and were able to show \nthat if we knew the patterns of command and control, as the \ndefense folks might say, of these cyber attacks, we could \nactually spot them assembling themselves and interrupt them \nbefore they had a chance to launch. So having tremendous \ncontrol over the environment out of which we were getting \ncompute resources gave us abilities to do additional security \nand additional monitoring, even if we assumed the security was \nnot perfect and could be breached, could we essentially in real \ntime detect that breach and interrupt it before it stopped. I \nthought that was a very forward-looking piece of work that was \ndriven by the Air Force CIO's office.\n    Mr. Bridenstine. Excellent. Go ahead.\n    Dr. Jahanian. As you alluded to, these new environments, \nwhether it is mobile platforms or cloud computing, are \nintroducing new challenges, and we recognize that attackers and \ndefenders are coevolving and there are enormous challenges to \nprotecting our critical infrastructure and our cyber \ninfrastructure.\n    I wanted to mention NSF's Secure and Trustworthy Cyberspace \nprogram, which is a research program addressing many of the \nchallenges that we alluded to, and this is a research program \nthat addresses not only the technology issues but also \ntransition to practice. Furthermore, the NITRD research and \ndevelopment subcommittee has a working group that focuses on \ncoordination of activity across various agencies on \ncybersecurity and there is rich dialog involving various \nagencies on that issue.\n    Mr. Bridenstine. Excellent. Are there any other things that \nthe Department of Defense could do to help you guys with the \nobjective of securing cloud computing for the Department of \nDefense?\n    Dr. Rappa. So I am currently co-directing a project with a \ncolleague at NC State, which is the science of security project \nthat is done in collaboration with Carnegie-Mellon University \nand University of Illinois, and we are trying to bring together \nlarge groups, multidisciplinary groups of faculty to really try \nto understand the underpinning of the security problem and how \nto produce science around it. It is a very long-term challenge \nbut it is one which I think has to start with getting the \nfaculty across different disciplines focused on it and \ncertainly I think it has been a tremendous opportunity and I \nlook forward to moving into the future.\n    Dr. McQueeney. Yeah, Dr. Rappa makes a very interesting \npoint, to close the loop here. The cybersecurity problem is \nitself a big data and fast-data problem, and in fact, with some \nof the advanced persistent threats that we see today, which \ndepend on breaching an infrastructure and then laying dormant \nfor several months, what the attacker is trying to do is to \nwait out how long you keep your log file data so that when they \nlaunch themselves, it is difficult to do forensics, and so what \nwe have learned is that these log files are actually the \nessence of the big data you need to do pattern analysis, \npattern discovery on forensics, you know, should any attack \noccur. So in fact, most of the science behind big data \nincluding data at rest and large-scale computation and fast-\ndata that are eating very high-speed streams is directly \nrelevant to the subject of cyber defense.\n    Mr. Bridenstine. Thank you.\n    Chairman Massie. Thank you, Mr. Bridenstine. If the Ranking \nMember is amenable to this, I think we will do another round of \nquestions?\n    Ms. Wilson. Yes.\n    Chairman Massie. Did you have something to introduce into \nthe record?\n    Ms. Wilson. I do. Thank you, Mr. Chair. Mr. Kilmer has lots \nof conflicts. As we saw him come to the meeting, he had to \nleave, and I want to ask unanimous consent on behalf of Mr. \nKilmer to introduce a report on big data from IDC into the \nrecord, and then I have a question.\n    Chairman Massie. Without objection, so ordered. It will be \nset into the record.\n    [The information appears in Appendix II]\n    Ms. Wilson. Thank you. This question is for everyone.\n    We have all had several discussions lately about the value \nof NSF-funded research to society and how we might certify that \nvalue based on the grant proposal. I think we might use big \ndata instructively here. It is an incredibly interdisciplinary \nfield where tools are developed in the pursuit of one narrow \nresearch question, let us say in the social sciences might have \nprofound applications across many fields of science and even in \nmany sectors of the economy that can't possibly be anticipated \nat the time of the proposal. What is the potential for data \nanalytics being developed in one little seemingly irrelevant \ncorner having unintended benefits to other fields and societal \napplications? And if you have concrete examples, that would be \neven better for us to understand. Thank you.\n    Dr. Jahanian. Okay. I guess I will start. There is no \nquestion there are all sorts of explorations that we are doing \nin the area of big data that we can't even begin to see the \npotential impact of it. I will give you an example. NSF has \nbeen investing and other agencies with the private sector in \nwhat is known as the area of machine learning. These \ninvestments have taken place for at least 20 or 30 years. In \nfact, IBM has also led efforts in this area. I can tell you \nthat it is investments of the last 20 or 30 years that have \ncome to fruition such that these machine learning algorithms \nessentially allow us to look at these large data sets and \nidentify trends and be able to adapt. Essentially, they have a \nbroad range of applications from weather forecasting to \nfinancial modeling to biomedical research and so on that have \nhad tremendous, tremendous impact and now we use these \ntechniques as if they are off-the-shelf solutions available \nthat you can buy. These are through years of investment that we \nhave made that have come to fruition, so that is an example of \nthat.\n    We are investing in all sorts of areas in natural language \nunderstanding, in information retrieval, in various algorithms \nand approaches to automated scalable approaches to reasoning \nthat could be applied to understanding relationship between \ngene sequence structure and biological functions. These are all \nessentially the kinds of investments that we are making that \nsome of us we could see how it comes to fruition. Some of it \nrelies on decades of investment that we have already made in \ncomputational techniques and data-intensive techniques.\n    Dr. McQueeney. If I could offer you an example from the \nmedical world, one of the critical problems in medicine is the \nloss of premature infants due to infections, and physicians \nhave struggled for a long time with identifying the onset of an \ninfection at a very early point because as these infections can \ngrow exponentially, the earlier you can intercept them, the \nmore likely you are to have a lifesaving benefit for someone \nwho is very vulnerable such as a premature infant. We have done \nwork with the Toronto Hospital for Sick Kids where a physician \nup there had an idea that all the instrumentation in the NICU \nthat is--you know, you have probably been in a hospital room or \nintensive-care room, all the instruments around the bed, \nsomeone comes in every half an hour and writes down those \nnumbers but the instruments are producing readings \ncontinuously, and this physician had the idea that if we kept \nall that data and we stored all that data as it came out of the \nmachines in real time, which was a tremendous aggregation from \na velocity of data point of view and correlated with the \neventual issues that these premature infants had, we might be \nable to detect patterns using techniques such as machine \nlearning that we were just hearing about that would give us an \nearly identification of an upcoming infection, the ability to \ntreat it before it got out of control, and her theories were \nabsolutely correct. There were signatures in the data that gave \nup to 24 hours advance notice of an onset of an infection that \nwas time for the doctors to in many cases provide some kind of \nlifesaving therapy. So there is an example of very, very deep \nmathematics, computer science being applied to a problem where \nthe data was being produced every day by these instruments and \nit wasn't being captured and it wasn't being looked at and it \nwasn't being correlated with results to produce a fantastic \noutcome.\n    Dr. Rappa. I would just sum up by saying that really big \ndata is part of a decades-long process that really started with \ncomputerization in the 1940s and 1950s and eventually got \ninterconnected through the Internet in the 1970s, 1980s and \n1990s that the world that we are turning into, data is going to \nbe everywhere. It is going to affect exactly what happens here. \nIt is going to affect hospitals, universities, every corner of \nthe economy literally, and so we need to take approaches to \nthat to try to develop understanding around big data, how it is \napplied, how the tools of analytics are applied across, you \nknow, virtually every sector of the economy, and so I would \ntake a very broad view, not looking at it as specifically, you \nknow, a realm of computer technology or some other sort of \nisolated realm but looking at it as, you know, unfortunately as \nthe big thing it is.\n    Dr. Jahanian. May I offer another example as I was thinking \nabout it? I am reminded of the work by Daphne Koller and her \ncollaborators at Stanford on classifying breast cancer via \nimage analysis. As you know, 40,000 women die from this disease \neach year. By extending essentially image analysis techniques \nto hundreds of, I should say thousands and thousands of biopsy \nimages, they were able to identify a subset of cellular \nfeatures. Out of 6,000 possible features, they were able to \nessentially identify a few of them that were predictive of \nsurvival time among breast cancer patients. What is really \nsurprising is that the feature that they identified, it wasn't \njust from--the best feature, I should say, that is a predictor \nof survival, was not from the cancerous tissue itself but it \nwas from the surrounding tissue, and that has led to new kinds \nof treatments. It has led to new kinds of diagnosis techniques \nand also a very personalized treatment that could aim to \nimprove survival times in patients. That is a very, very \nconcrete example.\n    Another example is the work that Google had done during \nH1N1 virus. I will be very brief about this. Before they \nactually discovered a vaccine, we wanted to track the spread of \ndisease. Google engineers used data that had nothing to do with \nthe virus directly from billions of essentially web searches \nfrom around the world together from publicly available, \nessentially historic data on flu trends, to predict the spread \nof flu virus down to small regions in the country--or across \nthe world, rather. This is a remarkable essentially application \nof data that one would have never thought could be applicable \nto something like H1N1 virus.\n    Ms. Wilson. Thank you very much.\n    Chairman Massie. Thank you, Ms. Wilson. Thank you for that \nvery excellent example of how we can use--a private company can \nfind information in the data.\n    We got a little bit out of order so the last question is \ngoing to be mine. I reserve five minutes for myself. And the \nquestion I want to ask is, we have heard about banks that are \ntoo big to fail, and we also know that the Internet is now too \nbig to fail. We recently in the House passed a CISPA bill which \nis somewhat controversial but some people felt it was necessary \nto do because the Internet was so big and pervasive in our \nlives. So my question to you is, are there any big data sets \nthat are too big to fail? In other words, are there ones that \nare pervasive that we have let through osmosis become--we have \nbecome too dependent upon or maybe not too dependent but we are \ndependent upon these data sets, for instance, weather, you \nknow, and early warning systems? Not all of those, I imagine, \nare government systems. Some of them are private but possibly \nthe government is relying on these systems and so I would be \nremiss if I didn't ask this question now before something \nfails, but tell us what is too big to fail right now? What \nwould we bail out, and is there sufficient redundancy in the \ncollection, storage and access of these data sets? Thank you.\n    Dr. McQueeney. Well, first, I would just like to say that \nwe were delighted to support that cyber bill, and I \ncongratulate you on such broad bipartisan support in the House \nfor getting that acted upon.\n    Data sets have the property that they can often be \nsubdivided and often be replicated, and so we have a lot of \ntechniques by which we can assure the continuity of data if we \ntake the time to do it, and if there were very valuable \nhistorical records on things like long-term weather trends that \nwere only stored in one place, that actually could be a concern \nbecause that is literally irreplaceable data. But essentially \nall of the IT techniques needed to take those large data sets \nand segment them and replicate them in different secure places \nso they could be re-created do exist but I think you raise an \ninteresting point, that it is worthwhile to periodically check \nthat we are being appropriately vigilant with the digital \narchives that are so valuable.\n    Chairman Massie. Dr. Jahanian?\n    Dr. Jahanian. I don't have a specific example. What I can \ntell you is that similar to the issue of cybersecurity, as \nNation's critical infrastructure and more generally the \nInternet is playing a vital role in integrating the economic, \nyou know, political, societal fabric of our society, we are \ngoing to become more and more dependent on data, and data is \ngoing to play an increasingly significant role in our day-to-\nday lives, and for that reason, I think the same sort of issues \nthat apply to all sorts of IT solutions that we take for \ngranted will increasingly be applied to data.\n    From a research and engineering community's point of view, \nit is not just failure of the data but making that data \naccessible and also making the data accessible to broad \ncommunity of scientists and engineers is an issue that we are \nquite concerned about.\n    Chairman. Massie. Thank you very much. I was part of the \nbipartisan on CISPA, opposing CISPA actually, but that is okay.\n    I want to thank the witnesses for their valuable testimony \nand the Members for their questions today. The Members in the \nCommittee may have additional questions for you, and we will \nask that you respond to those in writing. The record will \nremain open for two weeks for additional comments and written \nquestions from the Members.\n    The witnesses are excused and this hearing is adjourned.\n    [Whereupon, at 11:35 a.m., the Subcommittees were \nadjourned.]\n\n\n\n                               Appendix I\n\n                              ----------                              \n\n\n                   Answers to Post-Hearing Questions\n\nResponses by Dr. Michael Rappa\n\n[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n\nResponses by Dr. Farnam Jahanian\n\n[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n\n                              Appendix II\n\n                              ----------                              \n\n\n                   Additional Material for the Record\n\n   IDC IVIEW, The Digital Universe in 2020: Big Data, Bigger Digital \n       Shadows, and Biggest Growth in the Far East, submitted by \n                      Representative Derek Kilmer\n\n[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]\n\n                                 <all>\n\x1a\n</pre></body></html>\n"