[Federal Register Volume 89, Number 4 (Friday, January 5, 2024)]
[Notices]
[Pages 783-786]
From the Federal Register Online via the Government Publishing Office [www.gpo.gov]
[FR Doc No: 2024-00036]


-----------------------------------------------------------------------

GENERAL SERVICES ADMINISTRATION

[Notice-MY-2023-03; Docket No. 2023-0002; Sequence No. 37]


Office of Shared Solutions and Performance Improvement (OSSPI); 
Chief Data Officers Council (CDO); Request for Information--Synthetic 
Data Generation

AGENCY: Federal Chief Data Officers (CDO) Council; General Services 
Administration, (GSA).

ACTION: Notice.

-----------------------------------------------------------------------

SUMMARY: The Federal CDO Council was established by the Foundations for 
Evidence-Based Policymaking Act. The Council's vision is to improve 
government mission achievement and increase benefits to the nation 
through improving the management, use, protection, dissemination, and 
generation of data in government decision-making and operations. The 
CDO Council is publishing this Request for Information (RFI) for the 
public to provide input on key questions concerning synthetic data 
generation. Responses to this RFI will inform the CDO Council's work to 
establish best practices for synthetic data generation.

DATES: We will consider comments received by February 5, 2024.

[[Page 784]]

Targeted Audience

    This RFI is intended for Chief Data Officers, data scientists, 
technologists, data stewards and data- and evidence-building related 
subject matter experts from the public, private, and academic sectors.

ADDRESSES: Respondents should submit comments identified by Notice-MY-
2023-03 via the Federal eRulemaking Portal at https://www.regulations.gov and follow the instructions for submitting 
comments. All public comments received are subject to the Freedom of 
Information Act and will be posted in their entirety at 
regulations.gov, including any personal and/or business confidential 
information provided. Do not include any information you would not like 
to be made publicly available.
    Written responses should not exceed six pages, inclusive of a one-
page cover page as described below. Please respond concisely, in plain 
language, and specify which question(s) you are responding to. You may 
also include links to online materials or interactive presentations, 
but please ensure all links are publicly available. Each response 
should include:
     The name of the individual(s) and/or organization 
responding.
     A brief description of the responding individual(s) or 
organization's mission and/or areas of expertise.
     The section(s) that your submission and materials are 
related to.
     A contact for questions or other follow-up on your 
response.
    By responding to the RFI, each participant (individual, team, or 
legal entity) warrants that they are the sole author or owner of, or 
has the right to use, any copyrightable works that the submission 
comprises, that the works are wholly original (or is an improved 
version of an existing work that the participant has sufficient rights 
to use and improve), and that the submission does not infringe any 
copyright or any other rights of any third party of which participant 
is aware.
    By responding to the RFI, each participant (individual, team, or 
legal entity) consents to the contents of their submission being made 
available to all Federal agencies and their employees on an internal-
to-government website accessible only to agency staff persons.
    Participants will not be required to transfer their intellectual 
property rights to the CDO Council, but participants must grant to the 
Federal Government a nonexclusive license to apply, share, and use the 
materials that are included in the submission. To participate in the 
RFI, each participant must warrant that there are no legal obstacles to 
providing the above-referenced nonexclusive licenses of participant 
rights to the Federal Government. Interested parties who respond to 
this RFI may be contacted for follow-on questions or discussion.

FOR FURTHER INFORMATION CONTACT: Issues regarding submission or 
questions can be sent to Ken Ambrose and Ashley Jackson, Senior 
Advisors, Office of Shared Solutions and Performance Improvement, 
General Services Administration, at 202-215-7330 (Kenneth Ambrose) and 
202-538-2897 (Ashley Jackson), or [email protected].

SUPPLEMENTARY INFORMATION:

Background

    Pursuant to the Foundations for Evidence-Based Policy Making Act of 
2018,\1\ the CDO Council is charged with establishing best practices 
for the use, protection, dissemination, and generation of data in the 
Federal Government. In reviewing existing activities and literature 
from across the Federal Government, the CDO Council has determined 
that:
---------------------------------------------------------------------------

    \1\ H.R. 4174--115th Congress (2017-2018): Foundations for 
Evidence-Based Policymaking Act of 2018 [verbar] Congress.gov 
[verbar] Library of Congress https://www.congress.gov/bill/115th-congress/house-bill/4174/text.
---------------------------------------------------------------------------

     the Federal Government would benefit from developing 
consensus of a more formalized definition for synthetic data 
generation,
     synthetic data generation has wide-ranging applications, 
and
     there are challenges and limitations with synthetic data 
generation.
    The CDO council is interested in consolidating feedback and inputs 
from qualified experts to gain additional insight and assist with 
establishing a best practice guide around synthetic data generation. 
The CDO Council has preliminarily drafted a working definition of 
synthetic data generation and several key questions to better inform 
its work.

Information and Key Questions

Section 1: Defining Synthetic Data Generation

    Synthetic data generation is an important part of modern data 
science work. In the broadest sense, synthetic data generation involves 
the creation of a new synthetic or artificial dataset using 
computational methods. Synthetic data generation can be contrasted with 
real-world data collection. Real-world data collection involves 
gathering data from a first-hand source, such as through surveys, 
observations, interviews, forms, and other methods. Synthetic data 
generation is a broad field that employs varied techniques and can be 
applied to many different kinds of problems. Data may be fully or 
partially synthetic. A fully synthetic dataset wholly consists of 
points created using computational methods, whereas a partially 
synthetic dataset may involve a mix of first-hand and computationally 
generated synthetic data.
    Throughout this RFI, we use the following definitions:
     data--recorded information, regardless of form or the 
media on which the data is recorded; \2\
---------------------------------------------------------------------------

    \2\ 44 U.S.C. 3502(16).
---------------------------------------------------------------------------

     data asset--a collection of data elements or data sets 
that may be grouped together; \3\
---------------------------------------------------------------------------

    \3\ 44 U.S.C. 3502(17).
---------------------------------------------------------------------------

     open government data asset--a public data asset that is 
(A) machine-readable; (B) available (or could be made available) in an 
open format; (C) not encumbered by restrictions, other than 
intellectual property rights, including under titles 17 and 35, that 
would impede the use or reuse of such asset; and (D) based on an 
underlying open standard that is maintained by a standards 
organization.\4\
---------------------------------------------------------------------------

    \4\ 44 U.S.C. 3502(20).
---------------------------------------------------------------------------

    The National Institute of Standards and Technology (NIST) defines 
synthetic data generation as ``a process in which seed data is used to 
create artificial data that has some of the statistical characteristics 
as the seed data''.\5\
---------------------------------------------------------------------------

    \5\ https://csrc.nist.gov/glossary/term/synthetic_data_generation.
---------------------------------------------------------------------------

    The CDO Council believes that this definition of synthetic data 
generation includes techniques such as using statistics to create data 
from a known distribution, generative adversarial networks (GANs),\6\ 
variational autoencoding (VAE),\7\ building test data for use in 
software development,\8\ privacy-preserving synthetic data generation 
\9\ and others.
---------------------------------------------------------------------------

    \6\ 15 U.S.C. 9204.
    \7\ A useful definition of this technique is available in the 
abstract of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8774760/.
    \8\ This technique is described in the Department of Defense 
DevSecOps Fundamentals Guidebook https://dodcio.defense.gov/Portals/0/Documents/Library/DevSecOpsTools-ActivitiesGuidebook.pdf, page 23.
    \9\ NIST Special Publication 800-188, Section 4.4 https://doi.org/10.6028/NIST.SP.800-188.
---------------------------------------------------------------------------

    The CDO Council also believes that it is important to draw 
contrasts between synthetic data generation and other activities. For 
example, synthetic data generation does not include collection

[[Page 785]]

of data without any inference. Synthetic data generation does not 
include signal processing, such as automated differential translations 
of global positioning satellite data. Synthetic data generation also 
does not include enriching data during data analysis--such intermediate 
steps that involve augmenting or enhancing existing data but do not 
involve the creation of artificial data.
    Other analysis techniques, such as distribution fitting and 
parametric modeling, are closely related to synthetic data generation. 
The CDO Council believes the key difference; however, is the purpose of 
the computational methods. Synthetic data generation seeks to create 
wholly new data points based on the statistical properties of a 
dataset, whereas distribution fitting seeks to `fill in' a dataset 
based on a known distribution. Notably, the fitted distribution can be 
used to generate points that are not part of the original dataset--
which is an application of synthetic data generation.
Questions
     Are there any limitations to relying on the NIST 
definition to describe the field of synthetic data generation? How 
should it be improved?
     How well does the CDO Council's list of examples and 
contrasts improve understanding? How should these be improved?

Section 2: Applying Synthetic Data Generation

    Synthetic data generation can enable the creation of larger and 
more diverse datasets, enhance model performance, and protect 
individual privacy. The CDO Council's review of potential applications 
of synthetic data generation found examples in:
     Data augmentation.\10\ This application involves creating 
new data points or datasets from existing data. This application can be 
particularly useful in developing training datasets for machine 
learning and advanced analytics.
---------------------------------------------------------------------------

    \10\ This application is briefly described at https://frederick.cancer.gov/initiatives/scientific-standards-hub/ai-and-data-science, Section 4.
---------------------------------------------------------------------------

     Data synthesis.\11\ This application involves using an 
existing dataset to create a new dataset, sharing similar statistical 
properties with the original dataset, to protect individual privacy. 
Generating such datasets has wide-ranging applications including, but 
not limited to, facilitating reproducible investigation of clinical 
data while preserving individual privacy.
---------------------------------------------------------------------------

    \11\ A definition of this technique is available in the abstract 
of this paper https://par.nsf.gov/servlets/purl/10187206.
---------------------------------------------------------------------------

     Modeling and simulation.\12\ This application involves 
setting assumptions, parameters and rules to develop data for further 
analysis. The synthetic dataset can be used for developing insights, 
testing hypotheses, and/or understanding a model's behavior. This 
application supports the conduct of controlled experiments, predicting 
potential future outcomes from current conditions, generating scenarios 
for rare or extreme events, and validating or calibrating a model.
---------------------------------------------------------------------------

    \12\ A definition a computer simulation is proposed at https://builtin.com/hardware/computer-simulation.
---------------------------------------------------------------------------

     Software development.\13\ This application involves using 
existing database schemas to simulate real-world scenarios and ensure 
that a software application can handle different types of data and 
errors effectively. This application assists in the creation of 
representative data, makes it easier to generate edge cases, protects 
individual privacy, and improves testing efficiency.
---------------------------------------------------------------------------

    \13\ DoD DevSecOps Fundamentals, ibid.
---------------------------------------------------------------------------

    Notably, the CDO Council believes that not all applications of 
modeling and simulation would meet the definition of synthetic data 
generation. For example, weather forecasting applies numerical models 
and applies a complex mix of data analysis, meteorological science, and 
computation methods but does not involve the creation of synthetic or 
artificial data points. Instead, the purpose of these models is to 
predict future conditions.
Questions
     How are these examples representative of synthetic data 
generation? How should they be revised?
     What other examples of synthetic data generation should 
the CDO Council know about?
     What are the key advantages for the use of synthetic data 
generation?

Section 3: Challenges and Limitations in Synthetic Data Generation

    The CDO Council recognizes that synthetic data generation can be a 
valuable technique. However, it should be noted that there are some 
challenges and limitations with the technique. For example, there can 
be challenges generating data that realistically simulates the real 
world and the diversity of real data. Additionally, evaluating the 
quality of a synthetic dataset may also be extremely challenging.
    Synthetic data generation is also subject to challenges commonly 
facing any statistical methods, such as overfitting and imbalances in 
the source data. These challenges reduce the utility of the generated 
synthetic data because they may not be properly representative, 
including failing to represent rare classes.
Questions
     What other challenges and limitations are important to 
consider in synthetic data generation?
     What tools or techniques are available for effectively 
communicating the limitations of generated synthetic data?
     What are best practices for CDOs to coordinate with 
statistical officials on synthetic data?
     What approaches can CDOs consider to help address these 
challenges?

Section 4: Ethics and Equity Considerations in Synthetic Data 
Generation

    Synthetic data generation techniques hold great promise, but also 
introduce questions of ethics and equity. Consistent with Federal 
privacy practices,\14\ any data generation technique involving 
individuals must respect their privacy rights and obtain informed 
consent before using real-world data to generate synthetic data. As 
noted in Section 3, synthetic data generation is also subject to 
challenges commonly facing any statistical methods and has the 
potential to introduce and encode errors or bias, potentially leading 
to discriminatory outcomes.
---------------------------------------------------------------------------

    \14\ OMB Circular A-130, Appendix II https://www.whitehouse.gov/wp-content/uploads/legacy_drupal_files/omb/circulars/A130/a130revised.pdf.
---------------------------------------------------------------------------

    Uses of generated synthetic data must also be carefully considered. 
The context and quality of the generated synthetic data will impact its 
practical utility and impact. Assessing and understanding the fitness 
of a generated synthetic dataset is essential. For instance, a 
generated synthetic dataset may not sufficiently represent the 
diversity of the source dataset. In addition, a generated synthetic 
dataset may not contain sufficient variables to fully represent the 
system and the drivers of differences in the phenomenon it is meant to 
represent.
Questions
     What techniques are available to facilitate transparency 
around generated synthetic data?
     What are best practices for CDOs to coordinate with 
privacy officials on

[[Page 786]]

ethics and equity matters related to synthetic data generation?
     How can we apply the Federal Data Ethics Framework \15\ to 
address these ethics and equity concerns?
---------------------------------------------------------------------------

    \15\ https://resources.data.gov/assets/documents/fds-data-ethics-framework.pdf.
---------------------------------------------------------------------------

Section 5: Synthetic Data Generation and Evidence-Building

    Synthetic data generation can enable the production of evidence for 
use in policymaking. Applications such as simulation or modeling can 
help policymakers explore scenarios and their potential impacts. 
Likewise, policymakers can conduct controlled experiments of potential 
policy interventions to better understand their impacts. Data synthesis 
may help policymakers make more data publicly available to spur 
research and other foundational fact-finding activities that can inform 
policymaking.
Questions
     What other applications of synthetic data generation 
support evidence-based policymaking? \16\
---------------------------------------------------------------------------

    \16\ OMB Memorandum M-19-23.
---------------------------------------------------------------------------

     What is the relationship between synthetic data generation 
and open government data? \17\
---------------------------------------------------------------------------

    \17\ 44 U.S.C. 3520(20).
---------------------------------------------------------------------------

     How can CDOs and Evaluation Officers best collaborate on 
synthetic data generation to support evidence-building? \18\ What about 
other evidence officials? \19\
---------------------------------------------------------------------------

    \18\ OMB Memorandum M-19-23, Appendix A.
    \19\ OMB Memorandum M-19-23, Section II (Key Senior Officials).

Kenneth Ambrose,
Senior Advisor CDO Council, Office of Shared Solutions and Performance 
Improvement, General Services Administration.
[FR Doc. 2024-00036 Filed 1-4-24; 8:45 am]
BILLING CODE 6820-69-P