Data Took Kit - Open Source Code Handbook
This handbook provides guidance on the process of publishing open source code, and starting, maintaining, or contributing to a successful open source project, with the aim of creating a consistent and repeatable set of guidelines and procedures for individuals to follow. It is a living resource that will be updated as new issues are identified or new workflows are created.
What is open source code?
Source code is the fundamental component of a computer program. It is the set of instructions or statements that defines how a program works. Everything from the applications installed on your phone to the operating system running on your computer has source code associated with it. Source code can be proprietary or open. Proprietary source code is copyrighted by a company or individual and not shared with others. Examples of proprietary software include Microsoft Office and Adobe Photoshop. Open source code refers to code that is made freely available for anyone to use, modify, and share. Examples of open source software include MySQL, Firefox, and WordPress.
Why is open source important?
Open source software is widely used today by companies in a broad range of industries. Many of these companies, including Facebook and Google, also develop and publish their own open source software. There are many reasons why a person or organization would want to open source their code. Some of these reasons include:
- Innovation – Modern technology is built upon open source software. Giving back to the open source community drives innovation.
- Collaboration – Accessible code builds community and enables broad collaboration across the world.
- Cost savings – Open source software is a low-cost alternative to proprietary software. Sharing and reusing open source code can help cut costs and save time.
- Transparency – Anyone can inspect the source code for errors or inconsistencies, improving openness, transparency, and accountability.
California joins the United States Government and a growing number of governments worldwide that are adopting an open approach to technology. In May 2018, the California Department of Technology (CDT) released a technology letter (TL 18-02) announcing the Open Source and Code Reuse Policy (SAM Section 4984). This policy establishes a public repository for enterprise code and directs California agencies to publish all new custom-developed code and documentation to the repository for sharing and reuse across state government. TL 18-02 states:
Currently, when Agencies/state entities produce custom-developed source code, they do not make their new code broadly available for state government-wide reuse. These challenges have resulted in duplicative acquisitions for substantially similar code and the inefficient use of taxpayer dollars. Enhanced reuse of custom-developed code across state government can have significant benefits for taxpayers, including decreasing duplicative costs for the same code and reducing vendor lock-in.
While the benefits of custom developed code reuse are significant, additional benefits can accrue when custom-developed code is also made available to the public for inspection, improvement, and reuse as OSS. When possible, making source code available as OSS can enable continual improvement of state software development efforts as a result of a broader user community implementing the code for its own purposes and publishing improvements. This collaborative atmosphere can make it easier to conduct software peer review and security testing, to reuse existing solutions, and to share technical knowledge.
How do I make my code open source?
No matter the size of your project – from a single script to enterprise level software – you can make your code open source. This section outlines the requirements of open source software and guides you through the process of preparing and publishing your code.
Exemptions and Approval
Some projects are more sensitive than others based on the purpose of the project and the type of data or information used in the project. In some cases, it is not appropriate to make a project open source. The Open Source and Code Reuse Policy (SAM Section 4984.2) identifies certain exemptions to the state open source policy, as follows:
- The sharing of the source code is restricted by law or regulation, including—but not limited to—patent or intellectual property law, the Export Asset Regulations, the International Traffic in Arms Regulation, and the Federal laws and regulations governing classified information;
- The sharing of the source code would create an identifiable risk to the detriment of national security, confidentiality of Government information, or individual privacy;
- The sharing of the source code would create an identifiable risk to the stability, security, or integrity of the Agency/state entity’s systems or personnel;
- The sharing of the source code would create an identifiable risk to the Agency/state entity’s mission, programs, or operations.
In addition to the exemptions listed above, you should review your project, including all code, data, and documentation, to ensure that they do not contain any sensitive information, including but not limited to the following:
- Keys, passwords, credentials, and login details
- Personally identifiable information (PII)
- Sensitive information like the locations of private wells
- Database queries, device IDs, IP addresses
- Algorithms used to detect fraud
- Unreleased or unannounced policy
If your project contains sensitive information that should not be released to the public, take measures to redact or anonymize the sensitive data. Anonymizing data involves replacing sensitive information with fake information in a way that still maintains the structure and analytical value of the original dataset. For larger and more complex datasets, however, it is usually safer to generate an entirely new (fake) dataset or exclude the dataset from your project altogether. Whatever you decide to do, make a note of your approach in your project’s documentation so that others who view or use your files are aware that the files have been modified.
We strongly encourage Water Boards staff to obtain approval from your management before proceeding further. Depending on the nature of your project, you may also want to meet with representatives from IT, legal, OIMA, and also any external stakeholders who might be impacted by the release of your code. Use your best judgement and thoroughly review your project files before publishing them online.
Source code is the fundamental component of a computer program. It is the set of instructions or statements written in a programming language (e.g., Python, C++, PHP) that defines how a program works. For your project to be considered open source, you must make your source code fully available to the public. We recommend that you prepare your source code based on the guidelines outlined below.
- Save your code in the original language and common file format for that language. Do not run your code through a translator, compressor/minifier, or obfuscator.
- Structure your code in a clear and logical way so that others can read and modify it. Clean your code so that it does not contain duplication or code that is irrelevant to the functioning of the program.
- Use consistent code conventions and clear function, method, and variable names. Every programming language has recommended style conventions and standards that, when followed, improve the readability of the code and make software easier to maintain. Awesome-guidelines, a user-created repository on GitHub, is a list of coding style conventions and standards organized by programming language and development environment/platform.
- Include inline comments that explain or clarify parts of the code that might not be clear to others. Remember that every programmer has his or her own style of writing code. Even experienced programmers need some guidance reading and understanding other people’s code.
Documentation is broadly defined as any separate documents or materials related to the development or use of the program. It is an essential component of open source software. Without documentation explaining how your software works, it will be difficult for others to use and adopt it.
Documentation includes the intended purpose of the software and any relevant technical details on how to build, make, install, or use the software. It is highly recommended that you also include any other materials that could help others understand your program better. The four main types of documentation include:
- Requirements documentation - Describes what the software does and how it is intended to operate.
- Architecture design documentation - Describes the components of the system and how they relate to each other. Includes the conceptual, logical, and physical designs of the system (e.g., network diagrams, entity relationship diagrams).
- Technical documentation - Describes the code, algorithms, known issues/bugs, dependencies, etc.
- User documentation - Describes how to install and use the software. Includes coding examples and user guides.
The amount of documentation to include in your project is dependent on the size and scope of your project. You might not need to include all four types listed above. At the minimum, you should include a README file and a CONTRIBUTING file, and then add on other files as appropriate.
- README.txt / README.md - Provides basic information about your project, including what the project is, the goal of the project, how to use the code/software, and any dependencies (if applicable).
- CONTRIBUTING.txt - Provides information about how others can contribute to your project. It explains the type of contributions needed and outlines the process by which others can submit their contributions.
Open Source License
A core component of open source software is the open source license. An open source license is a type of license that allows the source code and design of a program to be shared and used under defined terms and conditions. The license prevents restrictions on the use of the software, allows the free modification and redistribution of the software without additional permission, and waives liability in case of any damages claimed in connection with the software.
Specifying the license is important for securing the rights to publish code as open source software and for extending those same rights to others who use your code.
This section is a work in progress! Stay tuned for further developments.
Publishing Your Code
The most widely-used platform for publishing open source code is GitHub. GitHub is a website and cloud-based service that provides hosting for software development control using Git. It has more than 40 million registered users around the world, including established companies such as Adobe and Twitter. Other websites with similar functionality to GitHub include GitLab and BitBucket. The following guidance is specific to GitHub, but the same concepts apply across platforms.
To get started with GitHub, sign up for a free account on the GitHub website. If you are new to Git and GitHub, you will need to familiarize yourself with both systems. Git/GitHub can be complicated to learn for someone brand new to the technology, but there are many free resources available online, including guides and video tutorials. Some are listed below, but we encourage you to explore other resources beyond what is presented here.
Websites like GitHub, GitLab, and BitBucket can host code for projects of all sizes, including single scripts and enterprise level software. OIMA maintains a GitHub organization account, the California Water Board Data Center, that is open to all Water Boards staff. If you are interested in joining the California Water Board Data Center organization, contact email@example.com.
If you are publishing enterprise level software, we recommend that you submit your code to the California Open Source Portal. This portal, which was established by CDT in May of 2018, is a repository of open source enterprise code produced by California state agencies, departments, and partners. It was designed to support state government-wide reuse and encourage public collaboration.
There are additional requirements for submitting code to the California Open Source Portal. Because the code is specifically for enterprise use, the instructions on the Portal website ask that contributors upload their code to a GitHub organization account, not a personal GitHub account. The username of this organization account should reference the State agency/entity, not a specific employee (e.g., CADeptofTechnology). The Water Board does not yet have a dedicated GitHub organization account for uploading enterprise code to the California Open Source Portal, but OIMA can help create and maintain a new organization account if there is interest. For more information, contact firstname.lastname@example.org.