-
Notifications
You must be signed in to change notification settings - Fork 110
A Journey at Data.gov by nickumia
Disclaimer: The thoughts and ideas conveyed in this blog are those of the author and do not reflect those of GSA, REI, Data.gov or any other organization that the author is part of.
It's been quite a journey over the last 2 years. From my onboarding ticket til now, a lot has changed about the world, Data.gov, life in general and about me as well. Let me start by saying that this is not a "farewell" or "final words" post because I will continue to be active (in varying capacities) in Data.gov. In the days to follow, it will mostly be guidance on issues and background research on design choices that the Data.gov team will face. However, long-term, I would love to contribute in an open-source fashion once some additional work is done to solidify the direction. Data.gov is a complex program. Even if you consider the metadata problem that Data.gov meets the needs of to be easy, the non-technical aspects of connecting people with data cause sooo many intricacies of development. I'm proud to say that I have supported Data.gov and I know Data.gov is super important in fostering collaboration on a national level. In recent times, the challenges of supporting Data.gov revolve around fortifying the underlying infrastructure and processes to ensure adequate time and energy can be spent towards the real work.
We've tried to organize the wealth of knowledge and experience in the Data.gov Wiki, the Data.gov google drive and across sooo many repos; however, the nature of the problem is that there's just so much to know and be familiar with. I'll make lots of references throughout this and I hope that most of it will be preserved into the future; however, please reach out to @nickumia if anything is broken and I can help find things too!
A good place to start is the list of issues I've contributed through over the years. I spent 'some' (read: a long) time revisiting all of my issues and labeling them to give context to find knowledge bits that are embedded in the foundation of the Data.gov codebase. There is a natural hierarchy of systems of systems.
mindmap
root((Data.gov))
Data Applications
CKAN
catalog.data.gov
inventory.data.gov
SSB
Solr
EKS
SMTP
Static Site
data.gov
Ideals
Mission & Vision
Tools
Security
Compliance
Use Latest
Notifications
O&M
Logging
Development
Testing
Feature
Bug
CI/CD
I debated the best representation of the label hierarchy, but I think the mindmap
works well above. To give more detail about the types of work:
-
Categories of Work:
- 🥇 Features: Aspects of development that represent some real value present to one of our stakeholders. They can be tiny (such as Creating a Homepage Message) or pretty big (such as Designing Solr Infrastucture).
- 🥈 Bugs: We deal with a lot of bugs... I would say some features have been historically hidden as bugfixes, but that's just the nature of data.gov systems. Features existed and then broke over time or with security changes and the "bugfix" revives functionality.
- 🖇️ Compliance and (special mention, ATO Condition): These have typically been working with GSA IT on various security findings. We have instances of limiting features because solving the compliance burden was too costly.
- 🌊 Use Latest: This has a slight crossover with
compliance
; however, this is more about keeping our code modern, relevant and up-to-date. It helps keep things smooth and functioning as software depreciates rapidly by the second. - 🧪 Testing: This is a tiny mashup of tickets. It includes testing out features of programming languages, new technologies to see if they work for us. It also includes creating real tests for existing technology or software. It can be summarized as uncertainties that we've highlighted and needed to address directly.
-
Aspects of Work:
- 🦖 CKAN: Because CKAN has been such an integral part of operations, there's a lot of work that is CKAN-specific. As others on the team know, getting spun up with CKAN is a dedicated role by itself, let alone understanding how we use CKAN. There are a LOT more tickets related to CKAN that we haven't necessarily tagged historically. I think the ones I've work on should be a good sampling of the variety of work involved there though.
- 🍽️ CI/CD: One of the biggest things that data.gov needs consistently is infrastructure evaluations and improvements. I called this work "CI/CD" instead of "Infrastructure" because infrastructure can exist in isolation and we don't manage infrastructure directly as much as we integrate with LOTS of different types of infrastructure.
- 📓 Logging and Notifications: These have generally been lacking as an area of development. Having meaningful messages in the right context makes development, operations, maintenance and every other aspect of our jobs easier and more effective. Knowing how to manage all of the information has been a struggle though and processing it in a reliable way is difficult.
- 👽 O&M: Leading from the last group... having personally seen the downfalls of two very different deployment stacks, this role has been the most annoying... Knowing what to do and how to make sure things are running smoothly is a struggle. Data.gov needs to figure out what's important to it, so that it can do proper Site Reliability Engineering and have proper SLAs/SLOs and structure releases...
- 🤗 UI/UX: This is a fun category. If you're curious how people interact with data.gov and how successful we've been making a relevant UI/UX, this is the category for you. These typically deal with accessibility, data presentation and data usability.
- 🗺️ Explore: This is probably the most interesting category in the sense that it spotlights work that show the growth potential of data.gov (from AI/NLP analysis to open-source collaborations to just learning about interesting integration between data.gov and the world). This is lacking in substance because it wasn't part of our process historically. The ones that exist (at the time of writing) are just a subset of things that I touched during my time.
- 📶 Mission & Vision: This is, by far, my most favorite category. It speaks to how we are advancing the mission of data.gov (see links below)
-
Applications
- 🌎 catalog.data.gov: Our flagship application. Too much to describe, Full-Stack DevSecOps to maintain everything that is "catalog". Contains public and private elements.
- 📨 inventory.data.gov: Our secondary application, mainly for data providers. Completely private application (except for existing on the public internet 😒).
- 👾 Harvesting: One of the energy drains from the data.gov team. We're supposed to be harvesting... but it feels like we are being harvested 😢
- 🦯 Egress Proxy: This exists because of a compliance gap in cloud.gov. It's not a big thing.. except when it breaks things..
- 🔍 Solr: I don't even want to talk about it 🤐
- 📦 SSB: This is a combination of the SSB Managed service that orchestrates the usage of various Brokers. Bret liked to draw a distinction between the "managed" boundary and the "application" boundary; however, we never made tags for
eks
orsmtp
, so... I grouped them in here (Sorry, not sorry, Bret 😅) This is another energy drain of the data.gov team though... sooo much abstraction in soooo many repos with soooo many technologies bundled together 😭 - 📜 Data.gov Homepage: The static site has evolved a lot over the years. I think the current 11ty platform is very robust in offering what we need. But this is just everything related to having the landing page and being able to summarize what we do.
- That is a private slack link of a message from me; however some snippets that I think are powerful:
- One of our biggest problems at Data.gov stems from our need to serve many different stakeholders. At the minimum, we've got (1) Agencies publishing data, (2) Agencies assessing their open data footprint (OMB and whoever else needs the reporting dashboard), (3) End users who are actually trying to retrieve data. As long as I've been around the static site has only been highlighting one aspect of data.gov (catalog.data.gov).
- To facilitate maximum collaboration, I think it's only fair to be transparent about all of the groups involved in the process.
- ...creating a brand for data.gov that revolves about our entire pipeline as opposed to the most "visible" part [catalog.data.gov] would help everyone to understand where the data comes from and where it is going and how to make it better.
- Data Providers, Data Consumers and Policymakers
- That is a private slack link of a message from me; however some snippets that I think are powerful:
I didn't even know that I was going to write so much about our different "workstreams" (to borrow Tyler's word 😉). I couldn't have even imagined all of this when I attempted to track my onboarding experience. And even this is not the complete story. My personal GitHub (@nickumia) will remain active and I am very excited to be called on for knowledge and expertise about ushering Data.gov into a better future. For what it's worth right now, here's a list of important documents or references that help unify some of the disparate knowledge that makes up Data.gov:
- 🌕 Table of Systems: List of applications/services/codebases that make up data.gov
- 🌑 Cloud.gov APP Tracker: List of applications per space with memory usage to check on cloud.gov quota
- 🌑 Data.gov Harvesting Requirements: One of my babies at data.gov ❤️ I'm hoping this can be made public one day (whether informally OR if it gains traction in the government and becomes official)
- 🌑 Airflow Preparation Doc: Comments on Harvesting 2.0
- 🌕 Data.gov Actions Tracker: General state of main CI/CD pipelines
- 🌕
datagov-brokerpak-eks
Docs - 🌕
datagov-brokerpak-solr
Cleanup Script - 🌕 ADR Comments:
ckanext-geodatagov
example - 🌑 Ancient SSB Anatomy Video
- 🌑 "Recent" Knowledge Sharing
- 🌑 (Initial Data.gov Mission and Responsibilities)
- 🌑 (2023 - Data.gov Mission/Strategy links)
- 🌑 (Vision/Roadmap)
- I'm cheating and giving a sneak peek, but this is 🔥 ❤️ 🔥 ❤️ 🔥 ❤️ 🔥 ❤️ 🔥 ❤️
- A world where users can easily discover and use any U.S. government dataset intended for the public.
- I'm cheating and giving a sneak peek, but this is 🔥 ❤️ 🔥 ❤️ 🔥 ❤️ 🔥 ❤️ 🔥 ❤️
- ... (I may edit this section and add more at a later date)
I've gained a wealth of experience working on the Data.gov program and words cannot describe my respect for the Program Director, Hyon, and the close connections I cherish with developers on the team. All government systems share the struggle of being such a large operating body. Understanding the best way to make a contribution is hard.. Sometimes, direction is absolutely needed from the top-down... Other times, constant effort can erode barriers from the bottom-up. I've put my fullest effort to grow alongside Data.gov and the memories are invaluable. I can't solve all of the tech problems in the world; however, there's so much that can be achieved with effective collaboration, communication and knowledge sharing and I'm going to to make that a reality in whatever way I can. One of the most unique things about Data.gov, compared to other projects I've worked on, is its open-source nature. I'm not entirely sure who this document is for; however, I'm hoping that it can be used as inspiration and motivation to achieve move together than anyone can achieve individually.