02.21.2022 - By Tobias Macey
Summary
One of the most persistent challenges faced by organizations of all sizes is the recording and distribution of institutional knowledge. In technical teams this is exacerbated by the need to incorporate technical review feedback and manage access to data before publishing. When faced with this problem as an early data scientist at AirBnB, Chetan Sharma helped create the Knowledge Repo project as a solution. In this episode he shares the story behind its creation and growth, how and why it was released as open source, and the features that make it a compelling option for your own team’s knowledge management journey.
Announcements
Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Your host as usual is Tobias Macey and today I’m interviewing Chetan Sharma about Knowledge Repo, an open source framework for managing documentation for technical users
Interview
Introductions
How did you get introduced to Python?
EE + CS/AI + Stats degrees
Airbnb working on ML models
Knowledge Repo itself
Can you describe what Knowledge Repo is and the story behind it?
We started seeing interviewees use ipython notebooks, thought they were great
Wanted to push more people to use notebooks, but they weren’t very shareable, vettable
Existing notebook hosting services weren’t very good, and weren’t built for people who aren’t data stakeholders. It was especially poor with images, annoying cell blocks
Made a simple post processor to remove cell blocks, push the images to s3, and host on flask
Once we were pushing notebooks into a Github repo for hosting on a flask app, so many things became possible
Review cycles
Shareability / collaboration features
Indexing / searching
Concurrently, great work was happening on developing internal R packages / python libraries to provide consistent, branded aesthetics
What are some of the approaches that teams typically take for recording and sharing institutional knowledge?
Copy and paste to google docs, slides
Facebook was using facebook photo albums
untrustworthy, not discoverable, divorced from the code
What are the unique requirements that are introduced when attempting to record and distribute learnings related to data such as A/B experiments, analytical methods, data sets, etc.?
Reproducibility is a big one
Making sure the learnings are trustworthy (good data? no bugs?)
Distributing widely, across the org and across time
Experimentation
Experimentation is at the end of a research-design-build-measure cycle, strategic analysis is often before
Capturing all of the context
Can you describe how the Knowledge Repo project is architected?
Repositories: a store of posts, most commonly a github repo
Markdown as original lingua franca, eventually a KR specific “KR post” concept (which is still basically markdown)
Post processors
Convert whatever upstream file to markdown / KR post (Jupyter notebook, R Markdown, markdown were the original ones)
Handle images and other large assets, usually pushing them to cloud storage
Evolved to handle PDFs, googledocs, keynotes
What were the motivating factors for making it available as an open source project?
It was such a common problem. Even incredibly sophisticated data teams at Uber, Facebook, etc. were begging us to share the system.
What is the workflow for creating, sharing, and discovering information in an installation of Knowledge Repo?
Create a github repo for hosting strategic analysis
Use the KR script to create a stub/template for whatever format you’re working in
Do your work in Jupyter, etc.
Instead of using github scripts (git add) use knowledge scripts (knowledge add), which is basically the github scripts with postprocessors
Do typical Github workflows
See the result in the hosted knowledge repo app
What are some of the options available for extending or customizing an installation of Knowledge Repo?
More postprocessors! google docs, presentations, UX research, anything can be done in KR with a simple postprocessor to turn it to markdown/images/PDF
Tying the system to your internal data tools. For example, an experimentation system like Eppo or whatever you use for marketing campaigns
If you were to start over today, what are some of the ways that you might approach the solution to knowledge management differently?
Think of it more holistically:
What are the most interesting, innovative, or unexpected ways that you have seen Knowledge Repo used?
UX research
Writing up guide for acquihiring
Demonstrating of capabilities, data framework
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Knowledge Repo?
Strategic analysis needs to be elevated, this leads to paradigm changes
Organization problems are helped by tools like KR: eg. promotions
Meeting people’s tools/workflows where they are is powerful
When is Knowledge Repo the wrong choice?
Keep In Touch
@chesharma87
Picks
Tobias
Learning Guitar
Chetan
Underrated cooking ingredients: chickpea flour, butter fried kimchi (in grilled cheese, nachos)
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Eppo
Data Engineering Podcast Episode
Knowledge Repo
IPython
Jupyter
Flask
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA