The Python Podcast.__init__

Scaling Knowledge Management For Technical Teams With Knowledge Repo

02.21.2022 - By Tobias MaceyPlay

Download our free app to listen on your phone

Download on the App StoreGet it on Google Play

Summary

One of the most persistent challenges faced by organizations of all sizes is the recording and distribution of institutional knowledge. In technical teams this is exacerbated by the need to incorporate technical review feedback and manage access to data before publishing. When faced with this problem as an early data scientist at AirBnB, Chetan Sharma helped create the Knowledge Repo project as a solution. In this episode he shares the story behind its creation and growth, how and why it was released as open source, and the features that make it a compelling option for your own team’s knowledge management journey.

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.

When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!

Your host as usual is Tobias Macey and today I’m interviewing Chetan Sharma about Knowledge Repo, an open source framework for managing documentation for technical users

Interview

Introductions

How did you get introduced to Python?

EE + CS/AI + Stats degrees

Airbnb working on ML models

Knowledge Repo itself

Can you describe what Knowledge Repo is and the story behind it?

We started seeing interviewees use ipython notebooks, thought they were great

Wanted to push more people to use notebooks, but they weren’t very shareable, vettable

Existing notebook hosting services weren’t very good, and weren’t built for people who aren’t data stakeholders. It was especially poor with images, annoying cell blocks

Made a simple post processor to remove cell blocks, push the images to s3, and host on flask

Once we were pushing notebooks into a Github repo for hosting on a flask app, so many things became possible

Review cycles

Shareability / collaboration features

Indexing / searching

Concurrently, great work was happening on developing internal R packages / python libraries to provide consistent, branded aesthetics

What are some of the approaches that teams typically take for recording and sharing institutional knowledge?

Copy and paste to google docs, slides

Facebook was using facebook photo albums

untrustworthy, not discoverable, divorced from the code

What are the unique requirements that are introduced when attempting to record and distribute learnings related to data such as A/B experiments, analytical methods, data sets, etc.?

Reproducibility is a big one

Making sure the learnings are trustworthy (good data? no bugs?)

Distributing widely, across the org and across time

Experimentation

Experimentation is at the end of a research-design-build-measure cycle, strategic analysis is often before

Capturing all of the context

Can you describe how the Knowledge Repo project is architected?

Repositories: a store of posts, most commonly a github repo

Markdown as original lingua franca, eventually a KR specific “KR post” concept (which is still basically markdown)

Post processors

Convert whatever upstream file to markdown / KR post (Jupyter notebook, R Markdown, markdown were the original ones)

Handle images and other large assets, usually pushing them to cloud storage

Evolved to handle PDFs, googledocs, keynotes

What were the motivating factors for making it available as an open source project?

It was such a common problem. Even incredibly sophisticated data teams at Uber, Facebook, etc. were begging us to share the system.

What is the workflow for creating, sharing, and discovering information in an installation of Knowledge Repo?

Create a github repo for hosting strategic analysis

Use the KR script to create a stub/template for whatever format you’re working in

Do your work in Jupyter, etc.

Instead of using github scripts (git add) use knowledge scripts (knowledge add), which is basically the github scripts with postprocessors

Do typical Github workflows

See the result in the hosted knowledge repo app

What are some of the options available for extending or customizing an installation of Knowledge Repo?

More postprocessors! google docs, presentations, UX research, anything can be done in KR with a simple postprocessor to turn it to markdown/images/PDF

Tying the system to your internal data tools. For example, an experimentation system like Eppo or whatever you use for marketing campaigns

If you were to start over today, what are some of the ways that you might approach the solution to knowledge management differently?

Think of it more holistically:

What are the most interesting, innovative, or unexpected ways that you have seen Knowledge Repo used?

UX research

Writing up guide for acquihiring

Demonstrating of capabilities, data framework

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Knowledge Repo?

Strategic analysis needs to be elevated, this leads to paradigm changes

Organization problems are helped by tools like KR: eg. promotions

Meeting people’s tools/workflows where they are is powerful

When is Knowledge Repo the wrong choice?

Keep In Touch

LinkedIn

@chesharma87

Picks

Tobias

Learning Guitar

Chetan

Underrated cooking ingredients: chickpea flour, butter fried kimchi (in grilled cheese, nachos)

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.

If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Eppo

Data Engineering Podcast Episode

Knowledge Repo

IPython

Jupyter

Flask

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

More episodes from The Python Podcast.__init__