hckrnws
Ask HN: What are the ways you go about getting comfortable with a new codebase?
by jarusll
Tools like Eclipse's Java Browsing Perspective(https://querix.com/go/beginner/Content/05_workbench/01_ls/02_interface/01_perspectives/java_browsing.htm), Visual Studio's Object Browser make it easier to browse the codebase by components.
What are the ways you do it? Is there anything similar available for functional programming?
Three ways...
1. I go straight to the entry point, the main(), and then follow how the initial configuration, flow of data, sanitisation, and routing is done.
2. I look for bugs. Fixing bugs reveals the complexity as you need to look for side effects of the fixes when you don't yet know the system. Writing tests for those fixes also helps understand the system.
3. I look for the least changed part. I find these are usually the oldest and most core part of how the program works, whereas more recent changes are business logic and feature addition.
But of these, the first yield the greatest initial understanding and allows me to change things with less fear.
How do you think of #1 in the context of a web app? Each page is essentially a different entry point into the code.
I guess the landing page or authentication page could be considered the equivalent, but I’m not sure those would hit your goals to understand flow of data, etc. ?
Yup, I've stopped looking for the entry point in SPAs and apps. Often the navigation flow can be extremely convuluted.
Android introduced navigation graphs [1], which were meant to solve this problem. But what happened in reality were instead of arrows point to each screens, there would be multiple islands that were teleported in from random places.
[1] https://developer.android.com/guide/navigation/navigation-ge...
It makes even more sense for web apps. It will tell you how it came to be that a web page received the request, what context it has, how the cookie is resolved to an auth context, how to access any query args, what is available to support the page.
These are great.
A 4th way I would add is: If you need to make a minor change or understand how one specific function is expected to work, search for its unit tests and start there.
#1 is my method too. I also take it as a sign learning the code base is going to be difficult if I have to ripgrep to find main because they couldn't put it in a cleanly named file like main.c =)
1st is exactly what I do while reminding myself that I don't need to understand everything. Somehow for me it's easy to think like that in OO code and not FP.
Try to profile the code and see where it's spending time.
Get a good flame graph up, and you'll have a really solid visual representation of what's going on.
Bonus: on almost any project, nobody has done a profiling pass in at least a few months, so you'll probably discover some extremely easy performance improvements and you'll look like a goddamn hero when you speed up e.g. the test suite by a factor of 3 in your first week on the job.
Hey, I'd like to get a grasp of how to do what you're explaining, would you have any link or resource by chance? thanks!
This is my favourite explanation of flame graphs: https://rbspy.github.io/profiling-guide/using-flamegraphs.ht...
Good resource! Alas, the specifics really depend on the language.
Definitely a good advice. Profiling is something I haven't looked into and should be after debugging.
I did this and flamegraphs in Rust are not great due to rayon :(
Ideally you have someone experienced in the codebase who can give a lay of the land.
I suggest:
1. Find a senior dev, ask then for exisiting pointers to good documentation to self learn.
2. give that a go, make note of all the questions you have
3. then have a session with that dev for platform walk through. Take lots of notes and ask your questions.
4. offer to update docs where you found errata or missing steps or even complete topics not mentioned
5. suggest to the team anything about onboarding that can be improved.
This is exactly what I would do ideally except I couldn't. I can understand that working in startups you would overlook alot of theory and due to hectic nature low quality calls(code/architecture explanation) are not appreciated.
And that's why the first thing I ask during a technical interview is "Do you have internal documentation?".
(This might be more abstract than you wanted)
Hopefully there's an overview of the code base in an `ARCHITECTURE.md` file[1], and then read through it, and the respective documentation and tests for the main modules mentioned in it.
If you assume their tests cover the important business logic / stuff "they want to keep" (ref. "Beyonce Rule"[2]), they should inform you about the most important stuff.
> [1] https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html
> [2] https://www.oreilly.com/library/view/software-engineering-at...
“Hopefully” … 0% chance
The best way is probably to make changes to it because that forces you to really understand the code. If you just read it without making changes it's too easy to pretend that you understand it. If there is one part of the code that you are trying to understand, I can recommend to write a replacement for it; then you at least learn how it works and you might even get a better solution.
Fixing bugs.
Someone can explain a code path, what it should do, what the bug is and with that you can get familiar with a path through the application.
This!
Ask around about the big things everyone would like to change and where the scary code is that nobody wants to touch, and with those things in mind, fix some bugs.
The initial questions will tell you what to avoid initially. Longer term, if you can fix them you'll look like a rock star.
Fixing bugs has worked out although I had to pull my hair.
Just mapping the whole codebase without a specific goal in mind seems counterproductive to me.
Instead, I get myself a couple specific tiny bugfixes/features to do first. Just finding out where those are, one by one, tells you a lot and may not be as simple as it sounds.
I was once hired to help with polishing a code base for imminent shipping. I fixed one bug. The fix was one line, but not trivial at all. Took me a whole week of reading code. The customer was extatic. There were like 12-15 years worth of layers of code to read.
If the codebase is ginormous and hard to decipher then you could use the magic source control to go back in time to an early point in the codebase. It’s probably going to be easier to understand a codebase that is 3 months old vs 6 years old, so you could go and check out that version, understand it and then jump forward a few years. This also gives you the benefit of understanding the evolution of the code and understanding why it is not just what it is.
As with most things depends on the code base, but if documentation in whatever form is available, I'd start there rather than just jumping into code.
I'd start from the highest level abstraction of the code and work downwards until I reach a domain I'm either interested in or asked to work on and specialize on that vertical for a while (this can be anything from a few weeks to say 6 months). I then repeat the process on other verticals if needed/wanted.
So going from highest level of abstraction down to actual code:
1. read docs or converse with others around what the value proposition/s are of the product/service/app.
2. Understand the main use-cases or if not obvious, read product brochures or w/e you have in terms of "sales" material for end-users.
3. Try to map the main use-cases back to high-level architecture diagrams (if available).
4. After doing above steps if there are multiple domains I would pick one based on either personal interest or assigned work.
5. When starting with a business domain (meaning some high level grouping of code based on their business function), I tend to focus first on the design of the persistence layers as its usually less dense and less sprawling than other parts of a code base and can give you some idea of state management.
6. From here I generally start up the service/s or apps related to he domain I'm studying and try to play around with it, trying to tie previous steps together in my mind with what I'm observing with my interactions.
7. At this point I would generally have documented my findings (whatever means / form it is done) and ask for a session with someone that's familiar with this domain and ask their opinion of my documentation, making corrections where needed.
7. After this it's generally best in my opinion to just jump into work.
8. Personally I find doing support work fixing bugs for about 6 months gives you a very good lay of the land and people.
Jumping straight into feature work is not optimal in my experience as it's less likely to provide as wide an array of exposure as support.
This obviously only fits certain scenarios, but for your garden variety product/s this is how I'd go about understanding the code base.
Oh, commit history is also a very very rich source of info if there's an established culture of good commit messages.
This is very well written. 1-3 describes my experience of tinkering in Pharo except I haven't actually built anything with it.
I scan everything that is closely or remotely related to the code base: the code, commit logs, diagrams, bug reports, change requests, user manuals, tests, technical logs, databases, other storage, cloud infra, ...
Usually at least one of them stands out, so I at least read this through (usually diagonally).
I might also pick different things based on my goals.
Once I think I have a grasp of the high level aspects, I start pairing or validate with tiny feedback loops.
Update: I also create my own (naive) helicopter view diagram of the context and validate it with people on different levels.
We have a tool that was built by a third party. They did an excellent job but for various reasons we needed to change vendors. I didn’t hire the old or new dev teams, so it wasn’t my role to tell them how to come up to speed. Early on they said they wanted to redo all the test cases, which seemed off to me (it’s too abstract, and why redo test cases for parts of the code that are unlikely to be modified). I said something but didn’t push it.
Someone on my team has been giving the dev team demos of the functionality and thinking behind the product a few days a week. My one request at the beginning was that they should learn enough about the product to be able to give a demo back to us. It took them about 2-3 weeks (maybe 8 45 min overview sessions from my team, which owns the product requirements), but it showed that they know what it is the tool is supposed to do.
They spent another 3 weeks “getting comfortable” (6 weeks from start) they finally felt comfortable to start implementing small features and bug fixes. I’d have preferred that they start fixing bugs right away (it might take 2-3 weeks to fix the first bug because they need to figure out how to get access to systems, documentation, deployment, etc.) because it’s more tangible, but I know I’m impatient and let them do it this way. It seems to work ok so far but will be another month or so before I can decide whether or not they are actually competent. I guess the good news for them is they (team of 10 in Eastern Europe) aren’t being bugged by the client, and if they actually are good, should be enjoying the freedom to do things their way and implement their own processes.
Besides all the great steps below, I like to browse the git repo to find the files that have been changed most recently and most often.
Projects, especially messy ones, often behave like lava flows where there is an active and ever expanding edge where changes are currently being made. Beneath this are layer upon layer of nearly impenetrable and often implicitly deprecated code from former developers.
This practice came from a time when I was brought in midway through a rewrite to get rid of unmaintainable code from some offshore contractors. I saw a repository where half the code lacked any organizing principles and had massive security issues. The second half was textbook (pedantic even) OOP, the kind taught in Java textbooks. It was beautifully executed except for using a few outdated tricks to do OOP in early versions of PHP (no longer needed in the version used for this project).
Because I didn't look at the dates, I assumed the neat OOP code was the result of the rewrite. I was wrong.
Comment was deleted :(
If you are lucky (working with a mature codebase), tests are my number one go to when getting started. I need to know input/output of things. I work in ml space so ymmv. This allows me to make small changes and check my assumptions as I gain more confidence around the codebase
If there is one, I often like to look at the database. The stuff that is stored and the names of the tables should give you a good idea how the application works.
Sort of unrelated, but I've got a story about a project I was looking through that confused the hell out of me. It was a C# library that would allow you to render an element from a shockwave flash file (it was either .swf or .fla).
I spent ages digging through the code. The example worked really well, but I couldn't get it to work with one of my files.
Eventually I contacted the author and he told me the library used reflection to get the name of your variable and would look for that variable name in the flash document.
In addition to already mentioned steps,
0) Read the code base docs (or README).
1) Pair with someone with knowledge of the code base or ask them to walk you through the code base.
2) Identify the public interface to interact with the app/api. How do consumers use the software. Play around with the app or api to get a sense of how things link up.
3) Identify various tools used in the code base(db, messaging, external api, etc). Now you know each tool is setup somewhere and used in one or more places.
4) Identify the patterns and conventions used (CQRS, mediator, dependency injection, middleware, pipelines, logging, etc). Now map the flow of each public interface using this knowledge.
Two things - First - I learn how the data flows from the source to the end. That teaches me to navigate the codebase entirely. (User action to database, or source data to end data etc.
Second - I learn how different components are wired together.
So indexing all the components and finding out all the interactions between them. This is exactly what class browsers do, they index all the classes and messages. The interactions could be described using an example/documentation.
This reminds me of Pharo which does all of the above, indexes classes, messages and has a rich documentation support.
Checkout the repository and run the unit tests. If none exists, write the first one.
I often write out callstacks/dependency chains on paper. I find that makes it stick.
Try actually using the program as an end user would.
Read error messages, read code, make predictions about what the code does, find out if your predictions are true.
I like to focus on the main business element. If it's a SaaS for sharing videos with comments, for instance, I'd take a longer look at the video and comment models, their relations, and the call chain from API endpoint to model.
Another strategy I like is picking parts of the codebase and trying to refactor them. You don't even need to commit anything if you're not supposed to go around changing things: just by spending some time moving things around, seeing what breaks and so on will give you a better understanding of the code and what it does.
Not using things like go to definition and any fancy tools, just manually forcing myself to work through files to understand how things fit together and using basic tools like grep
Very surprising! IIUC, you consider "Go To Definition"/"Go To References" and other "LSP assists" unhelpful (or worse) when familiarizing yourself with a new codebase. I personally find them indispensable. Could you say more to help me understand your position?
If I start with these tools I get a feeling I understand how things fit together, but when I want to add something new I realize my understanding is pretty shallow. I force myself to manually open files where things are so I get an understanding of the conventions and principles in the project. By far my most valuable tool is grep (it’s super fast in vim) and I grep the code to see where certain functions and such are used. I use the lsp tools later when I’m used to the project.
Interesting. Thanks!
I'd give the complete opposite advice.
I use "go to definition"/"find references" to go all around the code base and at the same time try to figure out how each files interact with each others.
I would prefer LSP assists as mentioned above. I love having things within reach. I believe programming should be as painless as possible.
Pharo(Smalltalk) IDE is one of the best IDE I've ever used. It even has a feature where you give input object and output object and it tells you what messages you need to send to get from input to output.
Static analysis suites are well suited to this. Things like Understand[1], NDepend[2], and Source Insight[3].
1: https://www.scitools.com/spaghetti-code
2: https://blog.ndepend.com/visualize-code-with-software-architecture-diagrams/
3: https://www.sourceinsight.com/#call-graphs
Run the code. This sounds so trivial but it's not always that easy. Once it get's running, I discuss with users of the code (be it customers or developers) and try to understand what they use it for and look for that same functionality in the code. Usually by this point I know where everything is and how to make changes if needed.
Yep, underrated advice. Run it, read the code, fiddle some bits and make sure what you fiddled matches your mental model.
Honestly, learning to read code and execute it in your head is a super power
Even better than running code is running tests.
Unless there are millions of files, I like to go one by one and write the javadoc/comment about what it does (if it doesn't already have it, which is usually the case, if it does just check if it's still correct). This way you can see what elements you have, sometimes even find minor bugs or improvements.
Get it running.
I find the only way for me is to actually run the code locally, play around with it until I understand the data flow.
Run it under debugger and go thru whole codebase a few times and you'll start getting proficient at it
I've recently did it on huge, very very specific codebase and after 2-3 months I understood it (what to add where, not just what's happening where) relatively OK
With both lambdas and @ComeFrom instructions[1], it’s impossible. I’m impressed that a feat of modern programming (say, started with Java 8) makes the code difficult to understand.
I have never heard of COMEFROM.
I love debugging and I would prefer this even if it's slower but there are cases where it's not easy to setup a debugger.
Refactor something unimportant with tests and a functional change, improve the documentation in the README, and push straight to the repo without a feature branch.
Reading the codebase does surprisingly little for me, you essentially have to change it and see what happens.
It's counterintuitive, but stepping through a very complex codebase with a debugger and taking notes along the way was great in my case.
Also, attempting to draw a sequence diagram and fixing it as you go trains the brain to handle a mental model of a large process.
Apart from reading, and trying to understand the architecture, these are some things you can do:
1. Fix a couple of bugs
2. Add a small feature
3. Refactor a small piece of it
Always start with few small things and keep increasing the complexity of the things you do. Working on a codebase is the best way to understand it.
Get your team to produce easy tasks (usually refactors) to help you ramp up on the codebase
Learn the domain. Learn as much as you can about the business logic, the problems the code solves. It's so much easier to do this outside of the code. I've spent days trying to fix bugs only to find the business logic was wrong.
I have used the technique described by M. Feathers as "Characterization tests", in his book "Working Effectively with Legacy Code", with good success.
Learn the main scenarios of the product and try to walk through them with a debugger.
Also try to read tests, because they can show a lot about how the components are used and their properties.
Depending on the codebase of course. But for me, mapping out the datastructures and understanding their relations helps the most.
I tend to use the thing, find a behavior I don't like, then build familiarity pulling at that specific thread
just start working on features/bugfixes and learn as you need and only the things you need. You'll be really slow in first couple of features/bugs but will get faster as you go. Set this expectation with your manager.
Old school - print it out, use a pencil to make notes, and a highlighter as needed.
Pick an interesting part and start writing unit tests.
Read the design docs.
Background: I have worked for several companies with systems that have been activity used and developed on for 30+ years. The databases have been complex, containing 1000s of unique tables and views. These databases often have hundreds of in-database user defined functions that can be 1000s of lines long. The code bases have been in multiple languages, using multiple technology stacks, and multiple git repositories. These projects have been a mix of monolithic and micro-service architectures. Oftentimes, there have been multiple front-ends (to meet different types of users’ needs – e.g. internal company users, vs external public users). The user applications have typically had ~100-200 pages to do specific types of actions. It is common for the UI pages to have to comply with laws, government regulations that have changed over time. These laws and regulations comprise of 100s-1000s of pages of text. It is common for a single button on a UI page to do very complex actions (e.g. batch jobs) , including building out large data-structures, touching (e.g selecting, inserting, updating, deleting) 10-50+ database tables. It is common for UI pages to only be used at certain parts of the year or once every few years. This makes it much harder to “know about” those parts of the system. It is very common for UI Button actions to have to “know” and enforce details found withing the laws and government regulations. Aside from the application code, we also have large ETL (extract, transform, load) processes to build up data warehouses. There have also typically been a large number of batch jobs to do regular data pulls and updates from other systems.
Given this background, “What are the ways you go about getting comfortable with a new codebase?” Here is some of what I feel/think/do:
1. In these companies, it is “understood” that a new hire will take about a year+ before they know enough to be useful. These systems took peoples entire careers’ to learn about and bill. It is “understood“ that a new hire will not learn the system “overnight”.
2. Even knowing that it is “impossible” to know the system quickly, I have a personal desire to contribute to the company (e.g. be of use) as quickly as possible. It is very important to have reasonable personal expectations. Chill out and allow yourself time to learn. You can’t avoid paying the TIME cost, to learning a complex system. You can’t skip the Time/Exposure cost required to learn a complex system. If you do skip the TIME/Exposure cost, you are fooling yourself into thinking you “know” something you do not. It will show.
3. Get access. You need access to the code base repository(s). Get access to the database(s). Get usernames and passwords (if allowed in your company). Get login credentials to your applications. Get access to the documentation store(s), ticket tracking system(s), applicable laws and regulation(s) locations, etc.
4. Get a development environment running. Checkout the code. Get it running on your local machine. I want to be able to run in debug mode where possible.
5. Define what it means for you to feel like you know the system. For me I feel like I “know” the system when: a. all the names and acronyms are familiar (every company and system has their own “language” b. I can look a random page in the system and known what database tables are referenced when the page loads, and what database table(s) are changed when the UI buttons on the page are pressed. c. I have a mental model of the data and business validation that will occur when a given UI button is pressed d. I am comfortable enough in the language(s) and technology stack(s) to make changes, review code, and deploy code e. This standard is pretty easy for me to “say” but takes YEARS of hard work to achieve.
6. Commit to keeping a log of your daily activities. Writing what you are doing, seeing and learning can act as a form of team coding with yourself. As you write, you are forcing yourself to “teach” and “explain” what you are doing and why. This activity helps me internalize the system quicker. This can help a lot if you are in a company that requires you change problem contexts a lot during the day or week. Also take LOTS of screen shots in your log (using Snagit or ShareX, etc).
7. Commit to creating system documentation. This can be in markdown, confluence, word, etc. It can be in a git repository or file system. It must support screenshots and page links to related pages. I tend to do UML diagrams in Drawio or Visio (exported as images) but if your tool supports UML all the better. I like to organize my system documentation by UI Page. For every major UI Page, I have a system documentation page of the same name. It may have many child pages (depending on how complex the page is). This simple structure means that I can quickly find the documentation that I have 6 months later when working on the same page again. Having a documentation space to put documentation makes it easier for me to create more documentation.
8. Depending on the company, you may ask to job shadow another developer for a period of time. This can mean spending an hour a day or 8 hours a day with them in a team coding environment. Ask another developer to do this. If they say yes, you can learn a lot more quickly about the system.
9. Depending on your company you may have leeway in how you use your time. You may be given issue tickets on day one or you may be sheltered from the storm of requests and given time to learn for a month to a year+.
10. If you are given leeway to research and learn (which is rare), one of the best ways for me to learn is to try to build a mimic system. I look at the existing system and try to build my own system to do what it does. If I can produce a screen that loads data that looks like the original system than I have a high degree of confidence that I “know” how it works. If I can produce a button that saves data (and does all of the complex validation, database changes, etc) that the original system’s save button does, then I have a much higher degree of confidence that I “know” how it works. As I mimic the system, I attempt to document what I am learning about the original system. These notes can become invaluable later (when in a time crunch to fix some critical issue). Building a system mimic makes it much harder to “fool yourself” into thinking you know what is going on. Let the compiler and screen comparison be a mentor. They can be brutally honest but effective teachers.
11. If you are not given such leeway to research and learn (which is common – people pay you to produce), you can learn a lot be reviewing prior completed issue tracking tickets. You can see the most common topics and rough patches in the system (for that time period in the year – problem hotspots change over the year(s)). The existing system documentation (if any exists) may be very helpful. If there are Unit Tests, these can be very valuable learning aids as well. As you work issue tracking tickets you will be forced to learn details about that specific area of the code. It may take longer to get a “wholistic” view of the system this way (as opposed to building a mimic system) but it will eventually get you to a similar level of understanding, via a different route.
12. Some of these company’s architectures have and support Unit Tests. Count yourself luck. Make use of them and add to them. Other company’s architectures do not (or make it really hard).
13. Spend as much time as you can playing with the Application UI. Pretend to be an end-user, loading pages, entering data, clicking thru the process and saving. You will see a lot of validation errors. You can learn a lot about the workflow and pain points in the workflow. You can learn a lot about how the front end effects the database. Systems tend to get better if you (as a developer) “eat your own dogfood” that you make your end users eat. Be in and use the application(s) on a daily basis. This will help you learn the company language and help you communicate with your users. It will also help you see why users are having trouble with a process and perhaps see ways to make the workflow better. It will also expose bugs and give you an opportunity to fix those bugs and learn more about that particular part of the system.
14. One of the most useful things I can do to learn a new system is to build “UNDO” features. I am not certain why this isn't more 'popular'. It feels like it grants SUPER POWERS to me. If I am looking at an Application UI button that does a complex Batch process, the best way to “know” that I know what it does is to build a script (and another button) to undo what that Batch button did. Many of the systems I have worked with do not have an UNDO button for complex processes. This means that everyone is scared to run them because they do so much. This means they do not get run or tested as often as they should. This builds up additional fear of clicking the button. If I can create an UNDO script, I will have to document everything it does. I have to “Know” every database record that it changes. Once I have the UNDO script, it becomes easy to run the button to do the complex batch process and it becomes easy to get back to a virgin state again. The easier it is to run, the more roundtrip testing you can do and the more likely it is for you to make the code better. This all can greatly reduce fears of that process in the future and can greatly speed up supporting the feature in the future.
I am sure there are could be more added to my list. Of course every company and every position is different. You'll have to figure out what makes sense in your situation.
Hopefully it gives you some ideas. Good luck learning your system(s).
Here are a few techniques that I use, in addition to requesting talks from people who know the code base, asking for code walkthroughs if they know the code, and just generally learning the terminology for various things in the domain (i.e., getting familiar with the Nouns and Verbs of the system):
- Check out the source code and look at how the codebase is laid out. i.e., the directory names. This often gives me an understanding of what is vendor code or third-party dependencies, test, and 'core' source code.
- How to build the product, and what artifacts are produced. This will tell you a lot about where to start looking.
- Take a small bug where further information is requested to help triage/solve the bug. This could be reproducing the issue by clicking around or performing a workflow by issuing commands, or rooting through logs and database entries to ascertain the state of the system. This is often an instructive exercise because it requires that you understand how the system is deployed, which is always a good thing to know, and how to troubleshoot an area that you know little about, which you will often be called upon to do. Reading the logs gives you a pretty good idea of the system initialization sequence, particularly in a cloud product. Ditto with looking at the core metrics or example traces of the system. Traces, if available, instantly lay bare a 30000 ft view of the system before you.
- For web apps, the router file. There is usually a file that contains various route definitions and the entrypoints to them. This is a great start for figuring out what links to what. Something basic like a simple GET of a collection or a health check is a great way to get your feet wet.
- For web apps and others that use a database, the database schema. Often I just do a COUNT(*) of tables and look at the schema tables that contain the largest number of entries.
- Unit Tests. For a particular functional area, these are an excellent aid to understanding what the expectations are and how they are tested. I also write unit tests for areas that do not have them as a way to get familiar with the codebase.
- Results of Smoke Test and Integration Test runs. These often give you a bigger picture idea of what the system does in relation to others that surround it, and the major 'compartments' of the system as it were.
- Fixing bugs of varying complexity. This is an excellent way to instantly get familiar with building, testing, code reviewing, and putting through standard precommit testing your change. The change itself is incidental; the things that you learn during this process will help you get productive quickly.
- Writing small features that are very self-contained and interact with one or two areas of the system. This helps you understand a few system areas inside-out, and you can slowly grow your understanding of the system by doing features that touch newer areas that you'd like to know.
Crafted by Rajat
Source Code