Git Basics for Data Professionals: A Beginner’s Guide to Local Version Control

Felipe Leite
Feb 20
3 min read

Hey there, data enthusiasts! 👋 Whether you're a data analyst, data scientist, or data engineer, version control is a skill you need in your toolkit. Git is the go-to tool for tracking changes in your code, scripts, and even data pipelines. In this blog post, I’ll walk you through the basics of Git, step by step, with practical examples tailored to your workflow. Let’s dive in!

Why Should Data Professionals Care About Git?

As a data professional, you’re likely working with scripts, notebooks, and pipelines that evolve over time. Git helps you:

Track changes to your code and data workflows.
Collaborate with teammates without stepping on each other’s toes.
Experiment with new ideas in isolated branches without breaking your main project.
Recover from mistakes by reverting to previous versions.

Credits: https://blog.programster.org/git-workflows

Think of Git as your project’s safety net. Whether you’re cleaning data, building models, or deploying pipelines, Git ensures you never lose your work and can always trace back your steps.

Today, we’ll focus on local version control, which means everything happens on your computer. No need to worry about remote repositories (like GitHub) just yet. Let’s start with the basics!

Getting Started: Setting Up Git

Before we jump into commands, let’s make sure Git is installed and set up on your machine.

Installing Git

Ubuntu: Open your terminal and type:

sudo apt install git

Windows: Download Git from git-scm.com and install it. Once installed, open Git Bash to use Git commands.

Configuring Git

Once Git is installed, you’ll want to set up your name and email. This helps Git track who made changes. Run these commands in your terminal:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

You can check your settings with:

git config -l

Basic Bash Commands to Know

Before we get into Git-specific commands, let’s cover some basic Bash commands to help you navigate your file system:

pwd 					# Shows your current directory (where you are).
ls 					# Lists files and folders in the current directory.
ls -l 				# Shows a detailed list of files.
ls -a 				# Shows hidden files (like the .git folder).
mkdir <folder_name> 	# Creates a new folder.
touch <file_name> 	# Creates a new file.
cd <folder_name>	 	# Moves you into a folder.
cd .. 				# Moves you up one directory.

Starting a Git Repository

Now that you’re comfortable with basic commands, let’s create your first Git repository!

Navigate to the folder where you want to start your project:

cd path/to/your/project

Initialize Git in the folder:

git init

(This creates a hidden .git folder, which Git uses to track changes.)

Create a new file:

touch data_cleaning_script.py

Tracking and Committing Changes

Once you’ve created a file, you’ll want to tell Git to track it. Here’s how:

Check the status of your repository:

git status

You’ll see data_cleaning_script.py listed as an untracked file.

Stage the file (tell Git to start tracking it):

git add data_cleaning_script.py

Commit the file with a message:

git commit -m "First commit: added data cleaning script"

Congrats! You’ve made your first commit. 🎉

Working with Branches

Branches are like parallel universes for your code. They let you work on new features or experiments without messing up your main codebase.

Create a new branch:

git checkout -b feature/new_model

Make changes and commit them:

touch model_training.py
git add model_training.py
git commit -m "Added initial version of model training script"

Switch back to the main branch:

git checkout main

Merge your new feature into the main branch:

git merge feature/new_model

Delete the branch (if you no longer need it):

git branch -D feature/new_model

Large Datasets

Git isn’t designed for large files. Instead, use tools like DVC (Data Version Control) or store your data in cloud storage (e.g., S3, Google Cloud Storage) and reference it in your repository.

Undoing Mistakes

We all make mistakes, and Git has your back! Here’s how to undo changes:

Soft reset: Unstage changes without losing them:

git reset <commit_hash>

Hard reset: Discard changes completely (use with caution!):

git reset --hard <commit_hash>

Wrapping Up

And there you have it—a beginner-friendly guide to using Git for local version control, tailored for data professionals! Whether you’re cleaning data, training models, or building pipelines, these commands will help you stay organized and in control of your work.

If you found this guide helpful, feel free to share it with your colleagues. And if you have any questions or want to dive deeper into Git, drop a comment below. Happy coding! 😊