top of page
Search

Git Basics for Data Professionals: A Beginner’s Guide to Local Version Control

  • Writer: Felipe Leite
    Felipe Leite
  • Feb 20
  • 3 min read

Hey there, data enthusiasts! 👋 Whether you're a data analyst, data scientist, or data engineer, version control is a skill you need in your toolkit. Git is the go-to tool for tracking changes in your code, scripts, and even data pipelines. In this blog post, I’ll walk you through the basics of Git, step by step, with practical examples tailored to your workflow. Let’s dive in!

Why Should Data Professionals Care About Git?

As a data professional, you’re likely working with scripts, notebooks, and pipelines that evolve over time. Git helps you:

  • Track changes to your code and data workflows.

  • Collaborate with teammates without stepping on each other’s toes.

  • Experiment with new ideas in isolated branches without breaking your main project.

  • Recover from mistakes by reverting to previous versions.



Think of Git as your project’s safety net. Whether you’re cleaning data, building models, or deploying pipelines, Git ensures you never lose your work and can always trace back your steps.


Today, we’ll focus on local version control, which means everything happens on your computer. No need to worry about remote repositories (like GitHub) just yet. Let’s start with the basics!

Getting Started: Setting Up Git

Before we jump into commands, let’s make sure Git is installed and set up on your machine.

Installing Git

  • Ubuntu: Open your terminal and type:

sudo apt install git
  • Windows: Download Git from git-scm.com and install it. Once installed, open Git Bash to use Git commands.


Configuring Git


Once Git is installed, you’ll want to set up your name and email. This helps Git track who made changes. Run these commands in your terminal:


git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

You can check your settings with:


git config -l

Basic Bash Commands to Know


Before we get into Git-specific commands, let’s cover some basic Bash commands to help you navigate your file system:


pwd 					# Shows your current directory (where you are).
ls 					# Lists files and folders in the current directory.
ls -l 				# Shows a detailed list of files.
ls -a 				# Shows hidden files (like the .git folder).
mkdir <folder_name> 	# Creates a new folder.
touch <file_name> 	# Creates a new file.
cd <folder_name>	 	# Moves you into a folder.
cd .. 				# Moves you up one directory.

Starting a Git Repository


Now that you’re comfortable with basic commands, let’s create your first Git repository!

  1. Navigate to the folder where you want to start your project:

cd path/to/your/project
  1. Initialize Git in the folder:

git init

(This creates a hidden .git folder, which Git uses to track changes.)


  1. Create a new file:

touch data_cleaning_script.py

Tracking and Committing Changes


Once you’ve created a file, you’ll want to tell Git to track it. Here’s how:

  1. Check the status of your repository:

git status

You’ll see data_cleaning_script.py listed as an untracked file.


  1. Stage the file (tell Git to start tracking it):

git add data_cleaning_script.py
  1. Commit the file with a message:

git commit -m "First commit: added data cleaning script"
  1. Congrats! You’ve made your first commit. 🎉


Working with Branches


Branches are like parallel universes for your code. They let you work on new features or experiments without messing up your main codebase.


  1. Create a new branch:

git checkout -b feature/new_model
  1. Make changes and commit them:

touch model_training.py
git add model_training.py
git commit -m "Added initial version of model training script"
  1. Switch back to the main branch:

git checkout main
  1. Merge your new feature into the main branch:

git merge feature/new_model
  1. Delete the branch (if you no longer need it):

git branch -D feature/new_model

Large Datasets


Git isn’t designed for large files. Instead, use tools like DVC (Data Version Control) or store your data in cloud storage (e.g., S3, Google Cloud Storage) and reference it in your repository.


Undoing Mistakes


We all make mistakes, and Git has your back! Here’s how to undo changes:

  1. Soft reset: Unstage changes without losing them:

git reset <commit_hash>
  1. Hard reset: Discard changes completely (use with caution!):

git reset --hard <commit_hash>

Wrapping Up


And there you have it—a beginner-friendly guide to using Git for local version control, tailored for data professionals! Whether you’re cleaning data, training models, or building pipelines, these commands will help you stay organized and in control of your work.


If you found this guide helpful, feel free to share it with your colleagues. And if you have any questions or want to dive deeper into Git, drop a comment below. Happy coding! 😊


 
 
 

Recent Posts

See All

Commentaires


SIGN UP AND STAY UPDATED!

Thanks for submitting!

  • LinkedIn
  • GitHub

© 2024 by Felipe Leite

bottom of page