20  Chapter 10: Working with Data

NoteChapter Summary

In this chapter, you’ll learn how to work with real-world data files. You’ll discover how to read CSV files, process JSON data, and analyse information - skills that transform your programs from toys to tools. This is where programming becomes practical!

20.1 The Concept First

Up until now, your programs have worked with data you typed in or created yourself. But real programs work with existing data — grade books, weather records, product inventories, survey results. The data already exists somewhere; your job is to read it, make sense of it, and do something useful.

Think of it this way: you already know how to cook (write programs). Now you need to learn how to unpack groceries (read data files). The food arrives in different packaging — cans, bags, boxes — and you need to know how to open each one before you can start cooking.

Two formats dominate the data world. CSV is like a spreadsheet saved as plain text — rows and columns separated by commas, nothing fancy. JSON is like a set of nested labelled boxes — open one and you might find more boxes inside, each with its own label. Most data you’ll encounter comes in one of these two formats.

20.2 Discovering Data with Your AI Partner

Before we write any code, let’s build intuition for how data files work.

Ask your AI:

Show me what a CSV file looks like as plain text, then show me
the same information as JSON. Use a simple example like 3 students
with names and grades.

Notice how CSV is flat and table-like while JSON can nest information inside other information. This difference drives when you’d choose each format.

20.3 CSV Files: Your Gateway to Spreadsheet Data

CSV stands for “Comma-Separated Values” - it’s the simplest way to store table-like data. Every spreadsheet program can export to CSV, making it a universal data format.

NoteMental Model: CSV as a Text Spreadsheet

Picture a spreadsheet. Now imagine stripping away the grid lines, the colours, the formulas — and just keeping the text, with commas where the column borders were. That’s a CSV file. Each line is a row, each comma marks a new column. You could create one in any text editor.

Understanding CSV Structure

Imagine a grade book:

Name,Quiz1,Quiz2,MidTerm,Final
Alice,85,92,88,91
Bob,78,85,82,79
Charlie,91,88,94,96

Each line is a row, commas separate columns. Simple, but powerful!

The AI Partnership Approach

Let’s explore CSV files together:

TipPrompt Engineering for CSV

“I have a CSV file with student grades. Show me how to read it and calculate each student’s average. Keep it simple - just the basics.”

AI will likely show you Python’s csv module. But here’s the learning approach:

  1. First, understand the structure - Read the file as plain text first
  2. Then, parse manually - Split by commas yourself
  3. Finally, use the tools - Apply the csv module

Building a Grade Analyzer

Let’s design a program that reads student grades and provides insights:

def read_grades_simple(filename):
    """Read grades from CSV - learning version"""
    grades = []
    
    with open(filename, 'r') as file:
        # Skip header line
        header = file.readline()
        
        # Read each student
        for line in file:
            parts = line.strip().split(',')
            student = {
                'name': parts[0],
                'grades': [int(parts[i]) for i in range(1, len(parts))]
            }
            grades.append(student)
    
    return grades

def calculate_average(grades):
    """Calculate average grade"""
    return sum(grades) / len(grades)

# Use the functions
students = read_grades_simple('grades.csv')
for student in students:
    avg = calculate_average(student['grades'])
    print(f"{student['name']}: {avg:.1f}")
WarningExpression Explorer: List Comprehension

The line [int(parts[i]) for i in range(1, len(parts))] is a list comprehension. Ask AI: “Explain this list comprehension by showing me the loop version first.”

Common CSV Patterns

When working with CSV files, you’ll often need to:

  1. Skip headers - First line often contains column names
  2. Handle missing data - Empty cells are common
  3. Convert types - Everything starts as text
  4. Deal with special characters - Commas in data, quotes, etc.

20.4 From Tables to Trees: JSON

CSV works brilliantly for flat, table-shaped data. But what happens when your data has structure within structure — a contact who has multiple phone numbers, or a student who belongs to several clubs? You need a format that can nest information. That’s where JSON comes in.

JSON (JavaScript Object Notation) is how modern applications share data. It’s like Python dictionaries written as text - perfect for complex, nested information.

NoteMental Model: JSON as Labelled Boxes

Think of JSON like a set of nested labelled boxes. You open a box marked “contacts” and find smaller boxes inside, each labelled with a person’s name. Open one of those and you find even smaller boxes: “phone”, “email”, “tags”. Each box either contains a value or more boxes. If you’ve used Python dictionaries, you already understand the idea.

Ask your AI:

Take this CSV data and convert it to JSON. Then show me a case
where CSV can't easily represent the data but JSON can — like a
student who has multiple email addresses.

This will show you exactly why both formats exist.

Understanding JSON Structure

Here’s a contact list in JSON:

{
    "contacts": [
        {
            "name": "Alice Smith",
            "phone": "555-1234",
            "email": "alice@email.com",
            "tags": ["friend", "work"]
        },
        {
            "name": "Bob Jones",
            "phone": "555-5678",
            "email": "bob@email.com",
            "tags": ["family"]
        }
    ],
    "last_updated": "2024-03-15"
}

Look familiar? It’s like the dictionaries you’ve been using!

Working with JSON Data

Python makes JSON easy:

import json

def load_contacts(filename):
    """Load contacts from JSON file"""
    with open(filename, 'r') as file:
        data = json.load(file)
    return data

def save_contacts(contacts, filename):
    """Save contacts to JSON file"""
    with open(filename, 'w') as file:
        json.dump(contacts, file, indent=4)

# Use it
data = load_contacts('contacts.json')
print(f"You have {len(data['contacts'])} contacts")
TipAI Learning Pattern

Ask AI: “I have a JSON file with nested data. Show me how to navigate through it step by step, printing what’s at each level.”

JSON vs CSV: Choosing the Right Format

Use CSV when: - Data is tabular (rows and columns) - You need Excel compatibility - Structure is simple and flat

Use JSON when: - Data has nested relationships - You need flexible structure - Working with web APIs

20.5 Real-World Data Analysis

Let’s combine everything into a practical example - analysing weather data:

The Weather Data Project

Imagine you have weather data in CSV format:

Date,Temperature,Humidity,Conditions
2024-03-01,72,65,Sunny
2024-03-02,68,70,Cloudy
2024-03-03,65,80,Rainy

Let’s build an analyzer:

def analyze_weather(filename):
    """analyse weather patterns"""
    data = []
    
    # Read the data
    with open(filename, 'r') as file:
        header = file.readline()
        for line in file:
            parts = line.strip().split(',')
            data.append({
                'date': parts[0],
                'temp': int(parts[1]),
                'humidity': int(parts[2]),
                'conditions': parts[3]
            })
    
    # Find patterns
    temps = [day['temp'] for day in data]
    avg_temp = sum(temps) / len(temps)
    
    rainy_days = [day for day in data if day['conditions'] == 'Rainy']
    
    return {
        'average_temperature': avg_temp,
        'total_days': len(data),
        'rainy_days': len(rainy_days),
        'data': data
    }

20.6 Data Cleaning: The Hidden Challenge

Real-world data is messy! Here’s what you’ll encounter:

Common Data Problems

  1. Missing values - Empty cells or “N/A”
  2. Inconsistent formats - “3/15/24” vs “2024-03-15”
  3. Extra spaces - ” Alice ” vs “Alice”
  4. Wrong types - “123” stored as text

Cleaning Strategies

def clean_value(value):
    """Clean a data value"""
    # Remove extra spaces
    value = value.strip()
    
    # Handle empty values
    if value == "" or value == "N/A":
        return None
    
    return value

def safe_int(value):
    """Convert to int safely"""
    try:
        return int(value)
    except ValueError:
        return 0
ImportantData Cleaning Reality

Professional programmers spend 80% of their time cleaning data! When working with AI, always ask: “What could go wrong with this data? Show me how to handle those cases.”

20.7 Building a Data Pipeline

A data pipeline is a series of steps that transform raw data into useful information:

  1. Load - Read from file
  2. Clean - Fix problems
  3. Transform - Calculate new values
  4. analyse - Find patterns
  5. Report - Present results

Example: Student Performance Pipeline

def process_student_data(csv_file):
    """Complete pipeline for student data"""
    # Load
    students = load_csv(csv_file)
    
    # Clean
    for student in students:
        student['grades'] = [safe_int(g) for g in student['grades']]
    
    # Transform
    for student in students:
        student['average'] = calculate_average(student['grades'])
        student['letter_grade'] = get_letter_grade(student['average'])
    
    # analyse
    class_average = sum(s['average'] for s in students) / len(students)
    
    # Report
    print(f"Class Average: {class_average:.1f}")
    print("\nTop Students:")
    top_students = sorted(students, key=lambda s: s['average'], reverse=True)[:3]
    for student in top_students:
        print(f"  {student['name']}: {student['average']:.1f}")

20.8 Working with Large Files

Sometimes data files are huge - millions of rows! Here’s how to handle them:

Reading Files in Chunks

def process_large_file(filename, chunk_size=1000):
    """Process a large file in chunks"""
    with open(filename, 'r') as file:
        header = file.readline()
        
        chunk = []
        for line in file:
            chunk.append(line.strip())
            
            if len(chunk) >= chunk_size:
                process_chunk(chunk)
                chunk = []
        
        # Don't forget the last chunk!
        if chunk:
            process_chunk(chunk)
TipMemory Management

When AI suggests loading entire files into memory, ask: “What if this file had a million rows? Show me how to process it in chunks.”

20.9 Data Formats Quick Reference

CSV Quick Reference

# Read CSV
with open('data.csv', 'r') as file:
    lines = file.readlines()

# Write CSV
with open('output.csv', 'w') as file:
    file.write('Name,Score\n')
    file.write('Alice,95\n')

JSON Quick Reference

# Read JSON
import json
with open('data.json', 'r') as file:
    data = json.load(file)

# Write JSON
with open('output.json', 'w') as file:
    json.dump(data, file, indent=4)

20.10 Common Pitfalls and Solutions

Pitfall 1: Assuming Clean Data

Problem: Your code crashes on real data Solution: Always validate and clean first

Pitfall 2: Loading Everything at Once

Problem: Program runs out of memory Solution: Process in chunks

Pitfall 3: Hardcoding Column Positions

Problem: Code breaks when columns change Solution: Use header row to find columns

Pitfall 4: Ignoring Encoding Issues

Problem: Special characters appear as ??? Solution: Specify encoding when opening files

20.11 Common AI Complications

When you ask AI to help with data processing, watch for these patterns where it overcomplicates things.

Pandas for everything. Ask AI to read a CSV file and it will often reach for pandas, a powerful data analysis library. For simple tasks — reading rows, calculating averages, filtering — Python’s built-in csv module or even plain string splitting is simpler and teaches you more. Save pandas for when you actually need its power.

One-liner list comprehensions. AI loves to compress data processing into dense single lines like [{k: (int(v) if v.isdigit() else v) for k, v in zip(headers, line.split(','))} for line in open('data.csv').readlines()[1:]]. This is clever but unreadable. A simple for loop with clear variable names is almost always better when you’re learning.

Over-engineered error handling. AI might wrap every line in try/except blocks or create elaborate validation classes for a 20-line script. Start with the straightforward version. Add error handling where your program actually crashes on real data.

Ask your AI:

Show me the simplest possible way to read a CSV file and print
each row. No pandas, no list comprehensions, just basic Python.

Compare what you get to earlier responses. Simpler is almost always better when you’re learning.

20.12 Practice Projects

Project 1: Grade Book Analyzer

Create a program that: - Reads student grades from CSV - Calculates averages and letter grades - Identifies struggling students - Generates a summary report

Project 2: Weather Tracker

Build a system that: - Loads historical weather data - Finds temperature trends - Identifies extreme weather days - Exports summaries to JSON

Project 3: Sales Data Processor

Develop a tool that: - Processes sales transactions (CSV) - Calculates daily/monthly totals - Finds best-selling products - Handles refunds and errors

20.13 Connecting to the Real World

Working with data files is your bridge to real-world programming. Every business runs on data: - Scientists analyse research data - Teachers track student progress - Businesses monitor sales and inventory - Developers process application logs

The skills you’ve learned here apply everywhere!

20.14 Looking Ahead

Next chapter, you’ll learn to get data from the internet using APIs - taking your programs from working with static files to live, updating information. Imagine weather data that’s always current, or stock prices that update in real-time!

20.15 Chapter Summary

You’ve learned to: - Read and write CSV files for tabular data - Work with JSON for complex, nested data - Clean and validate real-world data - Process large files efficiently - Build complete data pipelines

These aren’t just programming skills - they’re data literacy skills that apply whether you’re coding, using spreadsheets, or just understanding how modern applications work.

20.16 Reflection Prompts

  1. Data Format Choice: When would you choose CSV vs JSON for a project?
  2. Error Handling: What could go wrong when reading data files?
  3. Real Applications: What data would you like to analyse with these skills?
  4. Pipeline Thinking: How does breaking processing into steps help?

Remember: Every major application works with data files. You now have the foundation to build real tools that solve real problems!