20 Chapter 10: Working with Data
In this chapter, you’ll learn how to work with real-world data files. You’ll discover how to read CSV files, process JSON data, and analyse information - skills that transform your programs from toys to tools. This is where programming becomes practical!
20.1 The Concept First
Up until now, your programs have worked with data you typed in or created yourself. But real programs work with existing data — grade books, weather records, product inventories, survey results. The data already exists somewhere; your job is to read it, make sense of it, and do something useful.
Think of it this way: you already know how to cook (write programs). Now you need to learn how to unpack groceries (read data files). The food arrives in different packaging — cans, bags, boxes — and you need to know how to open each one before you can start cooking.
Two formats dominate the data world. CSV is like a spreadsheet saved as plain text — rows and columns separated by commas, nothing fancy. JSON is like a set of nested labelled boxes — open one and you might find more boxes inside, each with its own label. Most data you’ll encounter comes in one of these two formats.
20.2 Discovering Data with Your AI Partner
Before we write any code, let’s build intuition for how data files work.
Ask your AI:
Show me what a CSV file looks like as plain text, then show me
the same information as JSON. Use a simple example like 3 students
with names and grades.
Notice how CSV is flat and table-like while JSON can nest information inside other information. This difference drives when you’d choose each format.
20.3 CSV Files: Your Gateway to Spreadsheet Data
CSV stands for “Comma-Separated Values” - it’s the simplest way to store table-like data. Every spreadsheet program can export to CSV, making it a universal data format.
Picture a spreadsheet. Now imagine stripping away the grid lines, the colours, the formulas — and just keeping the text, with commas where the column borders were. That’s a CSV file. Each line is a row, each comma marks a new column. You could create one in any text editor.
Understanding CSV Structure
Imagine a grade book:
Name,Quiz1,Quiz2,MidTerm,Final
Alice,85,92,88,91
Bob,78,85,82,79
Charlie,91,88,94,96
Each line is a row, commas separate columns. Simple, but powerful!
The AI Partnership Approach
Let’s explore CSV files together:
“I have a CSV file with student grades. Show me how to read it and calculate each student’s average. Keep it simple - just the basics.”
AI will likely show you Python’s csv module. But here’s the learning approach:
- First, understand the structure - Read the file as plain text first
- Then, parse manually - Split by commas yourself
- Finally, use the tools - Apply the csv module
Building a Grade Analyzer
Let’s design a program that reads student grades and provides insights:
def read_grades_simple(filename):
"""Read grades from CSV - learning version"""
grades = []
with open(filename, 'r') as file:
# Skip header line
header = file.readline()
# Read each student
for line in file:
parts = line.strip().split(',')
student = {
'name': parts[0],
'grades': [int(parts[i]) for i in range(1, len(parts))]
}
grades.append(student)
return grades
def calculate_average(grades):
"""Calculate average grade"""
return sum(grades) / len(grades)
# Use the functions
students = read_grades_simple('grades.csv')
for student in students:
avg = calculate_average(student['grades'])
print(f"{student['name']}: {avg:.1f}")The line [int(parts[i]) for i in range(1, len(parts))] is a list comprehension. Ask AI: “Explain this list comprehension by showing me the loop version first.”
Common CSV Patterns
When working with CSV files, you’ll often need to:
- Skip headers - First line often contains column names
- Handle missing data - Empty cells are common
- Convert types - Everything starts as text
- Deal with special characters - Commas in data, quotes, etc.
20.4 From Tables to Trees: JSON
CSV works brilliantly for flat, table-shaped data. But what happens when your data has structure within structure — a contact who has multiple phone numbers, or a student who belongs to several clubs? You need a format that can nest information. That’s where JSON comes in.
JSON (JavaScript Object Notation) is how modern applications share data. It’s like Python dictionaries written as text - perfect for complex, nested information.
Think of JSON like a set of nested labelled boxes. You open a box marked “contacts” and find smaller boxes inside, each labelled with a person’s name. Open one of those and you find even smaller boxes: “phone”, “email”, “tags”. Each box either contains a value or more boxes. If you’ve used Python dictionaries, you already understand the idea.
Ask your AI:
Take this CSV data and convert it to JSON. Then show me a case
where CSV can't easily represent the data but JSON can — like a
student who has multiple email addresses.
This will show you exactly why both formats exist.
Understanding JSON Structure
Here’s a contact list in JSON:
{
"contacts": [
{
"name": "Alice Smith",
"phone": "555-1234",
"email": "alice@email.com",
"tags": ["friend", "work"]
},
{
"name": "Bob Jones",
"phone": "555-5678",
"email": "bob@email.com",
"tags": ["family"]
}
],
"last_updated": "2024-03-15"
}Look familiar? It’s like the dictionaries you’ve been using!
Working with JSON Data
Python makes JSON easy:
import json
def load_contacts(filename):
"""Load contacts from JSON file"""
with open(filename, 'r') as file:
data = json.load(file)
return data
def save_contacts(contacts, filename):
"""Save contacts to JSON file"""
with open(filename, 'w') as file:
json.dump(contacts, file, indent=4)
# Use it
data = load_contacts('contacts.json')
print(f"You have {len(data['contacts'])} contacts")Ask AI: “I have a JSON file with nested data. Show me how to navigate through it step by step, printing what’s at each level.”
JSON vs CSV: Choosing the Right Format
Use CSV when: - Data is tabular (rows and columns) - You need Excel compatibility - Structure is simple and flat
Use JSON when: - Data has nested relationships - You need flexible structure - Working with web APIs
20.5 Real-World Data Analysis
Let’s combine everything into a practical example - analysing weather data:
The Weather Data Project
Imagine you have weather data in CSV format:
Date,Temperature,Humidity,Conditions
2024-03-01,72,65,Sunny
2024-03-02,68,70,Cloudy
2024-03-03,65,80,Rainy
Let’s build an analyzer:
def analyze_weather(filename):
"""analyse weather patterns"""
data = []
# Read the data
with open(filename, 'r') as file:
header = file.readline()
for line in file:
parts = line.strip().split(',')
data.append({
'date': parts[0],
'temp': int(parts[1]),
'humidity': int(parts[2]),
'conditions': parts[3]
})
# Find patterns
temps = [day['temp'] for day in data]
avg_temp = sum(temps) / len(temps)
rainy_days = [day for day in data if day['conditions'] == 'Rainy']
return {
'average_temperature': avg_temp,
'total_days': len(data),
'rainy_days': len(rainy_days),
'data': data
}20.7 Building a Data Pipeline
A data pipeline is a series of steps that transform raw data into useful information:
- Load - Read from file
- Clean - Fix problems
- Transform - Calculate new values
- analyse - Find patterns
- Report - Present results
Example: Student Performance Pipeline
def process_student_data(csv_file):
"""Complete pipeline for student data"""
# Load
students = load_csv(csv_file)
# Clean
for student in students:
student['grades'] = [safe_int(g) for g in student['grades']]
# Transform
for student in students:
student['average'] = calculate_average(student['grades'])
student['letter_grade'] = get_letter_grade(student['average'])
# analyse
class_average = sum(s['average'] for s in students) / len(students)
# Report
print(f"Class Average: {class_average:.1f}")
print("\nTop Students:")
top_students = sorted(students, key=lambda s: s['average'], reverse=True)[:3]
for student in top_students:
print(f" {student['name']}: {student['average']:.1f}")20.8 Working with Large Files
Sometimes data files are huge - millions of rows! Here’s how to handle them:
Reading Files in Chunks
def process_large_file(filename, chunk_size=1000):
"""Process a large file in chunks"""
with open(filename, 'r') as file:
header = file.readline()
chunk = []
for line in file:
chunk.append(line.strip())
if len(chunk) >= chunk_size:
process_chunk(chunk)
chunk = []
# Don't forget the last chunk!
if chunk:
process_chunk(chunk)When AI suggests loading entire files into memory, ask: “What if this file had a million rows? Show me how to process it in chunks.”
20.9 Data Formats Quick Reference
CSV Quick Reference
# Read CSV
with open('data.csv', 'r') as file:
lines = file.readlines()
# Write CSV
with open('output.csv', 'w') as file:
file.write('Name,Score\n')
file.write('Alice,95\n')JSON Quick Reference
# Read JSON
import json
with open('data.json', 'r') as file:
data = json.load(file)
# Write JSON
with open('output.json', 'w') as file:
json.dump(data, file, indent=4)20.10 Common Pitfalls and Solutions
Pitfall 1: Assuming Clean Data
Problem: Your code crashes on real data Solution: Always validate and clean first
Pitfall 2: Loading Everything at Once
Problem: Program runs out of memory Solution: Process in chunks
Pitfall 3: Hardcoding Column Positions
Problem: Code breaks when columns change Solution: Use header row to find columns
Pitfall 4: Ignoring Encoding Issues
Problem: Special characters appear as ??? Solution: Specify encoding when opening files
20.11 Common AI Complications
When you ask AI to help with data processing, watch for these patterns where it overcomplicates things.
Pandas for everything. Ask AI to read a CSV file and it will often reach for pandas, a powerful data analysis library. For simple tasks — reading rows, calculating averages, filtering — Python’s built-in csv module or even plain string splitting is simpler and teaches you more. Save pandas for when you actually need its power.
One-liner list comprehensions. AI loves to compress data processing into dense single lines like [{k: (int(v) if v.isdigit() else v) for k, v in zip(headers, line.split(','))} for line in open('data.csv').readlines()[1:]]. This is clever but unreadable. A simple for loop with clear variable names is almost always better when you’re learning.
Over-engineered error handling. AI might wrap every line in try/except blocks or create elaborate validation classes for a 20-line script. Start with the straightforward version. Add error handling where your program actually crashes on real data.
Ask your AI:
Show me the simplest possible way to read a CSV file and print
each row. No pandas, no list comprehensions, just basic Python.
Compare what you get to earlier responses. Simpler is almost always better when you’re learning.
20.12 Practice Projects
Project 1: Grade Book Analyzer
Create a program that: - Reads student grades from CSV - Calculates averages and letter grades - Identifies struggling students - Generates a summary report
Project 2: Weather Tracker
Build a system that: - Loads historical weather data - Finds temperature trends - Identifies extreme weather days - Exports summaries to JSON
Project 3: Sales Data Processor
Develop a tool that: - Processes sales transactions (CSV) - Calculates daily/monthly totals - Finds best-selling products - Handles refunds and errors
20.13 Connecting to the Real World
Working with data files is your bridge to real-world programming. Every business runs on data: - Scientists analyse research data - Teachers track student progress - Businesses monitor sales and inventory - Developers process application logs
The skills you’ve learned here apply everywhere!
20.14 Looking Ahead
Next chapter, you’ll learn to get data from the internet using APIs - taking your programs from working with static files to live, updating information. Imagine weather data that’s always current, or stock prices that update in real-time!
20.15 Chapter Summary
You’ve learned to: - Read and write CSV files for tabular data - Work with JSON for complex, nested data - Clean and validate real-world data - Process large files efficiently - Build complete data pipelines
These aren’t just programming skills - they’re data literacy skills that apply whether you’re coding, using spreadsheets, or just understanding how modern applications work.
20.16 Reflection Prompts
- Data Format Choice: When would you choose CSV vs JSON for a project?
- Error Handling: What could go wrong when reading data files?
- Real Applications: What data would you like to analyse with these skills?
- Pipeline Thinking: How does breaking processing into steps help?
Remember: Every major application works with data files. You now have the foundation to build real tools that solve real problems!