# Test Dataset for Duplicate Finder Accuracy Validation

## Dataset goals

1. **Recall (completeness)** — does it find all real duplicates?
2. **Precision (accuracy)** — does it avoid marking unique files as duplicates?

**Validation rule:** In each duplicate group there are exactly 3 files. If the app shows a group with a different number — it is an error.

---

## Dataset structure

| Category | Files | Duplicate groups | Test goal |
|----------|------:|-----------------:|-----------|
| **Real duplicates** | 600 | 200 | Recall |
| **Traps (unique)** | 154 | 0 | Precision (FP=0) |
| **Edge cases (duplicates)** | 90 | 30 | Edge cases |
| **Edge cases (traps)** | 7 | 0 | Bundles |
| **Total** | **851** | **230** | |

**Expected result:** 230 groups of 3 files, 0 false positives

---

## Part 1: Real duplicates (600 files → 200 groups)

### 1.1. Simple duplicates (300 files → 100 groups)

Each group: original + 2 copies = 3 files

| Type | Groups | Files | Description |
|------|-------:|------:|-------------|
| JPEG photos | 20 | 60 | 20 originals × 3 |
| PNG photos | 10 | 30 | 10 originals × 3 |
| PDF documents | 15 | 45 | 15 originals × 3 |
| MP4 videos | 10 | 30 | 10 originals × 3 |
| MP3 audio | 10 | 30 | 10 originals × 3 |
| ZIP archives | 10 | 30 | 10 originals × 3 |
| TXT text | 10 | 30 | 10 originals × 3 |
| Code (py, js, swift) | 15 | 45 | 15 originals × 3 |
| **Total** | **100** | **300** | |

**Copy variations in each group:**
- Original: `folder_A/photo.jpg`
- Copy 1: `folder_A/photo_copy.jpg` (same folder, different name)
- Copy 2: `folder_B/photo.jpg` (different folder, same name)

### 1.2. Duplicates with modified metadata (300 files → 100 groups)

Each group: original + 2 copies with modified metadata = 3 files

| Scenario | Groups | Files | What is changed in the copies |
|----------|-------:|------:|-------------------------------|
| Different filename | 20 | 60 | `photo.jpg` → `IMG_2024.jpg`, `DSC_001.jpg` |
| Different extension case | 10 | 30 | `file.jpg` → `file.JPG`, `file.Jpg` |
| Different modification date | 20 | 60 | `touch -t` — different dates |
| Different creation date | 10 | 30 | Birth date changed |
| Different permissions | 10 | 30 | `chmod 644`, `chmod 755`, `chmod 600` |
| Added xattr | 10 | 30 | Different extended attributes |
| In a hidden folder | 10 | 30 | `visible/`, `.hidden/`, `.secret/` |
| Hidden file (dot in the name) | 10 | 30 | `file.jpg`, `.file.jpg`, `..file.jpg` |
| **Total** | **100** | **300** | |

---

## Part 2: Traps — NOT duplicates (154 files)

**Rule:** For each trap subtype — 2 sets:
- Set A: 2 files
- Set B: 5 files
- Total: 7 files per subtype

### 2.1. Same names, different content (28 files)

```
folder_A/report.pdf  (version 1)
folder_B/report.pdf  (version 2 — DIFFERENT content)
```

| Subtype | Sets | Files | Description |
|--------|-----:|------:|-------------|
| Document versions | 2 (2+5) | 7 | Same name, different document versions |
| Re-shot photos | 2 (2+5) | 7 | `sunset.jpg` — different sunset photos |
| Configs | 2 (2+5) | 7 | `.gitignore` from different projects |
| README.md | 2 (2+5) | 7 | README from different projects |
| **Total** | **8** | **28** | |

### 2.2. Same size, different content (42 files)

```
file_1.bin  (1000 bytes, content AAA...)
file_2.bin  (1000 bytes, content BBB...)
```

| Size | Sets | Files | Description |
|------|-----:|------:|-------------|
| 0 bytes | 2 (2+5) | 7 | Zero-byte files with different names |
| 1 KB | 2 (2+5) | 7 | Different content |
| 10 KB | 2 (2+5) | 7 | Different content |
| 100 KB | 2 (2+5) | 7 | Different content |
| 1 MB | 2 (2+5) | 7 | Different content |
| 10 MB | 2 (2+5) | 7 | Different content |
| **Total** | **12** | **42** | |

### 2.3. Similar names, different content (28 files)

```
photo.jpg
photo_copy.jpg      ← DIFFERENT content!
photo (1).jpg       ← DIFFERENT content!
```

| Pattern | Sets | Files |
|---------|-----:|------:|
| `file` vs `file_copy` | 2 (2+5) | 7 |
| `file` vs `file (1)` | 2 (2+5) | 7 |
| `file` vs `file-backup` | 2 (2+5) | 7 |
| `file` vs `file_2024` | 2 (2+5) | 7 |
| **Total** | **8** | **28** |

### 2.4. Same file prefix, different length (28 files)

```
video_full.mp4      (100 MB full video)
video_cut.mp4       (10 MB — first 10 MB identical, but the file is shorter)
```

| Subtype | Sets | Files | Description |
|--------|-----:|------:|-------------|
| Trimmed videos | 2 (2+5) | 7 | Full vs trimmed |
| Trimmed audio | 2 (2+5) | 7 | Full track vs preview |
| Partial archives | 2 (2+5) | 7 | Full vs corrupted archive |
| PDF with removed pages | 2 (2+5) | 7 | 10 pages vs 5 pages |
| **Total** | **8** | **28** | |

### 2.5. Visually similar, but technically different (28 files)

| Subtype | Sets | Files | Description |
|--------|-----:|------:|-------------|
| Photo re-saved with different quality | 2 (2+5) | 7 | JPEG 100% vs JPEG 80% |
| Photo with a minimal difference | 2 (2+5) | 7 | 1 pixel differs |
| Documents with an invisible difference | 2 (2+5) | 7 | Trailing space at end of line |
| Photo with different EXIF | 2 (2+5) | 7 | Same image, different metadata |
| **Total** | **8** | **28** | |

---

## Part 3: Edge cases (97 files)

### 3.1. Special filenames (60 files → 20 groups)

Real duplicates with unusual names. Each group = 3 files.

| Case | Groups | Files | Example name |
|------|-------:|------:|--------------|
| Unicode emoji | 3 | 9 | `📷photo.jpg` |
| Cyrillic | 3 | 9 | `фото.jpg` |
| Hieroglyphs | 3 | 9 | `写真.jpg` |
| Spaces in name | 3 | 9 | `my vacation photo.jpg` |
| Special characters | 3 | 9 | `file@#$%&.jpg` |
| Very long names | 3 | 9 | 200+ characters |
| Starts with a dot | 2 | 6 | `.hidden_photo.jpg` |
| **Total** | **20** | **60** | |

### 3.2. Deep nesting (30 files → 10 groups)

Real duplicates in folders of different depths. Each group = 3 files.

```
level1/level2/level3/.../level15/deep_file.jpg    (copy 1)
level1/level2/shallow_file.jpg                     (copy 2)
root_file.jpg                                      (copy 3)
```

| Depth | Groups | Files |
|-------|-------:|------:|
| 15 levels vs 2 levels vs 0 levels | 10 | 30 |

### 3.3. Bundles .app — traps (7 files → 0 groups)

```
MyApp.app/           (an application bundle — it is a folder)
MyApp_copy.app/      (bundle copy)
```

| Subtype | Sets | Files | Description |
|--------|-----:|------:|-------------|
| Copies of .app bundles | 2 (2+5) | 7 | Must NOT be detected as duplicate files |

---

## Folder structure

```
~/DuplicateTestSet/
├── 01_true_duplicates/
│   ├── 1.1_simple/
│   │   ├── photos_jpeg/
│   │   ├── photos_png/
│   │   ├── documents_pdf/
│   │   ├── videos_mp4/
│   │   ├── audio_mp3/
│   │   ├── archives_zip/
│   │   ├── text_txt/
│   │   └── code/
│   └── 1.2_metadata_changed/
│       ├── renamed/
│       ├── extension_case/
│       ├── date_modified/
│       ├── date_created/
│       ├── permissions/
│       ├── xattr/
│       ├── hidden_folder/
│       └── dotfiles/
│
├── 02_traps_not_duplicates/
│   ├── 2.1_same_name_diff_content/
│   ├── 2.2_same_size_diff_content/
│   ├── 2.3_similar_names/
│   ├── 2.4_truncated/
│   └── 2.5_visually_similar/
│
├── 03_edge_cases/
│   ├── 3.1_special_names/
│   │   ├── unicode_emoji/
│   │   ├── cyrillic/
│   │   ├── hieroglyphs/
│   │   ├── spaces/
│   │   ├── special_chars/
│   │   ├── long_names/
│   │   └── dotfiles/
│   ├── 3.2_deep_nesting/
│   └── 3.3_bundles/
│
└── manifest.json
```

---

## manifest.json — for automated validation

```json
{
  "version": "1.0",
  "created": "2025-01-27",
  "summary": {
    "total_files": 851,
    "expected_duplicate_groups": 230,
    "expected_files_per_group": 3,
    "expected_false_positives": 0
  },
  "duplicate_groups": [
    {
      "id": "G001",
      "category": "1.1_simple",
      "type": "photos_jpeg",
      "files": [
        "01_true_duplicates/1.1_simple/photos_jpeg/folder_A/sunset.jpg",
        "01_true_duplicates/1.1_simple/photos_jpeg/folder_A/sunset_copy.jpg",
        "01_true_duplicates/1.1_simple/photos_jpeg/folder_B/sunset.jpg"
      ],
      "hash_sha256": "a1b2c3d4..."
    }
  ],
  "traps": [
    {
      "id": "TRAP_001",
      "category": "2.1_same_name_diff_content",
      "type": "documents_versions",
      "reason": "Same filename, different content",
      "files": [
        "02_traps/2.1_same_name_diff_content/docs_v1/report.pdf",
        "02_traps/2.1_same_name_diff_content/docs_v2/report.pdf"
      ]
    }
  ]
}
```

---

## Metrics for evaluation

After scanning, compare with manifest.json:

| Metric | Formula | Expected |
|--------|---------|----------|
| **Recall** | Found groups / 230 | ≥ 99% |
| **Precision** | Correct groups / All found | 100% |
| **False Positives** | Groups not from manifest | 0 |
| **Group size** | Files in each group | Exactly 3 |

---

## Quick visual verification

1. Run the app on `~/DuplicateTestSet/`
2. Check the number of groups found → should be **230**
3. Go through the groups → each must have **exactly 3 files**
4. Verify that files from `02_traps/` **did NOT** appear in results
