godocs

Upload Deduplication — Client Guide

Instructions for building a client that uploads only new documents to godocs.

API Reference

Upload a document

POST /api/document/upload
Content-Type: multipart/form-data
Form field: file (the document file)

Responses:

Status	Meaning	Body
201	Created — new document ingested	`{"ulid": "01J...", "name": "file.pdf", "hash": "abc123...", "id": 42}`
409	Conflict — duplicate already exists	`{"error": "duplicate document", "hash": "abc123...", "ulid": "01J...", "name": "file.pdf", "id": 42}`
400	Bad request — sidecar file rejected	`{"error": "cannot upload sidecar files directly; ..."}`
200	Ingested but ULID lookup failed	`{"path": "/ingress/file.pdf"}`

The server computes the MD5 hash of the uploaded bytes and checks the database before writing to disk. A 409 response includes the existing document’s ULID, so the client can proceed with metadata/tag operations without re-uploading.

Supported file types: .pdf, .jpg, .jpeg, .png, .tiff, .doc, .docx, .odf, .rtf, .txt

Rejected sidecar extensions: .ocr.txt, .thumb.png, .tags.json, .tn_256.png

Look up a document by hash

GET /api/document/lookup?hash=<md5_hex>

Responses:

Status	Meaning	Body
200	Found	`{"ulid": "01J...", "name": "file.pdf", "path": "L/00/00/42/000042.orig.pdf", "id": 42, "hash": "abc123..."}`
404	Not found	—

Set OCR text

PUT /api/document/:ulid/ocr
Content-Type: application/json
Body: {"text": "extracted text content"}

Set metadata

PUT /api/document/:ulid/metadata
Content-Type: application/json
Body: {"author": "...", "source": "scanner", ...}

Also auto-generates the thumbnail.

Add a tag

POST /api/documents/:ulid/tags
Content-Type: application/json
Body: {"tag_id": 1}

List all tags

GET /api/tags

Returns array of {"id": 1, "name": "Finance", "color": "#3273dc", ...}.

Hash Algorithm

MD5, lowercase hex string (32 characters). Example: d41d8cd98f00b204e9800998ecf8427e.

Go: crypto/md5 — the server uses github.com/drummonds/godocs-hash.

import "crypto/md5"

func hashFile(path string) (string, error) {
    f, err := os.Open(path)
    if err != nil {
        return "", err
    }
    defer f.Close()
    h := md5.New()
    if _, err := io.Copy(h, f); err != nil {
        return "", err
    }
    return fmt.Sprintf("%x", h.Sum(nil)), nil
}

Shell: md5sum file.pdf | cut -d' ' -f1

MD5 throughput is ~1.5 GB/s on modern CPUs. A 50 MB file hashes in ~26 ms. On a Raspberry Pi, expect 50–200 MB/s — still under 1 second for large files.

Client Upload Strategy

Simple: upload and handle 409

Upload every file. If the server returns 409, use the ULID from the response body to continue with metadata/tag operations. No client-side hashing needed.

for each file:
    resp = POST /api/document/upload with file
    if resp.status == 201:
        ulid = resp.body.ulid       # new document
    elif resp.status == 409:
        ulid = resp.body.ulid       # already exists
    else:
        handle error
    # continue with OCR, metadata, tags using ulid

This is simplest but transfers every file over the network.

Efficient: hash-before-upload

Compute MD5 locally, check via lookup endpoint, skip upload if the document already exists.

for each file:
    hash = md5(file)
    resp = GET /api/document/lookup?hash={hash}
    if resp.status == 200:
        ulid = resp.body.ulid       # already on server
    else:
        resp = POST /api/document/upload with file
        ulid = resp.body.ulid       # 201 created
    # continue with OCR, metadata, tags using ulid

Optimal: hash-before-upload + local manifest

For repeated syncs, maintain a local manifest (path → {size, mtime, md5}) to skip files that haven’t changed since the last sync.

for each file:
    stat = os.Stat(file)
    if manifest[path].mtime == stat.mtime && manifest[path].size == stat.size:
        skip                        # unchanged since last sync

    hash = md5(file)
    if manifest[path].hash == hash:
        update manifest mtime, skip # content unchanged despite mtime change

    resp = GET /api/document/lookup?hash={hash}
    if resp.status == 200:
        ulid = resp.body.ulid
    else:
        resp = POST /api/document/upload with file
        ulid = resp.body.ulid

    update manifest {path, size, mtime, hash}
    # continue with OCR, metadata, tags using ulid

Complete Upload Flow (Go pseudocode)

func uploadDocument(client *http.Client, baseURL, filePath string) (string, error) {
    // 1. Hash locally
    hash, err := hashFile(filePath)
    if err != nil {
        return "", err
    }

    // 2. Check if already on server
    resp, err := client.Get(baseURL + "/api/document/lookup?hash=" + hash)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    if resp.StatusCode == 200 {
        var doc struct{ ULID string `json:"ulid"` }
        json.NewDecoder(resp.Body).Decode(&doc)
        return doc.ULID, nil // already exists
    }

    // 3. Upload
    body := &bytes.Buffer{}
    writer := multipart.NewWriter(body)
    part, _ := writer.CreateFormFile("file", filepath.Base(filePath))
    f, _ := os.Open(filePath)
    io.Copy(part, f)
    f.Close()
    writer.Close()

    resp, err = client.Post(baseURL+"/api/document/upload", writer.FormDataContentType(), body)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    var result struct {
        ULID  string `json:"ulid"`
        Error string `json:"error"`
    }
    json.NewDecoder(resp.Body).Decode(&result)

    switch resp.StatusCode {
    case 201:
        return result.ULID, nil // new document
    case 409:
        return result.ULID, nil // race condition: duplicate appeared between lookup and upload
    default:
        return "", fmt.Errorf("upload failed: %d %s", resp.StatusCode, result.Error)
    }
}

Approaches Considered and Rejected

Filename + filesize pre-filter — Not reliable. Same content can have different names; different content can share names.

Streaming hash during upload — Wastes bandwidth on duplicates. Hash-before-upload avoids the transfer entirely.

Partial hashing (first N bytes) — Not collision-safe for scanned documents with identical headers. Full MD5 is already milliseconds.

This site is open source. Improve this page.