A comprehensive Python library for data manipulation, conversion, and combination. Built with best practices including robust error handling, progress tracking, and comprehensive logging.
- 🔄 Smart Data Combining - Merge multiple JSON files with validation and error recovery
- 📊 Format Conversion - Convert LinkedIn Sales Navigator data to CSV
- 🛡️ Robust Error Handling - Comprehensive error catching and recovery
- 📝 Rich Logging - Detailed operation logs with customizable verbosity
- 📈 Progress Tracking - Real-time progress bars for long operations
- 🔒 Data Protection - Built-in .gitignore to prevent accidental data commits
- 🧪 Well Tested - Unit tests for core functionality
- 📚 Comprehensive Docs - Detailed guides and examples
data-extraction/
├── src/ # Core library
│ ├── combiners/ # JSON file combining tools
│ │ └── json_merger.py
│ ├── converters/ # Format conversion tools
│ │ ├── linkedin_to_csv.py
│ │ └── linkedin_to_csv_enhanced.py
│ └── utils/ # Utility functions
│ ├── file_utils.py # File handling utilities
│ ├── logging_utils.py # Logging setup and helpers
│ └── progress_utils.py # Progress bars and spinners
├── workflows/ # Pre-built complete workflows
│ └── linkedin_salesnav_pipeline.py
├── examples/ # Usage examples
│ └── example_usage.py
├── docs/ # Documentation
│ └── LINKEDIN_SALESNAV_GUIDE.md
├── tests/ # Unit tests
│ └── test_suite.py
└── requirements.txt # Dependencies (none - stdlib only!)
- JSON Combiner: Intelligently merge multiple JSON files
- Supports both list and object JSON formats
- Automatic error detection and recovery
- Handles nested data structures
- Error handling for malformed JSON
- LinkedIn JSON to CSV: Convert LinkedIn Sales Navigator data to CSV
- Extracts company information
- Builds LinkedIn URLs from entity URNs
- Normalizes text and handles image artifacts
- Processes spotlight badges
# Clone the repository
git clone https://github.com/rahalio/data-extraction.git
cd data-extraction
# No external dependencies required - uses Python standard library only
# Python 3.8+ requiredThe easiest way to process LinkedIn Sales Navigator exports:
# Process all JSON files and create a CSV
python workflows/linkedin_salesnav_pipeline.py \
--input-dir /path/to/json/files \
--output-dir ./outputSee the complete LinkedIn Sales Navigator Guide for detailed instructions.
from src.combiners import combine_json_files
from src.converters import convert_json_to_csv
# Combine JSON files
result = combine_json_files(
input_dir="./data",
output_file="combined.json",
pattern="*.json"
)
# Convert LinkedIn JSON to CSV
result = convert_json_to_csv(
input_pattern="*.json",
output_file="companies.csv",
input_dir="./data"
)python src/combiners/json_merger.py \
--input-dir ./data \
--output combined.json \
--pattern "*.json" \
--verbosepython src/converters/linkedin_to_csv_enhanced.py \
--pattern "*.json" \
--output companies.csv \
--input-dir ./data \
--verbose--input-dir: Directory containing JSON files (default: current directory)--output: Output filename (default: combined.json)--pattern: Glob pattern for matching files (default: *.json)--verbose,-v: Enable verbose output with progress tracking
--pattern: Glob pattern for input JSON files (default: *.json)--output: Output CSV filename (default: companies.csv)--input-dir: Directory containing JSON files (default: current directory)--verbose,-v: Enable verbose output with progress tracking
--input-dir: Directory containing JSON export files (required)--output-dir: Directory for output files (default: same as input-dir)--keep-combined: Keep the intermediate combined.json file
- Python 3.8 or higher
- No external dependencies (uses Python standard library only)
See LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
Repository: https://github.com/rahalio/data-extraction