A Node.js web scraping application that uses Puppeteer to search for job listings and extract email addresses from search engine results. Specifically designed for Hebrew job searches in the Israeli market.
Built in January 2019. This application demonstrates crawling search engine results, extracting URLs, parsing email addresses, validating and cleaning data, and storing results in MongoDB.
- 🔍 Automated search engine crawling using Puppeteer
- 📧 Email address extraction and validation
- 🧹 Extensive email cleaning and normalization (Israeli domains)
- 💾 MongoDB storage with duplicate prevention
- ⚡ Performance optimization with resource blocking
- 🧪 Test mode support for development
- 🇮🇱 Hebrew search query support
- 🔧 Configurable search parameters and limits
flowchart TD
A[Start Application] --> B[Load Settings]
B --> C{Test Mode?}
C -->|Yes| D[Load Local Sources]
C -->|No| E[Initialize Puppeteer]
D --> F[Extract URLs]
E --> G[Navigate to Search URL]
G --> H[Block Resources]
H --> I[Wait for Page Load]
I --> F
F --> J[Parse Email Addresses]
J --> K[Clean & Validate Emails]
K --> L[Store in MongoDB]
L --> M{Goal Reached?}
M -->|No| N{Max Processes?}
M -->|Yes| O[End]
N -->|No| G
N -->|Yes| O
flowchart LR
A[Search Keys] --> B[Build Query]
B --> C[Search Engine]
C --> D[HTML Content]
D --> E[URL Extraction]
E --> F[Email Regex]
F --> G[Raw Emails]
G --> H[Cleaning Logic]
H --> I[Validation]
I --> J[MongoDB]
J --> K[Unique Emails]
- Node.js (v10 or higher)
- npm or yarn
- MongoDB (local or remote instance)
- Clone the repository:
git clone https://github.com/orassayag/puppeteer-example.git
cd puppeteer-example- Install dependencies:
npm install- Ensure MongoDB is running:
mongodEdit the settings in src/settings/settings.js:
{
IS_TEST_MODE: true, // Test mode vs live crawling
SEARCH_ENGINE_TYPE: 'bing', // Search engine to use
GOAL: 1000, // Target email count
MAXIMUM_SEARCH_PROCESSES_COUNT: 100, // Max search iterations
SEARCH_ENGINE_PAGES_COUNT_PER_PROCESS: 3, // Pages per process
MONGO_DATA_BASE_CONNECTION_STRING: 'mongodb://localhost:27017/crawl'
}Customize search terms in src/core/lists/searchKeys.list.js:
- Job keywords (want)
- Professions (profession)
- Cities (city)
- Email keywords (email)
npm startThe application will:
- Connect to MongoDB
- Begin crawling based on configured search parameters
- Extract and validate email addresses
- Store unique emails in the database
- Stop when goal is reached or max processes completed
For testing without live crawling:
- Set
IS_TEST_MODE: truein settings - Place test HTML files in the
sources/directory - Run the application
puppeteer-example/
├── src/
│ ├── core/
│ │ ├── enums/ # Color and search enums
│ │ │ └── files/
│ │ ├── lists/ # Search keys and filters
│ │ └── models/ # MongoDB schemas
│ ├── logics/ # Main business logic
│ │ └── crawl.logic.js
│ ├── scripts/ # Entry points
│ │ └── crawl.script.js
│ ├── services/ # Services layer
│ │ └── files/
│ │ ├── database.service.js
│ │ ├── searchKey.service.js
│ │ ├── setup.service.js
│ │ └── source.service.js
│ ├── settings/ # Configuration
│ │ └── settings.js
│ └── utils/ # Utility functions
│ └── files/
│ ├── color.utils.js
│ ├── file.utils.js
│ ├── log.utils.js
│ ├── path.utils.js
│ ├── system.utils.js
│ ├── text.utils.js
│ ├── time.utils.js
│ └── validation.utils.js
├── sources/ # Test sources (test mode)
├── dist/ # Output directory
├── package.json
├── README.md
├── LICENSE
├── CONTRIBUTING.md
└── INSTRUCTIONS.md
Main orchestration logic for the crawling process.
Extensive text processing utilities including:
- Email address extraction
- URL extraction
- Email cleaning and normalization
- Domain parsing
MongoDB connection and operations.
Manages search query construction from configured keywords.
The application includes sophisticated email cleaning for Israeli domains:
- Fixes common typos (
.co→.co.il,.ill→.il) - Removes invalid characters
- Handles
mailto:links - Validates against email regex
- Uses the
validatorlibrary for final validation
- Resource Blocking: Blocks images, stylesheets, fonts, and scripts for faster page loads
- Request Interception: Puppeteer request interception for fine-grained control
- Configurable Limits: Control search depth and total processes
- Database Indexing: Unique constraint on email addresses prevents duplicates
- Node.js - JavaScript runtime
- Puppeteer - Headless browser automation
- Mongoose - MongoDB object modeling
- Validator - String validation
- fs-extra - File system operations
- log-update - Terminal output
- ESLint - Code quality
npm run lintTest with local HTML files by enabling test mode in settings.
Contributions to this project are released to the public under the project's open source license.
Everyone is welcome to contribute. Contributing doesn't just mean submitting pull requests—there are many different ways to get involved, including answering questions and reporting issues.
For more details, see CONTRIBUTING.md.
- Or Assayag - Initial work - orassayag
- Or Assayag [email protected]
- GitHub: https://github.com/orassayag
- StackOverflow: https://stackoverflow.com/users/4442606/or-assayag?tab=profile
- LinkedIn: https://linkedin.com/in/orassayag
This project is licensed under the MIT License - see the LICENSE file for details.
- Built as a demonstration of Puppeteer web scraping capabilities
- Designed for the Israeli job market with Hebrew language support
- Includes domain-specific email cleaning for Israeli TLDs