AIS Downloader
The AISDataDownloaderDuckDB class downloads historical AIS vessel-traffic
data from Marine Cadastre and stores it efficiently in a local DuckDB database
with Parquet files for fast spatial and temporal queries.
- class ecosound.environment.ais_downloader.AISDataDownloaderDuckDB(db_path: str, parquet_dir: str, temp_dir: str | None = None)[source]
Bases:
objectDownloads and processes AIS data from Marine Cadastre with DuckDB + Parquet storage. Optimized for fast temporal and geographic queries.
Initialize the AIS data downloader with DuckDB.
- Parameters:
db_path – Path to DuckDB database file
parquet_dir – Directory to store Parquet files (partitioned by date)
temp_dir – Temporary directory for downloads (default: system temp)
- BASE_URL = 'https://coast.noaa.gov/htdata/CMSP/AISDataHandler'
- __init__(db_path: str, parquet_dir: str, temp_dir: str | None = None)[source]
Initialize the AIS data downloader with DuckDB.
- Parameters:
db_path – Path to DuckDB database file
parquet_dir – Directory to store Parquet files (partitioned by date)
temp_dir – Temporary directory for downloads (default: system temp)
- create_ais_view()[source]
Create a view that unions all Parquet files for easy querying. This view enables querying all AIS data as a single table.
- generate_date_urls(start_date: str, end_date: str) List[Tuple[str, str]][source]
Generate download URLs for date range.
- Parameters:
start_date – Start date in YYYY-MM-DD format
end_date – End date in YYYY-MM-DD format
- Returns:
List of (url, date_string) tuples
- is_date_in_database(date_str: str) bool[source]
Check if a date has already been processed and is in the database.
- Parameters:
date_str – Date string in YYYY-MM-DD format
- Returns:
True if date exists in database, False otherwise
- async download_file(session: ClientSession, url: str, date_str: str, force_download: bool = False) Path | None[source]
Download a single AIS data file.
- Parameters:
session – aiohttp client session
url – Download URL
date_str – Date string for filename
force_download – If True, download even if file already exists
- Returns:
Path to downloaded file or None if failed
- async download_files(start_date: str, end_date: str, max_concurrent: int = 5, force_download: bool = False) List[Path][source]
Download multiple AIS data files concurrently.
- Parameters:
start_date – Start date in YYYY-MM-DD format
end_date – End date in YYYY-MM-DD format
max_concurrent – Maximum concurrent downloads
force_download – If True, download files even if already in database
- Returns:
List of downloaded file paths
- extract_and_process_file(zip_path: Path, min_lat: float | None = None, max_lat: float | None = None, min_lon: float | None = None, max_lon: float | None = None, force_process: bool = False) int[source]
Extract ZIP file and process CSV data into Parquet with geographic filtering.
- Parameters:
zip_path – Path to ZIP file
min_lat – Latitude boundaries (optional)
max_lat – Latitude boundaries (optional)
min_lon – Longitude boundaries (optional)
max_lon – Longitude boundaries (optional)
force_process – If True, reprocess even if already in database
- Returns:
Number of records inserted
- process_all_files(downloaded_files: List[Path], min_lat: float | None = None, max_lat: float | None = None, min_lon: float | None = None, max_lon: float | None = None, max_workers: int = 4, force_process: bool = False)[source]
Process all downloaded files in parallel using ThreadPoolExecutor.
- Parameters:
downloaded_files – List of downloaded ZIP file paths
min_lat – Latitude boundaries
max_lat – Latitude boundaries
min_lon – Longitude boundaries
max_lon – Longitude boundaries
max_workers – Maximum number of parallel workers (default: 4)
force_process – If True, reprocess even if already in database