GitHub - habanoz/cc_spider: A simple pipeline based on datatrove to collect web pages from common crawl and clean-up.

cc-spider

A simple data collection and cleaning pipeline based on common crawl and datatrove. It is designed to collect text from a designated list of domains.

Create an cdx index of urls to fetch from a domain
Build a db index of cdx data to keep track of fetching
Parallel fetch chunks from common crawl using the offset and length information from cdx indexing data.
Extract text and apply heuristic based quality filters
Run deduplication to remove near duplicate documents

Build CDX Index

Edit and use provided bash shell

build_cdx_index.sh

Alternatively directly run following command(edit before using):

python src/cccd/cdx-index-client.py -c all https://www.aa.com.tr/tr/* -p 1 --max-retries 10 -j -z -d cdx_root/aa

Initialize Fetcher Database

python -m src.cccd.build_db_index --pattern "cdx_root/aa/prefix-t24.com.tr-CC-MAIN-*" --root work/aa

Download CC Files

Use CDX index to download the files.

python -m src.cccd.chunk_downloader --root work/aa

Extract Text and Apply Filters

This step requires datatrove to be installed. Edit python file before running.

python datatrove/process_common_crawl_fetched_files.py

Deduplicate

This step requires datatrove to be installed. Edit python file before running.

python datatrove/local_minhash_deduplication.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datatrove		datatrove
news-tr-1_8M_logs		news-tr-1_8M_logs
notebook		notebook
src/cccd		src/cccd
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
build_cdx_index.sh		build_cdx_index.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc-spider

Build CDX Index

Initialize Fetcher Database

Download CC Files

Extract Text and Apply Filters

Deduplicate

About

Releases

Packages

Languages

License

habanoz/cc_spider

Folders and files

Latest commit

History

Repository files navigation

cc-spider

Build CDX Index

Initialize Fetcher Database

Download CC Files

Extract Text and Apply Filters

Deduplicate

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages