`Hyperscaler-Features` : A Feature Engineering Framework for e-Commerce Impressions Dataset

This repository contains a feature engineering framework that closely captures the scale, data types, and query workloads of feature engineering pipelines at various large-scale companies.

We narrow the scope to the e-Commerce scenario in TPC-DS and introduce User Browser History so that our data generator can produce the interaction between features and user-behaviors within a user-session on a sequential timeline.

License

The SQL table creation and dataset derive from material licensed as follows:
TPC Benchmark™ and TPC-DS are trademarks of the Transaction Processing Performance Council.
All parties are granted permission to copy and distribute to any party without fee all or part of this material provided that:
1. copying and distribution is done for the primary purpose of disseminating TPC material;
2. the TPC copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the Transaction Processing Performance Council.

This feature engineering framework codebase is licensed under Apache License as follows:
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Requirements

The following packages are needed:

Docker 25.0.3
OpenJDK 1.8.0

Important

Set up docker to run without sudo access. Please refer to this post-installation guide

The following python packages are needed:

numpy
sqlalchemy
'pyhive[presto]'
pandas
sql_formatter
Minio
pyarrow

`Step 1` : Run node_setup.sh

./node_setup.sh -i <install_path> [ -p <http_proxy_url:port> ]

<install_path> is the directory where we map volumes for various docker containers. We recommend that this directory exist outside the repo.

Optional : Add proxy <URL>:<port> for docker containers using the -p flag.

Example:

# If proxy is NOT required
./node_setup.sh -i <install_path>

# When proxy needs to be used
./node_setup.sh -i <install_path> -p http://proxy.example.com:3128

This results in the creation of four docker containers:

Postgresql
hive
MinIO
Coordinator (presto)

These containers exist within a docker network (prestonet) and can communicate with each other.

`Step 2` : Run prepare.sh

Creates the minio bucket corresponding to the scale factor.
Generates sql queries inside <working_dir>/framework/sfx/. This directory is mapped to the presto docker container and can be accessed from within it.
Generates the TPC-DS dataset and tables.
Generates the click log dataset and queries based on the TPC-DS tables.
Generates create and drop table sql query files for click log dataset. Save these files inside <working_dir>/framework/sfx/.

cd framework/
./prepare.sh -s <scale_factor> -l <install_path>

<install_path> is the same path as provided in Step 1.

<scale_factor> accepts integer values. Scale factor = 1 means size of dataset is 1 GB. Usual values are 1, 10, 100, ...

`Step 3` : Create click log table and run other queries

Follow the steps below to create the partitioned click_log_formatted table from the generated log files stored inside the MinIO S3 bucket. Currently these log files are generated in parquet (default) or csv format.

# Log into the presto docker container
docker exec -it coordinator /bin/bash

# Create the partitioned click log table
presto-cli --file /framework/sfx/create_click_log_table.sql

# Example: when scale factor = 1, replace sfx with sf1
presto-cli --file /framework/sf1/create_click_log_table.sql

Note

Instead of the aforementioned Step 3, it is possible to run queries directly from localhost in the following way:

cd <install_path>/work
./run_query.sh framework/sfx/<sql_query_file>

# Example: Create the click_log_formatted table for scale factor 1
# ./run_query.sh framework/sf1/create_click_log_table.sql

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
config		config
framework		framework
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
get_dep.sh		get_dep.sh
node_setup.sh		node_setup.sh
start-metastore		start-metastore
start_presto_single_node.sh		start_presto_single_node.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Hyperscaler-Features` : A Feature Engineering Framework for e-Commerce Impressions Dataset

License

Requirements

`Step 1` : Run node_setup.sh

`Step 2` : Run prepare.sh

`Step 3` : Create click log table and run other queries

About

Releases

Contributors 5

Languages

License

intel/hyperscaler-features

Folders and files

Latest commit

History

Repository files navigation

Hyperscaler-Features : A Feature Engineering Framework for e-Commerce Impressions Dataset

License

Requirements

Step 1 : Run node_setup.sh

Step 2 : Run prepare.sh

Step 3 : Create click log table and run other queries

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Contributors 5

Languages

`Hyperscaler-Features` : A Feature Engineering Framework for e-Commerce Impressions Dataset

`Step 1` : Run node_setup.sh

`Step 2` : Run prepare.sh

`Step 3` : Create click log table and run other queries