Introduce a microservices architecture to create reusable, scalable ETL frameworks This modular approach allows for independent development and maintenance of ETL components which in turn enhances flexibility and efficiency in data processing Transition from a de-normalized to a star schema to simplify the data structure Implement Automated Data Quality Governance Frameworks to enable real- time monitoring of data quality Techniques like parallel processing, data partitioning and schema optimization can be employed to boost ETL operations Automated profiling and cleansing routines should be an integral part of automated data quality governance frameworks Incorporate indexing, SQL tuning, query optimization

If you need data prepared in advance, like reports and aggregations, without any time delays, opt for ETL. However, if your focus is on data analysis and running ad-hoc queries, ELT is the better choice.

To prioritize ETL processes, I first identify which data sources offer the most valuable information for business goals. This ensures alignment with strategic objectives and maximizes impact. Next, I prioritize based on business rule complexity and data volume. Even small data sources with intricate rules can significantly affect ETL performance. I've found success using ELT (Extract, Load, Transform) instead of ETL. This approach enables building multiple pipelines and intermediate datasets, allowing parallel and interconnected execution for data transformations. It also enhances maintenance, encapsulates business rules effectively, and improves data quality monitoring and governance.

Switch to ELT process instead of ETL due to the cheaper storage cost vs spending more time on cleaning it before load. Once the data is stored safely, prioritize the sources that are business critical applications first to transform with clean data.

In general, below data set are faster for ETL: 1. 1-1 mapping data set (without transformation) are faster. 2. Smaller data set are faster. 3. Data set that supports delta capability are faster due to less volume. 4. Good quality data are faster (since corrections are not required during ETL) 5. In many cases, native connector are faster compared to generic connector ETL priority among data sources depends on: 1. Business critical and then Non-critical 2. Operational (Daily) and then strategic (Daily, Weekly, Monthly) 3. Single source data set and then multi-source data set (federated data) Priority within a source system: a. Master data b. Transaction data (with no dependency) c. Transaction data (with dependency on a and/or b)

Last updated on Dec 14, 2024

You're overwhelmed with data sources for ETL processes. Which ones should you prioritize for faster speed?

Juggling multiple data sources for ETL (Extract, Transform, Load) can be overwhelming, but prioritizing the right ones can streamline your process.

When faced with numerous data sources for ETL processes, it's crucial to focus on those that boost efficiency and speed. Here's how you can effectively prioritize:

Identify high-volume sources: Focus on data sources that handle large volumes, as optimizing these will yield the greatest performance gains.

Evaluate data quality: Prioritize sources with clean, reliable data to reduce the need for extensive cleaning and transformation.

Assess integration ease: Choose sources that integrate smoothly with your existing ETL tools to minimize setup and maintenance time.

What strategies have you found effective in managing ETL data sources?

Data Warehousing

+ Follow

Last updated on Dec 14, 2024

You're overwhelmed with data sources for ETL processes. Which ones should you prioritize for faster speed?

Juggling multiple data sources for ETL (Extract, Transform, Load) can be overwhelming, but prioritizing the right ones can streamline your process.

When faced with numerous data sources for ETL processes, it's crucial to focus on those that boost efficiency and speed. Here's how you can effectively prioritize:

Identify high-volume sources: Focus on data sources that handle large volumes, as optimizing these will yield the greatest performance gains.

Evaluate data quality: Prioritize sources with clean, reliable data to reduce the need for extensive cleaning and transformation.

Assess integration ease: Choose sources that integrate smoothly with your existing ETL tools to minimize setup and maintenance time.

What strategies have you found effective in managing ETL data sources?

Add your perspective

35 answers

Pavani Mandiram

Managing Director | Top Voice in 66 skills l Global Laureate in Learning and Development l Global Laureate in IT l Amb Human Rights Children's in Nobre Ordem para a Excelência Humana-NOHE
Report contribution
Introduce a microservices architecture to create reusable, scalable ETL frameworks This modular approach allows for independent development and maintenance of ETL components which in turn enhances flexibility and efficiency in data processing Transition from a de-normalized to a star schema to simplify the data structure Implement Automated Data Quality Governance Frameworks to enable real- time monitoring of data quality Techniques like parallel processing, data partitioning and schema optimization can be employed to boost ETL operations Automated profiling and cleansing routines should be an integral part of automated data quality governance frameworks Incorporate indexing, SQL tuning, query optimization

Like
Prashant Kawle

Data Engineering Specialist| 3x Databricks Certified | 2x Azure Certified | SQL 5🌟| Python 5🌟 | AWS | Spark | Hadoop | ETL | ELT | Streaming KAFKA | Certified Big Data Associate
Report contribution
If you need data prepared in advance, like reports and aggregations, without any time delays, opt for ETL. However, if your focus is on data analysis and running ad-hoc queries, ELT is the better choice.

Like
Rodrigo Oliveira

Head of Data Science | Machine Learning | Big Data | Agro | Inteligência Artificial
Report contribution
To prioritize ETL processes, I first identify which data sources offer the most valuable information for business goals. This ensures alignment with strategic objectives and maximizes impact. Next, I prioritize based on business rule complexity and data volume. Even small data sources with intricate rules can significantly affect ETL performance. I've found success using ELT (Extract, Load, Transform) instead of ETL. This approach enables building multiple pipelines and intermediate datasets, allowing parallel and interconnected execution for data transformations. It also enhances maintenance, encapsulates business rules effectively, and improves data quality monitoring and governance.

Like
Charu Singhal

Azure Data Engineering Manager
Report contribution
Switch to ELT process instead of ETL due to the cheaper storage cost vs spending more time on cleaning it before load. Once the data is stored safely, prioritize the sources that are business critical applications first to transform with clean data.

Like
Pruthvi Renukarya

Manager SAP Analytics at Cognizant | SAP HANA, BW, BW/4HANA, S/4HANA Embedded Analytics (ABAP CDS with AMDP), SAP Analytics Cloud, Snowflake
Report contribution
In general, below data set are faster for ETL: 1. 1-1 mapping data set (without transformation) are faster. 2. Smaller data set are faster. 3. Data set that supports delta capability are faster due to less volume. 4. Good quality data are faster (since corrections are not required during ETL) 5. In many cases, native connector are faster compared to generic connector ETL priority among data sources depends on: 1. Business critical and then Non-critical 2. Operational (Daily) and then strategic (Daily, Weekly, Monthly) 3. Single source data set and then multi-source data set (federated data) Priority within a source system: a. Master data b. Transaction data (with no dependency) c. Transaction data (with dependency on a and/or b)

Like
Arman K.

Sales Management | Relationship Management | Channel Management | Project Management | BDM | B2B | 15+ years of experience in IT
Report contribution
In my experience with ETL tools, I have extensively worked with Informatica, which I regard as one of the best solutions in the field. When dealing with an overwhelming number of data sources, the priority should be on those that are critical for business insights, have the highest frequency of updates, and require minimal transformation to maintain speed. Informatica's robust features, such as efficient data mapping and seamless integration capabilities, significantly contribute to optimizing these processes, ensuring both speed and reliability in data handling.

Like
Dinesh Tyagi

Data & AI Solutions Architect | Gen AI/ML, LLM | Google-Vertex AI | Azure - OpenAI | AWS - Bedrock | Data Lake, Datawarehouse, Lakehouse | SAP - Datasphere | Data Evangelist, Leader/Mentor |
Report contribution
Overwhelmed with ETL data sources? Prioritize for speed by focusing on key factors. Analyze data: prioritize high volume, velocity, and structured formats. Consider source system performance and data quality. From a business perspective, focus on valuable, time-sensitive, or regulated data. Technically, optimize extraction methods, ensure strong connectivity, and consider data virtualization or Change Data Capture (CDC). Employ the 80/20 rule and data profiling to pinpoint bottlenecks. This combined approach of data, business, and technical considerations will accelerate your ETL processes.

Like
Oldrich Pola

Master data & performance excellence manager central Europe
Report contribution
Decide most critical datasources you need , define which datasources could be used with lower frequency (slowly changing dimensions for example), which data you need on daily, weekly / monthlt basis. Stop integrating data "for future usage" and focus on really necessary data. Between those identified prioritize those which are well structured already and needs low investment into complex transformation.

Like
Syam Adusumilli
Report contribution
Prioritizing ETL data sources can be overwhelming, especially if there are more than a 100. There are several ways to optimize, and they are more or less relevant by context. 1. I would not touch any source before understanding my business data flows. Relevance of specific data elements is key. Any data elements that are not used often can typically wait 2. Split the source files to Key or critical, Useful, maybe useful and partition vertically. You may load into a landing zone to do this. 3. Evaluate for sparsity. Any field that is very sparse may have limited general value. Evaluate for ref data mapping. If the hierarchies are not leveling with your target, that is important before you load 4. Size, frequency, conformance all matter

Like
Tim Erben

working with data for a better future.
Report contribution
Do ETL processes feel like juggling bowling balls? Here's how to take control of the chaos: Imagine you're a retailer drowning in data sources - point-of-sale systems, online transactions, social media analytics. Instead of tackling everything at once, you focus on the biggest player: your online data, which makes up 70% of your total volume. A few smart optimizations later, you've halved the processing time. Next, you tackle the messy customer feedback data, streamlining the cleaning process so it’s as easy as pitting cherries. The result? Faster insights, less stress, and finally time for your second coffee. So, start with the heavy hitters - clean, important, and easy to integrate. What’s your go-to ETL strategy for cutting the chaos?

Like

View more answers

You're overwhelmed with data sources for ETL processes. Which ones should you prioritize for faster speed?

Data Warehousing

You're overwhelmed with data sources for ETL processes. Which ones should you prioritize for faster speed?

Data Warehousing

Rate this article

Thanks for your feedback

More articles on Data Warehousing

More relevant reading

You're overwhelmed with data sources for ETL processes. Which ones should you prioritize for faster speed?

Data Warehousing

You're overwhelmed with data sources for ETL processes. Which ones should you prioritize for faster speed?

Data Warehousing

Rate this article

Thanks for your feedback

Explore Other Skills