One of the main differences between a data scientist and a data engineer has to do with ETL versus DAD:
- ETL (Extract/Load/Transform) is for data engineers, or sometimes data architects or database administrators (DBA).
- DAD (Discover/Access /Distill) is for data scientists.
Data engineers tend to focus on software engineering, data base design, production code, and making sure data is flowing smoothly between source (where it is collected) and destination (where it is extracted and processed, with statistical summaries and output produced by data science algorithms, eventually moved back to the source or elsewhere). Data scientists, while they need to understand this data flow (and how it is optimized, especially when working with Hadoop) don’t actually optimize the data flow itself, but rather the data processing step: extracting value from data. But they work with engineers and business people to define the metrics, design data collecting schemes and make sure data science processes integrate efficiently with the enterprise data systems (storage, data flow). This is especially true for data scientists working in small companies, and a reason why data scientists should be able to write code (more and more, Python) re-usable by engineers.
ETL = Extract, Transform and Load
ETL stands for Extract, Transform and Load. The ETL process typically extracts data from the source / transactional systems, transforms it to fit the model of data warehouse and finally loads it to the data warehouse. The transformation process involves cleansing, enriching and applying transformations to create the desired output. Data is usually dumped to a staging area after extraction. In some cases, the transformations might be applied on the fly and loaded to the target system without the intermediate staging area. The diagram below illustrates a typical ETL process.
For big data scenarios, using the ELT process allows you to create copies of the source data and move them into Hadoop. This is not as resource-intensive as ETL, where the transporting and transforming of data can be cumbersome. In ELT, because the data is in Hadoop and takes advantage of large-scale parallel processing, there is less stress on source systems, which shortens the time frame for transformation.
Here are some of the top ETL vendors and their key products:
Big Data Dashboards
The difference between “big data” and useful data is having the right tools to analyze data. In an era when almost every department is flooded with information about clients, prospects, processes, and operations, effective data analysis can easily become a source of competitive advantage.
Business intelligence (BI) software aids this process by pulling data from your various client-side and back-end systems and providing visualization and analysis tools. By transforming your raw data into intelligible reports, dashboards, and illustrations, you can gain quicker insights, make better decisions about your business, and move toward positive revenue goals faster.
Here are some of the leading Big data dashboards in the market today:
- IBM Cognos
- SAP Business Objects
- Oracle BI
- Microsoft Business Intelligence