Exam 70-458 – Objective 5 – Extract and transform data

This entry is part 5 of 8 in the series Exam 70-458

 

Design data flow

  • Define data sources and destinations (http://technet.microsoft.com/en-us/library/ms140080%28v=sql.110%29.aspx)
  • Distinguish blocking and non-blocking transformations
  • Use different methods to pull out changed data from data sources
    • Change Data Capture

      • Run sys.sp_cdc_enable_db on database then for each table run sys.sp_cdc_enable_table
      • There are CDC transforms in SSIS
    • Change Tracking
      • Only keeps track that record has changed, no details
      • ALTER DATABASE <db_name> SET CHANGE_TRACKING = ON
      • Alter table <tablename> ENABLE CHANGE_TRACKING
      • Uses extraction log to keep versions 
  • Determine appropriate data flow components (http://msdn.microsoft.com/en-us/library/ms137612(v=sql.105).aspx)
  • Determine the need for supporting Slowly Changing Dimensions (SCD) (http://msdn.microsoft.com/en-us/library/ms141715.aspx)
  • Determine whether to use SQL Joins or SSIS lookup or merge join transformations
    • Merge Tranforms

      • Merge (similar to UNION)
        • Input must be sorted in same order
        • Sort in data source when possible
        • Mark source as sorted in properties
        • Input column metadata must be same
      • Merge join
        • Supports inner, left, and full
        • Requires both inputs to sorted in the same order
        • Join columns must have same metadata
    • Lookup Transforms
      • Can use OLEDB connection or cache connection
      • Can treat unmatched rows as errors or regular output
      • Can use caching for rows matched or to pre-load to reduce trips to data source
    • Data Modification Transforms
      • Derived column
      • Data conversion
      • Character map 
  • Batch processing vs. row by row processing
  • Determine the appropriate transform to use for a specific task
    • Slowly Changing Dimension Transformation (http://msdn.microsoft.com/en-us/ms141715)
      • Changing attribute – overwrite existing records
      • Historical attribute – create new records for changes
      • Fixed attribute – column cannot change
      • Inferred member – reference dimension value not loaded yet
  • Determine the need and method for identity mapping and deduplicating
  • Fuzzy lookup, fuzzy grouping, and Data Quality Services (DQS) transformation
  • Determine the need for text mining
  • Determine the need for custom data sources, destinations, and transforms
  • Determine what to do with erroneous rows (http://msdn.microsoft.com/en-us/library/ms140083.aspx)
    • Every transform can output an error
    • Types of errors include data truncation, no matches, and type conversions
    • Each type of error can be handled with different responses
    • OnError Event 
  • Determine auditing needs
  • Determine sampling needs for data mining
  • Trusted/authoritative data sources, including warehouse metadata

 

Implement data flow

  • Debug data flow (http://msdn.microsoft.com/en-us/library/ms137944.aspx)
    • Data Viewers – display data between two components
    • Row Counts
    • Progress Reporting – Color coded for success, errors, running, or waiting
  • Use the appropriate data flow component
  • SQL/SSIS data transformation
    • Split
      • Mulit-cast – allows you to send the same data to multiple places
      • Conditional – allows split data to different destinations
  • Create SSIS packages that support slowly changing dimensions (http://msdn.microsoft.com/en-us/library/ms141662.aspx)
  • Use the Lookup task in SSIS
  • Map identities using SSIS Fuzzy Lookup
  • Specify a data source and destination
  • Use data flows
  • Different categories of transformations (http://msdn.microsoft.com/en-us/library/ms141713(v=sql.110).aspx)
    • Business Intelligence
    • Row
    • Rowset
    • Split and Join
    • Auditing
    • Custom
  • Read, transform, and load data
  • Understand which transforms to use to accomplish a specific business task
  • Data correction transformation
  • Performance tune an SSIS dataflow
  • Optimize Integration Services packages for speed of execution
  • Maintain data integrity, including good data flow

 

Implement script tasks in SSIS

 

Preparation resources

Series Navigation<< Exam 70-458 – Objective 4 – Design and implement a data warehouseExam 70-458 – Objective 6 – Load data >>
Tracy Boggiano
Follow me

Tracy Boggiano

Database Administrator at ChannelAdvisor
Tracy has spent over 20 years in IT and has been using SQL Server since 1999 and is currently certified as a MCSE Data Platform. She has worked on SQL Server 6.5 and up including currently SQL 2017 CTP 2.0. She enjoys monitoring, performance tuning, and disaster recovery technologies.

She also tinkered with databases in middle school to keep her baseball card collection organized.

Her passion outside of SQL Server is volunteering with foster children as their advocate in court through http://www.casaforchildren.org.
Tracy Boggiano
Follow me