Section 1: Designing data processing systems
1.1 Designing flexible data representations. Considerations include:
- future advances in data technology
- changes to business requirements
- awareness of current state and how to migrate the design to a future state
- data modeling
- tradeoffs
- distributed systems
- schema design
1.2 Designing data pipelines. Considerations include:
- future advances in data technology
- changes to business requirements
- awareness of current state and how to migrate the design to a future state
- data modeling
- tradeoffs
- system availability
- distributed systems
- schema design
- common sources of error (eg. removing selection bias)
1.3 Designing data processing infrastructure. Considerations include:
- future advances in data technology
- changes to business requirements
- awareness of current state, how to migrate the design to the future state
- data modeling
- tradeoffs
- system availability
- distributed systems
- schema design
- capacity planning
- different types of architectures: message brokers, message queues, middleware, service-oriented
Section 2: Building and maintaining data structures and databases
2.1 Building and maintaining flexible data representations
2.2 Building and maintaining pipelines. Considerations include:
- data cleansing
- batch and streaming
- transformation
- acquire and import data
- testing and quality control
- connecting to new data sources
2.3 Building and maintaining processing infrastructure. Considerations include:
- provisioning resources
- monitoring pipelines
- adjusting pipelines
- testing and quality control
Section 3: Analyzing data and enabling machine learning
3.1 Analyzing data. Considerations include:
- data collection and labeling
- data visualization
- dimensionality reduction
- data cleaning/normalization
- defining success metrics
3.2 Machine learning. Considerations include:
- feature selection/engineering
- algorithm selection
- debugging a model
3.3 Machine learning model deployment. Considerations include:
- performance/cost optimization
- online/dynamic learning
Section 4: Modeling business processes for analysis and optimization
4.1 Mapping business requirements to data representations. Considerations include:
- working with business users
- gathering business requirements
4.2 Optimizing data representations, data infrastructure performance and cost. Considerations include:
- resizing and scaling resources
- data cleansing, distributed systems
- high performance algorithms
- common sources of error (eg. removing selection bias)
Section 5: Ensuring reliability
5.1 Performing quality control. Considerations include:
- verification
- building and running test suites
- pipeline monitoring
5.2 Assessing, troubleshooting, and improving data representations and data processing infrastructure.
5.3 Recovering data. Considerations include:
- planning (e.g. fault-tolerance)
- executing (e.g., rerunning failed jobs, performing retrospective re-analysis)
- stress testing data recovery plans and processes
Section 6: Visualizing data and advocating policy
6.1 Building (or selecting) data visualization and reporting tools. Considerations include:
- automation
- decision support
- data summarization, (e.g, translation up the chain, fidelity, trackability, integrity)
6.2 Advocating policies and publishing data and reports.
Section 7: Designing for security and compliance
7.1 Designing secure data infrastructure and processes. Considerations include:
- Identify and Access Management (IAM)
- data security
- penetration testing
- Separation of Duties (SoD)
- security control
7.2 Designing for legal compliance. Considerations include:
- legislation (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), etc.)
- audits