Esmond Blog

AWS Solution Architect certification

The AWS Certified Solutions Architect certification offered by Amazon Web Services (AWS) designed to validate an individual's expertise in designing resilient, high-performing, secure, and cost-efficient architect on the AWS platform and services. AWS maintains over 30% of market share of the global cloud computing sector according to data in 2023. Obtaining AWS solution architect certification not only enhance your cloud knowledge but equips you to design better solution meet SLA and lower TCO.

How do I prepare

My experience in cloud computing spans several years in working closely with customers on both on-premises and hybrid-cloud environments. This has provided me with a solid foundational knowledge of Storage, Network, Compute instances, VMs, and more. However, the vast array of over 200 AWS services can initially seem overwhelming. The "Ultimate AWS Certified Solution Architect Associate SAA-C03" course on Udemy, taught by Stephane Maarek, was instrumental in helping me navigate these services. Stephane's clear and structured explanations of AWS services make them easily understandable. The course's hands-on labs encourage practical engagement, ensuring concepts are not only learned but fully grasped.

After grasping the technical concepts of each service, it's beneficial to develop a holistic architectural perspective, linking services together to meet specific requirements. For instance, integrating Kinesis Data Streams and Kinesis Data Firehose to ingest real-time data into S3 bucket for analysis. Akin to connecting LEGO pieces, each service interlocks to build a comprehensive solution. Resources like the AWS blog and the AWS Architecture Center have valuable material to reference and establish holistic view.

The AWS Skill Builder offers practice exams that are instrumental in familiarizing yourself with the format and types of questions you'll encounter. Similar practice exams are available on Udemy as well. Aim to score above 800 on your practice exam, review your answers, and take notes on the practice exams to solidify your understanding and readiness for the actual exam.

Exam experience

The exam lasts 130 minutes and costs $150. It comprises 65 questions, which are either multiple-choice to select one correct answer out of four or multiple-response where two or more answers must be selected from five options. There is no penalty for guessing, so it is advisable not to leave any answer blank. A minimun score of 720 out of 1000 is required to to pass. This exam is challenging, thorough preparation and practice are essential for success. To Register, create the account on the AWS certification site. The exam can be taken online via OnVUE or in-person at a test center. I took my exam at a test center and highly recommend this option to minimize distractions that might occur at home. Below are some of my best-known methods (BKMs).

Understanding why incorrect (distractor) answers are wrong is crucial. Begin by eliminating these options first.
If you are not completely certain about an answer, flag the question and return to it later for review.
One of my tricks is to preview the choices before fully reading each question. This approach provides additional insight and context.

Material

Note

I successfully passed the exam with score exceeding 810 in my first attempt. AWS dispatches your score and results within 7 working days—though I received mine in just 2 days. You can also download your score report in exam register portal. The report not only show result but your score performance percentage in each domain. Digital badge issued with expiration for 3 years. An additional benefit for passing exam is AWS provides 50% discount on your next exam. Don't forget to claim it.

AWS ML Specialty certification

The AWS Certified Machine Learning – Specialty certification is an advanced certification offered by Amazon Web Services that validates your ability to design, implement, deploy, and maintain Machine Learning (ML) solutions for given business problems. It demonstrates the candidate's expertise in using AWS services to design and create ML solutions that are scalable and efficient.

Exam content domain include Data Engineering, Exploratory Data Analysis, Modeling and ML implementation. Around 50% of the exam content on the ML concept eg: handle missing/un-balanced data, evaluating performance, model tuning. And the other 50% of content on AWS ML offering and best practices eg: SageMaker, data ingest in AWS, AI services, custom Algorithm. Machine learning and artificial intelligence are rapidly evolving fields to a wide array of industries, this certification ensures your professionals skills are up-to-date with the latest ML technologies with AWS cloud.

How do I prepare

My years of experience working with customers on data and AI workloads for training and inference have fortified my skills in AI and ML, which I believed would align well with the content of the related certification exams. Indeed, my background facilitated a quicker grasp of key concepts, such as employing a confusion matrix for classification tasks and using metrics like recall, precision, accuracy, or the F1 score to measure model performance. Delving deeper into model tuning, I dedicated time to understanding underfitting and overfitting, hyperparameter optimization, and data manipulation to meet specific metric criteria.

The "AWS Certified Machine Learning Specialty" course on Udemy, presented by Frank Kane and Stephane Maarek, was an good resource that bolstered my understanding of AWS ML concepts. Frank's lecture on SageMaker and AWS ML services and operations were particularly enlightening. A significant portion of my study time was invested in under the hood of SageMaker algorithms and exploring the technical nuances of SageMaker tools such as Canvas, Data Wrangler, and Autopilot, all of which offered extensive hands-on experience. I highly recommend becoming well-acquainted with these algorithms and toolkits, as they are integral to the exam.

Additional resources I found exceptionally helpful include the AWS blog and the AWS Architecture Center. The newly posted blog on ML usage scenarios provides a more insightful perspective on potential solutions to keep learning more interesting. Specifically to the exam, frequent practice increases your chances of passing the exam. The AWS Skill Builder offers practice exams that are instrumental in familiarizing yourself with the format and types of questions you'll encounter. Similar practice exams are available on Udemy as well. Aim to score above 800 on your practice exam, review your answers, and take notes on the practice exams to solidify your understanding and readiness for the actual exam.

Exam experience

The exam lasts 180 minutes and costs $300. It comprises 65 questions, which are either multiple-choice to select one correct answer out of four or multiple-response where two or more answers must be selected from five options. There is no penalty for guessing, so it is advisable not to leave any answer blank. A minimun score of 750 out of 1000 is required to to pass. This exam is in advanced spacialty level which even more challenging. Thorough preparation and practice are essential for success.

To Register, same web portal as my previous AWS SA exam to log-in and register. Create the account if you are new to AWS exam on the AWS certification site. The exam can be taken online via OnVUE or in-person at a test center. I took my exam at a test center and highly recommend this option to minimize distractions that might occur at home. Below are some of my best-known methods (BKMs).

Understanding why incorrect (distractor) answers are wrong is crucial. Begin by eliminating these options first.
If you are not completely certain about an answer, flag the question and return to it later for review. I marked more questions in this exam as distractor are not so obvious to eliminate.
One of my tricks is to preview the choices before fully reading each question. This approach provides additional insight and context.

Material

Note

I opted to reschedule my first certification exam as I don't feel highly confidence in such 3-hour rigorous exam. it paid off when I successfully passed the exam with score exceeding 800 in my first attempt. AWS dispatches your score and results within 7 working days—though I received mine in just 2 days. You can also download your score report in exam register portal. The report not only show result but your score performance percentage in each domain. Digital badge issued with expiration for 3 years. An additional benefit for passing exam is AWS provides 50% discount on your next exam. Don't forget to claim it.

Remember to clean up your AWS services and environment after completing your hands-on sessions to avoid unnecessary costs. Be aware that some of the ML services are not included in the free-tier. Setting up budget alerts and disabling unnecessary services are crucial lessons I learned while preparing for this exam.

GCP Professional Cloud Architect certification

The Google Cloud Professional Cloud Architect (PCA) certification is a prestigious credential offered by Google Cloud to validate an individual's ability to design, develop, and manage robust, secure, scalable, highly available, and dynamic solutions to drive business objectives on Google Cloud Platform (GCP). GCP has integrated Google Workspace and the Vertex AI platform to boost collaborative and advanced AI capabilities, attracting more customers to the GCP cloud. This integration provides a competitive edge in the rapidly evolving cloud computing domain.

How do I prepare

I leveraged my experience with AWS hybrid-cloud services to broaden my knowledge in GCP. Both platforms share similar concepts in cloud infrastructure components, such as compute instances, VMs, storage, databases, networking, and containers, though their implementations differ. For example, creating a new VM on both GCP and AWS involves similar configurations for instance type, image, storage, and networking. Despite significant differences in APIs and console interfaces, the underlying cloud concepts remain similar, which accelerated my learning curve in GCP. I found the comparison of AWS and Azure Services to Google Cloud particularly helpful in comparing different services between GCP and AWS.

Google Cloud training offers a diverse content of learning resources, including videos, documents, and labs, providing an excellent jumpstart. The labs grant temporary credentials to log into GCP for learning purposes, ensuring that there is no risk to your personal account or unexpected charges. I began with the Cloud Architect learning path on the GCP Skill Boost site, completing courses earns you points and viewing your position on the leaderboard adds more fun and motivational element to my learning experience.

To explore the breadth and scope of the domains covered in the exam, the Preparing for the Architect Journey course offers diagnostic questions that are instrumental in familiarizing yourself with the format and types of questions you'll encounter. Review your answers and take notes on these diagnostic questions to solidify your understanding and readiness for the actual exam. Another good resource is the Cloud Architecture Center, which provides design guides and reference architectures that are quite helpful in my learning.

Exam experience

The exam lasts 120 minutes and costs $200. It consists of 50-60 multiple-choice and multi-select questions, as outlined in the exam guide. In my exam session, however, I encountered only 50 multiple-choice questions, each with one correct answer out of four options. There is no penalty for guessing, so it is advisable not to leave any answers blank. A minimum score of 800 out of 1000 is required to pass. To Register exam, create the account on the Weassesor site. The exam can be taken remote proctored or onsite proctored in the test center close to you. I took my exam at a test center and highly recommend this option to minimize distractions that might occur at home. Below are some of my best-known methods (BKMs).

Understanding why incorrect (distractor) answers are wrong is crucial. Begin by eliminating these options first.
If you are not completely certain about an answer, flag the question and return to it later for review.

Material

Note

I successfully passed the exam on my first attempt. Unlike AWS certification, the results are displayed immediately after completing and submitting the exam, providing a straightforward way to know the outcome without the delay of waiting for days. Exciting....The official report, indicating whether you passed or failed, is sent within 7 working days.

Accelerate ZFS performance with Persistent Memory

ZFS (Zettabyte File System) is an advanced file system and logical volume manager originally designed by Sun Microsystems. Initially integrated into Sun's Solaris Operating System, ZFS has been ported to other Unix-like systems, including FreeBSD and OpenZFS on Linux. ZFS is widely utilized in Software-Defined Storage (SDS) for NAS (Network Attached Storage) or NFS (Network File System) storage products. This blog post explores how to enhance performance by leveraging fast SSDs, such as persistent memory, as tiering and caching layers.

ZFS storage architecture

Traditional memory and storage tiering adopt a two-tier pyramid layout; memory has a smaller capacity and is closer to the CPU with nanosecond latency, while storage disks, although positioned farther from the CPU, offer significantly larger capacity with hundred-millisecond latency (as shown in Fig. 1). Cache misses incur substantial performance penalties by necessitating data retrieval from disk storage. Modern storage architectures incorporate additional layers, such as fast SSDs that provide larger capacity and lower latency between memory and storage disks (as shown in Fig. 2). This mitigates the penalties associated with memory cache misses and enhances both read and write speeds, thereby accelerating data retrieval back to the CPU. In the ZFS software stack, the ZIL (ZFS Intent Log) and L2ARC (Level2 Adaptive Replacement Cache) serve as tiering and caching layers for synchronous write and read operations using fast SSDs ( as shown in Fig. 3).

ZFS storage tier

ZIL stands for ZFS Intent Log. In ZFS, the ZIL is crucial for logging synchronous operations to disk before they are written to disk-based zpool storage. This synchronous logging ensures that operations are completed and writes are committed to persistent storage, rather than merely being cached in memory. The ZIL functions as a persistent write buffer, facilitating quick and safe handling of synchronous operations. This is particularly important before the spa_sync() operation, which can take considerable time to access disk-based zpool storage. Essentially, ZIL is designed to enhance the reliability and integrity of data by processing writes more efficiently prior to long-term storage commits during spa_sync().

ZFS ZIL

The level2 ARC is a cache layer in-between memory and disk-based zpool storage. It uses dedicated fast SSD to enlarge the cache capacity to hold cached data and boost the random read performance. There is no eviction path from the ARC to the L2ARC, meanig ARC and L2ARC work as largher capacity of cache reduce cache-miss . L2ARC cache the data from the ARC before it is evicted. arc_read() only read from the disk-based zpool storage when the data not exist in both ARC and L2ARC.

ZFS L2ARC

Performance profile

Intel Optane persistent memory SSDs, such as the P5800X and P1600X with micro-second latency, are ideal for the ZIL and L2ARC layers. In the test configuration (shown in Fig4) six QLC NAND SSDs configured with RAIDz2 are used as zpool storage, with a P1600X serving as the ZIL log. This setup reduced 4Kbyte random write latency by 67%. A similar configuration that replaced the zpool storage with 20 HDDs which significantly enhanced the 4K random write IOPS, improving more than 100 times(as depicted in Fig5). In the data cache test scenario, with the zpool storage comprising 20 HDDs in RAID1 array and a P5800X as the L2ARC, the random read IOPS performance also improved by more than 100 times. (shown in Fig6) The data indicates the tiering storage architecture has significantly enhanced both read and write performance.

ZFS ZIL latency ZFS ZIL IOPS ZFS L2ARC IOPS

Reference

Well architect and profiling Datacenter NVMe SSD

Solid State Drives (SSDs) have rapidly evolved due to advancements in NAND flash media processes and interface protocols, leading to higher capacities and lower latencies. This blog aims to provide an overview of the architectural considerations and tools necessary for adapting and profiling Datacenter NVMe flash SSDs. Important considerations include leveraging a tiering architecture to balance Total Cost of Ownership (TCO) and performance when NAND flash media transitions from SLC (Single-Level Cell) to more dense formats like TLC (Triple-Level Cell), QLC (Quad-Level Cell), and even PLC (Penta-Level Cell) with endurance significantly decreases. Additionally, integrating low-latency storage software stacks such as SPDK or io_uring becomes crucial when employing low-latency SSDs in mission-critical workloads.

NVMe SSD interface and form factor

Few key factors on NVMe SSD:

Interface: Modern NVMe SSDs based on PCIe 5 offer a staggering 128GT/s, providing more than 20 times the bandwidth compared to the latest SATA3 (Serial ATA) interface, which supports 6Gbit/s. NVMe SSD based on PCIe can significantly influence the performance and scalability of the storage system. Additionally, the next-generation Compute Express Link (CXL) based on PCIe 5 can also be utilized to set up storage pooling/sharing in heterogeneous computing environments.
Flash Media: As NAND flash media capacity increases with bits-per-cell, write performance does not scale similarly because the write block size also increases. High-capacity QLC SSDs exhibit lower performance and reduced endurance in small block write workloads compared to TLC SSDs. A tiering architecture that leverages the high performance and endurance of SLC or persistent memory combined with QLC/PLC SSDs emerges as a genuine solution.
SSD Form Factor: The choice of form factor influences hardware design and backplane configurations in your server. General M.2 and U.2 form factors are commonly used in today's data centers. The new E1.s form factor by SNIA SFF-TA offers benefits in capacity, throughput, and thermal management, positioning it as a superior replacement for M.2 and U.2 in 1U/2U servers.

Low latency storage stack

io_uring is a modern Linux I/O interface designed to reduce overhead and enhance performance. Its zero-copy and no-locking design can improve latency by up to 60% compared to traditional asynchronous I/O methods like libaio. io_uring was introduced in Linux kernel v5.1 and is supported by benchmark applications such as fio and t/io_uring, which have integrated io_uring capabilities. However, io_uring still relies on Linux context switches for I/O operations, which may introduce performance penalties. In contrast, the Storage Performance Development Kit (SPDK) moves all necessary drivers into user space, leveraging polling and avoiding system calls to achieve superior storage performance. SPDK is also widely used in RDMA and NVMe-oF disaggregated storage architectures. The following test from SNIA shows the io_uring reduce 60% of latency compare to libaio. And SPDK reduce 60% of latency comapre to io_uring.

nvme sw latency

Performance profile

Before profiling your NVMe SSD, it is crucial to verify the SSD's health and ensure it has the correct firmware version ready for benchmarking. The open-source tool nvme-cli is used for managing, monitoring health, and updating firmware on NVMe SSDs. A good practice is to use nvme-cli to install the latest firmware, check the appropriate namespace for your target, and monitor health via SMART logs. Low-level formatting and pre-conditioning are recommended before initiating benchmark workloads. Persistent memory SSDs, such as Optane, which feature highly consistent flash media, can streamline the profiling process by reducing the need for lengthy pre-conditioning cycles.

The Flexible I/O Tester (fio) is a widely recognized and versatile tool for I/O benchmarking and stress-testing storage systems. It supports extensive configuration options with different I/O engines, read/write ratios, and block sizes to simulate realistic workload patterns. The following graph profiles a TLC NVMe SSD under a 70/30 read/write ratio and a 4K random workload over 60 minutes. It took approximately 50 minutes for the NVMe drive to reach a steady performance state. A performance drop of about 70% was observed after 20 minutes of runtime, coinciding with the start of garbage collection process. Ensure that the performance profiling duration is sufficient, to reach a steady state for the reliable and accurate measurements.

nvme performance profile

Reference

Observability with Grafana, Prometheus, and Intel VTune

Observability in cloud computing involves the collection and analysis of various telemetry types, such as metrics, distributed traces, and logs. It enables the understanding and diagnosis of an internal system's state, performance, and issues without disrupting operations. Similar tools are implemented by cloud service providers, including AWS CloudWatch, AWS CloudTrail, Google Cloud Operations, and Azure Monitor. On-premises environments often utilize tools like Prometheus and OpenTelemetry, integrated with Grafana for the observability. Intel's oneAPI VTune is also widely adopted to address application performance bottlenecks. This blog aims to provide a deeper insight into Prometheus, Grafana, and Intel oneAPI VTune.

Prometheus, Grafana, and Intel VTune

Prometheus is an open-source monitoring and alerting toolkit that collects and stores metrics as time-series data. It supports a variety of libraries and servers, known as exporters, which facilitate the exporting of metrics from systems into Prometheus-compatible formats. The node_exporter is included by default and enables hardware and OS-level metrics from the kernel, such as CPU usage, disk I/O, memory info, and network stats. Users can also create custom exporter plugins for proprietary metrics using Golang.

Grafana is an open-source visualization and analytics platform that enables you to query, visualize, alert on, and explore your metrics, regardless of their storage location. It transforms time-series data from Prometheus node_exporters into compelling graphs and visualizations. Grafana supports a wide array of data sources including InfluxDB, Graphite, and Elasticsearch. Additionally, it allows for the integration of multiple data sources within the same dashboard, facilitating a comprehensive view of data across various environments and platforms.

Intel VTune Profiler is a powerful tool for deep performance optimization, ensuring that software runs efficiently on hardware. It examines application code and identifies threading issues and resource bottlenecks. VTune also analyzes code and threads hotspots that significantly impact performance. Below is a Grafana dashboard displaying metrics sourced from Prometheus, providing a visual representation. And the VTune profiler user interface

Grafana dashboard vtune profiler

Summary

Prometheus and Grafana together deliver exceptional observability by monitoring and alerting on time-series data, enhanced by rich visualizations for on-premises environments. Intel VTune provides in-depth analysis for application and resource performance optimization. All these open-source tools are readily available to enhance observability in private cloud environments. In public cloud or hybrid-cloud settings, leveraging tools from cloud service providers can significantly extend observability capabilities beyond what Grafana or VTune alone can offer.

Reference

Esmond Blog

Content

AWS Solution Architect certification

How do I prepare

Exam experience

Material

Note

AWS ML Specialty certification

How do I prepare

Exam experience

Material

Note

GCP Professional Cloud Architect certification

How do I prepare

Exam experience

Material

Note

Accelerate ZFS performance with Persistent Memory

ZFS storage architecture

Performance profile

Reference

Well architect and profiling Datacenter NVMe SSD

NVMe SSD interface and form factor

Low latency storage stack

Performance profile

Reference

Observability with Grafana, Prometheus, and Intel VTune

Prometheus, Grafana, and Intel VTune

Summary

Reference