AWS Solution Architect certification
The AWS Certified Solutions Architect certification offered by Amazon Web Services (AWS) designed to validate an
individual's expertise in designing resilient, high-performing, secure, and cost-efficient architect on the AWS platform and services.
AWS maintains over 30% of market share of the global cloud computing sector according to data in 2023. Obtaining AWS solution architect certification not
only enhance your cloud knowledge but equips you to design better solution meet SLA and lower TCO.
How do I prepare
My experience in cloud computing spans several years in working closely with customers on both on-premises and hybrid-cloud environments. This has provided me
with a solid foundational knowledge of Storage, Network, Compute instances, VMs, and more. However, the vast array of over 200 AWS services can initially seem
overwhelming. The "Ultimate AWS Certified Solution Architect Associate SAA-C03" course on Udemy, taught by Stephane Maarek, was instrumental in helping me navigate
these services. Stephane's clear and structured explanations of AWS services make them easily understandable. The course's hands-on labs encourage practical engagement,
ensuring concepts are not only learned but fully grasped.
After grasping the technical concepts of each service, it's beneficial to develop a holistic architectural perspective, linking services together to meet
specific requirements. For instance, integrating Kinesis Data Streams and Kinesis Data Firehose to ingest real-time data into S3 bucket for analysis. Akin
to connecting LEGO pieces, each service interlocks to build a comprehensive solution. Resources like the AWS blog
and the AWS Architecture Center have valuable material to reference and establish holistic view.
The AWS Skill Builder offers practice exams that are instrumental in familiarizing yourself with the
format and types of questions you'll encounter. Similar practice exams are available on Udemy as well. Aim to score above 800 on your practice exam,
review your answers, and take notes on the practice exams to solidify your understanding and readiness for the actual exam.
Exam experience
The exam lasts 130 minutes and costs $150. It comprises 65 questions, which are either multiple-choice to select one correct answer out of four
or multiple-response where two or more answers must be selected from five options. There is no penalty for guessing, so it is advisable not to leave any
answer blank. A minimun score of 720 out of 1000 is required to to pass. This exam is challenging, thorough preparation and practice are essential for success.
To Register, create the account on the AWS certification
site. The exam can be taken online via OnVUE or in-person at a test center. I took my exam at a test center and highly recommend this option to minimize
distractions that might occur at home. Below are some of my best-known methods (BKMs).
- Understanding why incorrect (distractor) answers are wrong is crucial. Begin by eliminating these options first.
- If you are not completely certain about an answer, flag the question and return to it later for review.
- One of my tricks is to preview the choices before fully reading each question. This approach provides additional insight and context.
Material
Note
I successfully passed the exam with score exceeding 810 in my first attempt. AWS dispatches your score and results within 7 working days—though I received mine in just 2 days. You
can also download your score report in exam register portal. The report not only show result but your score performance percentage in each domain.
Digital badge issued with expiration for 3 years. An additional benefit for passing exam is AWS provides 50% discount on your next exam. Don't forget to claim it.
AWS ML Specialty certification
The AWS Certified Machine Learning – Specialty certification is an advanced certification offered by Amazon Web Services that validates your ability
to design, implement, deploy, and maintain Machine Learning (ML) solutions for given business problems. It demonstrates the candidate's expertise in using AWS
services to design and create ML solutions that are scalable and efficient.
Exam content domain include Data Engineering, Exploratory Data Analysis, Modeling and
ML implementation. Around 50% of the exam content on the ML concept eg: handle missing/un-balanced data, evaluating performance, model tuning. And the other
50% of content on AWS ML offering and best practices eg: SageMaker, data ingest in AWS, AI services, custom Algorithm.
Machine learning and artificial intelligence
are rapidly evolving fields to a wide array of industries, this certification ensures your professionals skills are up-to-date with the latest ML technologies
with AWS cloud.
How do I prepare
My years of experience working with customers on data and AI workloads for training and inference have fortified my skills in AI and ML, which I believed would
align well with the content of the related certification exams. Indeed, my background facilitated a quicker grasp of key concepts, such as employing
a confusion matrix for classification tasks and using metrics like recall, precision, accuracy, or the F1 score to measure model performance. Delving deeper into
model tuning, I dedicated time to understanding underfitting and overfitting, hyperparameter optimization, and data manipulation to meet specific metric criteria.
The "AWS Certified Machine Learning Specialty" course on Udemy, presented by Frank Kane and Stephane Maarek, was an good resource that bolstered my
understanding of AWS ML concepts. Frank's lecture on SageMaker and AWS ML services and operations were particularly enlightening. A significant portion of my
study time was invested in under the hood of SageMaker algorithms and exploring the technical nuances of SageMaker tools such as Canvas, Data Wrangler,
and Autopilot, all of which offered extensive hands-on experience. I highly recommend becoming well-acquainted with these algorithms and toolkits,
as they are integral to the exam.
Additional resources I found exceptionally helpful include the AWS blog and
the AWS Architecture Center. The newly posted blog on ML usage scenarios
provides a more insightful perspective on potential solutions to keep learning more interesting.
Specifically to the exam, frequent practice increases your chances of passing the exam.
The AWS Skill Builder offers practice exams that are instrumental in
familiarizing yourself with the format and types of questions you'll encounter. Similar practice exams are available on Udemy as well.
Aim to score above 800 on your practice exam, review your answers, and take notes on the practice exams to solidify your understanding and readiness for the actual exam.
Exam experience
The exam lasts 180 minutes and costs $300. It comprises 65 questions, which are either multiple-choice to select one correct answer out of four
or multiple-response where two or more answers must be selected from five options. There is no penalty for guessing, so it is advisable not to leave any
answer blank. A minimun score of 750 out of 1000 is required to to pass. This exam is in advanced spacialty level which even more challenging.
Thorough preparation and practice are essential for success.
To Register, same web portal as my previous AWS SA exam to log-in and register. Create the account if you are new to AWS exam on the AWS certification
site. The exam can be taken online via OnVUE or in-person at a test center. I took my exam at a test center and highly recommend this option to minimize
distractions that might occur at home. Below are some of my best-known methods (BKMs).
- Understanding why incorrect (distractor) answers are wrong is crucial. Begin by eliminating these options first.
- If you are not completely certain about an answer, flag the question and return to it later for review. I marked more questions in this exam as distractor
are not so obvious to eliminate.
- One of my tricks is to preview the choices before fully reading each question. This approach provides additional insight and context.
Material
Note
I opted to reschedule my first certification exam as I don't feel highly confidence in such 3-hour rigorous exam. it paid off when
I successfully passed the exam with score exceeding 800 in my first attempt. AWS dispatches your score and results within 7 working days—though I received mine in
just 2 days. You can also download your score report in exam register portal. The report not only show result but your score performance percentage in each domain.
Digital badge issued with expiration for 3 years. An additional benefit for passing exam is AWS provides 50% discount on your next exam. Don't forget to claim it.
Remember to clean up your AWS services and environment after completing your hands-on sessions to avoid unnecessary costs. Be aware that some of the ML
services are not included in the free-tier. Setting up budget alerts and disabling unnecessary services are crucial lessons I learned while preparing for this exam.
GCP Professional Cloud Architect certification
The Google Cloud Professional Cloud Architect (PCA) certification is a prestigious credential offered by Google Cloud to validate an individual's ability to
design, develop, and manage robust, secure, scalable, highly available, and dynamic solutions to drive business objectives on Google Cloud Platform (GCP).
GCP has integrated Google Workspace and the Vertex AI platform to boost collaborative and advanced AI capabilities, attracting more customers to the GCP cloud.
This integration provides a competitive edge in the rapidly evolving cloud computing domain.
How do I prepare
I leveraged my experience with AWS hybrid-cloud services to broaden my knowledge in GCP. Both platforms share similar concepts in cloud infrastructure components,
such as compute instances, VMs, storage, databases, networking, and containers, though their implementations differ. For example, creating a new VM on both GCP and
AWS involves similar configurations for instance type, image, storage, and networking. Despite significant differences in APIs and console interfaces, the underlying
cloud concepts remain similar, which accelerated my learning curve in GCP. I found the comparison
of AWS and Azure Services to Google Cloud particularly helpful in comparing different services between GCP and AWS.
Google Cloud training offers a diverse content of learning resources, including videos, documents,
and labs, providing an excellent jumpstart. The labs grant temporary credentials to log into GCP for learning purposes, ensuring that there is no risk to your
personal account or unexpected charges. I began with the Cloud Architect learning path on the
GCP Skill Boost
site, completing courses earns you points and viewing your position on the leaderboard adds more fun and motivational element to my learning experience.
To explore the breadth and scope of the domains covered in the exam, the Preparing
for the Architect Journey course offers diagnostic questions that are instrumental in familiarizing yourself with the format and types of questions you'll
encounter. Review your answers and take notes on these diagnostic questions to solidify your understanding and readiness for the actual exam. Another good
resource is the Cloud Architecture Center, which provides design guides
and reference architectures that are quite helpful in my learning.
Exam experience
The exam lasts 120 minutes and costs $200. It consists of 50-60 multiple-choice and multi-select questions, as outlined in the exam guide. In my exam session,
however, I encountered only 50 multiple-choice questions, each with one correct answer out of four options. There is no penalty for guessing, so it is advisable
not to leave any answers blank. A minimum score of 800 out of 1000 is required to pass.
To Register exam, create the account on the Weassesor
site. The exam can be taken remote proctored or onsite proctored in the test center close to you.
I took my exam at a test center and highly recommend this option to minimize distractions that might occur at home. Below are some of my best-known methods (BKMs).
- Understanding why incorrect (distractor) answers are wrong is crucial. Begin by eliminating these options first.
- If you are not completely certain about an answer, flag the question and return to it later for review.
Material
Note
I successfully passed the exam on my first attempt. Unlike AWS certification, the results are displayed immediately after completing and submitting the exam,
providing a straightforward way to know the outcome without the delay of waiting for days. Exciting....The official report, indicating whether you passed or failed,
is sent within 7 working days.
Accelerate ZFS performance with Persistent Memory
ZFS (Zettabyte File System) is an advanced file system and logical volume manager originally designed by Sun Microsystems. Initially integrated into Sun's
Solaris Operating System, ZFS has been ported to other Unix-like systems, including FreeBSD and OpenZFS on Linux. ZFS is widely utilized in
Software-Defined Storage (SDS) for NAS (Network Attached Storage) or NFS (Network File System) storage products. This blog post explores how to enhance performance
by leveraging fast SSDs, such as persistent memory, as tiering and caching layers.
ZFS storage architecture
Traditional memory and storage tiering adopt a two-tier pyramid layout; memory has a smaller capacity and is closer to the CPU with nanosecond latency,
while storage disks, although positioned farther from the CPU, offer significantly larger capacity with hundred-millisecond latency (as shown in Fig. 1).
Cache misses incur substantial performance penalties by necessitating data retrieval from disk storage. Modern storage architectures incorporate additional layers,
such as fast SSDs that provide larger capacity and lower latency between memory and storage disks (as shown in Fig. 2). This mitigates the penalties associated
with memory cache misses and enhances both read and write speeds, thereby accelerating data retrieval back to the CPU. In the ZFS software stack, the ZIL
(ZFS Intent Log) and L2ARC (Level2 Adaptive Replacement Cache) serve as tiering and caching layers for synchronous write and read operations using fast SSDs (
as shown in Fig. 3).
ZIL stands for ZFS Intent Log. In ZFS, the ZIL is crucial for logging synchronous operations to disk before they are written to disk-based zpool storage.
This synchronous logging ensures that operations are completed and writes are committed to persistent storage, rather than merely being cached in memory.
The ZIL functions as a persistent write buffer, facilitating quick and safe handling of synchronous operations. This is particularly important before the
spa_sync() operation, which can take considerable time to access disk-based zpool storage. Essentially, ZIL is designed to enhance the reliability and integrity
of data by processing writes more efficiently prior to long-term storage commits during spa_sync().
The level2 ARC is a cache layer in-between memory and disk-based zpool storage. It uses dedicated fast SSD to enlarge the cache capacity to hold cached data and
boost the random read performance. There is no eviction path from the ARC to the L2ARC, meanig ARC and L2ARC work as largher capacity of cache reduce cache-miss
. L2ARC cache the data from the ARC before it is evicted. arc_read() only read from the disk-based zpool storage when the data not exist in both ARC and L2ARC.
Performance profile
Intel Optane persistent memory SSDs, such as the P5800X and P1600X with micro-second latency, are ideal for the ZIL and L2ARC layers. In the test
configuration (shown in Fig4) six QLC NAND SSDs configured with RAIDz2 are used as zpool storage, with a P1600X serving as the ZIL log. This setup
reduced 4Kbyte random write latency by 67%. A similar configuration that replaced the zpool storage with 20 HDDs which significantly enhanced the 4K random
write IOPS, improving more than 100 times(as depicted in Fig5). In the data cache test scenario, with the zpool storage comprising 20 HDDs in RAID1 array and a P5800X
as the L2ARC, the random read IOPS performance also improved by more than 100 times. (shown in Fig6) The data indicates the tiering storage architecture has significantly
enhanced both read and write performance.
Reference
Well architect and profiling Datacenter NVMe SSD
Solid State Drives (SSDs) have rapidly evolved due to advancements in NAND flash media processes and interface protocols, leading to higher capacities
and lower latencies. This blog aims to provide an overview of the architectural considerations and tools necessary for adapting and profiling Datacenter
NVMe flash SSDs. Important considerations include leveraging a tiering architecture to balance Total Cost of Ownership (TCO) and performance when NAND flash
media transitions from SLC (Single-Level Cell) to more dense formats like TLC (Triple-Level Cell), QLC (Quad-Level Cell), and even PLC (Penta-Level Cell)
with endurance significantly decreases. Additionally, integrating low-latency storage software stacks such as SPDK or io_uring becomes crucial
when employing low-latency SSDs in mission-critical workloads.
NVMe SSD interface and form factor
Few key factors on NVMe SSD:
-
Interface: Modern NVMe SSDs based on PCIe 5 offer a staggering 128GT/s, providing more than 20 times the bandwidth compared to the latest SATA3
(Serial ATA) interface, which supports 6Gbit/s. NVMe SSD based on PCIe can significantly influence the performance and scalability of the storage
system. Additionally, the next-generation Compute Express Link (CXL) based on PCIe 5 can also be utilized to set up storage pooling/sharing in
heterogeneous computing environments.
-
Flash Media: As NAND flash media capacity increases with bits-per-cell, write performance does not scale similarly because the write block
size also increases. High-capacity QLC SSDs exhibit lower performance and reduced endurance in small block write workloads compared to TLC SSDs.
A tiering architecture that leverages the high performance and endurance of SLC or persistent memory combined with QLC/PLC SSDs emerges as a
genuine solution.
-
SSD Form Factor: The choice of form factor influences hardware design and backplane configurations in your server. General M.2 and U.2
form factors are commonly used in today's data centers. The new E1.s form factor by SNIA SFF-TA offers benefits in capacity, throughput,
and thermal management, positioning it as a superior replacement for M.2 and U.2 in 1U/2U servers.
Low latency storage stack
io_uring is a modern Linux I/O interface designed to reduce overhead and enhance performance. Its zero-copy and no-locking design can improve latency by
up to 60% compared to traditional asynchronous I/O methods like libaio. io_uring was introduced in Linux kernel v5.1 and is supported by benchmark applications
such as fio and t/io_uring, which have integrated io_uring capabilities. However, io_uring still relies on Linux context switches for I/O operations, which may
introduce performance penalties. In contrast, the Storage Performance Development Kit (SPDK) moves all necessary drivers into user space, leveraging polling and
avoiding system calls to achieve superior storage performance. SPDK is also widely used in RDMA and NVMe-oF disaggregated storage architectures.
The following test from SNIA shows the io_uring reduce 60% of latency compare to libaio. And SPDK reduce 60% of latency comapre to io_uring.
Performance profile
Before profiling your NVMe SSD, it is crucial to verify the SSD's health and ensure it has the correct firmware version ready for benchmarking. The open-source
tool nvme-cli is used for managing, monitoring health, and updating firmware on NVMe SSDs. A good practice is to use nvme-cli to install the latest firmware,
check the appropriate namespace for your target, and monitor health via SMART logs. Low-level formatting and pre-conditioning are recommended before initiating
benchmark workloads. Persistent memory SSDs, such as Optane, which feature highly consistent flash media, can streamline the profiling process by reducing the
need for lengthy pre-conditioning cycles.
The Flexible I/O Tester (fio) is a widely recognized and versatile tool for I/O benchmarking and stress-testing storage
systems. It supports extensive configuration options with different I/O engines, read/write ratios, and block sizes to simulate realistic workload patterns.
The following graph profiles a TLC NVMe SSD under a 70/30 read/write ratio and a 4K random workload over 60 minutes. It took approximately 50 minutes for the NVMe
drive to reach a steady performance state. A performance drop of about 70% was observed after 20 minutes of runtime, coinciding with the start of garbage collection
process. Ensure that the performance profiling duration is sufficient, to reach a steady state for the reliable and accurate measurements.
Reference
Observability with Grafana, Prometheus, and Intel VTune
Observability in cloud computing involves the collection and analysis of various telemetry types, such as metrics, distributed traces, and logs.
It enables the understanding and diagnosis of an internal system's state, performance, and issues without disrupting operations. Similar tools are implemented
by cloud service providers, including AWS CloudWatch, AWS CloudTrail, Google Cloud Operations, and Azure Monitor. On-premises environments often utilize tools
like Prometheus and OpenTelemetry, integrated with Grafana for the observability. Intel's oneAPI VTune is also widely adopted to address application performance bottlenecks.
This blog aims to provide a deeper insight into Prometheus, Grafana, and Intel oneAPI VTune.
Prometheus, Grafana, and Intel VTune
Prometheus is an open-source monitoring and alerting toolkit that collects and stores metrics as time-series data. It supports a variety of libraries and
servers, known as exporters, which facilitate the exporting of metrics from systems into Prometheus-compatible formats. The node_exporter is included by default
and enables hardware and OS-level metrics from the kernel, such as CPU usage, disk I/O, memory info, and network stats. Users can also create custom
exporter plugins for proprietary metrics using Golang.
Grafana is an open-source visualization and analytics platform that enables you to query, visualize, alert on, and explore your metrics, regardless of
their storage location. It transforms time-series data from Prometheus node_exporters into compelling graphs and visualizations. Grafana supports a wide
array of data sources including InfluxDB, Graphite, and Elasticsearch. Additionally, it allows for the integration of multiple data sources within the same
dashboard, facilitating a comprehensive view of data across various environments and platforms.
Intel VTune Profiler is a powerful tool for deep performance optimization, ensuring that software runs efficiently on hardware. It examines
application code and identifies threading issues and resource bottlenecks. VTune also analyzes code and threads hotspots that significantly impact performance.
Below is a Grafana dashboard displaying metrics sourced from Prometheus, providing a visual representation. And the VTune profiler user interface
Summary
Prometheus and Grafana together deliver exceptional observability by monitoring and alerting on time-series data, enhanced by rich visualizations for
on-premises environments. Intel VTune provides in-depth analysis for application and resource performance optimization. All these open-source tools are
readily available to enhance observability in private cloud environments. In public cloud or hybrid-cloud settings, leveraging tools from cloud service providers can
significantly extend observability capabilities beyond what Grafana or VTune alone can offer.
Reference