Share Grey Beards on Storage
Share to email
Share to Facebook
Share to X
This is the first time we have talked with Hammerspace and Michael Kade (Hammerspace on X), Senior Solutions Architect. We have known about Hammerspace for years now and over the last couple of years, as large AI clusters have come into use, Hammerspace’s popularity has gone through the roof..
Mike’s been benchmarking storage for decades now and recently submitted results for MLperf Storage v1.0, an AI benchmark that focuses on storage activity for AI training and inferencing work. We have written previously on v0.5 of the benchmark, (see: AI benchmark for storage, MLperf Storage). Listen to the podcast to learn more.
Some of the changes between v0.5 and v1.0 of MLperf’s Storage benchmark include:
MLperf Storage benchmarks have to be run 5 times in a row and results reported are the average of the 5 runs. Metrics include samples/second (~files processed/second), overall storage bandwidth (MB/sec) and number of accelerators kept busy during the run (90% busy for U-net3D & ResNet-50 and 70% for CosmoFlow).
Hammerspace submitted 8 benchmarks: 2 workloads (U-net3D & ResNet-50) X 2 accelerators (A100 & H100 GPUs) X 2 client configurations (1 & 5 clients). Clients are workstations that perform training or inferencing work for the models. Clients can be any size. GPUs or accelerators are not physically used during the benchmark but are simulated as dead time, depending on the workload and GPU type (note this doesn’t depend on client size)
Hammerspace also ran their benchmarks with 5 and 22 DSX storage servers. Storage configurations matter for MLperf storage benchmarks and for v0.5, storage configurations weren’t well documented. V1.0 was intended to fix this but it seems there’s more work to get this right.
For ResNet-50 inferencing, Hammerspace drove 370 simulated A100s and 135 simulated H100s and for U-net3D training, Hammerspace drove 35 simulated A100s and 10 simulated H100s. Storage activity for training demands a lot more data than inferencing.
It turns out that training IO also uses checkpointing (which occasionally writes out models to save them in case of run failure). But the rest of the IO is essentially random sequential. Inferencing has much more randomized IO activity to it.
Hammerspace is a parallel file system (PFS) which uses NFSv4.2. NFSv4.2 is available native, in the Linux kernel. The main advantages of PFS is that IO activity can be parallelized by spreading it across many independent storage servers and data can move around without operational impact.
Mike ran their benchmarks in AWS. I asked about cloud noisy neighbors and networking congestion and he said, if you ask for a big enough (EC2) instance, high speed networks come with it, and noisy neighbors-networking congestion are not a problem.
Michael Kade has over a 45-year history with the computer industry and over 35 years of experience working with storage vendors. He has held various positions with EMC, NetApp, Isilon, Qumulo, and Hammerspace.
He specializes in writing software that bridges different vendors and allows their software to work harmoniously together. He also enjoys benchmarking and discovering new ways to improve performance through the correct use of software tuning.
In his free time, Michael has been a helicopter flight instructor for over 25 years for EMS.
I’ve known Gina Rosenthal (@[email protected]), Founder&CEO, Digital Sunshine Solutions seems like forever and she’s been on the very short list for being a GBoS co-host but she’s got her own Tech Aunties Podcast now. We were both at VMware Explore last week in Vegas. Gina was working in the community hub and I was in their analyst program.
VMware (World) Explore has changed a lot since last year. I found the presentation/sessions to be just as insightful and full of users as last years, but it seems like there may have been fewer of them. Gina found the community hub sessions to be just as busy and the Code groups were also very well attended. On the other hand, the Expo was smaller than last year and there were a lot less participants (and [maybe] analysts) at the show. Listen to the podcast to learn more.
The really big news was VCF 9.0. Both a new number (for VCF) and an indicator of a major change in direction for how VMware functionality will be released in the future. As one executive told me, VCF has now become the main (release) delivery vehicle for all functionality.
In the past, VCF would generally come out with some VMware functionality downlevel to what’s generally available in the market. With VCF 9, that’s going to change now. From now on, all individual features/functions of VCF 9.0 will be at the current VMware functionality levels. Gina mentioned this is a major change for how VMware released functionality, and signals much better product integrations than available in the past.
Much of VMware distinct functionality has been integrated into VCF 9 including SDDC, Aria and other packages. They did, however, create a new class of “advanced services” that runs ontop of VCF 9. We believe these are individually charged for and some of these advanced services include:
There were others, but we didn’t discuss them on the podcast.
I would have to say that Private AI was of most interest to me and many other analysts at the show. In fact, I heard that it’s VMware’s (and supposedly NVIDIA’s) intent to reach functional parity with GCP Vertex and others with Private AI. This could come as soon as VCF 9.0 is released. I pressed them on this point and they held firm to that release number.
My only doubt is that VMware or NVIDIA don’t have their own LLM. Yes, they can use Meta’s LLama 3.1, OpenAI or any other LLM on the market. But running them in-house on enterprise VCF servers is another question.
The lack of an “owned” LLM should present some challenges with reaching functional parity with organizations that have one. On the other hand, Chris Walsh mentioned that they (we believe VMware internal AI services) have been able to change their LLM 3 times over the last year using Private AI Foundation.
Chris repeated more than once that VMware’s long history with DRS and HA makes VCF 9 Private AI Foundation an ideal solution for enterprises to run AI workloads. He specifically mentioned GPU HA that can take GPUs from data scientists when enterprise inferencing activities suffer GPU failures. Unclear whether any other MLops cloud or otherwise can do the same.
From a purely storage perspective I heard a lot about vVols 2.0, This is less a functional enhancement, than a new certification to make sure primary storage vendors offer full vVol support in their storage.
Gina mentioned and it came up in the Analyst sessions, that Broadcom has stopped offering discounts for charities and non-profits. This is going to hurt most of those organizations which are now forced to make a choice, pay full subscription costs or move off VMware.
The other thing of interest was that Broadcom spent some time trying to soothe over the bad feelings of VMware’s partners. There was a special session on “Doing business with Broadcom VMware for partners” but we both missed it so can’t report any details.
Finally, Gina and I, given our (lengthy) history in the IT industry and Gina’s recent attendance at IBM Share started hypothesizing on a potential linkup between Broadcom’s CA and VMware offerings.
I mentioned multiple times there wasn’t even a hint of the word “mainframe” during the analyst program. Probably spent more time discussing this than we should of, but it’s hard to take the mainframe out of IT (as most large enterprises no doubt lament).
As the Founder and CEO of Digital Sunshine Solutions, Gina brings over a decade of expertise in providing marketing services to B2B technology vendors. Her strong technical background in cloud computing, SaaS, and virtualization enables her to offer specialized insights and strategies tailored to the tech industry.
She excels in communication, collaboration, and building communities. These skills to help her create product positioning, messaging, and content that educates customers and supports sales teams. Gina breaks down complex technical concepts and turn them into simple, relatable terms that connect with business goals.
She is the co-host of The Tech Aunties podcast, where she shares thoughts on the latest trends in IT, especially the buzz around AI. Her goal is to help organizations tackle the communication and organizational challenges associated with modern datacenter transitions.
Jim Handy, General Director, Objective Analysis, is our long, time goto guy on SSD and Memory Technologies and we were both at FMS (Future of Memory and Storage – new name/broader focus) 2024 conference last week in Santa Clara, CA. Lots of new SSD technology both on and off the show floor as well as new memory offerings and more.
Jim helps Jason and I understand what’s happening with NAND, and other storage/memory technologies that matter to today’s IT infrastructure. Listen to the podcast to learn more.
First off, I heard at the show that the race for more (3D NAND) layers is over. According to Jim, companies are finding it’s more expensive to add layers than it is just to do a lateral (2D, planar) shrink (adding more capacity per layer).
One vendor mentioned that the CapEx Efficiencies were degrading as they add more layers. Nonetheless, I saw more than one slide at the show with a “3xx” layers column.
Kioxia and WDC introduced a 218 layer, BICS8 NAND technology with 1Tb TLC and up to 2Tb QLC NAND per chip. Micron announced a 233 layer Gen 9 NAND chip.
Some vendor showed a 128TB (QLC) SSD drive. The challenge with PCIe Gen 5 is that it’s limited to 4GB/sec per lane and for 16 lanes, that’s 64GB/s of bandwidth and Gen 4 is half that. Jim called using Gen 4/Gen 5 interfaces for a 128TB SSD like using a soda straw to get to data.
The latest Kioxia 2Tb QLC chip is capable of 3.6Gbps (source: Kioxia America) and with (4*128 or) 512 of these 2Tb chips needed to create a 128TB drive that’s ~230GB/s of bandwidth coming off the chips being funneled down to 16X PCIe Gen5 64GB/s of bandwidth, wasting ~3/4ths of chip bandwidth.
Of course they need (~1.3x?) more than 512 chips to make a durable/functioning 128TB drive, which would only make this problem worse. And I saw one slide that showed a 240TB SSD!
Enough on bandwidth, let’s talk data growth. Jason’s been doing some research and had current numbers on data growth. According to his research, the world’s data (maybe xmitted over internet) in 2010 was 2ZB (ZB, zettabytes = 10^21 bytes), and in 2023 it was 120ZB and by 2025 it should be 180ZB. For 2023, thats over 328 Million TB/day or 328EB/day (EB, exabytes=10^18 bytes).
Jason said ~54% of this is video. He attributes the major data growth spurt since 2010 to mainly social media videos.
Jason also mentioned that the USA currently (2023?) had 5,388 data centers, Germany 522, UK 517, and China 448. That last number seems way low to all of us but they could just be very, very big data centers.
No mention on the average data center size (meters^2, # servers, #GPUs, Storage size, etc). But we know, because of AI, they are getting bigger and more power hungry,
There were more FMS 2024 topics discussed, like the continuing interest in TLC SSDs, new memory offerings, computational storage/memory, etc.
Jim Handy of Objective Analysis has over 35 years in the electronics industry, including 20 years as a leading semiconductor and SSD industry analyst. Early in his career he held marketing and design positions at leading semiconductor suppliers including Intel, National Semiconductor, and Infineon.
A frequent presenter at trade shows, Mr. Handy is known for his technical depth, accurate forecasts, widespread industry presence and volume of publication.
He has written hundreds of market reports, articles for trade journals, and white papers, and is frequently interviewed and quoted in the electronics trade press and other media.
He posts blogs at www.TheMemoryGuy.com, and www.TheSSDguy.com
Dr J Metz, (@drjmetz, blog) has been on our podcast before mostly in his role as SNIA spokesperson and BoD Chair, but this time he’s here discussing some of his latest work on the Ultra Ethernet Consortium (UEC) (LinkedIN: @ultraethernet, X: @ultraethernet)
The UEC is a full stack re-think of what Ethernet could do for large single application environments. UEC was originally focused on HPC, with 400-800 Gbps networks and single applications like simulating a hypersonic missile or airplane. But with the emergence of GenAI and LLMs, UEC could also be very effective for large AI model training with massive clusters doing a single LLM training job over months. Listen to the podcast to learn more.
The UEC is outside the realm of normal enterprise environments. But as AI training becomes more ubiquitous, who knows whether UEC may not find a place in the enterprise. However, it’s not intended for mixed network environments with multiple applications. It’s a single application network.
One wouldn’t think, HPC was a big user of Ethernet for their main network. But Dr J pointed out that the top 3 of the HPC 500, all use Ethernet and more are looking to use it in the future.
UEC is essentially an optimized software stack and hardware for networking used by single application environments. These types of workloads are constantly pushing the networking envelope. And by taking advantage of the “special networking personalities” of these workloads, UEC can significantly reduce networking overheads, boosting bandwidth and workload execution.
The scale of networks is extreme. The UEC is targeting up to a million endpoints, over >100K servers, with each network link >100Gbps and more likely 400-800Gpbs. With the new (AMD and others) networking cards coming out that support 4 400/800Gbps network ports, having a pair of these on each server, with 100K server cluster gives one 800K endpoints. A million is not that far away when you think of it at that scale.
Moreover, LLM training and HPC work are starting to look more alike these days. Yes there are differences but the scale of their clusters are similar, and the way work is sometimes fed to them is similar, which leads to similar networking requirements
UEC is attempting to handle a 5% problem. That is 95% of the users will not have 1M endpoints in their LAN, but maybe 5% will and for these 5%, a more mixed networking workload is unnecessary. In fact, a mixed network becomes a burden slowing down packet transmission.
UEC is finding that with a few select networking parameters, almost like workload fingerprints, network stacks can be much more optimized than current Ethernet and thereby support reduced packet overheads, and more bandwidth.
AI and HPC networks share a very limited set of characteristics which can be used as fingerprints. These characteristics are like reliable or unreliable transport, ordered or unordered delivery, multi-path packet spraying or not, etc, With a set of these types of parameters, selected for an environment, UEC can optimize a network stack to better support a million networking endpoints
We asked where CXL fits in with UEC? DrJ said it could potentially be an entity on the network but he sees CXL more as a within server or between a tight (limited) cluster of servers, solution rather than something on a UEC network.
Just 12 months ago the UEC had 10 members or so and this past week they were up to 60. UEC seems to have struck a chord.
The UEC plans to release a 1.0 specification, near the end of this year. UEC 1.0 is intended to operate on current (>100Gbps) networking equipment with firmware/software changes.
Considering the UEC was just founded in 2023, putting out their 1.0 technical spec. within 1.5 years is astonishing. But also speaks volumes to the interest in the technology.
The UEC has a blog post which talks more about UEC 1.0 specification and the technology behind it.
J works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.
J has previously held roles in both startups and Fortune 100 companies as a Field CTO, R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.
J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.
Steffen Hellmold, Director, Cerabyte Inc. is extremely knowledgeable about the storage device business. He has worked for WDC in storage technology and possesses an in-depth understanding of tape and disk storage technology trends.
Cerabyte, a German startup, is developing cold storage. Steffen likened Cerabyte storage to ceramic punch cards that dominated IT and pre-IT over much of the last century. Once cards were punched, they created near-WORM storage that could be obliterated or shredded but was very hard to modify. Listen to the podcast to learn more.
Cerabyte uses a unique combination of semiconductor (lithographic) technology, ceramic coated glass, LTO tape (form factor) cartridge and LTO automation in their solution. So, for the most part, their critical technologies all come from somewhere else.
Their main technology uses a laser-lithographic process to imprint onto a sheet (ceramic coated glass) a data page (block?). There are multiple sheets in each cartridge.
Their intent is to offer a robotic system (based on LTO technology) to retrieve and replace their multi-sheet cartridges and mount them in their read-write drive.
As mentioned above, the write operation is akin to a lithographic data encoded mask that is laser imprinted on the glass. Once written, the data cannot be erased. But it can be obliterated, by something akin to writing all ones or it can be shredded and recycled as glass.
The read operation uses a microscope and camera to take scans of the sheet’s imprint and convert that into data.
Cerabyte’s solution is cold or ultra-cold (frozen) storage. If LTO robotics are any indication, a Cerabyte cartridge with multiple sheets can be presented to a read-write drive in a matter of seconds. However, extracting the appropriate sheet in a cartridge, and mounting it in a read-write drive will take more time. But this may be similar in time to an LTO tape leader being threaded through a tape drive, again a matter of seconds
Steffen didn’t supply any specifications on how much data could be stored per sheet other than to say it’s on the order of many GB. He did say that both sides of a Cerabyte sheet could be recording surfaces.
With their current prototype, an LTO form factor cartridge holds less than 5 sheets of media but they are hoping that they can get this to a 100 or more. in time.
We talked about the history of disk and tape storage technology. Steffen is convinced (as are many in the industry) that disk-tape capacity increases have slowed over time and that this is unlikely to change. I happen to believe that storage density increases tend to happen in spurts, as new technology is adopted and then trails off as that technology is built up. We agreed to disagree on this point.
Steffen predicted that Cerabyte will be able to cross over disk cost/capacity this decade and LTO cost/capacity sometime in the next decade.
We discussed the market for cold and frozen storage. Steffen mentioned that the Office of the Director of National Intelligence (ODNI) has tasked the National Academies of Sciences, Engineering, and Medicine to conduct a rapid expert consultation on large-scale cold storage archives. And that most hyperscalers have use for cold and frozen storage in their environments and some even sell this (Glacier storage) to their customers.
The Library of Congress and similar entities in other nations are also interested in digital preservation that cold and frozen technology could provide. He also thinks that medical is a prime market that is required to retain information for the life of a patient. IBM, Cerabyte, and Fujifilm co-sponsored a report on sustainable digital preservation.
And of course, the media libraries for some entertainment companies represent a significant asset that if on tape has to be re-hosted every 5 years or so. Steffen and much of the industry are convinced that a sizeable market for cold and frozen storage exists.
I mentioned that long archives suffer from data format drift (data formats are no longer supported). Steffen mentioned there’s also software version drift (software that processed that data is no longer available/runnable on current OSs). And of course the current problem with tape is media drift (LTO media formats can be read only 2 versions back).
Steffen seemed to think format and software drift are industry-wide problems and they are being worked on. Cerabyte seems to have a great solution for media drift. As it can be read with a microscope. And the (ceramic glass) media has a predicted life of 100 years or more.
I mentioned the “new technology R&D” problem. Historically, as new storage technology has emerged, they have always end up being left behind (in capacity), because disk-tape-NAND R&D ($Bs each) over spends them. Steffen said it’s certainly NOT B$ of R&D for tape and disk.
Steffen countered by saying that all storage technology R&D spending pales in comparison to semiconductor R&D spending focused on reducing feature size. And as Cerabyte uses semiconductor technologies to write data, sheet capacity is directly a function of semiconductor technology. So, Cerabyte’s R&D technology budget should not be a problem. And in fact they have been able to develop their prototype, with just $7M in funding.
Steffen mentioned there is an upcoming Storage Technology Showcase conference in early March where Cerabyte will be at.
Steffen has more than 25 years of industry experience in product, technology, business & corporate development as well as strategy roles in semiconductor, memory, data storage and life sciences.
He served as Senior Vice President, Business Development, Data Storage at Twist Bioscience and held executive management positions at Western Digital, Everspin, SandForce, Seagate Technology, Lexar Media/Micron, Samsung Semiconductor, SMART Modular and Fujitsu.
He has been deeply engaged in various industry trade associations and standards organizations including co-founding the DNA Data Storage Alliance in 2020 as well as the USB Flash Drive Alliance, serving as their president from 2003 to 2007.
He holds an economic electrical engineering degree (EEE) from the Technical University of Darmstadt, Germany.
We talked with Andy Warfield (@AndyWarfield), VP Distinguished Engineer, Amazon, about 10 years ago, when at Coho Data (see our (005:) Greybeards talk scale out storage … podcast). Andy has been a good friend for a long time and he’s been with Amazon S3 for over 5 years now. Since the recent S3 announcements at AWS Re:Invent, we thought it a good time to have him back on the show. Andy has a great knack for explaining technology, I suppose that comes from his time as a professor but whatever the reason, he was great to have on the show again.
Lately, Andy’s been working on S3 Express, One Zone storage, announced last November, a new version of S3 object storage with lower response time. We talked about this later in the podcast but first we touched on S3’s history and other advances. S3 and its ancillary services have advanced considerably over the years. Listen to the podcast to learn more
S3 is ~18 years old now and was one of the first AWS offerings. It was originally intended to be the internet’s file system which is why it was based on HTTP protocols.
Andy said that S3 was designed for 11-9s durability and high availability options. AWS constantly monitors server and storage failures/performance to insure that they can maintain this level of durability. The problem with durability is that when a drive/server goes down, the data needs to be rebuilt onto another drive before another drive fails. One way to do this is to have more replicas of the data. Another way is to speed up rebuild times. I’m sure AWS does both.
S3 high availability requires replicas across availability zones (AZ). AWS availability zone data centers are carefully located so that they are power-networking isolated from others data centers in the region. Further, AZ site locations are deliberately selected with an eye towards ensuring they are not susceptible to similar physical disasters.
Andy discussed other AWS file data services such as their FSx systems (Amazon FSx for Lustre, for OpenZFS, for Windows File Server, & for NetApp ONTAP) as well as Elastic File System (EFS). Andy said they sped up one of these FSx services by 3-5X over the last year.
Andy mentioned one of the guiding principles for lot of AWS storage is to try to eliminate any hard decisions for enterprise developers. By offering FSx files, S3 objects and their other storage and data services, customers already using similar systems in house can just migrate apps to AWS without having to modify code.
Andy said one thing that struck him as he came on the S3 team was the careful deliberation that occurred whenever they considered S3 API changes. He said the team is focused on the long term future of S3 and any API changes go through a long and deliberate review before implementation.
One workload that drove early S3 adoption was data analytics. Hadoop and BigTable have significant data requirements. Early on, someone wrote an HDFS interface to S3 and over time lots of data analytics activity moved to S3 object hosted data.
Databases have also changed over the last decade or so. Keith mentioned that many customers are foregoing traditional data bases to use open source database solutions with S3 as their backend storage. It turns out that Open Table Format database offerings such as Apache Iceberg, Apache Hudi and Delta Lake are all available on AWS use S3 objects as their storage
We talked a bit about Lambda Server-less processing triggered by S3 objects. This was a new paradigm for computing when it came out and many customers have adopted Lambda to reduce cloud compute spend.
Recently Amazon introduced a file system Mount point for S3 storage. Customers can now use an NFS mount point to access any S3 bucket.
Amazon also supports the Registry for Open Data, which holds just about every canonical data set (stored as S3 objects) used for AI training.
In the last ReInvent, Amazon announced S3 Express One Zone which is a high performance, low latency version of S3 storage. The goal for S3 express was to get latency down from 40-60 msec to less than 10 sec.
They ended up making a number of changes to S3 such as:
Somewhere during our talk Andy said that, in aggregate, S3 is providing 100TBytes/sec of data bandwidth. How’s that for a scale out storage.
Andy is a Vice President and Distinguished Engineer in Amazon Web Services. He focusses primarily on data storage and analytics.
Andy holds a PhD from the University of Cambridge, where he was one of the authors of the Xen hypervisor. Xen is an open source hypervisor that was used as the initial virtualization layer in AWS, among multiple other early cloud companies. Andy was a founder at Xensource, a startup based on Xen that was subsequently acquired by Citrix Systems for $500M. Following XenSource,
Andy was a professor at the University of British Columbia (UBC), where he was awarded a Canada Research Chair, and a Sloan Research Fellowship. As a professor, Andy did systems research in areas including operating systems, networking, security, and storage.
Andy’s second startup, Coho Data, was a scale-out enterprise storage array that integrated NVMe SSDs with programmable networks. It raised over 80M in funding from VCs including Andreessen Horowitz, Intel Capital, and Ignition Partners.
Sponsored By:
This is the last in this year’s, GreyBeards-RackTop Systems podcast series and once again we are talking with Jonathan Halstuch (@JAHGT), Co-Founder and CTO, RackTop Systems. This time we discuss why traditional security practices can’t cut it alone, anymore. Listen to the podcast to learn more.
Turns out traditional security practices are keeping the bad guys out or supplies perimeter security with networking equivalents. But the problem is sometimes the bad guy is internal and at other times the bad guys pretend to be good guys with good credentials. Both of these aren’t something that networking or perimeter security can catch.
As a result, the enterprise needs both traditional security practices as well as something else. Something that operates inside the network, in a more centralized place, that can be used to detect bad behavior in real time.
Jonathan talked about a typical attack:
By the time security systems detect the malware, the attacker has been in your systems and all over your network for months, and it’s way too late to stop them from doing anything they want with your data.
In the past detection like this could have been 3rd party tools that scanned backups for malware or storage systems copying logs to be assessed, on a periodic basis.
The problem with such tools is that they always lag behind the time when the theft/corruption has occurred.
The need to detect in real time, at something like the storage system, is self-evident. The storage is the central point of access to data. If you could detect illegal or bad behavior there, and stop it before it could cause more harm that would be ideal.
In the past, storage system processors were extremely busy, just doing IO. But with today’s modern, multi-core, NUMA CPUs, this is no longer be the case.
Along with high performing IO, RackTop Systems supports user and admin behavioral analysis and activity assessors. These processes run continuously, monitoring user and admin IO and command activity, looking for known, bad or suspect behaviors.
When such behavior is detected, the storage system can prevent further access automatically, if so configured, or at a minimum, warn the security operations center (SOC) that suspicious behavior is happening and inform SOC of who is doing what. In this case, with a click of a link in the warning message, SOC admins can immediately stop the activity.
If it turns out the suspicious behavior was illegal, having the detection at the storage system can also provide SOC a list of files that have been accessed/changed/deleted by the user/admin. With these lists, SOC has a rapid assessment of what’s at risk or been lost.
Jonathan and I talked about RackTop Systems deployment options, which span physical appliances, SAN gateways to virtual appliances. Jonathan mentioned that RackTop Systems has a free trial offer using their virtual appliance that any costumer can download to try them out.
Jonathan Halstuch is the Chief Technology Officer and Co-Founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.
With over 20-years of experience as an engineer, technologist, and manager for the federal government, he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.
Jason and Keith joined Ray for our annual year end wrap up and look ahead to 2024. I planned to discuss infrastructure technical topics but was overruled. Once we started talking AI, we couldn’t stop.
It’s hard to realize that Generative AI and ChatGPT in particular, haven’t been around that long. We discussed some practical uses Keith and Jason had done with the technology.
Keith mentioned its primary skill is language expertise. He has used it to help write up proposals. He often struggles to convince CTO Advisor non-sponsors of the value they can bring and found that using GenAI has helped do this better.
Jason mentioned he uses it to create BASH, perl, and PowerShell scripts. He says it’s not perfect but can get ~80% there and with a few tweaks, is able to have something a lot faster than if he had to do it completely by hand. He also mentioned its skill in translating from one scripting language to others and how well the code it generates is documented (- that hurt).
I was the odd GreyBeard out, having not used any GenAI, proprietary or not. I’m still working to get a reinforcement learning task to work well and consistently. I figured once I mastered that, I train an LLM on my body of (text and code) work (assuming of course someone gifts me a gang of GPUs).
I agreed GenAI are good at (English) language and some coding tasks (where lot’s of source code exists, such as java, scripting, python, etc.).
However, I was on a MLops slack channel and someone asked if GenAI could help with IBM RPG II code. I answered, probably not. There’s just not a lot of RPG II code publicly accessible on the web and the structure of RPG was never line of text/commands oriented.
We had some heated discussion on where LLMs get the data to train with. Keith was fine with them using his data. I was not. Jason was neutral.
We then turned to what this means to the white collar workers who are coding and writing text. Keith made the point that this has been a concern throughout history, at least since the industrial revolution.
Machines come along, displace work that was done by hand, increase production immensely, reduce costs. Organizations benefit, but people doing those jobs need to up level their skills, to take advantage of the new capabilities.
Easy for us to say, as we, except for Jason, in his present job, are essentially entrepreneurs and anything that helps us deliver more value, faster, easier or less expensively, is a boon for our businesses.
Jason mentioned, Stephen Wolfram wrote a great blog post discussing LLM technology (see What is ChatGPT doing … and why does it work). Both Jason and Keith thought it did a great job about explaining the science and practice behind LLMs.
We moved on to a topic harder to discuss but of great relevance to our listeners, GenAI’s impact on the enterprise.
It reminds me of when Cloud became most prominent. Then “C” suites tasked their staff to adopt “the cloud” anyway they could. Today, “C” suites are tasking their staff to determine what their “AI strategy” is and when will it be implemented.
Keith mentioned that this is wrong headed. The true path forward (for the enterprise) is to focus on what are the business problems and how can (Gen)AI address (some of) them.
AI is so varied and its capabilities across so many fields, is so good nowadays ,that organizations should really look at AI as a new facility that can recognize patterns, index/analyze/transform images, summarize/understand/transform text/code, etc., in near real-time and see where in the enterprise that could help.
We talked about how enterprises can size AI infrastructure needed to perform these activities. And it’s more than just a gaggle of GPUs.
MLcommons’s MLperf benchmarks can help show the way, for some cases, but they are not exhaustive. But it’s a start.
The consensus was maybe deploy in the cloud first and when the workload is dialed in there, re-home it later. With the proviso that hardware needed is available.
Our final topic was the Broadcom VMware acquisition. Keith mentioned their recent subscription pricing announcements vastly simplified VMware licensing, that had grown way too complex over the decades.
And although everyone hates the expense of VMware solutions, they often forget the real value VMware brings to enterprise IT.
Yes hyperscalars and their clutch of coders, can roll their own hypervisor services stacks, using open source virtualization. But the enterprise has other needs for their developers. And the value of VMware virtualization services, now that 128 Core CPUs are out, is even higher.
We mentioned the need for hybrid cloud and how VCF can get you part of the way there. Keith said that dev teams really want something like “AWS software” services running on GCP or Azure.
Keith mentioned that IBM Cloud is the closest he’s seen so far to doing what Dev wants in a hybrid cloud.
We all thought when DNN’s came out and became trainable, and reinforcement learning started working well, that AI had turned a real corner. Turns out, that was just a start. GenAI has taken DNNs to a whole other level and Deepmind and others are doing the same with reinforcement learning.
This time AI may actually help advance mankind, if it doesn’t kill us first. On the latter topic you may want to checkout my RayOnStorage AGI series of blog posts (latest … AGI part-8)
Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.
He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.
He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022
Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations.
Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.
This is the 2nd time Brian Dean, Technical Marketing, Dell PowerFlex Storage has been on our show discussing their storage. Since last time there’s been a new release with significant functional enhancements to file services, Dell CloudIQ integration and other services. We discussed these and other topics on our talk with Brian. Please listen to the podcast to learn more.
We began the discussion on the recent (version 4.5) changes to Powerflex for file services. PowerFlex file services are provided by File Nodes each running a NAS Container, which supplies multiple NAS Servers. NAS servers supply tenant network namespaces, security policies and host file systems, each of which resides on a single PowerFlex volume.
File Nodes are deployed in HA pairs, each on a separate hardware server. One can have up to 16 File Nodes or 8 pairs of File Nodes running on a PowerFlex cluster. If one of the pair goes down, file access fails over to the other File Node in a pair.
Each NAS Server supports multiple file systems each of which can be up to 256TB. The NAS Container is also used for other Dell storage file services, so it’s full featured and very resilient.
PowerFlex file services support multiple NFS and SMB versions as well as SFTP/FTP and other essential file data services. In addition, it also supports a global name space which allows all PowerFlex cluster file systems to be accessed under a single name space and IP target.
Next, we discussed PowerFlex’s automated LCM (Life Cycle Management) services which is specific to the PowerFlex appliance and fully-integrated, rack deployment models. Recall that PowerFlex can be deployed as an appliance, rack solution or in a software only solution using X86 servers.
With the appliance and rack models, a PowerFlex Manager (PFxM) service is used to deploy, change, monitor and manage PowerFlex cluster nodes. It discovers networking and PowerFlex servers/storage, loads appropriate firmware, BIOS, PowerFlex storage data services software and then brings up PowerFlex block services.
PFxM also offers automated LCM by maintaining an intelligent catalog, which declares all current software/firmware/BIOS and hardware versions compatible with PowerFlex software. When changes are made to the cluster, say when storage is increased or a server is added, the PFxM service detects the change and goes about bringing any new hardware up to proper software levels.
Finally the PFxM service can non-disruptively update the cluster whenever a PowerFlex code change is deployed. This would involve an intelligent catalog update, after which the PFxM service detects the cluster is out of compliance, and then it would serially go through, bringing each cluster node up to the proper level, without host IO access interruption.
Finally, we discussed changes made to CloudIQ-PowerFlex interface, so that CloudIQ can now troubleshoot and report performance-capacity trends at the PowerFlex storage pool, fault set, and fault domain level. Previously, CloudIQ could only do this at the full PowerFlex system level.
CloudIQ is Dell’s free, cloud service used to monitor and trouble shoot all Dell storage systems and many other Dell solutions, whether on premises or in the cloud.
Brian mentioned that all technical information for PowerFlex is available on their InfoHub.
Brian is a 16+ year veteran of the technology industry, and before that spent a decade in higher education. Brian has worked at EMC and Dell for 7 years, first as Solutions Architect and then as TME, focusing primarily on PowerFlex and software-defined storage ecosystems.
Prior to joining EMC, Brian was on the consumer/buyer side of large storage systems, directing operations for two Internet-based digital video surveillance startups.
When he’s not wrestling with computer systems, he might be found hiking and climbing in the mountains of North Carolina.
Bryan Cantrill (@bcantrill), CTO, Oxide Computer was a hard man to interrupt once started but the GreyBeards did their best to have a conversation. Nonetheless, this is a long podcast. Oxide are making a huge bet on rack scale computing and have done everything they can to make their rack easy to unbox, setup and deploy VMs on.
They use commodity parts (AMD EPYC CPUs) and package them in their own designed hardware (server) sleds, which blind mate to networking and power in the back of the own designed rack. They use their own OS Helios (OpenSolaris derivative) with their own RTOS, Hubris, for system bringup, monitoring and the start of their hardware root of trust. And of course, to make it all connect easie,r they designed and developed their own programmable networking switch. Listen to the podcast to learn more.
Oxide essentially provides rack hardware which supports EC2-like compute and EBS-like storage to customers. It also has Terraform plugins to support infrastructure as code. In addition, all their software is completely API driven.
Bryan said time and time again, developing their own hardware and software made everything easier for them and their customers. Customers pay for hardware but there’s absolutely NO SOFTWARE LICENSING FEEs, because all their software is open source.
For example, the problem with AMI bios and UEFIs is their opacity, There’s really no way to understand what packages are included in its root of trust because it’s proprietary. Brian said one company UEFI they examined, had URL’s embedded in firmware. It seemed odd to have another vendor’s web pages linked to their root of trust.
Bryan said they did their own switch to reduce integration and validation test time. The Oxide rack supports all internal networking, compute sled to compute sled, and ToR switch (with no external cabling) and has 32 networking ports to connect the rack to the data center’s core networking.
As for storage, Bryan said each of the 10 U.2 NVMe drives in their compute sled is a separate, ZFS file system and customer data is 3 way mirrored across any of them. ZFS also provides end to end checksumming across all customer data for IO integrity.
Bryan said Oxide Computer rack bring up is 1) plug it in to core networking and power, 2) power it on, 3) attach a laptop to their service processor, 4) SSH into it, 5) Run a configuration script and your ready to assign VMs. He said that from the time an Oxide Rack hits your dock until you are up and firing up VMs, could be as short as an HOUR.
The Rust programming language is the other secret to Oxide’s success. More to the point their company is named after Rust (oxide get it). Apparently just about any software they developed is written in Rust.
The question for Oxide and every other computer and storage vendor is – do you believe that on premises computing will continue for the foreseeable future. The GreyBeards and Oxide believe yes. If not for compliance and better latency but also because it often costs less.
Bryan mentioned they have their own podcast, Oxide and Friends. On their podcast, they did a board bring up series (Tales from the Bring-Up Lab) and a series on taking their rack through FCC compliance (Oxide and the Chamber of Mysteries).
Bryan Cantrill is a software engineer who has spent over a quarter of a century at the hardware/software interface. He is the co-founder and CTO of Oxide Computer Company, the creator of the world’s first commercial cloud computer.
Prior to Oxide, he spent nearly a decade at Joyent, a cloud computing pioneer; prior to Joyent, he spent 14 years at Sun Microsystems.
Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University, and is a MIT Technology Review 35 Top Young Innovators alumnus.
You can learn more about his work with Oxide at oxide.computer, or listen in on their weekly live show, Oxide and Friends (link above), on Discord or anywhere you get your podcasts.
The podcast currently has 363 episodes available.
5,827 Listeners
1,259 Listeners
4,188 Listeners
30,872 Listeners
43 Listeners
110,419 Listeners
55,905 Listeners
15 Listeners
929 Listeners
8 Listeners
12,646 Listeners
15,411 Listeners
8,061 Listeners
422 Listeners
5,188 Listeners