[SAC] New Hardware, can we purchase now

harrison.grundy at astrodoggroup.com harrison.grundy at astrodoggroup.com
Sat Mar 31 17:47:19 PDT 2018


I'll do some digging and come up with a summary for the board. I suspect you're right on the layout, but I've had a few boards with really nasty PCIe channel layouts. (Sharing the disk controller and 4 port NIC on the same 4x bus is evil!) 

On the writes, it depends a lot on how we expect load on the machine to be organized... a single database hitting the write cache can hurt on magnification, but a large group will generally be coalesced effectively as the deeper disk queue is flushed in a single operation. Since ZFS was originally designed for spinning rust, it's pretty clever about doing as much as it can in the single op. 

Is there a target in terms of operations per second, database throughput, or something else we can use to calculate the actual load we plan to put on the drives?

On the speed side, since it's the write cache, you can disable cache, secure erase, and enable to restore prior performance, when I've done SSD backed ZFS caches, I usually just put that on a weekly crontab. 

I can hop on IRC for a chat if you want to run through it quickly. Also, don't hold this up on my account, I'm an intermittent participant at best and the concerns I've got are outweighed by how helpful the new machine would be, regardless of the answers to the above!

Harrison

  Original Message  
From: chrisgiorgi at gmail.com
Sent: April 1, 2018 08:28
To: harrison.grundy at astrodoggroup.com
Cc: tech at wildintellect.com; sac at lists.osgeo.org
Subject: Re: [SAC] New Hardware, can we purchase now

Hi Harrison,
I understand what you mean regarding PCIe lanes, but I would be highly
surprised if supermicro chose to share the lanes of the slot used for
the lone PCIe riser with any other devices, given the large number of
PCIe lanes provided by the controllers on dual xeon processors. Of
course, if you can verify or refute that, it would be valuable. In any
case, U.2 would still be far faster than a SATA3 interconnect even if
it was bridged and only had four lanes :)

The write cache is responsible for logging every write before
returning from a synchronous write, so each DB transaction to
insert,update,or delete an entry may represent dozens or even hundreds
of block writes to the cache by the time it completes; these blocks
are however reordered into a transaction group for the write to the
permanent storage pool. Standard SSDs used in such an application
either have to be grossly oversized and/or replaced on a somewhat
frequent basis, and they also don't perform nearly as well in general.
The previous generation SLC flash based enterprise SSDs such as the
Intel 3700 have been discontinued, and prices are absurdly high for
those units which are available. SSDs also tend to slow noticeably
after a use for frequent writes.

The expected useful life of this server is planned for 5-7 years
minimum, and one of the criteria that was stated is avoiding the
necessity for on-site modifications or service if at all possible. A
rough estimate is 2-5years service for SSDs of around twice the size,
vs 6-10yrs on the optanes, based on both DWPD and MTBF. Please take a
look at the specs and let me know if I missed something.

Thanks,
   ~~~Chris~~~

On Sat, Mar 31, 2018 at 4:05 PM,  <harrison.grundy at astrodoggroup.com> wrote:
> I mean that, while the slot may be 16x electrically, do we know how it's wired on the board, with the on board RAID, NIC, M2, etc. Supermicro has this somewhere, I'll see if I can dig it up.
>
> While the life of the optanes may be a great deal longer than SSD, is it material to how hard we'd push the ZFS write cache?  (Since that's transactional, on most implementations, you can minimize the write amplification issues.)
>
> My thinking is that first generation drives that will last for 80 years of writes instead of 8 probably aren't worth the significantly increased cost unless we'll actually take advantage of the increased throughput, since we can always swap them later for faster/larger/etc units in the future without impacting the pool.
>
> Harrison
>
>   Original Message
> From: chrisgiorgi at gmail.com
> Sent: March 31, 2018 05:54
> To: harrison.grundy at astrodoggroup.com
> Cc: tech at wildintellect.com; sac at lists.osgeo.org
> Subject: Re: [SAC] New Hardware, can we purchase now
>
> I'm not sure how we would go about fitting a third Optane device --
> the quote had HHHL PCIe cards listed, not the required U.2 devices
> which go in place of the micron sata ssds.
> The PCIe -> U.2 interface card provides 4 PCIe 3.0 lanes to each U.2
> interface, which then connects by cable to the drives themselves.
> The M.2 card slot on the board should be on it's own set of lanes, as
> none of the remaining PCIe slots on the board are occupied due to
> space constraints.
> The reason for using the more expensive (and faster) Optanes for the
> write cache is that a write-cache failure can lead to data corruption,
> and they have an order of magnitude more write endurance than a
> standard SSD.
> The read cache can use a larger, cheaper (but still fast) SSD because
> it see much lower write-amplification than the write cache and a
> failure won't cause corruption.
>
>    ~~~Chris~~~
>
> On Fri, Mar 30, 2018 at 11:53 AM,  <harrison.grundy at astrodoggroup.com> wrote:
>> Can someone confirm that the 4x PCIe slots aren't shared with the M.2 slot on the board and that 2 independent 4x slots are available?
>>
>> If all 3 (SSD, Optanes) are on a single 4x bus, it kinda defeats the purpose.
>>
>> Harrison
>>
>>
>>
>> Sent via the BlackBerry Hub for Android
>>
>>   Original Message
>> From: tech_dev at wildintellect.com
>> Sent: March 31, 2018 02:21
>> To: sac at lists.osgeo.org
>> Reply-to: tech at wildintellect.com; sac at lists.osgeo.org
>> Cc: chrisgiorgi at gmail.com
>> Subject: Re: [SAC] New Hardware, can we purchase now
>>
>> Here's the latest quote with the modifications Chris suggested.
>>
>> One question, any reason we can't just use the Optanes for both read &
>> write caches?
>>
>> Otherwise unless there are other suggestions or clarifications, I will
>> send out another thread for an official vote to approve. Note the price
>> is +$1,000 more than originally budgeted.
>>
>> Thanks,
>> Alex
>>
>> On 03/14/2018 09:47 PM, Chris Giorgi wrote:
>>> Further investigation into the chassis shows this is the base sm is using:
>>> https://www.supermicro.com/products/system/1U/6019/SYS-6019P-MT.cfm
>>> It has a full-height PCIe 3.0 x8 port, as well as a M2 PCIe 3.0 x4
>>> slot on the motherboard.
>>> In light of this, I am changing my recommendation to the following,
>>> please follow-up with sm for pricing:
>>> 2ea. Intel Optane 900p 280GB PCIe 3.0 x4 with U.2 interfaces,
>>> replacing SATA SSDs
>>> ..connected to either a SuperMicro AOC-SLG3-2E4R or AOC-SLG3-2E4R
>>> (Depending on compatibility)
>>> Then, a single M.2 SSD such as a 512GB Samsung 960 PRO in the motherboard slot.
>>>
>>> With this configuration, the Optanes supply a very fast mirrored write
>>> cache (ZFS ZIL SLOG), while the M.2 card provides read caching (ZFS
>>> L2ARC), and no further cache configuration needed.
>>>
>>> Let me know if that sound more palatable.
>>>    ~~~Chris~~~
>>>
>>>
>>> On Wed, Mar 14, 2018 at 10:36 AM, Chris Giorgi <chrisgiorgi at gmail.com> wrote:
>>>> Alex,
>>>>
>>>> Simply put, write caching requires redundant devices; read caching does not.
>>>>
>>>> The write cache can be relatively small -- it only needs to handle
>>>> writes which have not yet been committed to disks. This allows sync
>>>> writes to finish as soon as the data hits the SSD, with the write to
>>>> disk being done async. Failure of the write cache device(s) may result
>>>> in data loss and corruption, so  they MUST be redundant for
>>>> reliability.
>>>>
>>>> The read cache should be large enough to handle all hot and much warm
>>>> data. It provides a second level cache to the in-memory block cache,
>>>> so that cache-misses to evicted blocks can be serviced very quickly
>>>> without waiting for drives to seek. Failure of the read cache device
>>>> degrades performance, but has no impact on data integrity.
>>>>
>>>>   ~~~Chris~~~
>>>>
>>>> On Wed, Mar 14, 2018 at 9:05 AM, Alex M <tech_dev at wildintellect.com> wrote:
>>>>> My overall response, I'm a little hesitant to implement so many new
>>>>> technologies at the same time with only 1 person who knows them (Chris G).
>>>>>
>>>>> My opinion
>>>>> +1 on some use of ZFS, if we have a good guide
>>>>> -1 on use of Funtoo, We've prefered Debian or Ubuntu for many years and
>>>>> have more people comfortable with them.
>>>>> +1 on trying LXD
>>>>> +1 on Optane
>>>>> ?0 on the SSD caching
>>>>>
>>>>> 1. What tool are we using to configure write-caching on the SSDs? I'd
>>>>> rather not be making an overly complicated database configuration.
>>>>>
>>>>> 2. That seems a reasonable answer to me, though do we still need the
>>>>> SSDs if we use the Optane for caching? It sounds to me like Optane or
>>>>> SSD would suffice.
>>>>>
>>>>> 3. Disks -  Yes if we plan to archive OSGeo Live that would benefit from
>>>>> larger disks. I'm a -1 on storing data for the geodata committee, unless
>>>>> they can find large data that is not publicly hosted elsewhere. At which
>>>>> point I would recommend we find partners to host the data like GeoForAll
>>>>> members or companies like Amazon/Google etc... Keep in mind we also need
>>>>> to plan for backup space. Note, I don't see the total usable disk size
>>>>> of backup in the wiki, can someone figure that out and add it. We need
>>>>> to update https://wiki.osgeo.org/wiki/SAC:Backups
>>>>>
>>>>> New question, which disk are we installing the OS on, and therefore the
>>>>> ZFS packages?
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>> On 03/13/2018 12:57 PM, Chris Giorgi wrote:
>>>>>>  Hi Alex,
>>>>>> Answers inline below:
>>>>>> Take care,
>>>>>>    ~~~Chris~~~
>>>>>>
>>>>>> On Mon, Mar 12, 2018 at 10:41 AM, Alex M <tech_dev at wildintellect.com> wrote:
>>>>>>> On 03/02/2018 12:25 PM, Regina Obe wrote:
>>>>>>>> I'm in IRC meeting with Chris and he recalls the only outstanding thing
>>>>>>>> before hardware purchase was the disk size
>>>>>>>>
>>>>>>>> [15:17] <TemptorSent> From my reply to the mailing list a while back, the
>>>>>>>> pricing for larger drives: (+$212 for 4x10he or +$540 for 4x12he)
>>>>>>>>  [15:19] <TemptorSent> That gives us practical double-redundant storage of
>>>>>>>> 12-16TB and 16-20TB respectively, depending how we use it.
>>>>>>>>
>>>>>>>>
>>>>>>>> If that is all, can we just get the bigger disk and move forward with the
>>>>>>>> hardware purchase.  Unless of course the purchase has already been made.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Regina
>>>>>>>>
>>>>>>>
>>>>>>> Apologies, I dropped the ball on many things while traveling for work...
>>>>>>>
>>>>>>> My take on this, I was unclear on if we really understood how we would
>>>>>>> utilize the hardware for the needs, since there are a few new
>>>>>>> technologies in discussion we haven't used before. Was also in favor of
>>>>>>> small savings as we're over the line item, and that money could be used
>>>>>>> for things like people hours or 3rd party hosting, spare parts, etc...
>>>>>>>
>>>>>>> So a few questions:
>>>>>>> 1. If we get the optane card, do we really need the SSDs? What would we
>>>>>>> put on the SSDs that would benefit from it, considering the Optane card?
>>>>>>
>>>>>> The Optane is intended for caching frequently read data on very fast storage.
>>>>>> As a single unmirrored device, it is not recommended for write-caching of
>>>>>> important data, but will serve quite well for temporary scratch space.
>>>>>>
>>>>>> Mirrored SSDs are required for write caching to prevent failure of a single
>>>>>> device causing data loss. The size of the write cache is very small by
>>>>>> comparison to the read cache, but the write-to-read ratio is much higher,
>>>>>> necessitating the larger total DWPD*size rating. The SSDs can also provide
>>>>>> the fast tablespace for databases as needed, which also have high write-
>>>>>> amplification. The total allocated space should probably be 40-60% of the
>>>>>> device size to ensure long-term endurance. The data stored on the SSDs
>>>>>> can be automatically backed up to the spinning rust on a regular basis for
>>>>>> improved redundancy.
>>>>>>
>>>>>>> 2. What caching tool will we use with the Optane? Something like
>>>>>>> fscache/CacheFS that just does everything accessed, or something
>>>>>>> configured per site like varnish/memcache etc?
>>>>>>
>>>>>> We can do both if desirable, allocating large cache for the fs (L2ARC in ZFS),
>>>>>> as well as providing an explicit cache where desirable. This configuration can
>>>>>> be modified at any time, as the system's operation is not dependent on the
>>>>>> caching device being active.
>>>>>>
>>>>>>> 3. Our storage growth is modest, not that I don't consider the quoted 8
>>>>>>> or 10 TB to be reliable, but the 2 and 4 TB models have a lot more
>>>>>>> reliability data, and take significantly less time to rebuild in a Raid
>>>>>>> configuration. So how much storage do we really need for Downloads and
>>>>>>> Foss4g archives?
>>>>>>
>>>>>> OSGeo-Live alone has a growth rate and retention policy that indicates needs for
>>>>>> on the order of 100GB-1TB over the next 5 years from my quick calculations, not
>>>>>> including any additional large datasets. Supporting the geodata project would
>>>>>> likely consume every bit of storage we throw at it and still be
>>>>>> thirsty for more in
>>>>>> short order, so I would propose serving only the warm data on the new server and
>>>>>> re-purposing one of the older machines for bulk cold storage and backups once
>>>>>> services have been migrated successfully.
>>>>>>
>>>>>> Remember, the usable capacity will approximately equal the total capacity of a
>>>>>> single drive in a doubly redundant configuration with 4 drives  at
>>>>>> proper filesystem
>>>>>> fill ratios. We'll gain some due to compression, but also want to provision for
>>>>>> snapshots and backup of the SSD based storage, so 1x single drive size is a
>>>>>> good SWAG. Resliver times for ZFS are based on actual stored data, not disk
>>>>>> size, and can be done online with minimal degradation of service, so that's a
>>>>>> moot point I believe.
>>>>>>
>>>>>>> 4. Do we know what we plan to put on the SSD drives vs the Spinning Disks?
>>>>>>
>>>>>> See (1).
>>>>>>
>>>>>>> I think with the answers to these we'll be able to vote this week and order.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>> _______________________________________________
>>>>>>> Sac mailing list
>>>>>>> Sac at lists.osgeo.org
>>>>>>> https://lists.osgeo.org/mailman/listinfo/sac
>>>>>
>>
>> _______________________________________________
>> Sac mailing list
>> Sac at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/sac


More information about the Sac mailing list