[SAC] [Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Lance Albertson lance at osuosl.org
Fri Jun 22 13:57:26 PDT 2018


This has been completed and I've put ftp-osl back into rotation! Thanks for
your patience.

On Fri, Jun 22, 2018 at 12:44 PM, Lance Albertson <lance at osuosl.org> wrote:

> The sync has been completed and I will be switching this over to the local
> drives at 1:30PM PDT (2030 UTC) today. I'm going to also reboot the machine
> so that it's running back on the normal CentOS kernel instead of our custom
> mainline kernel we needed for Ceph. This outage should only last for about
> 10 minutes while the machine reboots.
>
> This does not affect anything pointed at ftp.osuosl.org, only ftp-osl
> (which is out of rotation).
>
> Thanks-
>
> On Tue, Jun 19, 2018 at 8:57 AM, Lance Albertson <lance at osuosl.org> wrote:
>
>> It's taking longer than I expected to sync the data back to the local
>> disks. This is due to the fact that the system is also rebuilding two RAID6
>> arrays which I forgot to account for. This is also making the system more
>> slower than I expected. At this rate it might take a few days to copy all
>> of the data back. Hopefully once the RAID6 arrays have finished rebuilding,
>> the I/O rate will speed up the syncing. Both arrays are currently at 55%
>> and 47% and we've transferred over 993G of 8.8T of data to the local disks.
>>
>> I will send another update once I'm ready switch the system back over.
>>
>> Thanks-
>>
>> On Mon, Jun 18, 2018 at 3:49 PM, Lance Albertson <lance at osuosl.org>
>> wrote:
>>
>>> I just wanted to send you all an update on where we're at in the process.
>>>
>>> As of right now, ftp-osl is back online and serving it's content from
>>> the the Ceph volume. I've gone ahead and kicked off a few manual syncs to
>>> catch everything up however if you're using us as a master I recommend you
>>> kick off an update job right now. I'm also currently copying the content to
>>> the local disks which I expect to run through tomorrow sometime.
>>>
>>> The rebuild took a little bit longer than originally planned due to some
>>> issues I ran into building the new RAID array. My original plan didn't work
>>> so I had to go with plan B which took a little longer. Plan B resulted in
>>> creating two separate RAID6 arrays which means I lost about 2T in capacity
>>> from my original plan.
>>>
>>> I'm keeping ftp-osl out of the public rotation for now since it's I/O
>>> throughput isn't likely as good as before since it's serving the content
>>> via Ceph.
>>>
>>> I'll send another update tomorrow when I'm ready to switch back over to
>>> local storage. Please let me know if you notice any issues.
>>>
>>> Thanks-
>>>
>>> On Thu, Jun 14, 2018 at 3:52 PM, Lance Albertson <lance at osuosl.org>
>>> wrote:
>>>
>>>> I had a few questions regarding this outages that I wanted to clarify
>>>> for everyone.
>>>>
>>>> 1. There should be no outage during the 5.5 hour outage window for
>>>> anything pointed to ftp.osuosl.org (unless your DNS is directly
>>>> pointing at ftp-osl.osuosl.org)
>>>> 2. During the 18-24hr sync from ceph to local storage, ftp-osl should
>>>> have normal read/write operations. There might be a little bit of I/O
>>>> performance hit during that window but it's hard to tell. There will be a
>>>> short (likely 5 min) outage to read/writes on ftp-osl when I do the final
>>>> switch back to local storage however.
>>>>
>>>> On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <lance at osuosl.org>
>>>> wrote:
>>>>
>>>>> Service(s) affected: ftp.osuosl.org
>>>>>
>>>>> During the outage, the master syncing node for our FTP cluster
>>>>> (ftp-osl) will be offline which means any updates to our software mirrors
>>>>> will be delayed.
>>>>>
>>>>> Outage Window:
>>>>> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
>>>>> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)
>>>>>
>>>>> Reason for outage:
>>>>>
>>>>> Our FTP cluster is starting to run low on disk space and we will be
>>>>> adding additional hard drives to the system. Our system currently has
>>>>> 9.375T of disk space and we're planning on upgrading it to 18.75T (this
>>>>> takes into account the RAID6 configuration)
>>>>>
>>>>> Unfortunately, due to the nature of the how the disk arrays are
>>>>> configured, we will not be able to grow the RAID array without a complete
>>>>> rebuild. This means we're going to have to re-copy all 8.8TB of data off of
>>>>> the machine and back onto it. Since this task is rather large and time
>>>>> consuming we've come up with a better alternative so that we don't have our
>>>>> master FTP server offline for very long.
>>>>>
>>>>> We have just recently built a new Ceph cluster for some new storage
>>>>> needs at the OSL and we are going to temporarily use this cluster to serve
>>>>> the ftp-osl content. I've already copied the content onto a new volume and
>>>>> have tested it enough to feel it can handle the load. This should make the
>>>>> transition plan much easier and quicker than initially.This server is
>>>>> already out of DNS rotation and we are planning on keeping it out of
>>>>> rotation until this process is complete to reduce the I/O load.
>>>>>
>>>>> So here's the plan thus far starting on Monday:
>>>>>
>>>>> 1. Stopping all services on the system and doing one final rsync to
>>>>> the Ceph volume
>>>>> 2. Rebooting machine and destroying the current RAID and creating a
>>>>> new one with the new disks
>>>>> 3. Reinstall the OS
>>>>> 4. Bootstrap machine without FTP components initially, setup ceph
>>>>> volume
>>>>> 5. Deploy FTP components after Ceph volume is setup and ready to go
>>>>> 6. Ensure inter FTP node syncing is working using the Ceph volume
>>>>> 7. Sync data from Ceph volume back over to local disks (I'm guessing
>>>>> this will take 18-24 hours)
>>>>> 8. Once sync is complete, shutdown all services and switch the mount
>>>>> point over to the local disks
>>>>> 9. Profit!
>>>>>
>>>>> I would like to thank IBM for donating the hard drives needed for this
>>>>> upgrade.
>>>>>
>>>>> We will plan on doing the storage upgrades on our two other nodes
>>>>> (ftp-nyc & ftp-chi) soon, however we won't be using the Ceph cluster for
>>>>> this since they are remote. The current plan is to take one machine out for
>>>>> several days and sync the data back between the nodes. I will send another
>>>>> outage announcement for those two nodes once we're ready for that. We still
>>>>> need to ship the drives to the locations and work with the local data
>>>>> centers to get them installed.
>>>>>
>>>>> Projects affected: Any project using our FTP cluster as a master
>>>>> syncing point
>>>>>
>>>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>



-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20180622/b6899e96/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting


More information about the Sac mailing list