From the course: CompTIA Advanced Security Practitioner (CASP+) (CAS-004) Cert Prep

Data lifecycle

- All data and information has a lifecycle associated with it. The data lifecycle is the entire period of time that data exists within your systems. Data goes through six main stages throughout its lifecycle, from creation, to usage, to sharing, to storage, to archival, to destruction. First, we have data creation. Data can be created in your system whenever it's acquired, entered, or captured. Data acquisition occurs when existing data that's produced outside of your system is imported automatically into your system. For example, if I created an email and I sent it to you, your system has acquired that data and began its lifecycle within your systems. Data entry is going to occur when information is manually typed into your system by personnel within your organization. For example, if you open up a Word document and you start taking notes while watching this lesson, you're going to be performing data entry. Now, data capture occurs when data is generated by a device used in your organization. For example, if your routers and switches are constantly generating log files, those are a form of data capture. Second, we have data use. Now, data use is the phase of the lifecycle where data is put to work to achieve some purpose within your organization. If you're viewing, processing, modifying, or saving the data, you are currently performing data use. Every time a critical piece of data is opened and accessed, there should be an audit trail that maintains a log of who accessed the data and when. Third, we have data sharing. Now, data sharing occurs when a user makes the data available to somebody else outside of the organization. For example, when I began recording this video, only my staff had access to this video so we could create this course and all the subtitles for this particular lesson. But, once we were at a point where we wanted you to be able to see this video, we had to share it with other organizations and people outside of Dion Training. When data is shared, it's important that you put the right protections in place based on who should be able to access the data being shared and where that data should be shared, too. Fourth, we have data storage. Now, data storage occurs when the data is not being actively used. Every piece of data needs to be stored for later retrieval, processing, use, or transfer. But, while it isn't actively being used, it's going to have to be stored someplace. Now, the data may be stored as a digital file, such as a Word document or a single item within a larger database, depending on the type of data and the protections it requires. Data that is going to be stored is going to be placed into an area that is instantly accessible when needed by your users. Fifth, we have data archival. Now, data archival is the copying of data to an environment where it's going to be stored in case it's going to be needed in an active production environment again later on. For example, your organization might conduct nightly backups of all of your servers and put that onto a backup tape or a cloud-based glacial server. In that case, the data won't be instantly available anymore, but your organization can recover to it and restore from it if they need to, taking those from the archives and putting them back onto your production servers in the case of an emergency or an investigation. Sixth, we have data destruction. At some point, the data you've created, used, shared, stored, and archived is going to be no longer valuable to you. At that point, it's going to be time to destroy the data and bring it to an end of its useful life. After all, we can't keep all of our data indefinitely because we're going to end up running out of storage space, or it's going to simply cost us too much to buy more storage space for all of that data that has no useful purpose. This destruction could be as simple as running a delete command on a server, or it could be overriding that area of a hard disk with zeroes, or you could physically destroy a tape backup by shredding that tape. The exact method here isn't really important, but the concept is that that data has to go through a lifecycle, and that's what we're concerned with here. Remember, all data moves through this lifecycle, from creation, to use, to sharing, to storage, to archiving, to destruction. Now that we understand the basic data lifecycle, we need to discuss the concept of a data inventory. Now, a data inventory serves as a single source of truth within your organization. A data inventory is going to be used to provide instant insight into all the sources of data that an organization has access to. What information is being collected by these sources, where that data is being stored, and what will ultimately happen to that data? This is also referred to as a data mapping in some organizations. So why is it important to conduct a data inventory or data mapping? Well, if we're going to be responsible for protecting our organization's data, then it's really important that we understand exactly where all that data is located. Now, this may sound easy, but these days, it's actually quite challenging because we have data located all over the place. Do you have data on your company's shared drive and email servers? Well, most likely you do, and you probably have full control over those servers. But, there's a lot more of your data out there as well. In my own company, we have data in our accounting software and our credit card processing software. Both of these are software as a service solutions. Now, we also have some of our data in our learning management system and other parts of our data in our customer relationship management system. We use tools like Slack, Office 365, and Google Workspace, and all of these have our data, too. So, I only just scratched the surface here, but I've already listed out nine different places where our data resides, and we're a really small company. This is why conducting a data inventory or data mapping is truly important here because once you know where all your data is, you can then begin to determine how you're going to secure that data and protect that data across all of these disparate storage arrays that you've now created. Now, once you've identified all of this data, you need to figure out how to ensure it's integrity is also being maintained. This is known as data integrity management. Now, data integrity is all about protecting data against improper maintenance, modification, or alteration, and it also includes data authenticity. Integrity has to do with the accuracy of information, including its authenticity and trustworthiness. Now, information with low integrity concerns may be considered unimportant to your business because it doesn't have a precise operational function, and therefore it's not necessary to vigorously check that for errors. Information with high integrity concerns, though, are considered to be crucial and critical to your functions, and therefore, they must be accurate in order to prevent negative impacts to your organization's activities. For example, if you're dealing with your accounting software, you likely need to ensure it has a high level of integrity because you don't want to have a customer's balance saying that they owe you $10,000 when they only owe you $1,000. That would be a big problem and due to a lack of integrity because it's changing those numbers. Therefore, you want to build out your data protection plans for your accounting systems and implement things like journaling and hashing of your data to ensure the integrity remains intact at all times. Conversely, if you're dealing with some kind of data that doesn't require high integrity, you might choose not to implement these more expensive controls. This is ultimately a decision that's going to be made using your risk management and considering the cost versus the benefits of adding these additional controls to each of your data processing systems. Finally, we need to discuss data storage a bit more in depth here. By far, the most commonplace we're going to store our data to is a RAID. A redundant array of inexpensive disks, or a RAID, is a hard-drive technology that allows data to be written to a logical partition that's going to be spread across multiple physical disk drives. This ensures that even if a single disk drive in the array fails, that data is still going to be available by restoring it from the RAID itself, instead of having to restore it from a tape backup. Now, there are four main types of RAID arrays that you should be familiar with. RAID 0, which is referred to as disk striping; RAID 1, which is called disk mirroring; RAID 3, which is called byte-level data striping with dedicated parity; and RAID 5, which is block-level data striping with distributed parity. With a RAID 0, or disk striping, this is going to involve a minimum of two physical disks. In this configuration, half of the data is stored on one of the physical drives, while the other half is stored on the other drive. This increases the responsiveness and the delivery of the data stored on this kind of RAID, but there is no added redundancy to this data. If either of these physical drives failed, all the data is going to be lost. Now, if we want to have some redundancy, we can move to a RAID 1. RAID 1, or disk mirroring, places the importance of redundancy over speed in this array. In this type of configuration, you need to have at least two physical disks and you're going to have a copy of the data written to both disks at the same time. This provides an always-ready and available backup in case either of those individual drives fail. The next type we have is known as a RAID 3, or byte-level data striping with a dedicated parity drive, and this uses a minimum of three disks. In this type of configuration, a portion of your data is placed on the first drive and another portion is placed on the second drive. Then, we use a mathematical algorithm to calculate a parity that's going to be stored on the third drive. If a single drive fails, then the parity can be used to recalculate the values that were stored on one of the drives that failed once we put in a new drive and we rebuild the array. This allows us to rebuild itself very quickly and provide data to our users in no time at all. Next, we have a RAID 5. Now, RAID 5 is the most commonly used RAID. It is known as a block-level data striping with distributed parity. In this array, a minimum of three drives is also required. When the data is stored on this array, a piece of the data is placed on each of the drives and the parity is also stored on those drives. Instead of reserving a single drive for all the parity storage, we're going to have data and parity equally distributed across all three drives. This type of array is very popular because we can replace any single drive without having to shut down the server, and this allows operations to continue while we're rebuilding a failed drive. Now, RAIDs can be implemented using either software or hardware. Now, it's cheaper to use a software-based solution, but hardware-based solutions will operate faster for most environments. Another storage option we have is known as storage area networks or, SANs, and these are very common in our larger enterprise networks. A SAN provides high-capacity storage by connecting storage devices using a high-speed private network that is going to be interconnected by storage-specific switches. This is usually going to be handled by a Fibre Channel network. SANs are going to be great for their scalability and high availability, but they're quite expensive to produce and to procure, and they require a high level of skill to maintain these things. These days, a lot of our data is also going to be stored in the cloud. This can be inside of a database, a block-level storage, or a binary large object known as a blob. Regardless of where we end up storing our data, it's always important for us to have a backup and recovery plan for that data because all forms of storage are subject to outages, and data loss eventually. (suspenseful music)

Contents