AEM 65 - Find Duplicate Assets (Binaries) in existing Repository

Goal


Duplicate binaries in AEM Assets are detected on upload using detect duplicate setting of Create Asset servlet (for more info check documentation)

It checks on upload, so to detect duplicates in an existing repository you can compare the jcr:content/metadata/dam:sha1 of asset nodes (index /oak:index/damAssetLucene/indexRules/dam:Asset/properties/damSha1 should speed it up). Following is a simple DavEx script for detecting duplicates (for more info on DavEx check this post)

Github



Solution


1) Add the necessary jars in classpath...



2) Execute a standalone program apps.FindDuplicateBinariesInAEM with the following code

package apps;

import org.apache.jackrabbit.commons.JcrUtils;

import javax.jcr.*;
import javax.jcr.query.Query;
import javax.jcr.query.QueryManager;
import java.util.*;

public class FindDuplicateBinariesInAEM {
    public static void main(String[] args) throws Exception{
        String REPO = "http://localhost:4502/crx/server";
        String WORKSPACE = "crx.default";

        Repository repository = JcrUtils.getRepository(REPO);

        Session session = repository.login(new SimpleCredentials("admin", "admin".toCharArray()), WORKSPACE);
        QueryManager qm = session.getWorkspace().getQueryManager();

        String stmt = "SELECT  * FROM [dam:Asset] WHERE ISDESCENDANTNODE(\"/content/dam\") ORDER BY 'jcr:content/metadata/dam:sha1'";
        Query q = qm.createQuery(stmt, Query.JCR_SQL2);

        NodeIterator results = q.execute().getNodes();
        Node node = null, metadata;
        String previousSha1 = null, currentSha1 = null, paths = null, previousPath = null;
        Map<String, String> duplicates = new LinkedHashMap<String, String>();

        while(results.hasNext()){
            node = (Node)results.next();

            metadata = node.getNode("jcr:content/metadata");

            if(metadata.hasProperty("dam:sha1")){
                currentSha1 = metadata.getProperty("dam:sha1").getString();
            }else{
                continue;
            }

            if(currentSha1.equals(previousSha1)){
                paths = duplicates.get(currentSha1);

                if( paths == null){
                    paths = previousPath;
                }else{
                    if(!paths.contains(previousPath)){
                        paths = paths + "," + previousPath;
                    }
                }

                paths = paths + "," + node.getPath();

                duplicates.put(currentSha1, paths);
            }

            previousSha1 = currentSha1;
            previousPath = node.getPath();
        }

        String[] dupPaths = null;

        System.out.println("--------------------------------------------------------------------");
        System.out.println("Duplicate Binaries in Repository - " + REPO);
        System.out.println("--------------------------------------------------------------------");

        for(Map.Entry entry : duplicates.entrySet()){
            System.out.println(entry.getKey());

            dupPaths = String.valueOf(entry.getValue()).split(",");

            for(String path : dupPaths){
                System.out.println("\t" + path);
            }
        }

        session.logout();
    }
}


Solution - 2


You can also utilize some of the following queries to get duplicate assets. Thank you Himanshu Pathak for the tip...

1) Get dam:sha1 of all assets (if the repository is too large with many assets, filter by folder structure and later combine in MS Excel)

http://localhost:4502/bin/querybuilder.json?p.hits=selective&path=/content/dam&p.properties=jcr:path%20jcr:content/metadata/dam:sha1&p.limit=-1&property=jcr:content/metadata/dam:sha1&property.operation=exists 



2) Convert JSON to CSV. A sample converter is available at https://konklone.io/json/



3) Download the csv file, open in excel and highlight cells with duplicate values



4) Apply a filter by selected cell's color



5) Duplicates in the repository








No comments:

Post a Comment