Large Java Cache Solutions

I am implementing a web crawler which is expected to create an index of approximately 100,000 webpages. My current implementation has worked fine until I approached an index of approximately 30,000 webpages. The problem is being able to supply several web crawler "clients" with urls to crawl at a rate faster than the "clients" can actually crawl the webpage. Currently my database of urls to crawl is getting emptied by crawler clients at a faster rate than I can repopulate it.
The problem is for each new url I find, I need to check it against the entire index of "crawled urls" to ensure its a "new" url (I dont want to crawl a url more than once). My current implementation uses db4o and each look up, with a database of 30,000 urls, takes approx 0.8 seconds.
My question is, would it be better to keep a large scale cache of "crawled urls", rather than querying db4o each time I need to check if a url has been previously crawled?
I have found two possible cache systems:
http://jakarta.apache.org/jcs/index.html
and
http://ehcache.sourceforge.net/
Before I go about trying to work out how these systems work I just need to get some opinions on whether this is a correct solution to my problem.
Also if anyone has had any previous experience with these, or similar, systems could you point me in the direction of a good "learners" tutorial! They both seem rather complexat first glance.
Thanks

maybe try hashing the url down to a 16ish bit integer, then when you're looking for a url, first hash it, select everything in your table with that integer hash, and then search that subset for your url.
Like,
SELECT * FROM (SELECT * FROM urls WHERE urls.hash=myhashfunction('www.someurl.com')) WHERE urls.name='www.someurl.com';Instead of the embedded select statement you could perhaps have a stored view, and you may want to have the hash function be a client-side thing instead of a stored function.
I'm not really into database optimization though.
EDIT: to clarify, your insert would be:
INSERT INTO urls (hash,name) VALUES (myhashfunction('www.someurl.com'), 'www.someurl.com');Edited by: endasil on Oct 17, 2007 6:23 AM

Similar Messages

  • Webutil_file_transfer url_to_client places objects in the Java Cache

    Hi,
    I'm using webutil_file_transfer url_to_client to download a PDF file.
    Once I downloaded the PDF file, it is visible in the Client Java Cache Viewer. When I try to run the download again webutil takes the PDF from the Cache instead taking the newer PDF from the Server where the URL is pointing to.
    So the URL I passing is each time the same, but the PDF file on the Server is getting updated very often.
    I don't usethe WEBCACHE port!
    My question now is. Can I avoid webutil to put does PDF file in the Java Cache on the Client. Or is there any utility to remove selected objects from the Java Cache.
    Does anybody have a Idea how to avoid such a behaviour?
    Fatih

    Hi Craig,
    with the help of Oracle Support, we found the reason and also a workaround for this issue.
    The real cause:
    By default the java applet parameter DefaultUseCaches is true an all files downloaded within an applet is going throug the Java Client Cache.
    WEBUTIL does not touch this parameter, so it keeps staying default (true).
    Solution/workaround:
    With the methode setDefaultUseCaches it possible to disable/enable the default Cache setting.
    So it's possible to disable the cache with a small Java Bean running in Forms, which disable the Cache before WEBUTIL_FILE_TRANSFER and enable it after the successfull download.
    here is an extract of the bean code:
    URL u = new URL( "http:" );
    URLConnection uc = u.openConnection();
    uc.setDefaultUseCaches(false);
    thank you for your time!
    regards,
    Fatih

  • Java Caching Framework

    Hi All,
    I m in process of evaluating some open source java caching framework which can help our web application to reduce response time.
    Though i have some open source caching framework in my list like
    JCS
    OSCache
    JOCache
    But i have never used any one of the caching framework.If any one in the group have used them in past or is working on some open source based framework,please do share there experience so that it can help us in deciding the best available solution.
    Thanks in advance
    -Umesh

    You may want to check out Hazelcast . It is an open source distributed, transactional distributed cache for Java. Hibernate second level cache plug-in is also available .
    Hazelcast is released under Apache license. It also have distributed lock, topic, multimap, queue and executor service implementations. [This 10 minute video|http://www.hazelcast.com/screencast.jsp] is very good to get started.
    -talip
    Edited by: talip_ozturk on May 2, 2010 1:59 PM
    Edited by: talip_ozturk on May 2, 2010 2:16 PM

  • Java web start fails to launch application when java cache is off

    Hi,
    there's no problem when java cache is used but when java cache isn't used, my application fails to launch via jws (jnlp).
    the following is the error
    java.lang.NullPointerException
    at java.util.jar.JarVerifier.mapSignersToCodeSource(Unknown Source)
    at java.util.jar.JarVerifier.mapSignersToCodeSources(Unknown Source)
    at java.util.jar.JarVerifier.getCodeSources(Unknown Source)
    at java.util.jar.JarFile.getCodeSources(Unknown Source)
    at java.util.jar.JavaUtilJarAccessImpl.getCodeSources(Unknown Source)
    at com.sun.deploy.cache.DeployCacheJarAccessImpl.getCodeSources(Unknown Source)
    at com.sun.javaws.security.SigningInfo.getCommonCodeSignersForJar(Unknown Source)
    at com.sun.javaws.security.SigningInfo.check(Unknown Source)
    at com.sun.javaws.LaunchDownload.checkSignedResourcesHelper(Unknown Source)
    at com.sun.javaws.LaunchDownload.checkSignedResources(Unknown Source)
    at com.sun.javaws.Launcher.prepareResources(Unknown Source)
    at com.sun.javaws.Launcher.prepareAllResources(Unknown Source)
    at com.sun.javaws.Launcher.prepareToLaunch(Unknown Source)
    at com.sun.javaws.Launcher.prepareToLaunch(Unknown Source)
    at com.sun.javaws.Launcher.launch(Unknown Source)
    at com.sun.javaws.Main.launchApp(Unknown Source)
    at com.sun.javaws.Main.continueInSecureThread(Unknown Source)
    at com.sun.javaws.Main$1.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    this is the jnlp file
    <?xml version="1.0" encoding="UTF-8"?>
    <jnlp spec="1.0+" version="1.0" codebase="http://11.4.100.41" href="secuiNXG.jnlp">
    <information>
    <title>secuiNXG U start GUI</title>
    <vendor>secui.com Ltd.</vendor>
    <homepage href="/"/>
    <icon href="web_login_ci.gif"/>
    <shortcut online="true">
    <desktop/>
    <menu submenu="secuiNXG U"/>
    </shortcut>
    </information>
    <security>
    <all-permissions/>
    </security>
    <resources>
    <j2se version="1.4+" initial-heap-size="50m" max-heap-size="250m" />
    <jar href="SES.jar"/>
    <jar href="skin_alloy.jar"/>
    <jar href="Borders.jar"/>
    <jar href="informa.jar"/>
    <jar href="jaxen-1.1-beta-12.jar"/>
    <jar href="jcelements.jar"/>
    <jar href="jctable.jar"/>
    <jar href="jdom.jar"/>
    <jar href="log4j-1.2.14.jar"/>
    <jar href="jnlp.jar"/>
    <jar href="jxl.jar"/>
    </resources>
    <application-desc main-class="secui.firewall.SecuiLogin">
    <argument>11.4.100.41:80</argument>
    </application-desc>
    </jnlp>Test Environment
    JRE : 1.6.0_19
    O/S : Windows 7 (32 bit)
    Browser : IE8
    I checked for other versions of jre but not using java cache didn't cause any problems.
    I checked for the release note of update 19 but I have no clue.
    Is there any way to launch java application without using java cache?

    greencosmos wrote:
    ..I had a problem changing the cache location. The button for this action is disabled. I can't figure out how to enable it.On my system, it is enabled when the 'cache files' box is checked, and disabled when it isn't.
    1) Without changing the location I assumed that deleting the cached files could be a similar job, so I clicked "Delete Files..." and deleted with all checkboxes checked.I am not convinced that would entirely clear the cache, but have done no specific tests to check.
    2) I unchecked "Keep temporary files on my computer".So it was checked when you were trying to change the cache location?
    3) Applied all the changes.
    4) I launched your demo.
    Result - It was launched without any prompt.
    I revisited the page with the URL and successfully launched the app. again.The application launched just fine, but no prompt.
    If you follow those (links and) steps I outlined exactly, does the file service demo launch twice for you?I'm sorry I don't exactly understand your question. .. Your description is enough to convince me that answer is 'yes'.
    ..I tried twice and it launched twice, but not twice at the same time(it launched once at a time).Surprise, surprise. That is the first mention in this thread of 'same time'/'simultaneously'. Care to share the other defining factors that you forgot to mention, or is guessing part of the 'fun' of helping?
    But still I'm having problems with launching my application.
    I assume that it could be jnlp syntax problem. The "main=true" subelement is missing.It is a good idea to validate the launch files of JWS based launches that are failing for any reason. For that purpose, I offer JaNeLA.
    Having said that, a missing main='true' will not be detected by JaNeLA, since it is not a compulsory attribute. ..But check them anyway.

  • Too large java heap error while starting the domain.Help me please..

    I am using weblogic 10.2,after creating the domain, while starting the domain,I am getting this error.Can anyone help me.Please treat this as urgent request..
    <Oct 10, 2009 4:09:24 PM> <Info> <NodeManager> <Server output log file is "/nfs/appl/XXXXX/weblogic/XXXXX_admin/servers/XXXXX_admin_server/logs/XXXXX_admin_server.out">
    [ERROR] Too large java heap setting.
    Try to reduce the Java heap size using -Xmx:<size> (e.g. "-Xmx128m").
    You can also try to free low memory by disabling
    compressed references, -XXcompressedRefs=false.
    Could not create the Java virtual machine.
    <Oct 10, 2009 4:09:25 PM> <Debug> <NodeManager> <Waiting for the process to die: 29643>
    <Oct 10, 2009 4:09:25 PM> <Info> <NodeManager> <Server failed during startup so will not be restarted>
    <Oct 10, 2009 4:09:25 PM> <Debug> <NodeManager> <runMonitor returned, setting finished=true and notifying waiters>

    Thanks Kevin.
    Let me try that.
    Already more than 8 domains were successfully created and running fine.Now the newly created domain have this problem.I need 1GB for my domain.Is there any way to do this?

  • Cisco Web Caching Solution

    Hi There,
    I need a high availability web caching solution for 500 users, preferential with proxy funcionality.
    Cisco had the Cisco Web Caching Engine, but it is End-of-Sales already. The new product that will replace the EOS product, is Wide Area Application Engine (WAE) plataforms. Is that right?
    The Cisco Global Price List has a SKU product SE512-IPROXY-K9 (Security / iProxy WAE-512 bundle, 1GB MEM, 1 250GB HDD Incl.).
    The question are:
    The SKU product SE512-IPROXY-K9 is the right product for my needs?
    What is the difference between the SKU SE512-IPROXY-K9 and the SKU WAE-512-K9?
    Regards,
    Pedro Vasques

    Hi Pedro,
    The WAE product line does WAN Optimization which includes HTTP optimization. The WAN Optimization solution works between two WAEs (one being at a Branch office and the other being at Data Center ). When employees at branch office tries to access resources like Servers at Data Center their TCP based traffic gets optimized. For more info pl go to http://www.cisco.com/go/waas
    Cisco's Web Caching was a different solution. It caches the web objects when first user goes to a given web site. The following user requests will be served locally as long as cache history is available for that web page.
    I suggest you contact either a Cisco partner or Cisco representative in your area.
    thanks
    Nat

  • Speed of animation fluctuates based on java cache..

    Hi everyone,
    It seems I am experiencing a rather weird problem.
    In my app I am having a small animation on a BufferedImage using the swingx.Timer class.
    The problem is that sometimes the animation runs very slow (I get a feeling of the computer struggling to keep up)
    while some other times its running on full speed! (I have set the Timer constructor to 1ms)
    What could be causing this "undeterministic" behaviour? I 've looked at the code a million times
    and there seems to be nothing wrong with it...!
    P.S I am having a CORE2DUO and running the latest JRE 1.6.024
    UPDATE: I can replicate the problem if I clear the java cache from the control panel
    in windows..!strange...isn't it?
    Edited by: konos5 on Mar 14, 2011 4:10 PM

    konos5 wrote:
    UPDATE: I can replicate the problem if I clear the java cache from the control panel
    in windows..!strange...isn't it?Depends. Are you perhaps loading the image in the same method you are drawing it? If so: that is horribly inefficient.

  • Problem while adding SAP Netweaver 7.3 As Java in Solution Manager 7.1

    Hi All,
    We are facing strange problem while configuring managed system- SAP Netweaver 7.3 As Java in Solution manager 7.1 SP3. We added this system in SLD of solution manager and synchronized it with LMDB. As a result this As Java system is present under Technical systems in SMSY. Now we are assigning Product system to it with Landscape verification tool. But the problem is when we try to save after assigning Product version as SAP Netweaver and Product as SAP Netweaver 7.3. It always gives below error:
    Fatal error trying to update Technical System (UPDATE_TSYSTEMS). Update could not be executed. 
    Product instance 54 not found
    Product instance number error keep on changing depend upon what usage i select like ADS 7.3, Application Server Java 7.3 etc.
    Please suggest if someone has faced this issue.
    Thanks
    Sunny

    Hi Sunny,
    You can refer to my first blog on this topic:
    #sapadmin:: How to assign Product System in SOLMAN 7.1 & How LMDB, SLD, SMSY and Landscape Verification  work in SOLMAN7.1
    Try to select only one product instance in LMDB and saved. After that, you can add more product instance in SMSY.
    Hope it helps.
    Cheers,
    Nicholas Chang

  • Value mapping replication - java cache

    i used value mapping replication and successfully loaded values to the java cache (i can see the in the cache monitoring but not in  the configuration).
    when i try to use a scenario with value mapping using the agencies and schemes which i loaded, the mapping doesn't work and i receive the same value that was entered.
    what could be the problem?
    P.S
    i used the same groupid for each pair of source and target value when i loaded them. is that the right way?

    Tomer,
    what is the context that u used. Did you give the correct context in mapping as in Runtime Cache.
    Regards,
    Sudharshan N A

  • Issue with Java cache

    Hi friends,
    I'm in PI 7.3 environment, my problem is - when I try to access ESB and IB, it says that it is Unable to download.
    So, for that am clearing Java cache and trying to download to get into ESB and IB every time. Obviously am wasting hours of time to login ESB and IB. Just in case if the connection gets off, then If I want to get into ESB again I have to clear the Java cache completly(Java cache is not recognizing)
    I came across all the links in sdn but not solves my issue. Kindly share exact to do things - to fix the issue.
    Many thanks.
    Swarnaalu.

    Hi Swarna,
             There might be some problem with javawebstart you can try uninstalling java and try to reinstall so that it will work.
    OR
    I am not sure in PI &.3 but in earlier versions you can do the re-initialization which works.
    follow the below:
    1. Login to your XI/PI server
    2. Go to Exchange Infrastructure Tools main web screen. You will see
    Tools list and other options.
    http://SERVER:PORT/rep/start/index.jsp
    3. On the Upper right part of your screen you will see Tools
    Administration Client Installation and Guidelines Documentation.
    Click Administration
    4. Login with user.
    5. On the Exhange Infrastructure Administration , make sure you are in
    the Correct tab. (Repository Directory Runtime). The tab will
    determine the administration configuration that your will perform
    5.1 When this problem occur with Integration Repository > make sure you
    are in Integration REPOSITORY TAB. Proceed with step 6. Skip step 5.2.
    5.2 When this problem occur with Integration Directory > make sure you
    are in Integration DIRECTORY TAB. Proceed with step 6.
    6. Click on Java web start administration.
    7. Click on Re-initialization and force-signing. This will
    re-authenticate all new JAR files deployed. This will also let the new
    JAR files adapt to the current certificate deployed.
    This function will cause the above re-collection and additionally a
    re-signing of ALL resources with a dummy certificate. The original SAP
    signatures of the jarfiles will be lost. To get back the original SAP signatures
    the application has to be deployed again.
    8. A Java(TM) Web start Application reset text will appear.
    9. Wait for 5 to 15 minutes for re-initialization to complete
    10. Start your "Integration Repository' or "Integration Directory"
    again. It should work now.
    This will solve the issue
    Regards,
    Naveen

  • Clear java cache

    Hi
    Hope someone can assist
    I'm struggling to download items on some websites on Safari and the forum on the website suggested I clear my Java cache. I can't find any references to Java under system Preferences or Utilities. There is also no option to clear the cache in Safari.
    I've cleared the history and the cookies but that hasn't helped either.
    Can anyone point me in the right direction?
    Much appreciated
    K

    Thanks,
    For some reason I don't see Java under System Preferences?

  • Store XML data in java cache (hashmap as a key value pair)

    Hi,
    I have to store a xml file in java cache so that I can resue it .The flow is like this :
    DAO layer reads database ,create an xml and sends to --> IBM MQ-->our java code should read this xml file over MQ and store it in a cache (preferably hashmap).The file contain unique id for every customer.
    How can we achieve this.One way is to store the xml as an string as key is "id" and value is whole xml.Is this a good way or any other way is available.Please suggest.
    Sample xml:
    <Client>
    <ClientId>1234</ClientId>
    <ClientName>STechnology</ClientName>
    <ID>10</ID>
    <ClientStatus>ACTIVE</ClientStatus>
    - <LEAccount>
    <ClientLE>678989</ClientLE>
    <LEId>56743</LEId>
    - <Account>
    <AccountNumber>9876543678</AccountNumber>
    </Account>
    </LEAccount>
    - <Service>
    <Cindicator>Y2Y</Cindicator>
    <PrefCode>980</PrefCode>
    <BSCode>876</BSCode>
    <MandatoryContent>MSP</MandatoryContent>
    </Service>
    </Client>
    Thanks
    Sumit

    A HashMap can work, but then still store the customer related data in a bean (and perhaps it will have some child objects as well, if for example the service subelement can repeat). So you get a HashMap of Customer objects, with the clientID as the index into the hashmap.

  • Java Caching techniques..

    How can any one please provide me the pointers for Java caching techniques for performance enhancement of a a code.
    Thanks
    Naxy

    [Commons Pool|http://commons.apache.org/pool/]?
    What are you caching and why?

  • Java Cache Viewer shows duplicate URL with different versions

    In doing some testing lately we have come across a couple of situations where the same URL is showing in the JAVA cache viewer with different versions to them.  In the case of a 1.7 JRE this is not causing any adverse effects however when utilizing a 1.6 JRE we are getting messages about mixed mode for the JAR files.
    The jar file is signed as below:
    Manifest-Version: 1.0
    Ant-Version: Apache Ant 1.8.2
    Application-Library-Allowable-Codebase: *
    Application-Name: named_applet
    Built-By: relbuild
    Permissions: all-permissions
    Created-By: 1.6.0_17-b04 (Sun Microsystems Inc.)
    Caller-Allowable-Codebase: *
    Codebase: *
    understanding the need to change the wildcard's to domains as we move forward.  I am wondering why the jar file is being listed and called twice and why it is being recognized as signed one time and not the next.
    The URL is identical in the cache viewer and the java console says it is not found in the cache when it is used the second time, even though previously in the session is finds it with the correct version.
    Any advice would be helpful

    that i even know, but the problem is that i do not have that version of java re installed, but i found distribution 07 of the same java version, i'll try it out with that.
    but when it still doesn't work, then i still would like to find the link where i can download java version 1.4.2_06.

  • Java Cache Issue on SAP MII 12.1 SP09 (Build 116)

    Hi All,
    Has anybody experienced this issue before?
    Problem Description
    This installation is a central MII instance, which is HP hosted in Swindon, UK. On Login we are experiencing a delay in loading the Java Applets, i.e. iGrids, iCharts, iCommands etc. After the first screen load the performance improves. When we close the session and re-enter, the performance on the first screen is delayed.
    Problem Investigation
    We have compared this project to other MII instances and investigated the cache loading of each of the MII installations. What we have noticed is that the Java is not being cached on the client side for the MII server V12.1 SP09 (Build 116). We had three MII servers, with this version, and we are experiencing the same issue. We have other instances of version 12.1.7, 12.1.8, and 12.2.0 and the cache functionality is working fine.
    To view the cache we are opening the Java Cache viewer on the Java Control Panel, and we are looking for the file u201Cillum8.zipu201D which is the MII Java library. This investigation was done using the same Client Java Runtime version and IE Version.
    Software Versions used: JavaRuntime version: 1.0.6_22 & Internet Explorer 8.0
    Any advice/help much appreciated.
    Regards,
    Henry

    Hi Henry,
    Not sure this has any bearing, but what NW version and SP is MII sitting on top of?
    Thanks,
    Mike

Maybe you are looking for

  • Printer says it needs to connect to update HP instant ink status

    My HP 8610 just started complaining that it needs to connect to update HP instant ink status.  When I try to have it connect, it says there is no internet connection. However, I have solid connection light, I can get the network info from the printer

  • Case for macbook?

    i've read a couple complaints here that the incase neoprene case has left stains on some people's white macbook....any recs out there for one that will protect my case without staining it? thanks!

  • Does anyone know if replacement adaptors for ipod docks are available suitable for latest nano?

    I have just got a replacement ipod and it's the latest 6th gen little square one with a clip on the back. This means it doesn't fit my JBL dock anymore - it's too fat.  JBL say they will send me a new set of adaptors but were uncertain that this new

  • Problem with app x86 - x64

    Hi, Develop an app, with DI api, in windows 7 32bits it works perfectly, but in windows server 2008 x64 not, no connect with SAP , since I must compile the app in order that it works?. SAP B1 8.8, SQL Server 2008, windows server 2008 x64 Thanks, Carl

  • Can any one give notes

    hello       can any one give me notes on funds management. thanx for sending me ashok