Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@ For complex projects, specifically those written in C or C++, you may need to pa

This flag accepts a comma-separated list of key-value pairs in the format `key=value`.

### usage
### Usage

```bash
--frontend-args key1=value1,key2=value2,key3=value3
Expand All @@ -379,6 +379,9 @@ The following arguments are supported when `--language` is set to `c`, `cpp`, or
| `function-bodies` | Boolean | Whether to extract function bodies. | `function-bodies=false` |
| `parse-inactive-code` | Boolean | Parse code within disabled preprocessor blocks (e.g., inside `#if 0`). | `parse-inactive-code=true` |
| `with-image-locations` | Boolean | Create image locations (explains how a name made it into the translation unit). | `with-image-locations=true` |
| `enable-ast-cache` | Boolean | Cache parsed ASTs to disk to speed up subsequent runs on unchanged files. | `enable-ast-cache=true` |
| `ast-cache-dir` | String | Directory to store cached AST files. Defaults to `ast_out` in input directory. | `ast-cache-dir=/tmp/cache` |
| `only-ast-cache` | Boolean | Only generate AST cache files and exit. Useful for large projects to avoid OOM. | `only-ast-cache=true` |

> **Note:** Boolean values must be passed as the strings `true` or `false`.

Expand Down Expand Up @@ -414,9 +417,32 @@ java -jar atom.jar \
--frontend-args parse-inactive-code=true
```

**4. Large Projects: Two-Stage Generation (Memory Optimization)**
For very large C/C++ codebases, generating the full graph in one pass might consume too much memory. You can split the process into two stages using the AST cache.

_Stage 1: Generate AST Cache Only_
This parses files one by one and saves their ASTs to disk (`./src/ast_out` by default), keeping memory usage low.

```bash
java -jar atom.jar \
--language c \
--input ./src \
--frontend-args only-ast-cache=true,ast-cache-dir=/tmp/cache
```

_Stage 2: Generate Atom from Cache_
Run the command again with caching enabled. It will load the pre-computed ASTs from disk, significantly speeding up graph creation.

```bash
java -jar atom.jar \
--language c \
--input ./src \
--frontend-args enable-ast-cache=true,ast-cache-dir=/tmp/cache
```

## Troubleshooting

### atom file is incomplete for large projects
### atom file is incomplete for large JS/TS projects

astgen might require a generous heap of memory for large JavaScript projects, especially flow projects. Use the environment variable `NODE_OPTIONS` to increase the memory available.

Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ThisBuild / organization := "io.appthreat"
ThisBuild / version := "2.5.0"
ThisBuild / scalaVersion := "3.7.4"

val chenVersion = "2.5.15"
val chenVersion = "2.5.16"

lazy val atom = Projects.atom

Expand Down
54 changes: 53 additions & 1 deletion docs/docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ For complex projects, specifically those written in C or C++, you may need to pa

This flag accepts a comma-separated list of key-value pairs in the format `key=value`.

### usage
### Usage

```bash
--frontend-args key1=value1,key2=value2,key3=value3
Expand All @@ -95,6 +95,9 @@ The following arguments are supported when `--language` is set to `c`, `cpp`, or
| `function-bodies` | Boolean | Whether to extract function bodies. | `function-bodies=false` |
| `parse-inactive-code` | Boolean | Parse code within disabled preprocessor blocks (e.g., inside `#if 0`). | `parse-inactive-code=true` |
| `with-image-locations` | Boolean | Create image locations (explains how a name made it into the translation unit). | `with-image-locations=true` |
| `enable-ast-cache` | Boolean | Cache parsed ASTs to disk to speed up subsequent runs on unchanged files. | `enable-ast-cache=true` |
| `ast-cache-dir` | String | Directory to store cached AST files. Defaults to `ast_out` in input directory. | `ast-cache-dir=/tmp/cache` |
| `only-ast-cache` | Boolean | Only generate AST cache files and exit. Useful for large projects to avoid OOM. | `only-ast-cache=true` |

> **Note:** Boolean values must be passed as the strings `true` or `false`.

Expand Down Expand Up @@ -129,3 +132,52 @@ java -jar atom.jar \
--input ./src \
--frontend-args parse-inactive-code=true
```

**4. Large Projects: Two-Stage Generation (Memory Optimization)**
For very large C/C++ codebases, generating the full graph in one pass might consume too much memory. You can split the process into two stages using the AST cache.

_Stage 1: Generate AST Cache Only_
This parses files one by one and saves their ASTs to disk (`./src/ast_out` by default), keeping memory usage low.

```bash
java -jar atom.jar \
--language c \
--input ./src \
--frontend-args only-ast-cache=true,ast-cache-dir=/tmp/cache
```

_Stage 2: Generate Atom from Cache_
Run the command again with caching enabled. It will load the pre-computed ASTs from disk, significantly speeding up graph creation.

```bash
java -jar atom.jar \
--language c \
--input ./src \
--frontend-args enable-ast-cache=true,ast-cache-dir=/tmp/cache
```

---

## Tips & Tricks

### c/++ monorepos:

Given a large monorepo of C/C++ source code (such as mongodb), atom and chen cannot reliably determine the base directory to use for all of them. These base directories are crucial and are often set by the build tools such as CMake, Ninja, etc., to successfully compile the project.

A trick we used recently is to first run atom in `only-ast-cache` mode from the parent directories of src, include, and source.

```shell
find . -type d \( -name "src" -o -name "source" -o -name "include" \) -print0 | \
xargs -0 -n1 dirname | \
sort -u -r | \
while read -r parent; do
echo "Processing: $parent"
~/work/AppThreat/atom/atom.sh -l c -o foo.atom --frontend-args enable-ast-cache=true,ast-cache-dir=/home/appthreat/sandbox/mongo/ast_out,only-ast-cache=true $parent
done
```

Re-running atom with the cache led to fewer time-out errors.

```
~/work/AppThreat/atom/atom.sh --with-data-deps -l c -o foo.atom --frontend-args enable-ast-cache=true,ast-cache-dir=/home/appthreat/sandbox/mongo/ast_out $parent
```
36 changes: 30 additions & 6 deletions src/main/scala/io/appthreat/atom/Atom.scala
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ object Atom:
)
opt[Map[String, String]]("frontend-args")
.text(
"Advanced frontend configuration (key=value). E.g. --frontend-args defines=DEBUG,cpp-standard=c++17"
"Advanced frontend configuration (key=value). E.g. --frontend-args defines=DEBUG,enable-ast-cache=true,only-ast-cache=true"
)
.action((x, c) =>
c match
Expand Down Expand Up @@ -586,16 +586,22 @@ object Atom:

private def generateForLanguage(language: String, config: AtomConfig): Either[String, String] =
val outputAtomFile = config.outputAtomFile.pathAsString
val onlyAstCache = extractArgBoolean(config, "only-ast-cache", default = false)

getOrCreateAtom(language, config, outputAtomFile) match
case Failure(exception) =>
Left(exception.getStackTrace.take(20).mkString("\n"))
case Success(ag) =>
for
_ <- enhanceCpg(config, ag)
_ <- generateSlice(config, ag)
_ <- closeCpg(ag)
yield "Atom generation successful"
if onlyAstCache then
closeCpg(ag)
Try(File(outputAtomFile).delete(true))
Right("AST cache generated successfully. Skipped CPG enhancement and slicing.")
else
for
_ <- enhanceCpg(config, ag)
_ <- generateSlice(config, ag)
_ <- closeCpg(ag)
yield "Atom generation successful"

private def getOrCreateAtom(
language: String,
Expand Down Expand Up @@ -659,6 +665,11 @@ object Atom:
val defines = extractArgSet(config, "defines")
val extraIncludes = extractArgSet(config, "includes") ++ extractArgSet(config, "include-paths")
val cppStandard = extractArgString(config, "cpp-standard")
val onlyAstCache = extractArgBoolean(config, "only-ast-cache", default = false)
val enableAstCache =
extractArgBoolean(config, "enable-ast-cache", default = false) || onlyAstCache
val defaultCacheDir = (config.inputPath / "ast_out").pathAsString
val cacheDir = extractArgString(config, "ast-cache-dir", default = defaultCacheDir)
val baseConfig = CConfig(
includeComments = false,
logProblems = false,
Expand All @@ -672,6 +683,10 @@ object Atom:
.withParseInactiveCode(false)
.withImageLocations(false)
.withIncludeTrivialExpressions(false)
.withAstCache(enableAstCache)
.withCacheDir(cacheDir)
.withOnlyAstCache(onlyAstCache)

val finalConfig = baseConfig
.withDefines(defines)
.withCppStandard(cppStandard)
Expand All @@ -689,6 +704,11 @@ object Atom:
val includeComments = extractArgBoolean(config, "include-comments", default = false)
val includeTrivialExpressions =
extractArgBoolean(config, "include-trivial-expressions", default = false)
val onlyAstCache = extractArgBoolean(config, "only-ast-cache", default = false)
val enableAstCache =
extractArgBoolean(config, "enable-ast-cache", default = false) || onlyAstCache
val defaultCacheDir = (config.inputPath / "ast_out").pathAsString
val cacheDir = extractArgString(config, "ast-cache-dir", default = defaultCacheDir)
val baseConfig = CConfig(
includeComments = includeComments,
logProblems = false,
Expand All @@ -702,6 +722,10 @@ object Atom:
.withParseInactiveCode(parseInactive)
.withImageLocations(imageLocations)
.withIncludeTrivialExpressions(includeTrivialExpressions)
.withAstCache(enableAstCache)
.withCacheDir(cacheDir)
.withOnlyAstCache(onlyAstCache)

val finalConfig = baseConfig
.withDefines(defines)
.withCppStandard(cppStandard)
Expand Down
18 changes: 13 additions & 5 deletions src/main/scala/io/appthreat/atom/slicing/UsageSlicing.scala
Original file line number Diff line number Diff line change
Expand Up @@ -252,11 +252,14 @@ object UsageSlicing:
resolvedMethod: Option[String],
language: Option[String]
): Option[String] =
if !language.contains(Languages.JSSRC) || baseCall.code.isEmpty || !baseCall.code.contains(
"("
)
then
resolvedMethod
val isJs = language.contains(Languages.JSSRC) || language.contains(Languages.JAVASCRIPT)
if !isJs || baseCall.code.isEmpty || !baseCall.code.contains("(") then
resolvedMethod
else
val taggedArg =
baseCall.argument.filter(_.tag.nameExact(FRAMEWORK_ROUTE).nonEmpty).isLiteral.headOption
if taggedArg.isDefined then
Option(taggedArg.get.code)
else
var code = baseCall.code.takeWhile(_ != '(')
if code.contains(" ") then code = code.split(" ").last
Expand All @@ -269,6 +272,8 @@ object UsageSlicing:
"\\t"
)
Option(code)
end if
end handleJavaScriptLogic

private def getDefNode(tgt: Declaration): Option[AstNode] = tgt match
case local: Local =>
Expand Down Expand Up @@ -321,10 +326,13 @@ object UsageSlicing:
(externalCalleesAsSlices(atom), routesAsUDT(atom))
case Some(Languages.RUBYSRC) =>
(danglingRouteCallsAsSlices(atom) ++ httpEndpointsAsSlices(atom), routesAsUDT(atom))
case Some(lang) if lang == Languages.JSSRC || lang == Languages.JAVASCRIPT =>
(unusedTypeDeclAsSlices(atom), routesAsUDT(atom))
case _ =>
(unusedTypeDeclAsSlices(atom), Nil)

ProgramUsageSlice(slices ++ extraSlices, userDefTypes ++ extraTypes)
end createProgramUsageSlice

private def createMethodUsageSlice(
method: Method,
Expand Down